Skill
schema-discoverer
Infer schema from sample data files (CSV, JSON, Parquet) and generate type definitions.
From majestic-dataInstall
1
Run in your terminal$
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-dataTool Access
This skill is limited to using the following tools:
Read Grep Glob Bash
Skill Content
Schema Discoverer
Audience: Data engineers and analysts working with unfamiliar data files.
Goal: Analyze data files to infer schema and generate type definitions in multiple output formats.
Workflow
- Identify file format from extension and content
- Sample the data (first 1000 rows for large files)
- Infer column types with confidence scores
- Detect patterns (dates, emails, IDs, categories)
- Generate output in requested format
Type Inference Rules
Numeric Detection
- Integer: All values match
^\d+$with no leading zeros (except "0") - Float: Values match
^\d+\.\d+$or scientific notation - Currency: Values match
^\$?\d{1,3}(,\d{3})*(\.\d{2})?$
String Patterns
- Email:
^[\w\.-]+@[\w\.-]+\.\w+$ - URL:
^https?:// - UUID:
^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$ - Phone:
^\+?[\d\s\-\(\)]+$with 10+ digits
Date Patterns
- ISO:
YYYY-MM-DDorYYYY-MM-DDTHH:MM:SS - US:
MM/DD/YYYY - EU:
DD/MM/YYYY - Timestamp: Unix epoch (10 or 13 digits)
Categorical Detection
- Unique ratio < 5% of total rows
- Repeated values dominate
Output Formats
Pydantic Model
from pydantic import BaseModel, Field
from datetime import date
from typing import Literal
class Record(BaseModel):
id: int = Field(gt=0)
email: str = Field(pattern=r'^[\w\.-]+@[\w\.-]+\.\w+$')
status: Literal['active', 'pending', 'inactive']
created_at: date
amount: float = Field(ge=0)
Pandera Schema
import pandera as pa
schema = pa.DataFrameSchema({
"id": pa.Column(int, pa.Check.gt(0), unique=True),
"email": pa.Column(str, pa.Check.str_matches(r'^[\w\.-]+@')),
"status": pa.Column(str, pa.Check.isin(['active', 'pending', 'inactive'])),
"created_at": pa.Column("datetime64[ns]"),
"amount": pa.Column(float, pa.Check.ge(0)),
})
SQL CREATE TABLE
CREATE TABLE records (
id INTEGER PRIMARY KEY,
email VARCHAR(255) NOT NULL,
status VARCHAR(20) CHECK (status IN ('active', 'pending', 'inactive')),
created_at DATE NOT NULL,
amount DECIMAL(10, 2) CHECK (amount >= 0)
);
TypeScript Interface
interface Record {
id: number;
email: string;
status: 'active' | 'pending' | 'inactive';
created_at: string;
amount: number;
}
Execution Steps
-
Read file sample:
- CSV:
pd.read_csv(path, nrows=1000) - JSON:
pd.read_json(path, lines=True, nrows=1000) - Parquet:
pd.read_parquet(path).head(1000)
- CSV:
-
For each column, analyze: null percentage, unique count/ratio, sample values, pattern matches
-
Generate confidence score (0-100) for each type inference
-
Output schema in requested format with comments explaining inference
Report Format
## Schema Discovery Report
**File:** data.csv
**Rows sampled:** 1000 of 50000
**Columns:** 5
| Column | Inferred Type | Confidence | Notes |
|--------|--------------|------------|-------|
| id | integer | 100% | All positive, unique |
| email | string(email) | 98% | 2% invalid format |
| status | categorical | 100% | 3 unique values |
| created_at | date | 95% | ISO format |
| amount | float | 100% | 2 decimals, no negatives |
### Recommendations
- Add NOT NULL constraint to: id, email, created_at
- Consider UNIQUE constraint on: id, email
- Status should be ENUM: active, pending, inactive
Similar Skills
Stats
Parent Repo Stars30
Parent Repo Forks6
Last CommitMar 15, 2026