From majestic-data
Infers schemas from CSV, JSON, Parquet files by sampling data, detecting types/patterns, and generating Pydantic models, Pandera schemas, SQL DDL, or TypeScript interfaces. For onboarding unfamiliar datasets.
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-dataThis skill is limited to using the following tools:
**Audience:** Data engineers and analysts working with unfamiliar data files.
Profiles CSV/TSV/Excel files: detects format, counts rows/headers, computes basic/advanced statistics (kurtosis, Gini, outliers), shows top value distributions.
Reads and explores Parquet, CSV, JSON, Arrow IPC, Avro files locally, from S3/GCS using datafusion-cli for schema inspection, row counts, and data previews.
Handles messy CSVs with encoding detection using chardet, delimiter inference via csv.Sniffer, and malformed row recovery with pandas. Use for cleaning real-world data files.
Share bugs, ideas, or general feedback.
Audience: Data engineers and analysts working with unfamiliar data files.
Goal: Analyze data files to infer schema and generate type definitions in multiple output formats.
^\d+$ with no leading zeros (except "0")^\d+\.\d+$ or scientific notation^\$?\d{1,3}(,\d{3})*(\.\d{2})?$^[\w\.-]+@[\w\.-]+\.\w+$^https?://^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$^\+?[\d\s\-\(\)]+$ with 10+ digitsYYYY-MM-DD or YYYY-MM-DDTHH:MM:SSMM/DD/YYYYDD/MM/YYYYfrom pydantic import BaseModel, Field
from datetime import date
from typing import Literal
class Record(BaseModel):
id: int = Field(gt=0)
email: str = Field(pattern=r'^[\w\.-]+@[\w\.-]+\.\w+$')
status: Literal['active', 'pending', 'inactive']
created_at: date
amount: float = Field(ge=0)
import pandera as pa
schema = pa.DataFrameSchema({
"id": pa.Column(int, pa.Check.gt(0), unique=True),
"email": pa.Column(str, pa.Check.str_matches(r'^[\w\.-]+@')),
"status": pa.Column(str, pa.Check.isin(['active', 'pending', 'inactive'])),
"created_at": pa.Column("datetime64[ns]"),
"amount": pa.Column(float, pa.Check.ge(0)),
})
CREATE TABLE records (
id INTEGER PRIMARY KEY,
email VARCHAR(255) NOT NULL,
status VARCHAR(20) CHECK (status IN ('active', 'pending', 'inactive')),
created_at DATE NOT NULL,
amount DECIMAL(10, 2) CHECK (amount >= 0)
);
interface Record {
id: number;
email: string;
status: 'active' | 'pending' | 'inactive';
created_at: string;
amount: number;
}
Read file sample:
pd.read_csv(path, nrows=1000)pd.read_json(path, lines=True, nrows=1000)pd.read_parquet(path).head(1000)For each column, analyze: null percentage, unique count/ratio, sample values, pattern matches
Generate confidence score (0-100) for each type inference
Output schema in requested format with comments explaining inference
## Schema Discovery Report
**File:** data.csv
**Rows sampled:** 1000 of 50000
**Columns:** 5
| Column | Inferred Type | Confidence | Notes |
|--------|--------------|------------|-------|
| id | integer | 100% | All positive, unique |
| email | string(email) | 98% | 2% invalid format |
| status | categorical | 100% | 3 unique values |
| created_at | date | 95% | ISO format |
| amount | float | 100% | 2 decimals, no negatives |
### Recommendations
- Add NOT NULL constraint to: id, email, created_at
- Consider UNIQUE constraint on: id, email
- Status should be ENUM: active, pending, inactive