Skill

schema-discoverer

Infer schema from sample data files (CSV, JSON, Parquet) and generate type definitions.

From majestic-data
Install
1
Run in your terminal
$
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-data
Tool Access

This skill is limited to using the following tools:

Read Grep Glob Bash
Skill Content

Schema Discoverer

Audience: Data engineers and analysts working with unfamiliar data files.

Goal: Analyze data files to infer schema and generate type definitions in multiple output formats.

Workflow

  1. Identify file format from extension and content
  2. Sample the data (first 1000 rows for large files)
  3. Infer column types with confidence scores
  4. Detect patterns (dates, emails, IDs, categories)
  5. Generate output in requested format

Type Inference Rules

Numeric Detection

  • Integer: All values match ^\d+$ with no leading zeros (except "0")
  • Float: Values match ^\d+\.\d+$ or scientific notation
  • Currency: Values match ^\$?\d{1,3}(,\d{3})*(\.\d{2})?$

String Patterns

  • Email: ^[\w\.-]+@[\w\.-]+\.\w+$
  • URL: ^https?://
  • UUID: ^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$
  • Phone: ^\+?[\d\s\-\(\)]+$ with 10+ digits

Date Patterns

  • ISO: YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS
  • US: MM/DD/YYYY
  • EU: DD/MM/YYYY
  • Timestamp: Unix epoch (10 or 13 digits)

Categorical Detection

  • Unique ratio < 5% of total rows
  • Repeated values dominate

Output Formats

Pydantic Model

from pydantic import BaseModel, Field
from datetime import date
from typing import Literal

class Record(BaseModel):
    id: int = Field(gt=0)
    email: str = Field(pattern=r'^[\w\.-]+@[\w\.-]+\.\w+$')
    status: Literal['active', 'pending', 'inactive']
    created_at: date
    amount: float = Field(ge=0)

Pandera Schema

import pandera as pa

schema = pa.DataFrameSchema({
    "id": pa.Column(int, pa.Check.gt(0), unique=True),
    "email": pa.Column(str, pa.Check.str_matches(r'^[\w\.-]+@')),
    "status": pa.Column(str, pa.Check.isin(['active', 'pending', 'inactive'])),
    "created_at": pa.Column("datetime64[ns]"),
    "amount": pa.Column(float, pa.Check.ge(0)),
})

SQL CREATE TABLE

CREATE TABLE records (
    id INTEGER PRIMARY KEY,
    email VARCHAR(255) NOT NULL,
    status VARCHAR(20) CHECK (status IN ('active', 'pending', 'inactive')),
    created_at DATE NOT NULL,
    amount DECIMAL(10, 2) CHECK (amount >= 0)
);

TypeScript Interface

interface Record {
    id: number;
    email: string;
    status: 'active' | 'pending' | 'inactive';
    created_at: string;
    amount: number;
}

Execution Steps

  1. Read file sample:

    • CSV: pd.read_csv(path, nrows=1000)
    • JSON: pd.read_json(path, lines=True, nrows=1000)
    • Parquet: pd.read_parquet(path).head(1000)
  2. For each column, analyze: null percentage, unique count/ratio, sample values, pattern matches

  3. Generate confidence score (0-100) for each type inference

  4. Output schema in requested format with comments explaining inference

Report Format

## Schema Discovery Report

**File:** data.csv
**Rows sampled:** 1000 of 50000
**Columns:** 5

| Column | Inferred Type | Confidence | Notes |
|--------|--------------|------------|-------|
| id | integer | 100% | All positive, unique |
| email | string(email) | 98% | 2% invalid format |
| status | categorical | 100% | 3 unique values |
| created_at | date | 95% | ISO format |
| amount | float | 100% | 2 decimals, no negatives |

### Recommendations
- Add NOT NULL constraint to: id, email, created_at
- Consider UNIQUE constraint on: id, email
- Status should be ENUM: active, pending, inactive
Stats
Parent Repo Stars30
Parent Repo Forks6
Last CommitMar 15, 2026