From majestic-data
Validates pandas DataFrames using pandera with schema definitions, column checks, decorators, error collection, and schema inference. Ideal for ETL pipelines and data engineering.
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-dataThis skill is limited to using the following tools:
**Audience:** Data engineers validating pandas DataFrames.
Provides Python data validation functions and pipelines for DataFrames using custom checks, Pydantic, Pandera, and Great Expectations. Includes schema evolution and pytest assertions.
Provides step-by-step guidance, code, and configurations for schema validation in data pipelines covering ETL, Airflow, Spark, streaming, and data processing. Activates on schema validator mentions.
Validates data against JSON schemas, business rules, and quality standards including duplicates, anomalies, formats. Generates reports with errors, stats, scores, and fix suggestions.
Share bugs, ideas, or general feedback.
Audience: Data engineers validating pandas DataFrames.
Goal: Provide pandera patterns for schema validation and type checking.
Execute schema functions from scripts/schemas.py:
from scripts.schemas import (
create_user_schema,
create_nullable_schema,
create_date_range_schema,
UserSchema,
validate_with_errors,
infer_and_export_schema
)
from scripts.schemas import create_user_schema
schema = create_user_schema()
validated_df = schema.validate(df)
from scripts.schemas import create_user_schema, validate_with_errors
schema = create_user_schema()
validated_df, errors = validate_with_errors(df, schema)
if errors:
for err in errors:
print(f"{err['column']}: {err['check']} - {err['failure_case']}")
from scripts.schemas import UserSchema
# Validate with type hints
UserSchema.validate(df)
# Use as function type hint
def process_users(df: pa.typing.DataFrame[UserSchema]) -> pd.DataFrame:
return df.query("status == 'active'")
from scripts.schemas import infer_and_export_schema
schema_export = infer_and_export_schema(df)
print(schema_export['python_code']) # Python schema definition
print(schema_export['yaml']) # YAML schema
| Check Type | Example | Description |
|---|---|---|
| Numeric | Check.gt(0), Check.in_range(0, 100) | Comparisons |
| String | Check.str_matches(r'pattern') | Regex match |
| Set membership | Check.isin(['A', 'B']) | Allowed values |
| Uniqueness | unique=True on Column | No duplicates |
| Nullable | nullable=True on Column | Allow nulls |
import pandera as pa
@pa.check_output(schema)
def load_data(path: str) -> pd.DataFrame:
return pd.read_csv(path)
@pa.check_input(schema, "df")
def process_data(df: pd.DataFrame) -> pd.DataFrame:
return df.assign(processed=True)
@pa.check_io(df=input_schema, out=output_schema)
def transform_data(df: pd.DataFrame) -> pd.DataFrame:
return df.transform(...)
| Use Case | Pandera | Alternative |
|---|---|---|
| DataFrame validation | ✓ | - |
| Type hints for DataFrames | ✓ | - |
| ETL pipeline checks | ✓ | Great Expectations |
| Record-level validation | - | Pydantic |
pandera>=0.18
pandas