From majestic-data
Provides Python data validation functions and pipelines for DataFrames using custom checks, Pydantic, Pandera, and Great Expectations. Includes schema evolution and pytest assertions.
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-dataThis skill is limited to using the following tools:
**Audience:** Data engineers building validation pipelines.
Validates pandas DataFrames using pandera with schema definitions, column checks, decorators, error collection, and schema inference. Ideal for ETL pipelines and data engineering.
Validates data against JSON schemas, business rules, and quality standards including duplicates, anomalies, formats. Generates reports with errors, stats, scores, and fix suggestions.
Provides step-by-step guidance, code, and configurations for schema validation in data pipelines covering ETL, Airflow, Spark, streaming, and data processing. Activates on schema validator mentions.
Share bugs, ideas, or general feedback.
Audience: Data engineers building validation pipelines.
Goal: Provide validation patterns for custom business rules.
Framework-specific skills:
pydantic-validation - Record-level validation with Pydanticpandera-validation - DataFrame schema validationgreat-expectations - Pipeline expectations and monitoringExecute validation functions from scripts/validators.py:
from scripts.validators import (
ValidationResult,
DataValidator,
validate_no_duplicates,
validate_referential_integrity,
validate_date_range,
validate_value_in_set,
run_validation_pipeline,
validate_with_schema_version,
assert_schema_match,
assert_no_nulls,
assert_unique,
assert_values_in_set
)
| Use Case | Framework |
|---|---|
| API request/response | Pydantic |
| Record-by-record ETL | Pydantic |
| DataFrame validation | Pandera |
| Type hints for DataFrames | Pandera |
| Pipeline monitoring | Great Expectations |
| Data warehouse checks | Great Expectations |
| Custom business rules | Custom functions (this skill) |
from scripts.validators import validate_no_duplicates, validate_referential_integrity
# Check duplicates
result = validate_no_duplicates(df, cols=['id'])
if not result.passed:
print(f"Error: {result.message}")
print(result.failed_rows)
# Check referential integrity
result = validate_referential_integrity(df, 'user_id', users_df, 'id')
from scripts.validators import DataValidator, validate_no_duplicates, validate_date_range
validator = DataValidator()
validator.add_check(lambda df: validate_no_duplicates(df, ['id']))
validator.add_check(lambda df: validate_date_range(df, 'created_at', '2020-01-01', '2025-12-31'))
results = validator.validate(df)
if not results['passed']:
for check in results['checks']:
if not check['passed']:
print(f"Failed: {check['message']}")
from scripts.validators import run_validation_pipeline
config = {
'unique_columns': ['id'],
'date_ranges': {
'created_at': ('2020-01-01', '2025-12-31'),
'updated_at': ('2020-01-01', '2025-12-31')
}
}
clean_df, results = run_validation_pipeline(df, config)
from scripts.validators import assert_schema_match, assert_no_nulls, assert_unique
# In pytest
def test_data_quality():
assert_schema_match(df, {'id': 'int64', 'email': 'object'})
assert_no_nulls(df, ['id', 'email'])
assert_unique(df, ['id'])
pandas