Run comprehensive validation suites against datasets and generate pass/fail reports.
Run comprehensive validation suites against datasets and generate pass/fail reports. Use this to validate data quality, schema, completeness, and business rules before processing.
/plugin marketplace add majesticlabs-dev/majestic-marketplace/plugin install majestic-data@majestic-marketplaceAutonomous agent that executes validation suites and produces quality reports.
# validation_suite.yml
suite_name: orders_validation
dataset: orders
version: 1.0
validations:
# Schema checks
- name: required_columns
type: schema
check: columns_exist
columns: [order_id, customer_id, order_date, total]
severity: critical
- name: column_types
type: schema
check: dtype_match
expectations:
order_id: int64
customer_id: int64
order_date: datetime64
total: float64
severity: critical
# Completeness checks
- name: no_null_ids
type: completeness
check: not_null
columns: [order_id, customer_id]
severity: critical
- name: row_count_range
type: completeness
check: row_count
min: 1000
max: 1000000
severity: warning
# Consistency checks
- name: date_order
type: consistency
check: column_compare
expression: "created_at <= updated_at"
severity: error
- name: foreign_key
type: consistency
check: referential_integrity
column: customer_id
reference_table: customers
reference_column: id
severity: error
# Accuracy checks
- name: positive_totals
type: accuracy
check: value_range
column: total
min: 0
severity: error
- name: valid_status
type: accuracy
check: allowed_values
column: status
values: [pending, confirmed, shipped, delivered, cancelled]
severity: error
# Validation Report
**Dataset:** orders
**Suite:** orders_validation v1.0
**Executed:** 2024-01-15 10:30:00 UTC
**Duration:** 4.2 seconds
## Summary
| Severity | Total | Passed | Failed |
|----------|-------|--------|--------|
| Critical | 3 | 3 | 0 |
| Error | 4 | 3 | 1 |
| Warning | 2 | 2 | 0 |
| **Total** | **9** | **8** | **1** |
**Overall Status:** ⚠️ PASSED WITH WARNINGS
## Failed Validations
### ❌ foreign_key (ERROR)
- **Check:** referential_integrity
- **Column:** customer_id
- **Issue:** 47 rows reference non-existent customers
- **Sample IDs:** [12345, 12346, 12350, ...]
- **Recommendation:** Check customer sync or handle orphaned orders
## Passed Validations
| # | Name | Type | Status |
|---|------|------|--------|
| 1 | required_columns | schema | ✅ |
| 2 | column_types | schema | ✅ |
| 3 | no_null_ids | completeness | ✅ |
| 4 | row_count_range | completeness | ✅ |
| 5 | date_order | consistency | ✅ |
| 6 | positive_totals | accuracy | ✅ |
| 7 | valid_status | accuracy | ✅ |
| 8 | date_format | accuracy | ✅ |
## Data Quality Score
**Score: 94.7 / 100**
- Completeness: 100%
- Consistency: 85% (FK violations)
- Accuracy: 100%
- Schema: 100%
from dataclasses import dataclass
from enum import Enum
from typing import Any
class Severity(Enum):
CRITICAL = "critical"
ERROR = "error"
WARNING = "warning"
INFO = "info"
@dataclass
class ValidationResult:
name: str
passed: bool
severity: Severity
message: str
details: dict = None
failed_rows: pd.DataFrame = None
class DataValidator:
def __init__(self, df: pd.DataFrame):
self.df = df
self.results: list[ValidationResult] = []
def validate_not_null(self, columns: list[str], severity: Severity = Severity.ERROR) -> ValidationResult:
for col in columns:
null_count = self.df[col].isnull().sum()
if null_count > 0:
return ValidationResult(
name=f"not_null_{col}",
passed=False,
severity=severity,
message=f"Column {col} has {null_count} null values",
failed_rows=self.df[self.df[col].isnull()]
)
return ValidationResult(
name="not_null",
passed=True,
severity=severity,
message=f"All columns {columns} have no nulls"
)
def validate_unique(self, columns: list[str], severity: Severity = Severity.ERROR) -> ValidationResult:
dup_count = self.df.duplicated(subset=columns).sum()
if dup_count > 0:
return ValidationResult(
name="unique",
passed=False,
severity=severity,
message=f"Found {dup_count} duplicate rows for {columns}",
failed_rows=self.df[self.df.duplicated(subset=columns, keep=False)]
)
return ValidationResult(
name="unique",
passed=True,
severity=severity,
message=f"All rows unique for {columns}"
)
def validate_range(self, column: str, min_val: float = None, max_val: float = None) -> ValidationResult:
violations = pd.Series([False] * len(self.df))
if min_val is not None:
violations |= self.df[column] < min_val
if max_val is not None:
violations |= self.df[column] > max_val
if violations.any():
return ValidationResult(
name=f"range_{column}",
passed=False,
severity=Severity.ERROR,
message=f"Column {column} has {violations.sum()} out-of-range values",
failed_rows=self.df[violations]
)
return ValidationResult(
name=f"range_{column}",
passed=True,
severity=Severity.ERROR,
message=f"All values in {column} within range"
)
def run_all(self) -> dict:
critical_failed = any(r.severity == Severity.CRITICAL and not r.passed for r in self.results)
return {
'passed': not critical_failed,
'total_checks': len(self.results),
'passed_checks': sum(1 for r in self.results if r.passed),
'results': self.results
}
Designs feature architectures by analyzing existing codebase patterns and conventions, then providing comprehensive implementation blueprints with specific files to create/modify, component designs, data flows, and build sequences