Data quality framework — profiling, validation, anomaly detection, data contracts, SLA monitoring. Use when the user asks to "design data quality framework", "set up data contracts", "plan data validation", "detect data anomalies", "define data SLAs", or mentions data profiling, quarantine patterns, or remediation workflows.
From maonpx claudepluginhub javimontano/mao-discovery-frameworkThis skill is limited to using the following tools:
examples/README.mdexamples/sample-output.htmlexamples/sample-output.mdprompts/prompt.mdreferences/quality-patterns.mdGuides agentic engineering workflows: eval-first loops, 15-min task decomposition, model routing (Haiku/Sonnet/Opus), AI code reviews, and cost tracking.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Data quality architecture defines how organizations detect, prevent, and remediate data issues through profiling, validation rules, anomaly detection, contracts between teams, and SLA monitoring. This skill produces data quality documentation that enables teams to build trust in their data through systematic quality management.
La calidad de datos no se inspecciona al final — se construye en cada paso. La prevención supera a la detección. Los data contracts entre productores y consumidores son la primera línea de defensa. Los patrones de quarantine protegen al pipeline sin detenerlo. Cada regla de validación tiene severity, owner, y fecha de última revisión.
The user provides a system or project name as $ARGUMENTS. Parse $1 as the system/project name used throughout all output artifacts.
Parameters:
{MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
{FORMATO}: markdown (default) | html | dual{VARIANTE}: ejecutiva (~40% — S1 profiling + S3 data contracts + S5 remediation) | técnica (full 6 sections, default)Before generating architecture, detect the project context:
!find . -name "*.py" -o -name "*.yml" -o -name "*.yaml" -o -name "*.sql" -o -name "*.json" | head -30
Use detected tools to tailor recommendations. If reference materials exist, load them:
Read ${CLAUDE_SKILL_DIR}/references/quality-patterns.md
Select based on team profile and existing stack, not feature count alone.
| Criterion | Great Expectations | Soda Core | dbt Tests | Elementary |
|---|---|---|---|---|
| Best for | Python teams, diverse sources | SQL-native teams, fast setup | Teams already in dbt | dbt-native observability |
| Language | Python (Expectations API) | SodaCL (declarative YAML) | SQL + YAML | SQL + dbt macros |
| Learning curve | Steep — rich but verbose | Low — accessible to analysts | Low if dbt-fluent | Low — runs inside dbt |
| Anomaly detection | Custom via profiler | Built-in (SodaCL anomaly checks) | Via Elementary add-on | Built-in (volume, freshness, schema) |
| Data Docs / UI | Yes (auto-generated HTML) | Soda Cloud (paid) | dbt Docs | Elementary Cloud or OSS dashboard |
| CI/CD integration | Checkpoint CLI | soda scan CLI | dbt test | dbt test + edr |
| Cost | OSS free; GX Cloud paid | OSS free; Soda Cloud paid | Free (bundled with dbt) | OSS free; Cloud paid |
| Connector breadth | 50+ (Spark, pandas, SQL) | 20+ (SQL-native) | dbt-supported warehouses | dbt-supported warehouses |
Combined pattern (recommended for mature orgs): dbt tests for transformation-layer validation, Great Expectations for ingestion validation of raw sources, Soda Core for continuous production monitoring and alerting. Elementary adds anomaly detection on top of dbt without external tooling.
Use these standard formulas for composite scoring. Weight per domain; no universal formula fits all.
| Dimension | Formula | Target (critical) |
|---|---|---|
| Accuracy | matching_records / total_records | >= 99.5% |
| Completeness | non_null_required_fields / total_required_fields | >= 99.9% |
| Timeliness | p95(event_time - available_time) <= SLA_target | Tier-dependent |
| Consistency | cross_system_matching_records / total_records | >= 99.0% |
| Validity | records_passing_rules / total_records | >= 99.5% |
| Uniqueness | 1 - (duplicate_records / total_records) | >= 99.99% |
Composite quality score: SUM(dimension_score * weight) where weights sum to 1.0. Adjust weights per domain — financial data weights accuracy higher; event streams weight timeliness higher.
Cost benchmark: Poor data quality costs organizations an average of $12.9M per year (Gartner). Use this to justify governance investment: even a 10% reduction in data incidents saves >$1M annually for a mid-size org.
Establishes the statistical baseline for understanding data characteristics.
Includes:
Key decisions:
Defines systematic data validation with severity classification.
Includes:
Key decisions:
Formalizes agreements between data producers and consumers. Follows the data contract specification pattern (Andrew Jones): contracts defined as version-controlled YAML alongside pipeline code.
Contract specification fields:
Enforcement mechanisms:
Key decisions:
Implements statistical and ML-based methods for detecting unexpected data changes.
Statistical methods (start here):
ML-based detection (use when statistical methods produce too many false positives):
Detection targets:
Key decisions:
Processes for handling data quality failures from detection to resolution.
Quarantine pattern: Isolate bad records in a staging area, continue processing good records. Time-bound: 72h before escalation or auto-discard.
Dead letter queue (DLQ):
SLA Breach Escalation Matrix:
| Tier | Scope | Response Time | Escalation |
|---|---|---|---|
| Tier 1: Revenue-critical | Payment, billing, pricing data | <15 min | Auto-page on-call engineer |
| Tier 2: Operational | Core business metrics, user data | <1 hour | Alert data engineering lead |
| Tier 3: Analytical | Reports, dashboards, ML features | <4 hours | Notify domain data steward |
At 4h+ unresolved for any tier: leadership notification with customer/revenue impact estimate.
Post-mortem template: Timeline, blast radius (affected datasets/consumers), root cause, corrective action, prevention measures.
Dashboards and metrics for ongoing data quality visibility.
Monitoring targets:
Dashboard audiences:
Reporting cadence: Real-time monitoring + weekly summary + monthly executive report.
SLA targets (negotiate with consumers): 99.5% freshness, 99.9% completeness, 99.5% accuracy for Tier 1 datasets.
| Decision | Enables | Constrains | When to Use |
|---|---|---|---|
| Strict Data Contracts | Reliability, upstream accountability | Slower iteration, producer friction | Production-critical pipelines, multi-team |
| Inline Validation | Early detection, prevents propagation | Pipeline latency, compute cost | Critical datasets, real-time pipelines |
| ML Anomaly Detection | Catches novel issues, adapts | Complexity, false positives, training | Large-scale data with complex patterns |
| Statistical Detection | Simple, interpretable, low maintenance | Misses complex patterns | Stable datasets, well-understood distributions |
| Auto-Fix Rules | Reduced manual effort | Risk of incorrect corrections | Deterministic fixes only (formatting, defaults) |
| Quarantine Pattern | Isolates without blocking pipeline | Storage overhead, investigation burden | Streaming pipelines, high-volume ingestion |
| Caso | Estrategia de Manejo |
|---|---|
| Sin baseline historico | Usar primeros 30 dias como baseline con thresholds amplios (4-sigma); ajustar gradualmente; aceptar mayor tasa de falsos positivos inicialmente |
| Entorno schema-on-read | Quality checks como capa de enforcement; profiling y validacion on-read; quarantine en primer acceso en lugar de ingestion |
| Fuentes de datos de terceros | Sin control sobre calidad del productor; contratos son aspiracionales; monitoring y quarantine esenciales; presupuestar 2-3x mas remediacion manual |
| Streaming de alto volumen | Profiling basado en sampling (1-5%); ventanas de calidad micro-batch (1-5 min); aceptar confianza estadistica en lugar de validacion deterministica |
| Datos regulados (GDPR, HIPAA) | Monitoring no debe exponer PII en logs o dashboards; solo metricas agregadas; quarantine respeta clasificacion; audit trail de decisiones de calidad obligatorio |
| Decision | Alternativa Descartada | Justificacion |
|---|---|---|
| Prevencion (data contracts) sobre deteccion (monitoring) | Solo deteccion post-ingestion | Los data contracts entre productores y consumidores son la primera linea de defensa; detectar problemas post-ingestion es 10x mas costoso que prevenirlos |
| Quarantine pattern para aislar sin detener pipeline | Pipeline completo se detiene ante error | En pipelines de alto volumen, detener todo por registros malos afecta SLA de datos buenos; quarantine aisla malos y continua con buenos |
| Auto-fix solo para correcciones deterministicas | Auto-fix para cualquier tipo de error | Correcciones no deterministicas (logica de negocio ambigua, conflictos cross-system) requieren juicio humano; auto-fix incorrecto es peor que no corregir |
| Alertas calibradas para < 5 falsos positivos/semana | Thresholds muy ajustados que maximizan deteccion | Alert fatigue es la causa #1 de que los equipos ignoren alertas reales; < 5 FP/semana es el umbral de atencion sostenible |
graph TD
subgraph Core["Core: Data Quality"]
PROF[Data Profiling & Baseline]
VAL[Validation Rule Engine]
DC[Data Contracts]
ANOM[Anomaly Detection]
REM[Remediation Workflows]
SLA[SLA Monitoring]
end
subgraph Inputs["Inputs"]
DATA[Raw Data Sources]
SCHEMA[Schema Definitions]
BIZ[Business Rules]
HIST[Historical Baselines]
end
subgraph Outputs["Outputs"]
RULES[Validation Rule Catalog]
CONTR[Contract YAML Templates]
DASH[Quality Dashboard]
REPORT[Incident Post-mortems]
end
subgraph Related["Related Skills"]
DGOV[data-governance]
DENG[data-engineering]
BI[bi-architecture]
OBS[observability]
end
DATA --> PROF
SCHEMA --> VAL
BIZ --> DC
HIST --> ANOM
PROF --> VAL --> DC --> ANOM --> REM --> SLA
SLA --> RULES
SLA --> CONTR
SLA --> DASH
SLA --> REPORT
DGOV --> DC
RULES --> DENG
DASH --> BI
SLA --> OBS
| Formato | Nombre | Contenido |
|---|---|---|
| Markdown | A-01_Data_Quality_Framework.md | Framework completo con profiling baseline, validation rule engine, data contracts, anomaly detection, remediation workflows y SLA monitoring. Diagramas Mermaid de validation flow y remediation workflow. |
| XLSX | A-01_Data_Quality_Scorecard.xlsx | Scorecard interactivo con composite quality score por dimension (accuracy, completeness, timeliness, consistency, validity, uniqueness), tendencias a 90 dias, y SLA compliance por dataset. |
| HTML | A-01_Data_Quality_Framework_{cliente}_{WIP}.html | Mismo contenido en HTML branded (Design System MetodologIA v5). Light-First Technical, self-contained, WCAG AA, responsive. Incluye quality scorecard por dimension, SLA escalation matrix visual, y remediation workflow interactivo con DLQ status. |
| DOCX | {fase}_Data_Quality_Framework_{cliente}_{WIP}.docx | Documento formal via python-docx (Design System MetodologIA v5). Cover page, TOC auto, headers/footers branded, tablas zebra. Poppins headings (navy), Montserrat body, gold accents. |
| PPTX | {fase}_Data_Quality_Framework_{cliente}_{WIP}.pptx | Via python-pptx con MetodologIA Design System v5. Navy gradient slide master, Poppins titles, Montserrat body, gold accents. Máx 20 slides ejecutivo / 30 técnico. Speaker notes con referencias de evidencia. |
| Dimension | Peso | Criterio |
|---|---|---|
| Trigger Accuracy | 10% | Descripcion activa triggers correctos (data quality, data contracts, validation, anomaly detection, SLA) sin falsos positivos con data-governance o analytics-engineering |
| Completeness | 25% | Las 6 secciones cubren profiling, validacion, contracts, anomaly detection, remediation y SLA monitoring sin huecos; todas las dimensiones de calidad representadas |
| Clarity | 20% | Instrucciones ejecutables sin ambiguedad; formulas de calidad con targets numericos; severity classification con acciones por nivel; SLA tiers con tiempos de respuesta |
| Robustness | 20% | Maneja sin baseline, schema-on-read, third-party sources, streaming de alto volumen y datos regulados con estrategias diferenciadas |
| Efficiency | 10% | Proceso no tiene pasos redundantes; variante ejecutiva reduce a S1+S3+S5 sin perder contratos y remediacion |
| Value Density | 15% | Cada seccion aporta valor practico directo; quality dimension formulas y SLA escalation matrix son herramientas de operacion inmediata |
Umbral minimo: 7/10.
Before finalizing delivery, verify:
| Format | Default | Description |
|---|---|---|
markdown | Yes | Markdown con Mermaid embebido (validation flow, remediation workflow). |
html | On demand | Branded HTML (Design System). Visual impact. |
dual | On demand | Both formats. |
Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter.
Primary: A-01_Data_Quality_Framework.html — Data profiling baseline, validation rule engine, data contracts, anomaly detection, remediation workflows, SLA monitoring dashboards.
Secondary: Validation rule catalog, data contract YAML templates, anomaly detection configuration, quality scorecard template, incident post-mortem template.
Autor: Javier Montaño | Última actualización: 12 de marzo de 2026