This skill should be used when the user asks to "design a data quality framework", "set up data contracts", "plan data validation", "detect data anomalies", or mentions data profiling, quarantine patterns, or remediation workflows. [EXPLICIT] It produces data quality documentation covering profiling, validation rule engines, data contracts, anomaly detection, remediation workflows, and SLA monitoring. [EXPLICIT] Use this skill whenever the user needs data quality strategy or implementation, even if they don't explicitly ask for "data-quality". [EXPLICIT]
From jm-adknpx claudepluginhub javimontano/jm-adk-alfaThis skill is limited to using the following tools:
agents/guardian.mdagents/lead.mdagents/specialist.mdagents/support.mdevals/evals.jsonknowledge/body-of-knowledge.mdknowledge/knowledge-graph.mdprompts/meta.mdprompts/primary.mdprompts/variations/deep.mdprompts/variations/quick.mdreferences/quality-patterns.mdtemplates/output.docx.mdtemplates/output.htmlSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.
Data quality architecture defines how organizations detect, prevent, and remediate data issues through profiling, validation rules, anomaly detection, contracts between teams, and SLA monitoring. This skill produces data quality documentation that enables teams to build trust in their data through systematic quality management. [EXPLICIT]
La calidad de datos no se inspecciona al final — se construye en cada paso. La prevención supera a la detección. Los data contracts entre productores y consumidores son la primera línea de defensa. Los patrones de quarantine protegen al pipeline sin detenerlo. Cada regla de validación tiene severity, owner, y fecha de última revisión.
The user provides a system or project name as $ARGUMENTS. Parse $1 as the system/project name used throughout all output artifacts. [EXPLICIT]
Parameters:
{MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
{FORMATO}: markdown (default) | html | dual{VARIANTE}: ejecutiva (~40% — S1 profiling + S3 data contracts + S5 remediation) | técnica (full 6 sections, default)Before generating architecture, detect the project context:
!find . -name "*.py" -o -name "*.yml" -o -name "*.yaml" -o -name "*.sql" -o -name "*.json" | head -30
Use detected tools to tailor recommendations. If reference materials exist, load them:
Read ${CLAUDE_SKILL_DIR}/references/quality-patterns.md
Select based on team profile and existing stack, not feature count alone. [EXPLICIT]
| Criterion | Great Expectations | Soda Core | dbt Tests | Elementary |
|---|---|---|---|---|
| Best for | Python teams, diverse sources | SQL-native teams, fast setup | Teams already in dbt | dbt-native observability |
| Language | Python (Expectations API) | SodaCL (declarative YAML) | SQL + YAML | SQL + dbt macros |
| Learning curve | Steep — rich but verbose | Low — accessible to analysts | Low if dbt-fluent | Low — runs inside dbt |
| Anomaly detection | Custom via profiler | Built-in (SodaCL anomaly checks) | Via Elementary add-on | Built-in (volume, freshness, schema) |
| Data Docs / UI | Yes (auto-generated HTML) | Soda Cloud (paid) | dbt Docs | Elementary Cloud or OSS dashboard |
| CI/CD integration | Checkpoint CLI | soda scan CLI | dbt test | dbt test + edr |
| Cost | OSS free; GX Cloud paid | OSS free; Soda Cloud paid | Free (bundled with dbt) | OSS free; Cloud paid |
| Connector breadth | 50+ (Spark, pandas, SQL) | 20+ (SQL-native) | dbt-supported warehouses | dbt-supported warehouses |
Combined pattern (recommended for mature orgs): dbt tests for transformation-layer validation, Great Expectations for ingestion validation of raw sources, Soda Core for continuous production monitoring and alerting. Elementary adds anomaly detection on top of dbt without external tooling.
Use these standard formulas for composite scoring. Weight per domain; no universal formula fits all. [EXPLICIT]
| Dimension | Formula | Target (critical) |
|---|---|---|
| Accuracy | matching_records / total_records | >= 99.5% |
| Completeness | non_null_required_fields / total_required_fields | >= 99.9% |
| Timeliness | p95(event_time - available_time) <= SLA_target | Tier-dependent |
| Consistency | cross_system_matching_records / total_records | >= 99.0% |
| Validity | records_passing_rules / total_records | >= 99.5% |
| Uniqueness | 1 - (duplicate_records / total_records) | >= 99.99% |
Composite quality score: SUM(dimension_score * weight) where weights sum to 1.0. Adjust weights per domain — financial data weights accuracy higher; event streams weight timeliness higher.
Cost benchmark: Poor data quality costs organizations an average of $12.9M per year (Gartner). Use this to justify governance investment: even a 10% reduction in data incidents saves >$1M annually for a mid-size org.
Establishes the statistical baseline for understanding data characteristics. [EXPLICIT]
Includes:
Key decisions:
Defines systematic data validation with severity classification. [EXPLICIT]
Includes:
Key decisions:
Formalizes agreements between data producers and consumers. Follows the data contract specification pattern (Andrew Jones): contracts defined as version-controlled YAML alongside pipeline code. [EXPLICIT]
Contract specification fields:
Enforcement mechanisms:
Key decisions:
Implements statistical and ML-based methods for detecting unexpected data changes. [EXPLICIT]
Statistical methods (start here):
ML-based detection (use when statistical methods produce too many false positives):
Detection targets:
Key decisions:
Processes for handling data quality failures from detection to resolution. [EXPLICIT]
Quarantine pattern: Isolate bad records in a staging area, continue processing good records. Time-bound: 72h before escalation or auto-discard.
Dead letter queue (DLQ):
SLA Breach Escalation Matrix:
| Tier | Scope | Response Time | Escalation |
|---|---|---|---|
| Tier 1: Revenue-critical | Payment, billing, pricing data | <15 min | Auto-page on-call engineer |
| Tier 2: Operational | Core business metrics, user data | <1 hour | Alert data engineering lead |
| Tier 3: Analytical | Reports, dashboards, ML features | <4 hours | Notify domain data steward |
At 4h+ unresolved for any tier: leadership notification with customer/revenue impact estimate. [EXPLICIT]
Post-mortem template: Timeline, blast radius (affected datasets/consumers), root cause, corrective action, prevention measures.
Dashboards and metrics for ongoing data quality visibility. [EXPLICIT]
Monitoring targets:
Dashboard audiences:
Reporting cadence: Real-time monitoring + weekly summary + monthly executive report.
SLA targets (negotiate with consumers): 99.5% freshness, 99.9% completeness, 99.5% accuracy for Tier 1 datasets.
| Decision | Enables | Constrains | When to Use |
|---|---|---|---|
| Strict Data Contracts | Reliability, upstream accountability | Slower iteration, producer friction | Production-critical pipelines, multi-team |
| Inline Validation | Early detection, prevents propagation | Pipeline latency, compute cost | Critical datasets, real-time pipelines |
| ML Anomaly Detection | Catches novel issues, adapts | Complexity, false positives, training | Large-scale data with complex patterns |
| Statistical Detection | Simple, interpretable, low maintenance | Misses complex patterns | Stable datasets, well-understood distributions |
| Auto-Fix Rules | Reduced manual effort | Risk of incorrect corrections | Deterministic fixes only (formatting, defaults) |
| Quarantine Pattern | Isolates without blocking pipeline | Storage overhead, investigation burden | Streaming pipelines, high-volume ingestion |
No Historical Baseline: Use first 30 days as baseline with wider thresholds (4-sigma instead of 3). Tighten gradually. Accept higher false positive rate initially.
Schema-on-Read Environments: Quality checks become the enforcement layer. Profile and validate on read; quarantine on first access rather than ingestion.
Third-Party Data Sources: No control over producer quality. Contracts are aspirational; monitoring and quarantine are essential. Budget 2-3x more manual remediation time than internal sources.
High-Volume Streaming: Sampling-based profiling (1-5% sample rate). Micro-batch quality windows (1-5 min). Accept statistical confidence instead of deterministic validation.
Regulated Data (GDPR, HIPAA): Quality monitoring must not expose PII in logs or dashboards. Aggregate metrics only. Quarantine areas must respect data classification. Audit trail of quality decisions required.
Before finalizing delivery, verify:
| Format | Default | Description |
|---|---|---|
markdown | Yes | Markdown con Mermaid embebido (validation flow, remediation workflow). |
html | On demand | Branded HTML (Design System). Visual impact. |
dual | On demand | Both formats. |
Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter. [EXPLICIT]
Primary: A-01_Data_Quality_Framework.html — Data profiling baseline, validation rule engine, data contracts, anomaly detection, remediation workflows, SLA monitoring dashboards.
Secondary: Validation rule catalog, data contract YAML templates, anomaly detection configuration, quality scorecard template, incident post-mortem template.
Author: Javier Montano | Last updated: March 18, 2026