Data pipeline architecture — ingestion, orchestration, quality, lineage, SLAs. Use when the user asks to "design data pipelines", "architect ingestion", "set up orchestration", "plan data lake", "design lakehouse", or mentions Airflow, Dagster, CDC, data lineage, or pipeline SLAs.
From pmnpx claudepluginhub javimontano/mao-pm-apexThis skill is limited to using the following tools:
examples/README.mdexamples/sample-output.htmlexamples/sample-output.mdprompts/metaprompts.mdprompts/use-case-prompts.mdreferences/body-of-knowledge.mdreferences/knowledge-graph.mmdreferences/pipeline-patterns.mdreferences/state-of-the-art.mdSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Data engineering architecture defines how data is ingested, orchestrated, stored, validated, and observed — the backbone infrastructure that feeds analytics, ML, and operational systems. This skill produces data engineering documentation that enables teams to build reliable, scalable, and cost-efficient data platforms.
Un pipeline que no se puede observar no se puede confiar. Data engineering diseña cómo los datos fluyen desde las fuentes hasta los consumidores — con calidad validada, lineage trazable, y SLAs medibles en cada paso. Los datos no "llegan" — se orquestan.
The user provides a system or project name as $ARGUMENTS. Parse $1 as the system/project name used throughout all output artifacts.
Parameters:
{MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
{FORMATO}: markdown (default) | html | dual{VARIANTE}: ejecutiva (~40% — S1 ingestion + S3 storage + S5 lineage) | técnica (full 6 sections, default)Before generating architecture, detect the project context:
!find . -name "*.py" -o -name "*.yaml" -o -name "*.yml" -o -name "Dockerfile" -o -name "*.tf" -o -name "*.sql" | head -30
Use detected tools (Airflow, Dagster, Prefect, Spark, Kafka, dbt, etc.) to tailor recommendations.
If reference materials exist, load them:
Read ${CLAUDE_SKILL_DIR}/references/pipeline-patterns.md
Defines how data enters the platform — sources, methods, schema handling.
Data contract enforcement at ingestion:
Includes:
Exactly-once delivery patterns:
enable.idempotence=true) + transactional consumers (isolation.level=read_committed) for end-to-end exactly-onceKey decisions:
Maps DAG patterns, scheduling, dependency management, and SLA enforcement.
Orchestrator comparison — select by team profile:
| Criterion | Airflow | Dagster | Prefect | Mage |
|---|---|---|---|---|
| Philosophy | Task-centric DAGs | Asset-centric software-defined data | Flow-centric with dynamic tasks | Notebook-style hybrid |
| Best for | Large teams, complex operator ecosystem | Analytics/ML teams, typed data contracts | Event-driven, dynamic workflows | Small teams, rapid prototyping |
| Local dev | Heavy (Docker-based) | Lightweight dev server | Lightweight agent | Built-in UI + notebooks |
| Data contracts | Convention-based (no native typing) | Native typed inputs/outputs | Pydantic validation | Schema validation |
| Kubernetes | Battle-tested KubernetesExecutor | Kubernetes support (newer) | Kubernetes agent | Kubernetes support |
| Community size | Largest (10K+ contributors) | Growing rapidly (2K+) | Medium (1.5K+) | Smaller (600+) |
| Choose when | Existing Airflow investment, heavy K8s ops | Greenfield, asset-first thinking, dbt integration | Dynamic/event-driven workflows | Data scientists building pipelines |
Practical guidance: run Dagster for new analytics/ML pipelines; keep Airflow for legacy or operator-heavy workflows; trigger across systems via APIs. Migrate incrementally.
Includes:
Key decisions:
Designs the data platform storage layers — zones, formats, and lifecycle.
Lakehouse table format comparison:
| Criterion | Apache Iceberg | Delta Lake | Apache Hudi |
|---|---|---|---|
| Multi-engine support | Strongest (Spark, Flink, Trino, Presto, Dremio, Snowflake) | Spark-native, growing (Trino, Flink via UniForm) | Spark, Flink, Presto |
| Best for | Multi-engine analytics, large-scale reads | Databricks ecosystem, streaming + batch | CDC-heavy workloads, record-level upserts |
| Hidden partitioning | Yes (partition evolution without rewrite) | No (explicit partition columns) | No (explicit partition columns) |
| Time travel | Snapshot-based, branch/tag support | Version-based log | Timeline-based |
| Compaction | Automatic + manual | Auto-optimize, Z-order | Inline + async compaction |
| Catalog | Nessie, Polaris, HMS, AWS Glue | Unity Catalog, HMS | HMS |
| Choose when | Multi-engine is priority, vendor-neutral | Databricks-centric, need Delta Sharing | Heavy CDC from transactional DBs |
Decision guidance: Iceberg is the 2025-2026 momentum leader for multi-engine portability; Delta Lake for Databricks-committed shops; Hudi only for CDC-dominant workloads.
Includes:
Key decisions:
Establishes validation, profiling, and remediation within data pipelines.
Data observability stack integration:
| Tool | Focus | Integration | Best For |
|---|---|---|---|
| Monte Carlo | Full observability (freshness, volume, schema, distribution) | Native warehouse + orchestrator connectors | Enterprise teams wanting managed observability |
| Elementary | dbt-native data observability | dbt package, runs with dbt test | dbt-centric teams, budget-conscious |
| Soda | Data quality checks as code | YAML-based checks, CI/CD integration | Cross-platform, polyglot teams |
| Great Expectations | Programmable data validation | Python library, checkpoint-based | Engineering teams wanting full control |
Selection criteria: Elementary for dbt shops (zero incremental infra); Soda for multi-tool environments; Monte Carlo for enterprise-wide observability; Great Expectations for Python-first teams.
Includes:
Key decisions:
Tracks data flow, monitors pipeline health, and enables incident response.
Includes:
Key decisions:
Optimizes data platform for growth while controlling costs.
Includes:
Key decisions:
| Decision | Enables | Constrains | Threshold |
|---|---|---|---|
| CDC Ingestion | Low latency, minimal source impact | CDC tool dependency, schema coupling | Transactional DBs with <5min freshness need |
| Batch Ingestion | Simple, predictable, easy debugging | Higher latency, full-scan cost | APIs, file drops, daily-freshness acceptable |
| Lakehouse Architecture | Unified batch+streaming, ACID, multi-engine | Learning curve, table format maturity | Modern platforms, mixed workloads |
| Event-Driven Orchestration | Responsive, decoupled | Harder debugging, eventual consistency | Data-availability triggers, microservice events |
| Managed Connectors | Fast setup, low maintenance | Vendor lock-in, limited customization | Standard SaaS sources, team < 5 engineers |
| Column-Level Lineage | Precise impact analysis, compliance | Tool cost, implementation effort | Regulated industries, 100+ tables |
Greenfield Data Platform: Start with managed connectors for quick wins, event-driven architecture for new systems, batch for legacy. Avoid custom connectors until managed options fail. Choose Iceberg as table format for future portability.
Legacy ETL Migration (Informatica, SSIS, Talend): Map existing jobs to modern orchestration. Document undocumented business logic. Run parallel validation before cutover. Expect 20-30% of logic to be obsolete.
Multi-Cloud Data Platform: Data ingested from AWS, processed in GCP, served from Azure. Address cross-cloud networking costs ($0.01-0.02/GB transfer), format compatibility (use Iceberg for portability), and unified catalog.
Real-Time Streaming at Scale: Kafka/Kinesis with millions of events per second. Address exactly-once semantics, consumer group management, dead-letter queues. Backpressure handling: bounded buffers, flow control that slows producers when consumers lag, consumer lag as first-class metric.
Compliance-Heavy Environment (GDPR, CCPA, HIPAA): Data must be classifiable, deletable, and auditable. Support PII tagging in catalog, right-to-delete pipelines (Iceberg row-level deletes), and access logging at record level.
Before finalizing delivery, verify:
| Format | Default | Description |
|---|---|---|
markdown | ✅ | Rich Markdown + Mermaid diagrams. Token-efficient. |
html | On demand | Branded HTML (Design System). Visual impact. |
dual | On demand | Both formats. |
Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter.
Primary: A-01_Data_Engineering.html — Ingestion patterns, orchestration design, storage architecture, quality framework, lineage and observability, scalability and cost management.
Secondary: Source inventory catalog, DAG dependency diagram, storage zone map, pipeline runbook templates, cost attribution dashboard spec.
| Caso | Estrategia de Manejo |
|---|---|
| Greenfield data platform | Managed connectors para quick wins, event-driven para nuevos sistemas, batch para legacy. Iceberg como table format para portabilidad futura. |
| Legacy ETL migration (Informatica, SSIS, Talend) | Mapear jobs existentes a orchestration moderno. Documentar business logic no documentada. Validacion en paralelo antes de cutover. 20-30% logica obsoleta esperada. |
| Multi-cloud data platform | Cross-cloud networking costs ($0.01-0.02/GB transfer). Iceberg para portabilidad de formato. Catalogo unificado (Polaris o similar). |
| Real-time streaming a escala (millions events/sec) | Kafka/Kinesis con exactly-once semantics. Consumer group management. Dead-letter queues. Backpressure con bounded buffers y consumer lag como metrica principal. |
| Compliance-heavy (GDPR, CCPA, HIPAA) | PII tagging en catalogo, right-to-delete pipelines (Iceberg row-level deletes), access logging a nivel record. Data classification mandatoria. |
| Decision | Alternativa Descartada | Justificacion |
|---|---|---|
| Idempotency como propiedad foundacional | At-least-once sin deduplicacion | Pipeline replayable sin duplicar datos es el requisito base de confiabilidad. Sin idempotencia, re-ejecuciones corrompen datos. |
| Lakehouse como baseline 2025-2026 | Pure data warehouse, pure data lake | Lakehouse (open table formats + catalog en object storage) unifica batch + streaming, provee ACID, y soporta multi-engine. Warehouse para equipos que priorizan simplicidad. |
| Data contracts entre equipos (producer defines) | Schema-on-read sin contratos | Sin contratos, la calidad es responsabilidad de nadie. Producer publica schema + SLA + ownership; consumer registra dependencia; breaking changes requieren sign-off. |
| Dagster para greenfield, Airflow para legacy | Unico orchestrador para todo | Dagster es asset-centric con typed data contracts (ideal para analytics/ML). Airflow tiene el ecosystem mas grande (ideal para operator-heavy workflows). Coexistencia via APIs. |
graph TD
subgraph Core["Conceptos Core"]
INGEST["Ingestion Patterns"]
ORCH["Orchestration Design"]
STORAGE["Storage Architecture"]
QUALITY["Data Quality"]
LINEAGE["Lineage & Observability"]
SCALE["Scalability & Cost"]
end
subgraph Inputs["Entradas"]
SOURCES["Source Systems"]
REQS["Consumer Requirements"]
INFRA["Cloud/On-prem Infrastructure"]
SLA["Freshness SLAs"]
end
subgraph Outputs["Salidas"]
ARCH["Data Engineering Architecture"]
CATALOG["Source Inventory Catalog"]
DAG["DAG Dependency Diagram"]
ZONES["Storage Zone Map"]
RUNBOOK["Pipeline Runbook Templates"]
end
subgraph Related["Skills Relacionados"]
AE["analytics-engineering"]
BIARCH["bi-architecture"]
DS["data-science-architecture"]
DQ["data-quality"]
end
SOURCES --> INGEST
REQS --> STORAGE
INFRA --> ORCH
SLA --> QUALITY
INGEST --> ORCH
ORCH --> STORAGE
STORAGE --> QUALITY
QUALITY --> LINEAGE
LINEAGE --> SCALE
ARCH --> CATALOG
ARCH --> DAG
ARCH --> ZONES
ARCH --> RUNBOOK
AE -.-> STORAGE
BIARCH -.-> STORAGE
DS -.-> INGEST
DQ -.-> QUALITY
Formato Markdown (default):
# Data Engineering Architecture: {project}
## S1: Ingestion Patterns
### Source Inventory
| Source | Type | Method | Freshness SLA | Connector |
...
### Schema Registry & Data Contracts
### Exactly-Once Delivery Patterns
## S2: Orchestration Design
### Orchestrator Selection: {Airflow|Dagster|Prefect}
### DAG Architecture (Mermaid)
### SLA Monitoring & Alerting
## S3: Storage Architecture
### Zone Architecture (Landing > Curated > Consumption > Archive)
### Table Format: {Iceberg|Delta|Hudi}
### Lifecycle Policies
## S4: Data Quality Framework
### Quality Checks per Zone Boundary
### Observability Stack Selection
## S5: Lineage & Observability
### Lineage Tracking (OpenLineage)
### Pipeline Monitoring Dashboard
## S6: Scalability & Cost Management
### Cost Attribution per Pipeline
Formato HTML (bajo demanda):
A-01_Data_Engineering_{project}_{WIP}.html
HTML self-contained branded (Design System MetodologIA v5). Light-First Technical. Incluye DAG dependency diagram interactivo, storage zone map visual, y pipeline observability dashboard layout. WCAG AA, responsive, print-ready.
Formato XLSX (bajo demanda):
Sheet 1: Source Inventory — source, type, method, freshness SLA, connector, owner
Sheet 2: Pipeline Catalog — pipeline name, DAG, sources, targets, schedule, SLA, owner
Sheet 3: Storage Zones — zone, format, partitioning, retention, lifecycle
Sheet 4: Quality Rules — dataset, check type, threshold, severity, remediation
Sheet 5: Lineage Map — source table, transformation, target table, pipeline
Sheet 6: Cost Attribution — pipeline, compute type, avg duration, estimated cost/run
Formato DOCX (bajo demanda):
A-01_Data_Engineering_{project}_{WIP}.docx
Via python-docx con Design System MetodologIA v5. Cover page, TOC auto, headers/footers branded, tablas zebra. Poppins headings (navy), Montserrat body, gold accents.
Formato PPTX (bajo demanda):
{fase}_Data_Engineering_{cliente}_{WIP}.pptx
Via python-pptx con MetodologIA Design System v5. Navy gradient slide master, Poppins titles, Montserrat body, gold accents. Máx 20 slides ejecutivo / 30 técnico. Speaker notes con referencias de evidencia.
| Dimension | Peso | Criterio |
|---|---|---|
| Trigger Accuracy | 10% | Activacion correcta ante keywords de data pipelines, ingestion, orchestration, data lake, lakehouse, Airflow, Dagster, CDC, lineage, pipeline SLAs. |
| Completeness | 25% | 6 secciones cubren ingestion, orchestration, storage, quality, lineage, y scalability. Schema registry y data contracts incluidos. |
| Clarity | 20% | Comparison matrices (orchestrators, table formats, quality tools) con criterios de seleccion claros. Decision rules por context. |
| Robustness | 20% | Edge cases (greenfield, legacy ETL, multi-cloud, streaming at scale, compliance) manejados. Exactly-once delivery patterns documentados. |
| Efficiency | 10% | Variante ejecutiva reduce a S1+S3+S5 (~40%). Context detection automatica adapta a stack detectado. |
| Value Density | 15% | Data contracts con enforcement en CI. Orchestrator comparison accionable. Cost attribution per-pipeline. Pipeline runbook templates. |
Umbral minimo: 7/10. Debajo de este umbral, revisar idempotency design y data contract enforcement.
Autor: Javier Montano · Comunidad MetodologIA | Ultima actualizacion: 15 de marzo de 2026