This skill should be used when the user asks to "design data pipelines", "architect data ingestion", "set up orchestration", "plan a data lake", or mentions Airflow, Dagster, CDC, data lineage, or pipeline SLAs. [EXPLICIT] It produces data engineering documentation covering ingestion patterns, orchestration design, storage architecture, quality frameworks, lineage, and cost management for scalable data platforms. [EXPLICIT] Use this skill whenever the user needs data platform architecture, even if they don't explicitly ask for "data-engineering". [EXPLICIT]
From jm-adknpx claudepluginhub javimontano/jm-adk-alfaThis skill is limited to using the following tools:
agents/guardian.mdagents/lead.mdagents/specialist.mdagents/support.mdevals/evals.jsonknowledge/body-of-knowledge.mdknowledge/knowledge-graph.mdprompts/meta.mdprompts/primary.mdprompts/variations/deep.mdprompts/variations/quick.mdreferences/pipeline-patterns.mdtemplates/output.docx.mdtemplates/output.htmlSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Data engineering architecture defines how data is ingested, orchestrated, stored, validated, and observed — the backbone infrastructure that feeds analytics, ML, and operational systems. This skill produces data engineering documentation that enables teams to build reliable, scalable, and cost-efficient data platforms. [EXPLICIT]
Un pipeline que no se puede observar no se puede confiar. Data engineering diseña cómo los datos fluyen desde las fuentes hasta los consumidores — con calidad validada, lineage trazable, y SLAs medibles en cada paso. Los datos no "llegan" — se orquestan.
The user provides a system or project name as $ARGUMENTS. Parse $1 as the system/project name used throughout all output artifacts. [EXPLICIT]
Parameters:
{MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
{FORMATO}: markdown (default) | html | dual{VARIANTE}: ejecutiva (~40% — S1 ingestion + S3 storage + S5 lineage) | técnica (full 6 sections, default)Before generating architecture, detect the project context:
!find . -name "*.py" -o -name "*.yaml" -o -name "*.yml" -o -name "Dockerfile" -o -name "*.tf" -o -name "*.sql" | head -30
Use detected tools (Airflow, Dagster, Prefect, Spark, Kafka, dbt, etc.) to tailor recommendations. [EXPLICIT]
If reference materials exist, load them:
Read ${CLAUDE_SKILL_DIR}/references/pipeline-patterns.md
Defines how data enters the platform — sources, methods, schema handling. [EXPLICIT]
Data contract enforcement at ingestion:
Includes:
Exactly-once delivery patterns:
enable.idempotence=true) + transactional consumers (isolation.level=read_committed) for end-to-end exactly-onceKey decisions:
Maps DAG patterns, scheduling, dependency management, and SLA enforcement. [EXPLICIT]
Orchestrator comparison — select by team profile:
| Criterion | Airflow | Dagster | Prefect | Mage |
|---|---|---|---|---|
| Philosophy | Task-centric DAGs | Asset-centric software-defined data | Flow-centric with dynamic tasks | Notebook-style hybrid |
| Best for | Large teams, complex operator ecosystem | Analytics/ML teams, typed data contracts | Event-driven, dynamic workflows | Small teams, rapid prototyping |
| Local dev | Heavy (Docker-based) | Lightweight dev server | Lightweight agent | Built-in UI + notebooks |
| Data contracts | Convention-based (no native typing) | Native typed inputs/outputs | Pydantic validation | Schema validation |
| Kubernetes | Battle-tested KubernetesExecutor | Kubernetes support (newer) | Kubernetes agent | Kubernetes support |
| Community size | Largest (10K+ contributors) | Growing rapidly (2K+) | Medium (1.5K+) | Smaller (600+) |
| Choose when | Existing Airflow investment, heavy K8s ops | Greenfield, asset-first thinking, dbt integration | Dynamic/event-driven workflows | Data scientists building pipelines |
Practical guidance: run Dagster for new analytics/ML pipelines; keep Airflow for legacy or operator-heavy workflows; trigger across systems via APIs. Migrate incrementally. [EXPLICIT]
Includes:
Key decisions:
Designs the data platform storage layers — zones, formats, and lifecycle. [EXPLICIT]
Lakehouse table format comparison:
| Criterion | Apache Iceberg | Delta Lake | Apache Hudi |
|---|---|---|---|
| Multi-engine support | Strongest (Spark, Flink, Trino, Presto, Dremio, Snowflake) | Spark-native, growing (Trino, Flink via UniForm) | Spark, Flink, Presto |
| Best for | Multi-engine analytics, large-scale reads | Databricks ecosystem, streaming + batch | CDC-heavy workloads, record-level upserts |
| Hidden partitioning | Yes (partition evolution without rewrite) | No (explicit partition columns) | No (explicit partition columns) |
| Time travel | Snapshot-based, branch/tag support | Version-based log | Timeline-based |
| Compaction | Automatic + manual | Auto-optimize, Z-order | Inline + async compaction |
| Catalog | Nessie, Polaris, HMS, AWS Glue | Unity Catalog, HMS | HMS |
| Choose when | Multi-engine is priority, vendor-neutral | Databricks-centric, need Delta Sharing | Heavy CDC from transactional DBs |
Decision guidance: Iceberg is the 2025-2026 momentum leader for multi-engine portability; Delta Lake for Databricks-committed shops; Hudi only for CDC-dominant workloads. [EXPLICIT]
Includes:
Key decisions:
Establishes validation, profiling, and remediation within data pipelines. [EXPLICIT]
Data observability stack integration:
| Tool | Focus | Integration | Best For |
|---|---|---|---|
| Monte Carlo | Full observability (freshness, volume, schema, distribution) | Native warehouse + orchestrator connectors | Enterprise teams wanting managed observability |
| Elementary | dbt-native data observability | dbt package, runs with dbt test | dbt-centric teams, budget-conscious |
| Soda | Data quality checks as code | YAML-based checks, CI/CD integration | Cross-platform, polyglot teams |
| Great Expectations | Programmable data validation | Python library, checkpoint-based | Engineering teams wanting full control |
Selection criteria: Elementary for dbt shops (zero incremental infra); Soda for multi-tool environments; Monte Carlo for enterprise-wide observability; Great Expectations for Python-first teams. [EXPLICIT]
Includes:
Key decisions:
Tracks data flow, monitors pipeline health, and enables incident response. [EXPLICIT]
Includes:
Key decisions:
Optimizes data platform for growth while controlling costs. [EXPLICIT]
Includes:
Key decisions:
| Decision | Enables | Constrains | Threshold |
|---|---|---|---|
| CDC Ingestion | Low latency, minimal source impact | CDC tool dependency, schema coupling | Transactional DBs with <5min freshness need |
| Batch Ingestion | Simple, predictable, easy debugging | Higher latency, full-scan cost | APIs, file drops, daily-freshness acceptable |
| Lakehouse Architecture | Unified batch+streaming, ACID, multi-engine | Learning curve, table format maturity | Modern platforms, mixed workloads |
| Event-Driven Orchestration | Responsive, decoupled | Harder debugging, eventual consistency | Data-availability triggers, microservice events |
| Managed Connectors | Fast setup, low maintenance | Vendor lock-in, limited customization | Standard SaaS sources, team < 5 engineers |
| Column-Level Lineage | Precise impact analysis, compliance | Tool cost, implementation effort | Regulated industries, 100+ tables |
Greenfield Data Platform: Start with managed connectors for quick wins, event-driven architecture for new systems, batch for legacy. Avoid custom connectors until managed options fail. Choose Iceberg as table format for future portability. [EXPLICIT]
Legacy ETL Migration (Informatica, SSIS, Talend): Map existing jobs to modern orchestration. Document undocumented business logic. Run parallel validation before cutover. Expect 20-30% of logic to be obsolete. [EXPLICIT]
Multi-Cloud Data Platform: Data ingested from AWS, processed in GCP, served from Azure. Address cross-cloud networking costs ($0.01-0.02/GB transfer), format compatibility (use Iceberg for portability), and unified catalog. [EXPLICIT]
Real-Time Streaming at Scale: Kafka/Kinesis with millions of events per second. Address exactly-once semantics, consumer group management, dead-letter queues. Backpressure handling: bounded buffers, flow control that slows producers when consumers lag, consumer lag as first-class metric. [EXPLICIT]
Compliance-Heavy Environment (GDPR, CCPA, HIPAA): Data must be classifiable, deletable, and auditable. Support PII tagging in catalog, right-to-delete pipelines (Iceberg row-level deletes), and access logging at record level. [EXPLICIT]
Before finalizing delivery, verify:
| Format | Default | Description |
|---|---|---|
markdown | ✅ | Rich Markdown + Mermaid diagrams. Token-efficient. |
html | On demand | Branded HTML (Design System). Visual impact. |
dual | On demand | Both formats. |
Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter. [EXPLICIT]
Primary: A-01_Data_Engineering.html — Ingestion patterns, orchestration design, storage architecture, quality framework, lineage and observability, scalability and cost management.
Secondary: Source inventory catalog, DAG dependency diagram, storage zone map, pipeline runbook templates, cost attribution dashboard spec.
Author: Javier Montano | Last updated: March 18, 2026