Stats

Actions

Tags

Help us improve

Share bugs, ideas, or general feedback.

data-pipeline-design | data-architecture | ClaudePluginHub

Skill

data-pipeline-design

From data-architecture

Design batch and streaming data pipelines. Plan ingestion, transformation, quality checks, and failure recovery. Use when building ETL/ELT systems or data infrastructure.

$

npx claudepluginhub sethdford/claude-skills --plugin architect-data-architecture

Popularity

Parent stars

13

Parent forks

2

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/data-architecture:data-pipeline-design

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Design robust, maintainable data pipelines that reliably move, transform, and validate data at scale.

SKILL.md

48 lines · ~773 tokens

Similar Skills

data-engineering-data-pipeline

38.6k

Designs scalable data pipelines for batch and streaming processing with Airflow, Prefect, dbt, Kafka, Spark, Delta Lake, and Great Expectations. Guides architecture, ingestion, orchestration, transformation, quality, and monitoring.

antigravity-awesome-skills

pipeline

18

Designs data pipelines and ETL processes covering extraction, transformation, loading, data quality checks, orchestration, and patterns for batch, streaming, CDC, ELT. Useful for building pipelines, data flows, syncing, or moving data between systems.

data-flow

67

Designs data pipeline architectures for batch ETL, streaming, or hybrid scenarios including tech stacks, ASCII diagrams, data quality strategies, and cost analysis. Useful for real-time processing, BI reporting, or migrations.

4 tools

Stats

Parent stars13

Parent forks2

MaintenanceFair

Last CommitMar 11, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Data Pipeline Design

Design robust, maintainable data pipelines that reliably move, transform, and validate data at scale.

Context

You are designing data pipelines (batch or streaming). Plan data flow, transformations, quality gates, failure recovery, and monitoring. Read source systems, target requirements, latency expectations, and volume projections.

Domain Context

Based on modern data engineering practices (Spark, Airflow, Kafka, Beam):

Batch Pipelines: Scheduled jobs (hourly, daily); high throughput, moderate latency
Streaming Pipelines: Continuous ingestion; low latency, higher operational complexity
Micro-batching: Spark Streaming; lower latency than batch, simpler than true streaming
Orchestration: DAG-based scheduling (Airflow, dbt) for complex multi-stage pipelines
Observability: Monitor latency, throughput, data quality, freshness

Instructions

Choose Processing Model: Batch (daily jobs?) or streaming (realtime features?)? Hybrid (Lambda: batch + streaming for both speed and accuracy)? Consider latency SLA and cost.
Design Data Stages: Raw ingestion (as-is from source) → Bronze. Cleansing and normalization → Silver. Business logic and enrichment → Gold. This layered medallion architecture separates concerns.
Implement Quality Gates: Validation at each stage. Fail pipeline if data quality drops. Track anomalies: unexpected null rates, value distributions, cardinality changes.
Handle Failures and Recovery: Idempotent transformations allow safe retries. Checkpoint state for streaming pipelines; resume from last checkpoint on failure. Use dead-letter queues for unparseable records.
Plan Monitoring and Alerting: Track freshness (when was last successful run?), latency (time from source to sink), volume (record counts by stage), error rates. Alert on anomalies and SLA misses.

Anti-Patterns

No Data Quality Checks: Assume data from source is clean. Result: garbage in, garbage out. Guard: Validate at ingestion; alert on schema changes or anomalies.
Tightly Coupled Transformations: Pipeline is monolithic script. Result: hard to test, reuse, debug. Guard: Break into modular stages; each stage is independently testable.
No Checkpoint/Recovery: Assume pipelines always succeed. Result: gaps in data, lost work. Guard: Checkpoint state; design for idempotent retries.
Ignoring Operational Overhead: Streaming pipelines look simple at 1MB/s, collapse at 1GB/s. Result: unexpected scaling headaches. Guard: Load-test pipelines; plan infrastructure for 10x growth.

Further Reading

Fundamentals of Data Engineering by Joe Reis and Matt Housley — modern pipeline design
The Data Warehouse Toolkit by Ralph Kimball — medallion/dimensional modeling
Apache Airflow Guide — orchestration patterns