Use when designing data pipelines, choosing between ETL and ELT approaches, or implementing data transformation patterns. Covers modern data pipeline architecture.
Provides patterns for designing data pipelines, choosing between ETL and ELT approaches, and implementing transformation strategies. Use when designing data pipelines, deciding between ETL vs ELT, or implementing incremental loads, CDC, or data quality validation patterns.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
Patterns for data extraction, loading, and transformation including modern ELT approaches and pipeline design.
┌─────────┐ ┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ Transform │ ──► │ Warehouse │
└─────────┘ │ Server │ └─────────────┘
└─────────────┘
(Transformation happens
before loading)
Characteristics:
- Transform before load
- Requires ETL server/tool
- Schema-on-write
- Traditional approach
Best for:
- Complex transformations
- Limited target storage
- Strict data quality requirements
- Legacy systems
┌─────────┐ ┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ Target │ ──► │ Transform │
└─────────┘ │ (Load raw) │ │ (in-place) │
└─────────────┘ └─────────────┘
(Transformation happens
after loading)
Characteristics:
- Load first, transform later
- Uses target system's compute
- Schema-on-read
- Modern approach
Best for:
- Cloud data warehouses
- Flexible exploration
- Iterative development
- Large data volumes
| Factor | ETL | ELT |
|---|---|---|
| Transform timing | Before load | After load |
| Compute location | Separate server | Target system |
| Raw data access | Limited | Full |
| Flexibility | Low | High |
| Latency | Higher | Lower |
| Cost model | ETL server + storage | Storage + target compute |
| Best for | Complex, pre-defined | Exploratory, iterative |
Extract: Fivetran, Airbyte, Stitch, Custom
│
▼
Load: Cloud Warehouse (Snowflake, BigQuery, Redshift)
│
▼
Transform: dbt, Dataform, SQLMesh
│
▼
Visualize: Looker, Tableau, Metabase
Core concepts:
- Models: SQL SELECT statements that define transformations
- Tests: Data quality assertions
- Documentation: Inline docs and lineage
- Macros: Reusable SQL snippets
Example model:
-- models/customers.sql
SELECT
customer_id,
first_name,
last_name,
order_count
FROM {{ ref('stg_customers') }}
LEFT JOIN {{ ref('customer_orders') }} USING (customer_id)
Strategy: Drop and recreate entire table
Process:
1. Extract all data from source
2. Truncate target table
3. Load all data
Pros: Simple, consistent
Cons: Slow for large tables, can't handle deletes
Best for: Small dimension tables, reference data
Strategy: Only process new/changed records
Process:
1. Track high watermark (last processed timestamp/ID)
2. Extract records > watermark
3. Merge into target
Pros: Fast, efficient
Cons: Complex, may miss updates
Best for: Large fact tables, event data
Strategy: Capture all changes from source
Approaches:
┌────────────────────────────────────────────────┐
│ Log-based CDC (Debezium, AWS DMS) │
│ - Reads database transaction log │
│ - Captures inserts, updates, deletes │
│ - No source table modification needed │
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│ Trigger-based CDC │
│ - Database triggers on changes │
│ - Writes to change table │
│ - Adds load to source │
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│ Timestamp-based CDC │
│ - Query by updated_at timestamp │
│ - Simple but misses hard deletes │
└────────────────────────────────────────────────┘
-- Snowflake/BigQuery style MERGE
MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET
t.name = s.name,
t.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (id, name, created_at)
VALUES (s.id, s.name, CURRENT_TIMESTAMP);
Pipeline with quality gates:
Extract → Validate → Load → Transform → Validate → Serve
│ │
▼ ▼
Quarantine Alert/Block
(bad records) (quality issue)
Schema validation:
- Required fields present
- Data types match
- Field lengths within limits
Data validation:
- Null checks
- Range checks
- Format validation (dates, emails)
- Referential integrity
Statistical validation:
- Row count within expected range
- Value distributions normal
- No unexpected duplicates
Define expectations between producer and consumer:
{
"contract_version": "1.0",
"schema": {
"customer_id": {"type": "string", "required": true},
"email": {"type": "string", "format": "email"},
"created_at": {"type": "timestamp"}
},
"quality": {
"freshness": "< 1 hour",
"completeness": "> 99%",
"row_count": "10000-100000"
}
}
Schedule-based processing:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Cron │ ──► │ Spark │ ──► │ DW │
│ (daily) │ │ (ETL) │ │ │
└─────────┘ └─────────┘ └─────────┘
Best for: Non-real-time, large volumes
Tools: Airflow, Dagster, Prefect
Real-time processing:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Kafka │ ──► │ Flink │ ──► │ DW │
│(events) │ │(process)│ │(stream) │
└─────────┘ └─────────┘ └─────────┘
Best for: Real-time analytics, event-driven
Tools: Kafka Streams, Flink, Spark Streaming
Batch + Speed layers:
┌─────────────────┐
┌────► │ Batch Layer │ ────┐
│ │ (comprehensive) │ │
┌─────────┐ │ └─────────────────┘ │ ┌─────────┐
│ Data │──┤ ├─►│ Serving │
│ Sources │ │ ┌─────────────────┐ │ │ Layer │
└─────────┘ └────► │ Speed Layer │ ────┘ └─────────┘
│ (real-time) │
└─────────────────┘
Pros: Complete + real-time
Cons: Complex, duplicate logic
Streaming only (reprocess from log):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Kafka │ ──► │ Stream │ ──► │ Serving │
│ (log) │ │ Process │ │ Layer │
└─────────┘ └─────────┘ └─────────┘
│
└── Replay for reprocessing
Pros: Simple, single codebase
Cons: Requires replayable log
Directed Acyclic Graph of tasks:
extract_a ──┐
├── transform ── load
extract_b ──┘
Tools: Airflow, Dagster, Prefect
1. Idempotent tasks (safe to retry)
2. Clear dependencies
3. Appropriate granularity
4. Monitoring and alerting
5. Backfill support
6. Parameterized runs
Transient errors (network, timeout):
- Exponential backoff
- Max retry count
- Circuit breaker
Data errors (validation failure):
- Quarantine bad records
- Continue processing good records
- Alert for review
Failed records → DLQ → Manual review → Reprocess
Capture:
- Original record
- Error message
- Timestamp
- Retry count
1. Idempotent transformations
2. Clear lineage tracking
3. Appropriate checkpointing
4. Graceful failure handling
5. Comprehensive logging
6. Data quality gates
1. Partition data appropriately
2. Incremental processing when possible
3. Parallel extraction
4. Efficient file formats (Parquet, ORC)
5. Compression
6. Resource sizing
data-architecture - Data platform designstream-processing - Real-time processingml-system-design - Feature engineeringCreating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.