Use when designing data platforms, choosing between data lakes/lakehouses/warehouses, or implementing data mesh patterns. Covers modern data architecture approaches.
Provides guidance on modern data architecture patterns, helping you choose between data lakes, warehouses, and lakehouses based on your specific use case and requirements.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.
Generation 1: Data Warehouse (1990s-2000s)
- Structured data only
- ETL into warehouse
- Star/snowflake schemas
- SQL-based analytics
Generation 2: Data Lake (2010s)
- All data types (structured, semi, unstructured)
- Schema-on-read
- Hadoop/HDFS based
- Cheap storage, complex processing
Generation 3: Lakehouse (2020s)
- Best of both: lake flexibility + warehouse features
- ACID transactions on lake
- Schema enforcement optional
- Unified analytics and ML
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ ETL │ ──► │ Warehouse │
│ (Structured)│ │ (Transform) │ │ (Star/Snow) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ BI │
│ Analytics │
└─────────────┘
Characteristics:
- Schema-on-write
- Optimized for SQL queries
- Structured data only
- High data quality
- Expensive storage
Best for:
- Business intelligence
- Financial reporting
- Structured analytics
┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ Data Lake │
│ (All) │ │ (Raw) │
└─────────────┘ └─────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ ML │ │ ETL │ │ Spark │
│ Training│ │ to DW │ │ Analysis│
└─────────┘ └─────────┘ └─────────┘
Characteristics:
- Schema-on-read
- All data types
- Cheap storage
- Flexible processing
- Risk of "data swamp"
Best for:
- Data science/ML
- Unstructured data
- Experimental analysis
┌─────────────┐ ┌─────────────────────────────────┐
│ Sources │ ──► │ Data Lakehouse │
│ (All) │ │ ┌──────────────────────────┐ │
└─────────────┘ │ │ Metadata Layer │ │
│ │ (Delta/Iceberg/Hudi) │ │
│ └──────────────────────────┘ │
│ ┌──────────────────────────┐ │
│ │ Storage Layer │ │
│ │ (Object Storage) │ │
│ └──────────────────────────┘ │
└─────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ SQL │ │ ML │ │ Stream │
│ BI │ │ Workload│ │ Process │
└─────────┘ └─────────┘ └─────────┘
Characteristics:
- ACID transactions
- Schema evolution
- Time travel
- Unified batch/streaming
- Open formats
Best for:
- Unified analytics
- Both BI and ML
- Modern data platforms
| Factor | Warehouse | Lake | Lakehouse |
|---|---|---|---|
| Data types | Structured | All | All |
| Query performance | Excellent | Poor-Medium | Good |
| Data quality | High | Variable | Configurable |
| Cost | High | Low | Medium |
| ML workloads | Limited | Excellent | Excellent |
| Real-time | Limited | Good | Good |
| Governance | Strong | Weak | Strong |
| Complexity | Low | High | Medium |
Decision Tree:
Is data mostly structured with BI focus?
├── Yes → Data Warehouse
└── No
└── Need ML + BI on same data?
├── Yes → Lakehouse
└── No
└── Primarily ML/unstructured?
├── Yes → Data Lake
└── No → Lakehouse
Features:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement/evolution
- Unified batch/streaming
- Optimized performance (Z-ordering, compaction)
File format: Parquet + Delta log
Features:
- ACID transactions
- Hidden partitioning
- Schema evolution
- Time travel
- Vendor neutral
File format: Parquet/ORC/Avro + metadata
Features:
- ACID transactions
- Incremental processing
- Record-level updates
- Time travel
- Optimized for streaming
File format: Parquet + Hudi metadata
| Feature | Delta Lake | Iceberg | Hudi |
|---|---|---|---|
| ACID | Yes | Yes | Yes |
| Time Travel | Yes | Yes | Yes |
| Schema Evolution | Good | Excellent | Good |
| Streaming | Excellent | Good | Excellent |
| Ecosystem | Databricks | Wide | Wide |
| Performance | Excellent | Excellent | Good |
| Community | Large | Growing | Medium |
Data Mesh = Decentralized data architecture
Four Principles:
1. Domain Ownership
- Data owned by domain teams
- Not centralized data team
2. Data as a Product
- Treat data like a product
- Quality, discoverability, usability
3. Self-Serve Platform
- Platform enables domain teams
- Reduces friction
4. Federated Governance
- Global standards
- Local implementation
Data Product = Autonomous unit of data
Components:
┌──────────────────────────────────────┐
│ Data Product │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Data │ │ Metadata │ │
│ │ (Tables) │ │ (Schema, docs) │ │
│ └──────────┘ └──────────────────┘ │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Code │ │ APIs │ │
│ │ (ETL) │ │ (Access layer) │ │
│ └──────────┘ └──────────────────┘ │
│ ┌──────────────────────────────────┐│
│ │ Quality + SLAs ││
│ └──────────────────────────────────┘│
└──────────────────────────────────────┘
| Aspect | Centralized | Data Mesh |
|---|---|---|
| Ownership | Central data team | Domain teams |
| Scaling | Team bottleneck | Scales with org |
| Domain knowledge | Lost in translation | Preserved |
| Governance | Centralized | Federated |
| Implementation | Uniform | Heterogeneous |
| Complexity | Lower initially | Higher initially |
┌─────────────┐
│ Dim_Time │
└──────┬──────┘
│
┌───────────┐ │ ┌───────────┐
│Dim_Product├──┼──┤Dim_Customer│
└───────────┘ │ └───────────┘
│
┌──────┴──────┐
│ Fact_Sales │
└─────────────┘
Pros: Simple, fast queries
Cons: Denormalized, redundancy
Best for: BI, reporting
Normalized dimensions:
Dim_Product → Dim_Category → Dim_Subcategory
Pros: Less redundancy
Cons: More joins, slower
Best for: Complex hierarchies
Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)
Pros: Auditable, flexible, scalable
Cons: Complex, learning curve
Best for: Enterprise data warehouse
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Bronze │ ──► │ Silver │ ──► │ Gold │
│ (Raw) │ │(Cleaned)│ │(Curated)│
└─────────┘ └─────────┘ └─────────┘
Bronze: Raw ingestion, append-only
Silver: Cleaned, validated, conformed
Gold: Business-level aggregates, features
Landing Zone: Raw files from sources
Raw Zone: Structured raw data
Curated Zone: Transformed, quality-checked
Consumption Zone: Ready for analytics
Sandbox Zone: Exploration and experimentation
Implement quality gates:
- Schema validation
- Null checks
- Range validation
- Referential integrity
- Freshness monitoring
Key capabilities:
- Data catalog
- Lineage tracking
- Access control
- Privacy compliance
- Audit logging
Optimization techniques:
- Partitioning (by date, region)
- Clustering/Z-ordering
- Compaction
- Caching
- Materialized views
etl-elt-patterns - Data transformationstream-processing - Real-time datadatabase-scaling - Database patternsCreating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.