npx claudepluginhub nwave-ai/nwave --plugin nwThis skill uses the workspace's default tool permissions.
Structured only -> **Data Warehouse** | Mixed + SQL analytics -> **Data Lakehouse** | Mixed + ML-primary -> **Data Lake** | Large org + autonomous domains -> **Data Mesh**
Guides designing data platforms by comparing warehouses, data lakes, lakehouses, and implementing data mesh patterns with trade-offs.
Provides Snowflake architecture blueprints: traditional data warehouse, Iceberg lakehouse, data mesh with sharing. Includes diagrams and SQL for multi-scale deployments.
Builds scalable data pipelines, modern data warehouses, and real-time streaming architectures using Apache Spark, dbt, Airflow, and cloud-native platforms.
Share bugs, ideas, or general feedback.
Structured only -> Data Warehouse | Mixed + SQL analytics -> Data Lakehouse | Mixed + ML-primary -> Data Lake | Large org + autonomous domains -> Data Mesh
Schema: structured, schema-on-write | Data: tables, rows, columns | Governance: centralized | Query: SQL analytics, BI | Architecture: centralized single source of truth
Star Schema: Central fact table (measures) surrounded by denormalized dimension tables. Best for BI dashboards, standard reporting.
Snowflake Schema: Normalized dimensions (dimensions reference other dimensions). Reduces storage, increases JOIN complexity. Best when storage cost matters more than query speed.
Kimball (Bottom-Up): Build data marts first, integrate later | Star schema, business-process driven | Faster initial delivery | Best for quick wins, department-level analytics
Inmon (Top-Down): Build enterprise DW first, derive data marts | Normalized 3NF enterprise model | Higher upfront effort | Best for large enterprises needing single source of truth
Technology: Snowflake | Amazon Redshift | Google BigQuery | Azure Synapse Analytics
Schema-on-read, flexible | All formats (structured, semi-structured, unstructured) | Raw data in native format | Query via Athena, Spark SQL, PySpark, Pandas | Risk: "data swamp" without governance
Zones: raw (landing, original format) -> curated (cleaned, validated) -> processed (transformed for use cases) -> archive (cold storage)
Technology: S3 + Athena/Glue | Azure Data Lake Storage + Synapse | HDFS + Hive
Combines warehouse reliability with lake flexibility | Schema enforcement on write with evolution support | ACID transactions on lake storage | Supports both BI/SQL and ML/data science workloads
Bronze: Raw data as-is, append-only for auditability, partitioned by ingestion date, schema-on-read Silver: Quality rules (null checks, range validation, referential integrity) | Deduplication on business keys | Schema enforced | SCD applied Gold: Business-level aggregations | Dimensional models (star/snowflake) | Pre-computed metrics/KPIs | Optimized for BI/reporting
Technology: Databricks (Delta Lake) | Apache Iceberg | Apache Hudi
Use when: Large org with autonomous domain teams | Central data team is bottleneck | Domain expertise needed | Platform engineering maturity exists Avoid when: Small team (<50 engineers) | Simple data needs | No platform capability | Unclear domain boundaries
Transform before loading via dedicated engine (Informatica, Talend, SSIS). Best for complex transforms, constrained targets, regulatory requirements. Scaling limited by transform engine.
Load raw first, transform using target compute (dbt, Snowflake SQL, BigQuery SQL). Best for cloud DWs with elastic compute, preserving raw data. Scales with target system.
Apache Airflow: DAG-based, Python-native, wide adoption | Prefect: modern, dynamic workflows | Dagster: software-defined assets
Distributed event streaming platform. Concepts: topics, partitions, consumer groups, offsets. At-least-once delivery (exactly-once with transactions). Use as event bus, message broker, stream storage.
Stateful stream processing engine. Concepts: DataStreams, windows (tumbling, sliding, session), state management. Exactly-once with checkpointing. Common pattern: Sources -> Kafka (durable event buffer) -> Flink (stateful compute) -> Sinks.
Streaming: real-time dashboards, fraud detection, IoT, event-driven | Batch: overnight reporting, historical analysis, ML training | Lambda: parallel batch + stream (complex, prefer Kappa) | Kappa: stream-only, reprocess from Kafka log (simpler)
Add CPU/RAM/storage to existing server | Simpler ops, no app changes | Hard limit: largest hardware | Use first for moderate growth
Read Replicas: Replicate to read-only copies | Route reads to replicas, writes to primary | Trade-off: replication lag (eventual consistency) | Use for read-heavy workloads
Partitioning (Single Server): Range (date, alphabetical) | List (region, category) | Hash (even distribution) | Benefits: query pruning, maintenance (drop old partitions)
Sharding (Multiple Servers): Distribute data across DB instances by shard key | Strategies: range-based, hash-based, directory-based, geographic
Shard Key Selection (most impactful decision):
Challenges: Cross-shard queries need scatter-gather | Distributed transactions (2PC) complex/slow | Resharding expensive | App complexity increases
Not exceeding single server -> optimize queries/indexes first | Read-heavy -> add read replicas | Write-heavy + partitionable -> partition then shard | Write-heavy + not partitionable -> write-optimized DBs (Cassandra, DynamoDB)
Normalize (3NF): OLTP with frequent writes | Data integrity paramount | Storage optimization | Write > read performance Denormalize: OLAP/analytics (star schema) | Read-heavy, predictable queries | Query > write performance | Acceptable redundancy
Practical approach: Start normalized for transactional tables | Add denormalized/materialized views for reporting | Denormalize selectively based on measured performance | Document decisions and rationale