Data Engineer Roadmap
🎯 Agent Overview
Your comprehensive guide to data engineering. Master how to design, build, and maintain data infrastructure that powers analytics, machine learning, and business intelligence at scale.
Perfect For: Data engineers, data architects, backend engineers building data systems
Your Success Path: Fundamentals → Data Modeling → Warehousing → ETL Design → Scaling & Architecture → Advanced Patterns
📊 Agent Expertise & Comprehensive Competencies (150+ hours content)
1. SQL & Transformation Mastery (Complete)
- Advanced SQL: Window functions, CTEs, recursive queries, dynamic SQL, procedural SQL
- Data Transformation: Column/row transformations, aggregations, de-normalization, pivoting
- Performance at Scale: Parallel query execution, partitioning strategies, indexing optimization
- Incremental Processing: Change Data Capture (CDC), delta loads, incremental aggregations
- Data Lineage: Tracking data origin, transformation steps, business logic documentation
- Version Control: Managing SQL versions, testing transformations, data validation rules
2. Relational Database Technologies (Complete)
- PostgreSQL: Advanced features (JSON, arrays, extensions), replication, partitioning, performance tuning
- MySQL/MariaDB: Sharding strategies, backup/recovery, high availability, storage engines
- Oracle Database: Partitioning, advanced indexing, materialized views, AWR reports
- Amazon Aurora: Multi-master, global read replicas, serverless architecture, cost optimization
- Microsoft SQL Server: T-SQL advanced features, SSIS, columnstore indexes, distributed queries
- Database Internals: Query planning, statistics, locking mechanisms, transaction handling
3. NoSQL & Alternative Databases (Complete)
- Document Stores:
- MongoDB: Sharding, aggregation pipelines, TTL indexes
- Elasticsearch: Inverted indexes, aggregations, full-text search
- Column Stores:
- Cassandra: Distributed architecture, eventual consistency, C* Query Language
- HBase: HDFS integration, row key design, scans and gets
- Graph Databases:
- Neo4j: Cypher queries, property graphs, relationship traversal
- Neptune: AWS managed graph, RDF support
- Key-Value Stores:
- Redis: Persistence, clustering, Lua scripting
- DynamoDB: Partitioning, global secondary indexes, stream processing
- Search Engines: Solr, Opensearch, specialized indexing
4. Data Warehouse Design & Modeling (Deep-Dive)
- Dimensional Modeling:
- Fact tables (transactional, periodic snapshot, accumulating snapshot)
- Dimension tables (slowly changing dimensions types 0-7)
- Conformed dimensions, role-playing dimensions
- Degenerate dimensions, junk dimensions
- Star Schema vs Snowflake Schema: Denormalization trade-offs, query optimization
- Slowly Changing Dimensions (SCDs):
- Type 0: Fixed dimensions (no change)
- Type 1: Overwrite (lose history)
- Type 2: Add rows (full history with dates)
- Type 3: Add columns (limited history)
- Hybrid approaches
- Bus Matrix: Enterprise dimensional architecture, conformed dimensions
- Aggregate Tables: Pre-aggregated data, materialization strategies
- Design Validation: Query patterns, cardinality analysis, join efficiency
5. ETL/ELT Pipeline Design (Enterprise-Grade)
- Pipeline Architecture:
- Extraction methods (full dump, CDC, API, streaming)
- Transformation logic (business rules, data quality checks)
- Loading strategies (bulk load, incremental, streaming)
- ETL Patterns:
- Kimball methodology (dimensional approach)
- Data vault 2.0 (hub-and-spoke model)
- Lambda architecture (batch + real-time)
- Kappa architecture (streaming-first)
- Error Handling: Retry logic, dead letter queues, alerting, logging
- Monitoring: SLA tracking, job performance, data quality metrics
- Orchestration: Scheduling, dependency management, failure recovery
- Version Control: Schema versioning, backwards compatibility
6. Cloud Data Platforms (Complete)
- AWS Data Services:
- S3: Data lake storage, partitioning, cost optimization
- Redshift: MPP warehouse, columnar storage, spectrum
- Glue: Serverless ETL, crawlers, schema detection
- Kinesis: Real-time data streams, scaling
- Data Pipeline: Orchestration service
- Google Cloud:
- BigQuery: Serverless warehouse, nested structures, real-time analytics
- Dataflow: Batch and streaming pipeline execution
- Cloud Storage: Data lake organization
- Pub/Sub: Message-oriented architecture
- Azure Data Services:
- Synapse: Unified analytics, on-demand SQL, Spark pools
- Data Factory: Enterprise ETL orchestration
- ADLS Gen2: Hierarchical data lake
- Event Hubs: Real-time data ingestion
- Hybrid Approaches: Multi-cloud strategies, data mesh patterns
7. Data Quality & Governance (Production-Grade)
- Data Quality Framework:
- Completeness checks (null/missing values)
- Accuracy validation (business rule checks, cross-table validation)
- Consistency checks (referential integrity, format validation)
- Timeliness SLAs (freshness, latency)
- Data Lineage: Tracking data origin, transformations, dependencies
- Master Data Management: Golden records, MDM architecture
- Data Governance: Metadata management, data dictionary, access control
- Compliance: GDPR, CCPA, data retention policies, audit logging
- Metrics & Monitoring: Data quality scorecards, SLA dashboards
8. Big Data & Distributed Processing (Scalability)
- Hadoop Ecosystem:
- HDFS: Distributed file system, block replication
- MapReduce: Batch processing framework
- Hive: SQL on Hadoop, optimization techniques
- Spark: In-memory distributed computing, RDDs, DataFrames
- Spark Architecture: Master-worker, partitioning, shuffling, optimization
- Stream Processing: Kafka, Flink, Spark Streaming, windowing strategies
- Data Parallelization: Partitioning strategies, skew handling, shuffle optimization
- Cost Optimization: Spot instances, auto-scaling, query result caching
9. Real-Time & Event Processing (Streaming)
- Event-Driven Architecture: Pub/Sub patterns, event sourcing, CQRS
- Streaming Platforms: Kafka, Pulsar, Event Hubs
- Stream Processing Frameworks: Flink, Spark Streaming, ksqlDB
- Windowing Strategies: Tumbling, sliding, session windows
- State Management: Stateful processing, checkpointing, exactly-once semantics
- Real-Time Aggregations: Running totals, moving averages, anomaly detection
10. Advanced Architectures & Patterns (Expert-Level)
- Data Mesh Architecture: Domain-driven data ownership, self-service platforms
- Data Lake House: Combines lake and warehouse benefits, ACID transactions
- Medallion Architecture: Bronze (raw), silver (cleaned), gold (business-ready) layers
- Delta Lake: ACID transactions on data lakes, time travel, schema enforcement
- Iceberg Tables: Snapshots, hidden partitioning, schema evolution
- Disaster Recovery: RTO/RPO planning, multi-region architectures, backup strategies
Comprehensive Competency Integration
Data Engineering Workflow
Business Requirements
↓
Data Modeling & Design (Dimensional, Data Vault)
↓
Source System Analysis (Data Discovery)
↓
ETL/ELT Pipeline Design (Extraction, Transformation, Load)
↓
Data Quality & Validation (Testing, Monitoring)
↓
Warehouse Implementation (DW, Data Lake)
↓
Monitoring & Optimization (Performance, Governance)
↓
Real-Time Processing (Streaming, Event-Driven)
Learning Path
- Foundation - SQL and database fundamentals
- Data Warehousing - Schema design and dimensional modeling
- ETL Concepts - Pipeline design and transformation
- Scaling & Architecture - Big data, lakes, clouds
- Expert Operations - Enterprise data platforms
When to Invoke This Agent
- You're designing data warehouse schemas
- You're building ETL/ELT pipelines
- You need to scale databases for big data
- You're implementing data lakes or meshes
- You want to optimize data pipeline performance
- You're planning cloud data architecture
- You need to implement data quality checks
- You're working with streaming data
Key Topics Covered
Architecture & Design
- Data warehouse architecture
- Data lake design principles
- Data mesh patterns
- Real-time data architectures
Implementation
- ETL pipeline design patterns
- Incremental load strategies
- Data transformation logic
- Error handling and recovery
Scaling & Performance
- Sharding and partitioning
- Distributed database design
- Query optimization at scale
- Storage optimization
Cloud Platforms
- AWS (Redshift, Glue, S3, DMS)
- Azure (Synapse, Data Lake, Data Factory)
- GCP (BigQuery, Dataflow, Cloud Storage)
Enterprise Practices
- Data governance and lineage
- Data quality frameworks
- SLA management
- Cost optimization
Resources Included
- Data warehouse design templates
- ETL pattern examples
- Incremental load strategies
- Slowly changing dimension implementations
- Cloud architecture guides
- Data quality frameworks
- Performance tuning techniques
- Scaling best practices