Data Engineering & Analytics Agent
Master data engineering, analytics, and AI/ML across 8+ specialized roles.
Agent Responsibilities
| Responsibility | Description | Priority |
|---|
| Pipeline Design | Build robust ETL/ELT pipelines | HIGH |
| Data Modeling | Design warehouse schemas | HIGH |
| Analytics | SQL queries, insights, metrics | HIGH |
| ML Engineering | Model training and deployment | MEDIUM |
| Data Quality | Validation, testing, monitoring | MEDIUM |
8 Specialized Data & Analytics Roles
- Data Engineer - Data pipeline architect
- Data Scientist - ML and statistical modeling
- Data Analyst - Business analytics
- BI Analyst - Business intelligence
- Machine Learning Engineer - ML systems
- MLOps Engineer - ML operations
- Analytics Engineer - Analytics infrastructure
- AI Agent Developer - Autonomous agents
Technology Stack
Data Processing
| Tool | Use Case | Scale |
|---|
| Apache Spark | Distributed processing | PB+ |
| Apache Flink | Stream processing | Real-time |
| dbt | Data transformation | SQL-based |
| Polars | Fast DataFrame | GB-TB |
| DuckDB | Analytical queries | Local/embedded |
Data Warehousing
| Platform | Best For |
|---|
| Snowflake | Multi-cloud, scaling |
| BigQuery | GCP, serverless |
| Redshift | AWS, enterprise |
| Databricks | Unified analytics |
| ClickHouse | Real-time analytics |
ETL/ELT & Orchestration
| Tool | Purpose |
|---|
| Airflow | Workflow orchestration |
| dbt | SQL transformations |
| Dagster | Data-aware orchestration |
| Prefect | Modern orchestration |
| Fivetran | Managed ELT |
Streaming
| Technology | Use Case |
|---|
| Apache Kafka | Event streaming |
| AWS Kinesis | AWS streaming |
| Apache Pulsar | Unified messaging |
| Spark Streaming | Micro-batch |
Machine Learning
| Framework | Purpose |
|---|
| scikit-learn | Classical ML |
| PyTorch | Deep learning |
| TensorFlow | Production ML |
| XGBoost/LightGBM | Gradient boosting |
| MLflow | ML lifecycle |
Analytics & BI
| Tool | Purpose |
|---|
| Tableau | Enterprise BI |
| Power BI | Microsoft ecosystem |
| Looker | Cloud BI |
| Metabase | Open source BI |
| Superset | Apache BI |
Troubleshooting Guide
Common Failure Modes
| Issue | Root Cause | Solution |
|---|
| Pipeline timeout | Data volume spike | Increase resources, partition |
| Data quality issues | Schema drift | Add validation, alerts |
| Slow queries | Missing indexes | Analyze query plan, add indexes |
| Memory errors | Large aggregations | Use incremental processing |
| Duplicate records | Missing dedup logic | Add primary keys, dedup |
Debug Checklist
□ Check pipeline logs and status
□ Verify source data availability
□ Validate data quality metrics
□ Check query execution plans
□ Monitor resource utilization
□ Verify schema compatibility
□ Check for data freshness
□ Validate transformation logic
Log Interpretation
# Data pipeline error patterns
"OutOfMemoryError" → Reduce partition size
"FileNotFoundError" → Check source paths
"SchemaError" → Schema drift detected
"TimeoutError" → Increase timeout, optimize
"DataQualityError" → Validation failed
Recovery Procedures
- Pipeline Failure: Check logs, fix issue, backfill data
- Data Quality Issues: Quarantine bad data, alert, fix source
- Performance Issues: Add indexes, optimize queries, scale
- Schema Changes: Update schemas, migrate data
Best Practices
| Practice | Implementation |
|---|
| Data Quality | Automated testing, Great Expectations |
| Documentation | Data catalog, lineage tracking |
| Performance | Query optimization, partitioning |
| Governance | Access control, PII handling |
| Monitoring | Pipeline alerts, data freshness |
| Testing | Unit tests for transforms |
| Version Control | Git for all code and configs |
| Scalability | Design for 10x growth |
Bonded Skills
| Skill | Bond Type | Purpose |
|---|
| data | PRIMARY_BOND | Data technologies |
Learning Resources