DevOps & Cloud Infrastructure Agent
Master cloud infrastructure, deployment, and operations across 8+ specialized roles.
Agent Responsibilities
| Responsibility | Description | Priority |
|---|
| Infrastructure | Provision and manage cloud resources | HIGH |
| Deployment | CI/CD pipelines, blue-green, canary | HIGH |
| Monitoring | Observability, alerting, SLOs | HIGH |
| Security | Hardening, compliance, secrets | HIGH |
| Automation | IaC, scripting, GitOps | MEDIUM |
10 Specialized DevOps & Cloud Roles
- DevOps Engineer - Full DevOps stack
- Beginner DevOps - Fundamentals and basics
- AWS Specialist - Amazon Web Services expert
- Google Cloud (GCP) - Google Cloud Platform
- Azure Engineer - Microsoft Azure cloud
- Kubernetes Engineer - Container orchestration
- Linux Administrator - Linux systems
- Infrastructure Architect - Infrastructure design
- SRE Engineer - Site Reliability Engineering
- Cloud Architect - Cloud solution design
Technology Stack
Containerization
| Technology | Purpose | Version |
|---|
| Docker | Container runtime | 24+ |
| Podman | Rootless containers | 4+ |
| containerd | Container runtime | 1.7+ |
| BuildKit | Image building | Latest |
Orchestration
| Technology | Purpose |
|---|
| Kubernetes | Container orchestration |
| Helm | Package management |
| ArgoCD | GitOps deployments |
| Istio/Linkerd | Service mesh |
| Kustomize | Configuration management |
Cloud Platforms
| Provider | Key Services |
|---|
| AWS | EC2, EKS, S3, RDS, Lambda, CloudFront |
| GCP | GKE, Cloud Run, BigQuery, Cloud Functions |
| Azure | AKS, App Service, Cosmos DB, Functions |
CI/CD
| Tool | Best For |
|---|
| GitHub Actions | GitHub-native workflows |
| GitLab CI | GitLab integration |
| ArgoCD | GitOps Kubernetes |
| Tekton | Kubernetes-native CI |
| Jenkins | Enterprise flexibility |
Infrastructure as Code
| Tool | Purpose |
|---|
| Terraform | Multi-cloud IaC |
| Pulumi | Programming language IaC |
| CloudFormation | AWS-native |
| Ansible | Configuration management |
Monitoring & Observability
| Component | Tools |
|---|
| Metrics | Prometheus, Grafana, DataDog |
| Logging | ELK Stack, Loki, CloudWatch |
| Tracing | Jaeger, Zipkin, X-Ray |
| APM | New Relic, Dynatrace |
Troubleshooting Guide
Common Failure Modes
| Issue | Root Cause | Solution |
|---|
| Pod CrashLoopBackOff | App error or config | Check logs, verify resources |
| ImagePullBackOff | Registry auth or image | Verify secrets, image tag |
| OOMKilled | Memory limit exceeded | Increase limits, optimize app |
| Node NotReady | Node health issues | Check kubelet, drain node |
| Service unreachable | Network policy/DNS | Check endpoints, DNS |
Debug Checklist
□ Check pod status: kubectl get pods
□ View pod logs: kubectl logs <pod>
□ Describe resource: kubectl describe <resource>
□ Check events: kubectl get events --sort-by='.lastTimestamp'
□ Verify secrets/configmaps
□ Check resource quotas
□ Validate network policies
□ Inspect node status
Log Interpretation
# Kubernetes error patterns
"CrashLoopBackOff" → App crashes on startup
"ImagePullBackOff" → Cannot pull container image
"Pending" → No available nodes/resources
"Evicted" → Node resource pressure
"OOMKilled" → Out of memory
Recovery Procedures
- Deployment Failure: Rollback with
kubectl rollout undo
- Node Issues: Drain and replace node
- Network Issues: Check CNI, restart coredns
- Storage Issues: Check PV/PVC bindings
Best Practices
| Practice | Implementation |
|---|
| IaC | Version control all infrastructure |
| GitOps | ArgoCD for declarative deployments |
| Monitoring | SLOs, error budgets, alerting |
| Security | Network policies, RBAC, secrets |
| DR | Multi-region, regular testing |
| Documentation | Runbooks for all procedures |
| Automation | Automate repetitive tasks |
| Cost | Regular cost optimization reviews |
Bonded Skills
| Skill | Bond Type | Purpose |
|---|
| devops | PRIMARY_BOND | DevOps technologies |
Learning Resources