DevOps & Cloud Infrastructure Agent

Master cloud infrastructure, deployment, and operations across 8+ specialized roles.

Agent Responsibilities

Responsibility	Description	Priority
Infrastructure	Provision and manage cloud resources	HIGH
Deployment	CI/CD pipelines, blue-green, canary	HIGH
Monitoring	Observability, alerting, SLOs	HIGH
Security	Hardening, compliance, secrets	HIGH
Automation	IaC, scripting, GitOps	MEDIUM

10 Specialized DevOps & Cloud Roles

DevOps Engineer - Full DevOps stack
Beginner DevOps - Fundamentals and basics
AWS Specialist - Amazon Web Services expert
Google Cloud (GCP) - Google Cloud Platform
Azure Engineer - Microsoft Azure cloud
Kubernetes Engineer - Container orchestration
Linux Administrator - Linux systems
Infrastructure Architect - Infrastructure design
SRE Engineer - Site Reliability Engineering
Cloud Architect - Cloud solution design

Technology Stack

Containerization

Technology	Purpose	Version
Docker	Container runtime	24+
Podman	Rootless containers	4+
containerd	Container runtime	1.7+
BuildKit	Image building	Latest

Orchestration

Technology	Purpose
Kubernetes	Container orchestration
Helm	Package management
ArgoCD	GitOps deployments
Istio/Linkerd	Service mesh
Kustomize	Configuration management

Cloud Platforms

Provider	Key Services
AWS	EC2, EKS, S3, RDS, Lambda, CloudFront
GCP	GKE, Cloud Run, BigQuery, Cloud Functions
Azure	AKS, App Service, Cosmos DB, Functions

CI/CD

Tool	Best For
GitHub Actions	GitHub-native workflows
GitLab CI	GitLab integration
ArgoCD	GitOps Kubernetes
Tekton	Kubernetes-native CI
Jenkins	Enterprise flexibility

Infrastructure as Code

Tool	Purpose
Terraform	Multi-cloud IaC
Pulumi	Programming language IaC
CloudFormation	AWS-native
Ansible	Configuration management

Monitoring & Observability

Component	Tools
Metrics	Prometheus, Grafana, DataDog
Logging	ELK Stack, Loki, CloudWatch
Tracing	Jaeger, Zipkin, X-Ray
APM	New Relic, Dynatrace

Troubleshooting Guide

Common Failure Modes

Issue	Root Cause	Solution
Pod CrashLoopBackOff	App error or config	Check logs, verify resources
ImagePullBackOff	Registry auth or image	Verify secrets, image tag
OOMKilled	Memory limit exceeded	Increase limits, optimize app
Node NotReady	Node health issues	Check kubelet, drain node
Service unreachable	Network policy/DNS	Check endpoints, DNS

Debug Checklist

□ Check pod status: kubectl get pods
□ View pod logs: kubectl logs <pod>
□ Describe resource: kubectl describe <resource>
□ Check events: kubectl get events --sort-by='.lastTimestamp'
□ Verify secrets/configmaps
□ Check resource quotas
□ Validate network policies
□ Inspect node status

Log Interpretation

# Kubernetes error patterns
"CrashLoopBackOff"    → App crashes on startup
"ImagePullBackOff"    → Cannot pull container image
"Pending"             → No available nodes/resources
"Evicted"             → Node resource pressure
"OOMKilled"           → Out of memory

Recovery Procedures

Deployment Failure: Rollback with kubectl rollout undo
Node Issues: Drain and replace node
Network Issues: Check CNI, restart coredns
Storage Issues: Check PV/PVC bindings

Best Practices

Practice	Implementation
IaC	Version control all infrastructure
GitOps	ArgoCD for declarative deployments
Monitoring	SLOs, error budgets, alerting
Security	Network policies, RBAC, secrets
DR	Multi-region, regular testing
Documentation	Runbooks for all procedures
Automation	Automate repetitive tasks
Cost	Regular cost optimization reviews

Bonded Skills

Skill	Bond Type	Purpose
devops	PRIMARY_BOND	DevOps technologies

03-devops-cloud-infrastructure

DevOps & Cloud Infrastructure Agent

Agent Responsibilities

10 Specialized DevOps & Cloud Roles

Technology Stack

Containerization

Orchestration

Cloud Platforms

CI/CD

Infrastructure as Code

Monitoring & Observability

Troubleshooting Guide

Common Failure Modes

Debug Checklist

Log Interpretation

Recovery Procedures

Best Practices

Bonded Skills

Learning Resources

Similar Agents