Complete DevOps/SRE toolkit: incident response, observability, reliability engineering, on-call management, and automation. The most comprehensive open-source DevOps plugin available.
npx claudepluginhub latestaiagents/agent-skills --plugin devops-srePre and post deployment checklist with automated verification and rollback preparation
Start structured incident response with automatic triage, communication, and resolution tracking
Generate comprehensive on-call handoff reports with active issues, context, and watch items
Generate blameless postmortem documents with timeline, root cause analysis, and action items
Execute runbook procedures step-by-step with safety checks and verification
Generate SLO/SLA status reports with error budget analysis and recommendations
Implement safe deployment strategies including rolling, blue-green, canary, and feature flags. Use this skill when planning deployments, reducing deployment risk, or implementing progressive delivery. Activate when: deployment strategy, rolling update, blue-green, canary deployment, feature flags, progressive delivery, zero downtime deployment, rollback, deployment risk.
Diagnose and fix common Kubernetes issues with systematic debugging approaches. Use this skill when troubleshooting K8s clusters, pods not starting, deployments failing, or networking issues. Activate when: kubernetes, k8s, pod, deployment, kubectl, container, crashloopbackoff, imagepullbackoff, pending pods, kubernetes networking, service not working, ingress issues.
Guide incident response as an Incident Commander with structured communication and coordination. Use this skill when there's an active incident, outage, service degradation, or production issue. Activate when: incident, outage, service down, production issue, SEV1, SEV2, pages, alerts firing, something broke, users complaining, error spike, latency spike.
Systematic root cause analysis using 5 Whys, fishbone diagrams, and fault tree analysis. Use this skill when investigating why an incident happened, performing RCA, or writing postmortems. Activate when: root cause, why did this happen, 5 whys, incident analysis, postmortem investigation, how did this happen, what caused, failure analysis.
Design effective alerting strategies that catch real issues without causing alert fatigue. Use this skill when setting up alerts, reducing noise, or improving on-call experience. Activate when: alerting, alerts, pagerduty, on-call, alert fatigue, too many alerts, missed alerts, monitoring thresholds, alert tuning.
Implement comprehensive observability with metrics, logs, and distributed traces. Use this skill when setting up monitoring, debugging production issues, or implementing observability. Activate when: metrics, logs, traces, observability, monitoring, Datadog, Prometheus, Grafana, OpenTelemetry, distributed tracing, logging, APM, what's happening in production.
Manage on-call rotations with sustainable practices, fair scheduling, and effective handoffs. Use this skill when setting up on-call, improving on-call experience, or managing rotations. Activate when: on-call, pagerduty, rotation, schedule, handoff, on-call burden, being paged, night pages, weekend on-call, on-call fatigue.
Implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets. Use this skill when defining reliability targets, measuring service health, or balancing reliability vs velocity. Activate when: SLO, SLI, SLA, error budget, reliability targets, service level, uptime target, availability target, latency target, nine nines, 99.9%.
Access AWS CloudWatch logs, metrics, and alarms
Interact with Kubernetes clusters - pods, deployments, services
Access repositories, PRs, issues, and deployments
Post incident updates and communicate with team
Query Prometheus metrics and alerts
Manage infrastructure as code with Terraform
Query metrics, logs, traces, monitors, and incidents from Datadog
Manage incidents, on-call schedules, and escalations in PagerDuty
Site Reliability Engineering discipline agent for reliability, monitoring, and incident response
Requires secrets
Needs API keys or credentials to function
Share bugs, ideas, or general feedback.
Multi-agent orchestrator for Claude Code. Track work with convoys, sling to polecats. The Cognition Engine for AI-powered software factories.
Enterprise-grade AI Agent Skills for software development, DevOps, SRE, security, and product teams.
Set of DevOps skills for Claude Code.
DevsForge site reliability engineering specialist for building resilient and scalable systems
Comprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.