Help us improve
Share bugs, ideas, or general feedback.
From ai-skills
Use this skill when investigating a production incident, when an alert fires (latency spike, error rate, pod crashloop), when a customer-reported issue needs prod telemetry, or as the diagnosis step of an incident-response or production-bugfix flow — including when the user describes a prod symptom without asking to "analyze" — to analyze the production environment by collecting Kubernetes pod status, managed database health, logs, metrics, and networking and diagnosing issues, supporting GCP, Azure, and AWS via the `cloud-platforms` skill and applying the SRE or DevOps role.
npx claudepluginhub alex-voloshin-dev/ai-skills --plugin ai-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/ai-skills:analyze-prod service name, symptom, or incident descriptionservice name, symptom, or incident descriptionThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Systematic production-environment analysis. Collects cluster status, pod health, database metrics, logs, and diagnoses issues. Standalone or as the entry point for production bugfixing.
Measures whether skills, rules, and agent definitions are actually followed by auto-generating test scenarios at 3 strictness levels and reporting compliance rates with full tool call timelines.
Share bugs, ideas, or general feedback.
Systematic production-environment analysis. Collects cluster status, pod health, database metrics, logs, and diagnoses issues. Standalone or as the entry point for production bugfixing.
Cloud platform detection: Read CLAUDE.md to identify the cloud platform, then load the matching reference module (GCP / Azure / AWS) from @cloud-platforms.
⚠️ SAFETY: READ-ONLY commands only. No mutations (scale, delete, restart, deploy) without explicit user approval at Step 6.
Ask the user:
If invoked as part of a bugfix / incident flow, extract context from the parent conversation instead.
Select role by problem type:
| Problem Type | Primary Role |
|---|---|
| Pod crashes, restarts, health-check failures | Agent(sre-engineer) |
| High latency, error-rate spikes, SLO burn | Agent(sre-engineer) |
| Managed DB issues (connections, replication, CPU) | Agent(sre-engineer) |
| Networking, ingress, DNS, connectivity | Agent(sre-engineer) + Agent(devops-engineer) |
| Deployment failures, rollback needed | Agent(devops-engineer) |
| Terraform / infra drift, resource config | Agent(devops-engineer) |
| Application errors in logs | Stack-specific (Agent(java-engineer) / Agent(python-engineer) / Agent(frontend-engineer)) |
| General / unclear | Agent(sre-engineer) |
Announce the applied role(s). For P1/P2 incidents, always apply Agent(sre-engineer).
Detect platform from CLAUDE.md and verify authentication via @cloud-platforms:
gcloud config get-value project + gcloud auth listaz account show + az aks listaws sts get-caller-identity + aws eks list-clustersConfirm the active project / subscription / account matches the user's target.
// turbo
kubectl config current-context
kubectl cluster-info
Run the platform-specific cluster list command from @cloud-platforms to verify cluster status.
Record: Cluster name, location, version, node count, current kubectl context.
Run the following read-only diagnostics. Present results as a structured summary.
// turbo
kubectl get nodes -o wide
kubectl top nodes
Flag: NotReady nodes, CPU/memory >80%, version skew.
For the affected namespace (or all namespaces if unspecified):
// turbo
kubectl get pods -n <namespace> -o wide --sort-by='.status.startTime'
kubectl get pods -n <namespace> --field-selector=status.phase!=Running
Flag: CrashLoopBackOff / Error / Pending / ImagePullBackOff, restart count >3 in last hour, replica count mismatch (kubectl get deployments).
kubectl describe pod <pod-name> -n <namespace>
kubectl logs --tail=200 --timestamps <pod-name> -n <namespace>
kubectl logs --previous --tail=50 <pod-name> -n <namespace>
Look for: OOMKilled (exit 137), app exceptions, connection errors, startup failures, readiness probe failures.
// turbo
kubectl top pods -n <namespace> --sort-by=memory
kubectl get hpa -n <namespace>
Flag: Pods near memory limits, HPA at max replicas, CPU throttling.
DB diagnostic commands per cloud — see @cloud-platforms. Key metrics: CPU / memory / disk utilization, active vs max connections, replication lag (HA / read replicas), failed connection count.
// turbo
kubectl get ingress -n <namespace>
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
Flag: Services with 0 endpoints (no healthy backends), Ingress with no address, port mismatches.
// turbo
kubectl get events -n <namespace> --sort-by='.lastTimestamp' --field-selector type!=Normal
Record: Warning / Error events — especially FailedScheduling, OOMKilled, FailedMount, Unhealthy, BackOff.
Observability methodology (Golden Signals / RED / USE / Distributed Tracing) — see @observability-methods. Pick a named method per problem class, then cross-reference SLI metrics, error-budget burn, and active alerts against that method's signals.
Telemetry stack patterns + per-vendor queries — see @telemetry-stacks. Identify the stack from CLAUDE.md, helm charts, prometheus-operator CRDs, or the OTel collector config, then query directly using the vendor patterns documented there.
Using the applied role's expertise:
Agent(sre-engineer) is active): error budget consumed, burn rate.<common_prod_issues>
Structure the diagnosis:
## Production Environment Summary
- Cloud: [GCP/Azure/AWS] | Project/Sub/Account: [id] | Cluster: [name] ([location])
- Nodes: [count] ([healthy]/[total]) | K8s: [version] | Managed DB: [instance] ([state], [tier])
## SLO Impact (if applicable)
- SLO: [target] | Current: [actual] | Error budget: [remaining]% | Burn rate: [Nx] | Time to exhaustion: [duration]
## Findings
### [Issue 1: title]
- **Symptom / Evidence / Root cause / Severity (P1–P4) / Blast radius**
## Recommendations
1. [Immediate mitigation] — [command] ⚠️ REQUIRES APPROVAL
2. [Root cause fix] — [change description]
3. [Prevention] — [long-term improvement]
## Environment Health: [HEALTHY | DEGRADED | OUTAGE]
⚠️ All production mutations require explicit user approval.
/analyze-local), deploy via normal CI/CD.Agent(devops-engineer) — never direct apply.After any fix, re-run relevant Step 4 commands to verify resolution.
/bugfix (production environment diagnostics)Agent(sre-engineer), Agent(devops-engineer)@cloud-platforms (platform-specific CLI commands, managed service diagnostics), @observability-methods (Golden Signals / RED / USE / Tracing), @telemetry-stacks (Prometheus / Datadog / Honeycomb / New Relic / Sentry / OTel queries)