From devops-skills
Diagnoses and fixes Kubernetes pod failures like CrashLoopBackOff, Pending, DNS, networking, storage mounts, and rollout issues using kubectl workflows and scripts.
npx claudepluginhub akin-ozer/cc-devops-skills --plugin devops-skillsThis skill uses the workspace's default tool permissions.
Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.
Use this skill when requests resemble:
CrashLoopBackOff; help me find the root cause."Pending and not scheduling."Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.
kubectl installed and configured.Quick preflight:
kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns
jq for more precise filtering in ./scripts/cluster_health.sh.metrics-server) for kubectl top.nslookup, getent, curl, wget, ip) for deep network tests.Fallback behavior:
kubectl top is unavailable, continue with kubectl describe and events.Use this skill for:
Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.
Commands requiring explicit confirmation:
kubectl delete pod ... --force --grace-period=0kubectl drain ...kubectl rollout restart ...kubectl rollout undo ...kubectl debug ... --copy-to=...Before disruptive actions:
# Snapshot current state for rollback and incident notes
kubectl get deploy,rs,pod,svc -n <namespace> -o wide
kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt
Load only the section needed for the observed symptom.
| Symptom / Need | Open | Start section |
|---|---|---|
| You need an end-to-end diagnosis path | ./references/troubleshooting_workflow.md | General Debugging Workflow |
Pod state is Pending, CrashLoopBackOff, or ImagePullBackOff | ./references/troubleshooting_workflow.md | Pod Lifecycle Troubleshooting |
| Service reachability or DNS failure | ./references/troubleshooting_workflow.md | Network Troubleshooting Workflow |
| Node pressure or performance regression | ./references/troubleshooting_workflow.md | Resource and Performance Workflow |
| PVC / PV / storage class issues | ./references/troubleshooting_workflow.md | Storage Troubleshooting Workflow |
| Quick symptom-to-fix lookup | ./references/common_issues.md | matching issue heading |
| Post-mortem fix options for known issues | ./references/common_issues.md | Solutions sections |
| Script | Purpose | Required args | Optional args | Output | Fallback behavior |
|---|---|---|---|---|---|
./scripts/cluster_health.sh | Cluster-wide health snapshot (nodes, workloads, events, common failure states) | None | --strict, K8S_REQUEST_TIMEOUT env var | Sectioned report to stdout | Continues on check failures, tracks them in summary and exit code |
./scripts/network_debug.sh | Pod-centric network and DNS diagnostics | <pod-name> (<namespace> defaults to default) | --strict, --insecure, K8S_REQUEST_TIMEOUT env var | Sectioned report to stdout | Uses secure API probe by default; insecure TLS requires explicit --insecure |
./scripts/pod_diagnostics.py | Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context) | <pod-name> | -n/--namespace, -o/--output | Sectioned report to stdout or file | Fails fast on missing access; skips optional metrics/log blocks with clear messages |
./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:
0: checks completed with no check failures (warnings allowed unless --strict is set).1: one or more checks failed, or warnings occurred in --strict mode.2: blocked preconditions (for example: missing kubectl, no active context, inaccessible namespace/pod).Follow this systematic approach for any Kubernetes issue:
kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n <namespace>
If preflight fails, stop and fix access/context first.
Categorize the issue:
Use the appropriate diagnostic script based on scope:
Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>
This script gathers:
Output can be saved for analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt
Use ./scripts/cluster_health.sh for overall cluster diagnostics:
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
This script checks:
Use ./scripts/network_debug.sh for connectivity issues:
./scripts/network_debug.sh <namespace> <pod-name>
# or force warning sensitivity / insecure TLS only when explicitly needed:
./scripts/network_debug.sh --strict <namespace> <pod-name>
./scripts/network_debug.sh --insecure <namespace> <pod-name>
This script analyzes:
Based on the identified issue, consult ./references/troubleshooting_workflow.md:
Refer to ./references/common_issues.md for symptom-specific fixes.
Run final verification:
kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl rollout status deployment/<name> -n <namespace>
Issue is done when user-visible behavior is healthy and no new critical warning events appear.
payments Namespacepython3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe
Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.
./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
kubectl get svc checkout-api -n checkout
kubectl get endpoints checkout-api -n checkout
kubectl get networkpolicies -n checkout
Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.
# View pod status
kubectl get pods -n <namespace> -o wide
# Detailed pod information
kubectl describe pod <pod-name> -n <namespace>
# View logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # Previous container
kubectl logs <pod-name> -n <namespace> -c <container> # Specific container
# Execute commands in pod
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh
# Get pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml
# Check services
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
# Check endpoints
kubectl get endpoints -n <namespace>
# Test DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
# View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Node resources
kubectl top nodes
kubectl describe nodes
# Pod resources
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers
# Restart deployment
kubectl rollout restart deployment/<name> -n <namespace>
# Rollback deployment
kubectl rollout undo deployment/<name> -n <namespace>
# Force delete stuck pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
# Drain node (maintenance)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Cordon node (prevent scheduling)
kubectl cordon <node-name>
Troubleshooting session is complete when all are true:
./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.Useful additional tools for Kubernetes debugging: