Kubernetes debugging, problem diagnosis, and issue resolution
Provides systematic Kubernetes troubleshooting for pods, networks, nodes, and cluster components using decision trees and debug commands. Triggers when diagnosing CrashLoopBackOff, ImagePullBackOff, DNS failures, or service connectivity issues.
/plugin marketplace add pluginagentmarketplace/custom-plugin-kubernetes/plugin install kubernetes-assistant@pluginagentmarketplace-kubernetesThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlassets/schema.jsonreferences/GUIDE.mdreferences/PATTERNS.mdscripts/validate.pyProduction-grade Kubernetes troubleshooting covering systematic diagnosis, debugging techniques, and resolution patterns. This skill provides deep expertise in rapid incident response, root cause analysis, and creating effective runbooks for enterprise environments.
Status Decision Tree
Pod Issue?
│
├── Pending
│ ├── Insufficient resources → Check node capacity, requests
│ ├── No matching node → Check nodeSelector, affinity
│ ├── PVC not bound → Check StorageClass, PV availability
│ └── Image pull issues → Check registry, imagePullSecrets
│
├── CrashLoopBackOff
│ ├── Check: kubectl logs <pod> --previous
│ ├── App error → Fix application code
│ ├── OOMKilled → Increase memory limits
│ └── Probe failure → Adjust probe settings
│
├── ImagePullBackOff
│ ├── Wrong image name → Verify image:tag
│ ├── Private registry → Check imagePullSecrets
│ └── Registry down → Check registry availability
│
└── Running but not ready
├── Readiness probe failing → Check probe config
└── Dependency unavailable → Check upstream services
Debug Commands
# Comprehensive pod info
kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -o yaml
# Container logs
kubectl logs <pod-name> -c <container> --tail=100
kubectl logs <pod-name> --previous # crashed container
kubectl logs -l app=myapp --all-containers
# Live debugging
kubectl debug <pod-name> -it --image=nicolaka/netshoot
kubectl exec -it <pod-name> -- /bin/sh
# Resource usage
kubectl top pod <pod-name>
kubectl describe node | grep -A 5 "Allocated resources"
Connectivity Decision Tree
Network Issue?
│
├── DNS not resolving
│ ├── Check CoreDNS pods: kubectl get pods -n kube-system -l k8s-app=kube-dns
│ ├── Test resolution: kubectl run debug --rm -it --image=busybox -- nslookup kubernetes
│ └── Check NetworkPolicy egress for DNS
│
├── Service unreachable
│ ├── Check endpoints: kubectl get endpoints <service>
│ ├── No endpoints → Pod selector mismatch
│ ├── Verify port mapping: targetPort matches container port
│ └── Check NetworkPolicy ingress
│
├── Pod-to-pod fails
│ ├── Same node → CNI issue, check CNI pods
│ ├── Cross-node → Node networking, firewall rules
│ └── Check NetworkPolicies blocking traffic
│
└── External access fails
├── Ingress → Check ingress controller logs
├── LoadBalancer → Check cloud LB status
└── NodePort → Check node firewall
Network Debug Commands
# DNS testing
kubectl run debug --rm -it --image=nicolaka/netshoot -- \
nslookup <service>.<namespace>.svc.cluster.local
# Connectivity testing
kubectl run debug --rm -it --image=nicolaka/netshoot -- \
curl -v http://<service>:<port>/health
# TCP connection test
kubectl run debug --rm -it --image=nicolaka/netshoot -- \
nc -zv <service> <port>
# Network policy check
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <name>
Node Health Analysis
# Node conditions
kubectl describe node <node> | grep -A 20 "Conditions:"
# Node events
kubectl get events --field-selector involvedObject.name=<node>
# System resource usage
kubectl top node <node>
# Pod distribution
kubectl get pods -A --field-selector spec.nodeName=<node>
# Node logs (via ssh)
journalctl -u kubelet --since "1 hour ago"
journalctl -u containerd --since "1 hour ago"
Common Node Issues
Node NotReady?
│
├── Check kubelet: systemctl status kubelet
├── Check container runtime: systemctl status containerd
├── Check certificates: ls -la /var/lib/kubelet/pki/
├── Check disk: df -h /var/lib/kubelet
└── Check network: ping <api-server-ip>
Control Plane Checks
# API server
kubectl get pods -n kube-system -l component=kube-apiserver
kubectl logs -n kube-system kube-apiserver-<node>
# Scheduler
kubectl logs -n kube-system kube-scheduler-<node>
kubectl get events --field-selector reason=FailedScheduling
# Controller Manager
kubectl logs -n kube-system kube-controller-manager-<node>
# etcd
kubectl get pods -n kube-system -l component=etcd
ETCDCTL_API=3 etcdctl endpoint health
Resource Analysis
# High CPU pods
kubectl top pods -A --sort-by=cpu | head -10
# High memory pods
kubectl top pods -A --sort-by=memory | head -10
# OOMKilled detection
kubectl get pods -A -o json | jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled")'
# Throttled pods (requires cAdvisor)
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods
Essential Tools
# Install debug tools in cluster
kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
# Quick debug pod
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
# stern for multi-pod logs
stern -n <namespace> <pod-prefix>
# k9s for interactive UI
k9s -n <namespace>
| Issue | Diagnosis | Resolution |
|---|---|---|
| CrashLoopBackOff | Check logs --previous | Fix app, adjust resources |
| ImagePullBackOff | Check image name, secrets | Fix image, add pullSecret |
| Pending pods | kubectl describe | Add resources, fix affinity |
| OOMKilled | Check memory usage | Increase limits |
| DNS failures | Test CoreDNS | Check egress policy |
| Service unreachable | Check endpoints | Fix selector |
| Node NotReady | Check kubelet | Restart kubelet |
| Metric | Target |
|---|---|
| MTTR | <15 minutes |
| First response | <5 minutes |
| Root cause found | 95% |
| Runbook coverage | Core scenarios |
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.