Master Kubernetes cluster administration, from initial setup through production management. Learn cluster installation, scaling, upgrades, and HA strategies.
Provides production-grade Kubernetes cluster administration covering lifecycle from deployment to day-2 operations. Use when you need to set up HA clusters, manage nodes, perform upgrades, configure etcd backups, or troubleshoot cluster health issues.
/plugin marketplace add pluginagentmarketplace/custom-plugin-kubernetes/plugin install kubernetes-assistant@pluginagentmarketplace-kubernetesThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlreferences/GUIDE.mdscripts/helper.pyProduction-grade Kubernetes cluster administration covering the complete lifecycle from initial deployment to day-2 operations. This skill provides deep expertise in cluster architecture, high availability configurations, upgrade strategies, and operational best practices aligned with CKA/CKS certification standards.
Control Plane Components
┌─────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
├─────────────┬─────────────┬──────────────┬────────────────────┤
│ API Server │ Scheduler │ Controller │ etcd │
│ │ │ Manager │ │
│ - AuthN │ - Pod │ - ReplicaSet │ - Cluster state │
│ - AuthZ │ placement │ - Endpoints │ - 3+ nodes for HA │
│ - Admission │ - Node │ - Namespace │ - Regular backups │
│ control │ affinity │ - ServiceAcc │ - Encryption │
└─────────────┴─────────────┴──────────────┴────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ WORKER NODES │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ kubelet │ kube-proxy │ Container Runtime │
│ - Pod lifecycle │ - iptables/ipvs │ - containerd (recommended) │
│ - Node status │ - Service VIPs │ - CRI-O │
│ - Volume mount │ - Load balance │ - gVisor (sandboxed) │
└─────────────────┴─────────────────┴─────────────────────────────┘
Production Cluster Bootstrap (kubeadm)
# Initialize control plane with HA
sudo kubeadm init \
--control-plane-endpoint "k8s-api.example.com:6443" \
--upload-certs \
--pod-network-cidr=10.244.0.0/16 \
--service-cidr=10.96.0.0/12 \
--apiserver-advertise-address=0.0.0.0 \
--apiserver-cert-extra-sans=k8s-api.example.com
# Join additional control plane nodes
kubeadm join k8s-api.example.com:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane \
--certificate-key <cert-key>
# Join worker nodes
kubeadm join k8s-api.example.com:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash>
Node Lifecycle Operations
# View node details with resource usage
kubectl get nodes -o wide
kubectl top nodes
# Label nodes for workload placement
kubectl label nodes worker-01 node-type=compute tier=production
kubectl label nodes worker-02 node-type=gpu accelerator=nvidia-a100
# Taint nodes for dedicated workloads
kubectl taint nodes worker-gpu dedicated=gpu:NoSchedule
# Cordon node (prevent new pods)
kubectl cordon worker-03
# Drain node safely (for maintenance)
kubectl drain worker-03 \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=300 \
--timeout=600s
# Return node to service
kubectl uncordon worker-03
Node Problem Detector Configuration
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector
namespace: kube-system
spec:
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
labels:
app: node-problem-detector
spec:
containers:
- name: node-problem-detector
image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.14
securityContext:
privileged: true
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: log
mountPath: /var/log
readOnly: true
- name: kmsg
mountPath: /dev/kmsg
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log
- name: kmsg
hostPath:
path: /dev/kmsg
tolerations:
- operator: Exists
effect: NoSchedule
HA Architecture Pattern
┌─────────────────┐
│ Load Balancer │
│ (HAProxy/NLB) │
└────────┬────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Control Plane │ │ Control Plane │ │ Control Plane │
│ Node 1 │ │ Node 2 │ │ Node 3 │
├───────────────┤ ├───────────────┤ ├───────────────┤
│ API Server │ │ API Server │ │ API Server │
│ Scheduler │ │ Scheduler │ │ Scheduler │
│ Controller │ │ Controller │ │ Controller │
│ etcd │◄──►│ etcd │◄──►│ etcd │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└────────────────────┴────────────────────┘
│
┌────────┴────────┐
│ Worker Nodes │
│ (N instances) │
└─────────────────┘
etcd Backup & Restore
# Backup etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify backup
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-*.db --write-out=table
# Restore etcd (disaster recovery)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-*.db \
--data-dir=/var/lib/etcd-restored \
--name=etcd-0 \
--initial-cluster=etcd-0=https://10.0.0.10:2380 \
--initial-advertise-peer-urls=https://10.0.0.10:2380
# Automated backup CronJob
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
etcdctl snapshot save /backup/etcd-\$(date +%Y%m%d-%H%M).db
env:
- name: ETCDCTL_API
value: "3"
volumeMounts:
- name: backup
mountPath: /backup
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
volumes:
- name: backup
persistentVolumeClaim:
claimName: etcd-backup-pvc
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
EOF
Upgrade Strategy Decision Tree
Upgrade Required?
│
├── Minor Version (1.29 → 1.30)
│ ├── Review release notes for breaking changes
│ ├── Test in staging environment
│ ├── Upgrade control plane first
│ │ └── One node at a time
│ └── Upgrade workers (rolling)
│
├── Patch Version (1.30.0 → 1.30.1)
│ ├── Generally safe, security fixes
│ └── Can upgrade more aggressively
│
└── Major changes in components
├── Test thoroughly
├── Have rollback plan
└── Consider blue-green cluster
Production Upgrade Process
# Step 1: Upgrade kubeadm on control plane
sudo apt-mark unhold kubeadm
sudo apt-get update && sudo apt-get install -y kubeadm=1.30.0-00
sudo apt-mark hold kubeadm
# Step 2: Plan the upgrade
sudo kubeadm upgrade plan
# Step 3: Apply upgrade on first control plane
sudo kubeadm upgrade apply v1.30.0
# Step 4: Upgrade kubelet and kubectl
kubectl drain control-plane-1 --ignore-daemonsets
sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y kubelet=1.30.0-00 kubectl=1.30.0-00
sudo apt-mark hold kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet
kubectl uncordon control-plane-1
# Step 5: Upgrade additional control planes
sudo kubeadm upgrade node
# Then upgrade kubelet/kubectl as above
# Step 6: Upgrade worker nodes (rolling)
for node in $(kubectl get nodes -l node-role.kubernetes.io/worker -o name); do
kubectl drain $node --ignore-daemonsets --delete-emptydir-data
# SSH to node and upgrade packages
kubectl uncordon $node
sleep 60 # Allow pods to stabilize
done
Namespace Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-backend
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
persistentvolumeclaims: "10"
requests.storage: 500Gi
pods: "50"
services: "20"
secrets: "50"
configmaps: "50"
---
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-backend
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
min:
cpu: 50m
memory: 64Mi
max:
cpu: 4
memory: 8Gi
- type: PersistentVolumeClaim
min:
storage: 1Gi
max:
storage: 100Gi
Cluster Autoscaler Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.0
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
Cluster Health Problem?
│
├── API Server unreachable
│ ├── Check: systemctl status kube-apiserver
│ ├── Check: /var/log/kube-apiserver.log
│ ├── Verify: etcd connectivity
│ └── Verify: certificates not expired
│
├── Node NotReady
│ ├── Check: kubelet status on node
│ ├── Check: container runtime status
│ ├── Verify: node network connectivity
│ └── Check: disk pressure, memory pressure
│
├── Pods Pending (no scheduling)
│ ├── Check: kubectl describe pod
│ ├── Verify: node resources available
│ ├── Check: taints and tolerations
│ └── Verify: PVC bound (if using volumes)
│
└── etcd Issues
├── Check: etcdctl endpoint health
├── Check: etcd member list
├── Verify: disk I/O performance
└── Check: cluster quorum
# Cluster-wide diagnostics
kubectl cluster-info dump --output-directory=/tmp/cluster-dump
kubectl get componentstatuses
kubectl get nodes -o wide
kubectl get events --sort-by='.lastTimestamp' -A
# Control plane health
kubectl get pods -n kube-system
kubectl logs -n kube-system kube-apiserver-<node>
kubectl logs -n kube-system kube-scheduler-<node>
kubectl logs -n kube-system kube-controller-manager-<node>
# etcd health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Node diagnostics
kubectl describe node <node-name>
kubectl get node <node-name> -o yaml | grep -A 10 conditions
ssh <node> "journalctl -u kubelet --since '1 hour ago'"
# Certificate expiration check
kubeadm certs check-expiration
# Resource usage
kubectl top nodes
kubectl top pods -A --sort-by=memory
| Challenge | Solution |
|---|---|
| etcd performance degradation | Use SSD storage, tune compaction |
| Certificate expiration | Set up cert-manager, kubeadm renew |
| Node resource exhaustion | Configure eviction thresholds, resource quotas |
| Control plane overload | Add more control plane nodes, tune rate limits |
| Upgrade failures | Always backup etcd, use staged rollouts |
| kubelet not starting | Check containerd socket, certificates |
| API server latency | Enable priority/fairness, scale API servers |
| Cluster state drift | GitOps, regular audits, policy enforcement |
| Metric | Target |
|---|---|
| Cluster uptime | 99.9% |
| API server latency p99 | <200ms |
| etcd backup success | 100% |
| Node ready status | 100% |
| Upgrade success rate | 100% |
| Certificate validity | >30 days |
| Control plane pods healthy | 100% |
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.