Help us improve
Share bugs, ideas, or general feedback.
From openshift-ops
Comprehensive guide for OpenShift node lifecycle management including adding/removing nodes, cordoning/draining, machine management, node maintenance, and troubleshooting node issues. Use when managing cluster capacity, performing node maintenance, or resolving node-level problems.
npx claudepluginhub redhat-community-ai-tools/claude-plugins --plugin openshift-opsHow this skill is triggered — by the user, by Claude, or both
Slash command
/openshift-ops:openshift-node-operationsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill provides guidance for managing OpenShift nodes throughout their lifecycle, from provisioning to decommissioning, including maintenance and troubleshooting.
Reviews OVHcloud Managed Kubernetes cluster lifecycle, node pool sizing, autoscaling, version upgrades, workload placement, network policies, RBAC, and Terraform IaC for ovh_cloud_project_kube resources.
Provides expert guidance on Kubernetes, OpenShift, and OLM: debugging resources like pods/deployments, operator development/troubleshooting, manifest/CRD reviews, and cluster investigations.
Manages Kubernetes cluster resources via kubectl across multiple clusters. Views pod/deployment statuses, logs/events; troubleshoots with exec/port-forward; modifies via scale/rollout.
Share bugs, ideas, or general feedback.
This skill provides guidance for managing OpenShift nodes throughout their lifecycle, from provisioning to decommissioning, including maintenance and troubleshooting.
# List all nodes
oc get nodes
oc get nodes -o wide
# Get detailed node information
oc describe node <node-name>
# View node labels
oc get nodes --show-labels
# Check node resource usage
oc adm top nodes
# Get node allocatable resources
oc describe nodes | grep -A 5 "Allocatable:"
# View nodes with specific roles
oc get nodes -l node-role.kubernetes.io/worker
oc get nodes -l node-role.kubernetes.io/master
For Automated Infrastructure (AWS, Azure, GCP, OpenStack)
# View existing machine sets
oc get machinesets -n openshift-machine-api
# Scale machine set to add nodes
oc scale machineset <machineset-name> -n openshift-machine-api --replicas=<new-count>
# Create new machine set (for different instance types/zones)
oc get machineset <existing-machineset> -n openshift-machine-api -o yaml > new-machineset.yaml
# Edit new-machineset.yaml (change name, replicas, instance type, etc.)
oc create -f new-machineset.yaml
# Monitor machine creation
oc get machines -n openshift-machine-api -w
# Verify nodes join cluster
oc get nodes -w
For Bare Metal/Manual Infrastructure
# Watch for pending CSRs
oc get csr
# Approve CSRs
oc adm certificate approve <csr-name>
# Approve all pending CSRs (use with caution)
oc get csr -o name | xargs oc adm certificate approve
# Verify node joined
oc get nodes
Labels
# Add label to node
oc label node <node-name> <key>=<value>
# Remove label from node
oc label node <node-name> <key>-
# Update existing label
oc label node <node-name> <key>=<new-value> --overwrite
# Label multiple nodes
oc label nodes -l <selector> <key>=<value>
# Common labels
oc label node <node-name> node-role.kubernetes.io/infra=""
oc label node <node-name> node-role.kubernetes.io/storage=""
oc label node <node-name> environment=production
Taints and Tolerations
# Add taint to node
oc adm taint node <node-name> <key>=<value>:<effect>
# Effects: NoSchedule, PreferNoSchedule, NoExecute
# Remove taint from node
oc adm taint node <node-name> <key>:<effect>-
# Examples
oc adm taint node <node-name> dedicated=gpu:NoSchedule
oc adm taint node <node-name> maintenance=true:NoExecute
oc adm taint node <node-name> special=true:PreferNoSchedule
# View node taints
oc describe node <node-name> | grep Taints
Cordoning prevents new pods from being scheduled on a node:
# Cordon a node (mark as unschedulable)
oc adm cordon <node-name>
# Uncordon a node (mark as schedulable)
oc adm uncordon <node-name>
# Cordon multiple nodes
oc adm cordon <node1> <node2> <node3>
# Verify node status
oc get nodes
# Look for "SchedulingDisabled" in STATUS column
Draining safely evicts pods from a node:
# Drain a node (standard)
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Drain with grace period
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=300
# Drain with timeout
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=5m
# Force drain (use with extreme caution)
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
# Dry run to see what would be drained
oc adm drain <node-name> --dry-run=client
# Skip waiting for deletion
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data --skip-wait-for-delete-timeout=0
Best Practices for Draining:
Standard Maintenance Workflow:
# 1. Cordon the node
oc adm cordon <node-name>
# 2. Drain the node
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data
# 3. Verify pods have been evicted
oc get pods -A -o wide | grep <node-name>
# 4. Perform maintenance on the node
# - SSH to node or use oc debug
# - Apply updates, reboot, etc.
# 5. Verify node is back and healthy
oc get nodes
# 6. Uncordon the node
oc adm uncordon <node-name>
# 7. Verify workloads are scheduling
oc get pods -A -o wide | grep <node-name>
Accessing Node for Maintenance:
# Debug node (creates privileged pod)
oc debug node/<node-name>
# Once in debug shell
chroot /host
# Common maintenance commands
systemctl status kubelet
systemctl restart kubelet
journalctl -u kubelet -f
rpm-ostree status
# Exit debug session
exit
For Automated Infrastructure:
# Scale down machine set
oc scale machineset <machineset-name> -n openshift-machine-api --replicas=<new-count>
# Delete specific machine
oc delete machine <machine-name> -n openshift-machine-api
# Monitor deletion
oc get machines -n openshift-machine-api -w
oc get nodes -w
For Manual Infrastructure:
# 1. Cordon and drain the node
oc adm cordon <node-name>
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
# 2. Delete the node from cluster
oc delete node <node-name>
# 3. Power down or repurpose the physical/virtual machine
Replacing Failed Node (Automated Infrastructure):
# Delete failed machine
oc delete machine <machine-name> -n openshift-machine-api
# Machine set controller will create replacement automatically
oc get machines -n openshift-machine-api -w
# Approve CSRs if needed
oc get csr
oc adm certificate approve <csr-name>
Replacing Failed Node (Manual Infrastructure):
oc delete node <node-name># List all machines
oc get machines -n openshift-machine-api
# Get machine details
oc describe machine <machine-name> -n openshift-machine-api
# Delete machine (will be recreated by machine set)
oc delete machine <machine-name> -n openshift-machine-api
# View machine status
oc get machines -n openshift-machine-api -o wide
# List machine sets
oc get machinesets -n openshift-machine-api
# Describe machine set
oc describe machineset <machineset-name> -n openshift-machine-api
# Scale machine set
oc scale machineset <machineset-name> -n openshift-machine-api --replicas=<count>
# Edit machine set
oc edit machineset <machineset-name> -n openshift-machine-api
# Delete machine set
oc delete machineset <machineset-name> -n openshift-machine-api
# Example: Create infra node machine set
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
name: <cluster-id>-infra-<zone>
namespace: openshift-machine-api
spec:
replicas: 3
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: <cluster-id>
machine.openshift.io/cluster-api-machineset: <cluster-id>-infra-<zone>
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: <cluster-id>
machine.openshift.io/cluster-api-machine-role: infra
machine.openshift.io/cluster-api-machine-type: infra
machine.openshift.io/cluster-api-machineset: <cluster-id>-infra-<zone>
spec:
metadata:
labels:
node-role.kubernetes.io/infra: ""
taints:
- key: node-role.kubernetes.io/infra
effect: NoSchedule
# Provider-specific configuration here
# View machine autoscaler
oc get machineautoscaler -n openshift-machine-api
# Create machine autoscaler
oc create -f machine-autoscaler.yaml
# View cluster autoscaler
oc get clusterautoscaler
# Edit autoscaler settings
oc edit clusterautoscaler default
# Check node status
oc get nodes
oc describe node <node-name>
# Check node conditions
oc get node <node-name> -o jsonpath='{.status.conditions}'
# Common causes:
# - Network issues
# - Disk pressure
# - Memory pressure
# - kubelet not running
# - Certificate issues
# Debug the node
oc debug node/<node-name>
chroot /host
systemctl status kubelet
journalctl -u kubelet -n 100
# Check disk usage on node
oc debug node/<node-name>
chroot /host
df -h
# Clean up images
crictl rmi --prune
# Check image registry cache
du -sh /var/lib/containers/
# Clear logs
journalctl --vacuum-time=3d
# Check memory usage
oc adm top nodes
oc describe node <node-name> | grep -A 10 "Allocated resources"
# Identify high memory pods
oc adm top pods -A | sort -k 4 -nr | head -20
# Check for memory leaks
oc debug node/<node-name>
chroot /host
free -h
top
# Check SDN/CNI pods on node
oc get pods -n openshift-sdn -o wide | grep <node-name>
oc get pods -n openshift-ovn-kubernetes -o wide | grep <node-name>
# Check network operator
oc get clusteroperator network
# Debug network from node
oc debug node/<node-name>
chroot /host
ip addr
ip route
ping <api-server>
# View pending CSRs
oc get csr | grep Pending
# Check CSR details
oc describe csr <csr-name>
# Approve CSR
oc adm certificate approve <csr-name>
# Automated approval (for testing/dev only)
oc get csr -o name | xargs oc adm certificate approve
# Quick node health check
oc get nodes && oc get machines -n openshift-machine-api
# Find pods on a specific node
oc get pods -A -o wide --field-selector spec.nodeName=<node-name>
# Count pods per node
oc get pods -A -o wide | awk '{print $8}' | sort | uniq -c
# View node capacity
oc get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu,MEMORY:.status.capacity.memory
# Check PodDisruptionBudgets
oc get pdb -A
# Emergency node reset (use with extreme caution)
oc adm drain <node-name> --force --delete-emptydir-data --ignore-daemonsets --skip-wait-for-delete-timeout=0
openshift-debugging - For troubleshooting node-related issuesopenshift-cluster-upgrade - Nodes are updated during cluster upgradesopenshift-operator-troubleshooting - Machine API operator issues