Perform comprehensive health check on OpenShift cluster and report issues
Analyzes OpenShift/Kubernetes cluster health and reports critical issues and warnings.
/plugin marketplace add openshift-eng/ai-helpers/plugin install openshift@ai-helpers[--verbose] [--output-format]openshift:cluster-health-check
/openshift:cluster-health-check [--verbose] [--output-format json|text]
The cluster-health-check command performs a comprehensive health analysis of an OpenShift/Kubernetes cluster and reports any detected issues. It examines cluster operators, nodes, deployments, pods, persistent volumes, and other critical resources to identify problems that may affect cluster stability or workload availability.
This command is useful for:
Before using this command, ensure you have:
Kubernetes/OpenShift CLI: Either oc (OpenShift) or kubectl (Kubernetes)
oc from: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/kubectl from: https://kubernetes.io/docs/tasks/tools/oc version or kubectl versionActive cluster connection: Must be connected to a running cluster
oc whoami or kubectl cluster-infoSufficient permissions: Must have read access to cluster resources
--verbose (optional): Enable detailed output with additional context
--output-format (optional): Output format for results
text (default): Human-readable text formatjson: Machine-readable JSON format for automationThe command performs the following health checks:
Detect which Kubernetes CLI is available:
if command -v oc &> /dev/null; then
CLI="oc"
CLUSTER_TYPE="OpenShift"
elif command -v kubectl &> /dev/null; then
CLI="kubectl"
CLUSTER_TYPE="Kubernetes"
else
echo "Error: Neither 'oc' nor 'kubectl' CLI found. Please install one of them."
exit 1
fi
Check if connected to a cluster:
if ! $CLI cluster-info &> /dev/null; then
echo "Error: Not connected to a cluster. Please configure your KUBECONFIG."
exit 1
fi
# Get cluster version info
if [ "$CLUSTER_TYPE" = "OpenShift" ]; then
CLUSTER_VERSION=$($CLI version -o json 2>/dev/null | jq -r '.openshiftVersion // "unknown"')
else
CLUSTER_VERSION=$($CLI version --short 2>/dev/null | grep -i server | awk '{print $3}')
fi
Create a report structure to collect findings:
REPORT_FILE=".work/cluster-health-check/report-$(date +%Y%m%d-%H%M%S).txt"
mkdir -p .work/cluster-health-check
# Initialize counters
CRITICAL_ISSUES=0
WARNING_ISSUES=0
INFO_MESSAGES=0
For OpenShift clusters, check cluster operator health:
if [ "$CLUSTER_TYPE" = "OpenShift" ]; then
echo "Checking Cluster Operators..."
# Get all cluster operators
DEGRADED_COs=$($CLI get clusteroperators -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Degraded" and .status=="True")) | .metadata.name')
UNAVAILABLE_COs=$($CLI get clusteroperators -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Available" and .status=="False")) | .metadata.name')
PROGRESSING_COs=$($CLI get clusteroperators -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Progressing" and .status=="True")) | .metadata.name')
if [ -n "$DEGRADED_COs" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$DEGRADED_COs" | wc -l)))
echo "❌ CRITICAL: Degraded cluster operators found:"
echo "$DEGRADED_COs" | while read co; do
echo " - $co"
# Get degraded message
$CLI get clusteroperator "$co" -o json | jq -r '.status.conditions[] | select(.type=="Degraded") | " Reason: \(.reason)\n Message: \(.message)"'
done
fi
if [ -n "$UNAVAILABLE_COs" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$UNAVAILABLE_COs" | wc -l)))
echo "❌ CRITICAL: Unavailable cluster operators found:"
echo "$UNAVAILABLE_COs" | while read co; do
echo " - $co"
done
fi
if [ -n "$PROGRESSING_COs" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$PROGRESSING_COs" | wc -l)))
echo "⚠️ WARNING: Cluster operators in progress:"
echo "$PROGRESSING_COs" | while read co; do
echo " - $co"
done
fi
fi
Examine all cluster nodes for issues:
echo "Checking Node Health..."
# Get nodes that are not Ready
NOT_READY_NODES=$($CLI get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name')
if [ -n "$NOT_READY_NODES" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$NOT_READY_NODES" | wc -l)))
echo "❌ CRITICAL: Nodes not in Ready state:"
echo "$NOT_READY_NODES" | while read node; do
echo " - $node"
# Get node conditions
$CLI get node "$node" -o json | jq -r '.status.conditions[] | " \(.type): \(.status) - \(.message // "N/A")"'
done
fi
# Check for SchedulingDisabled nodes
DISABLED_NODES=$($CLI get nodes -o json | jq -r '.items[] | select(.spec.unschedulable==true) | .metadata.name')
if [ -n "$DISABLED_NODES" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$DISABLED_NODES" | wc -l)))
echo "⚠️ WARNING: Nodes with scheduling disabled:"
echo "$DISABLED_NODES" | while read node; do
echo " - $node"
done
fi
# Check for node pressure conditions (MemoryPressure, DiskPressure, PIDPressure)
PRESSURE_NODES=$($CLI get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select((.type=="MemoryPressure" or .type=="DiskPressure" or .type=="PIDPressure") and .status=="True")) | .metadata.name')
if [ -n "$PRESSURE_NODES" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$PRESSURE_NODES" | wc -l)))
echo "⚠️ WARNING: Nodes under resource pressure:"
echo "$PRESSURE_NODES" | while read node; do
echo " - $node"
$CLI get node "$node" -o json | jq -r '.status.conditions[] | select((.type=="MemoryPressure" or .type=="DiskPressure" or .type=="PIDPressure") and .status=="True") | " \(.type): \(.message // "N/A")"'
done
fi
# Check node resource utilization if metrics-server is available
if $CLI top nodes &> /dev/null; then
echo "Node Resource Utilization:"
$CLI top nodes
fi
Identify problematic pods:
echo "Checking Pod Health..."
# Get pods that are not Running or Completed
FAILED_PODS=$($CLI get pods --all-namespaces -o json | jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | "\(.metadata.namespace)/\(.metadata.name) [\(.status.phase)]"')
if [ -n "$FAILED_PODS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$FAILED_PODS" | wc -l)))
echo "❌ CRITICAL: Pods in failed/pending state:"
echo "$FAILED_PODS"
fi
# Check for pods with restarts
HIGH_RESTART_PODS=$($CLI get pods --all-namespaces -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .restartCount > 5) | "\(.metadata.namespace)/\(.metadata.name) [Restarts: \(.status.containerStatuses[0].restartCount)]"')
if [ -n "$HIGH_RESTART_PODS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$HIGH_RESTART_PODS" | wc -l)))
echo "⚠️ WARNING: Pods with high restart count (>5):"
echo "$HIGH_RESTART_PODS"
fi
# Check for CrashLoopBackOff pods
CRASHLOOP_PODS=$($CLI get pods --all-namespaces -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "CrashLoopBackOff") | "\(.metadata.namespace)/\(.metadata.name)"')
if [ -n "$CRASHLOOP_PODS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$CRASHLOOP_PODS" | wc -l)))
echo "❌ CRITICAL: Pods in CrashLoopBackOff:"
echo "$CRASHLOOP_PODS"
fi
# Check for ImagePullBackOff pods
IMAGE_PULL_PODS=$($CLI get pods --all-namespaces -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "ImagePullBackOff" or .state.waiting?.reason == "ErrImagePull") | "\(.metadata.namespace)/\(.metadata.name)"')
if [ -n "$IMAGE_PULL_PODS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$IMAGE_PULL_PODS" | wc -l)))
echo "❌ CRITICAL: Pods with image pull errors:"
echo "$IMAGE_PULL_PODS"
fi
Verify workload controllers:
echo "Checking Deployments..."
# Check deployments with unavailable replicas
UNHEALTHY_DEPLOYMENTS=$($CLI get deployments --all-namespaces -o json | jq -r '.items[] | select(.status.unavailableReplicas > 0 or .status.replicas != .status.readyReplicas) | "\(.metadata.namespace)/\(.metadata.name) [Ready: \(.status.readyReplicas // 0)/\(.spec.replicas)]"')
if [ -n "$UNHEALTHY_DEPLOYMENTS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$UNHEALTHY_DEPLOYMENTS" | wc -l)))
echo "⚠️ WARNING: Deployments with unavailable replicas:"
echo "$UNHEALTHY_DEPLOYMENTS"
fi
echo "Checking StatefulSets..."
UNHEALTHY_STATEFULSETS=$($CLI get statefulsets --all-namespaces -o json | jq -r '.items[] | select(.status.replicas != .status.readyReplicas) | "\(.metadata.namespace)/\(.metadata.name) [Ready: \(.status.readyReplicas // 0)/\(.spec.replicas)]"')
if [ -n "$UNHEALTHY_STATEFULSETS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$UNHEALTHY_STATEFULSETS" | wc -l)))
echo "⚠️ WARNING: StatefulSets with unavailable replicas:"
echo "$UNHEALTHY_STATEFULSETS"
fi
echo "Checking DaemonSets..."
UNHEALTHY_DAEMONSETS=$($CLI get daemonsets --all-namespaces -o json | jq -r '.items[] | select(.status.numberReady != .status.desiredNumberScheduled) | "\(.metadata.namespace)/\(.metadata.name) [Ready: \(.status.numberReady)/\(.status.desiredNumberScheduled)]"')
if [ -n "$UNHEALTHY_DAEMONSETS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$UNHEALTHY_DAEMONSETS" | wc -l)))
echo "⚠️ WARNING: DaemonSets with unavailable pods:"
echo "$UNHEALTHY_DAEMONSETS"
fi
Check for storage issues:
echo "Checking Persistent Volume Claims..."
# Get PVCs that are not Bound
PENDING_PVCS=$($CLI get pvc --all-namespaces -o json | jq -r '.items[] | select(.status.phase != "Bound") | "\(.metadata.namespace)/\(.metadata.name) [\(.status.phase)]"')
if [ -n "$PENDING_PVCS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$PENDING_PVCS" | wc -l)))
echo "⚠️ WARNING: PVCs not in Bound state:"
echo "$PENDING_PVCS"
fi
For OpenShift, check critical namespaces:
if [ "$CLUSTER_TYPE" = "OpenShift" ]; then
echo "Checking Critical Namespaces..."
CRITICAL_NAMESPACES="openshift-kube-apiserver openshift-etcd openshift-authentication openshift-console openshift-monitoring"
for ns in $CRITICAL_NAMESPACES; do
# Check if namespace exists
if ! $CLI get namespace "$ns" &> /dev/null; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: Critical namespace missing: $ns"
continue
fi
# Check for failed pods in critical namespace
FAILED_IN_NS=$($CLI get pods -n "$ns" -o json | jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | .metadata.name')
if [ -n "$FAILED_IN_NS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$FAILED_IN_NS" | wc -l)))
echo "❌ CRITICAL: Failed pods in critical namespace $ns:"
echo "$FAILED_IN_NS" | while read pod; do
echo " - $pod"
done
fi
done
fi
Look for recent warning/error events:
echo "Checking Recent Events..."
# Get events from last 30 minutes with Warning or Error type
RECENT_WARNINGS=$($CLI get events --all-namespaces --field-selector type=Warning -o json | jq -r --arg since "$(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-30M +%Y-%m-%dT%H:%M:%SZ)" '.items[] | select(.lastTimestamp > $since) | "\(.lastTimestamp) [\(.involvedObject.namespace)/\(.involvedObject.name)]: \(.message)"' | head -20)
if [ -n "$RECENT_WARNINGS" ]; then
echo "⚠️ Recent Warning Events (last 30 minutes):"
echo "$RECENT_WARNINGS"
fi
Create a summary of findings:
echo ""
echo "==============================================="
echo "Cluster Health Check Summary"
echo "==============================================="
echo "Cluster Type: $CLUSTER_TYPE"
echo "Cluster Version: $CLUSTER_VERSION"
echo "Check Time: $(date)"
echo ""
echo "Results:"
echo " Critical Issues: $CRITICAL_ISSUES"
echo " Warnings: $WARNING_ISSUES"
echo ""
if [ $CRITICAL_ISSUES -eq 0 ] && [ $WARNING_ISSUES -eq 0 ]; then
echo "✅ Cluster is healthy - no issues detected"
exit 0
elif [ $CRITICAL_ISSUES -gt 0 ]; then
echo "❌ Cluster has CRITICAL issues requiring immediate attention"
exit 1
else
echo "⚠️ Cluster has warnings - monitoring recommended"
exit 0
fi
If --output-format json is specified, export findings as JSON:
{
"cluster": {
"type": "OpenShift",
"version": "4.21.0",
"checkTime": "2025-10-31T12:00:00Z"
},
"summary": {
"criticalIssues": 2,
"warnings": 5,
"healthy": false
},
"findings": {
"clusterOperators": {
"degraded": ["authentication", "monitoring"],
"unavailable": [],
"progressing": ["network"]
},
"nodes": {
"notReady": ["worker-1"],
"schedulingDisabled": ["worker-2"],
"underPressure": []
},
"pods": {
"failed": ["namespace/pod-1", "namespace/pod-2"],
"crashLooping": [],
"imagePullErrors": ["namespace/pod-3"]
},
"workloads": {
"unhealthyDeployments": [],
"unhealthyStatefulSets": [],
"unhealthyDaemonSets": []
},
"storage": {
"pendingPVCs": []
}
}
}
/openshift:cluster-health-check
Output:
Checking Cluster Operators...
✅ All cluster operators healthy
Checking Node Health...
⚠️ WARNING: Nodes with scheduling disabled:
- ip-10-0-51-201.us-east-2.compute.internal
Checking Pod Health...
✅ All pods healthy
...
===============================================
Cluster Health Check Summary
===============================================
Cluster Type: OpenShift
Cluster Version: 4.21.0
Check Time: 2025-10-31 12:00:00
Results:
Critical Issues: 0
Warnings: 1
⚠️ Cluster has warnings - monitoring recommended
/openshift:cluster-health-check --verbose
/openshift:cluster-health-check --output-format json
The command returns different exit codes based on findings:
Output Format:
Symptoms: Cluster operators showing Degraded=True or Available=False
Investigation:
oc get clusteroperator <operator-name> -o yaml
oc logs -n openshift-<operator-namespace> -l app=<operator-name>
Remediation: Check operator logs and events for specific errors
Symptoms: Nodes in NotReady state
Investigation:
oc describe node <node-name>
oc get events --field-selector involvedObject.name=<node-name>
Remediation: Common causes include network issues, disk pressure, or kubelet problems
Symptoms: Pods continuously restarting
Investigation:
oc logs <pod-name> -n <namespace> --previous
oc describe pod <pod-name> -n <namespace>
Remediation: Check application logs, resource limits, and configuration
Symptoms: Pods unable to pull container images
Investigation:
oc describe pod <pod-name> -n <namespace>
Remediation: Verify image name, registry credentials, and network connectivity
/prow-job:analyze-test-failure, /must-gather:analyze