Perform comprehensive health check on HCP cluster and report issues
Performs comprehensive health check on HCP cluster and report issues
/plugin marketplace add openshift-eng/ai-helpers/plugin install hcp@ai-helpers<cluster-name> [--verbose] [--output-format json|text]hcp:cluster-health-check
/hcp:cluster-health-check <cluster-name> [--verbose] [--output-format json|text]
The /hcp:cluster-health-check command performs an extensive diagnostic of a Hosted Control Plane (HCP) cluster to assess its operational health, stability, and performance. It automates the validation of multiple control plane and infrastructure components to ensure the cluster is functioning as expected.
The command runs a comprehensive set of health checks covering control plane pods, etcd, node pools, networking, and infrastructure readiness. It also detects degraded or progressing states, certificate issues, and recent warning events.
Specifically, it performs the following:
Before using this command, ensure you have:
Kubernetes/OpenShift CLI: Either oc (OpenShift) or kubectl (Kubernetes)
oc from: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/kubectl from: https://kubernetes.io/docs/tasks/tools/oc version or kubectl versionActive cluster connection: Must be connected to a running cluster
oc whoami or kubectl cluster-infoSufficient permissions: Must have read access to cluster resources
[cluster-name] (required): Name of the HostedCluster to check. This uniquely identifies the HCP instance whose health will be analyzed. Example: hcp-demo
[namespace] (optional): The namespace in which the HostedCluster resides. Defaults to clusters if not provided. Example: /hcp:cluster-health-check hcp-demo clusters-dev
--verbose (optional): Enable detailed output with additional context
--output-format (optional): Output format for results
text (default): Human-readable text formatjson: Machine-readable JSON format for automationThe command performs the following health checks:
Detect which Kubernetes CLI is available and verify cluster connection:
if command -v oc &> /dev/null; then
CLI="oc"
elif command -v kubectl &> /dev/null; then
CLI="kubectl"
else
echo "Error: Neither 'oc' nor 'kubectl' CLI found. Please install one of them."
exit 1
fi
# Verify cluster connectivity
if ! $CLI cluster-info &> /dev/null; then
echo "Error: Not connected to a cluster. Please configure your KUBECONFIG."
exit 1
fi
Create a report structure to collect findings:
CLUSTER_NAME=$1
NAMESPACE=${2:-"clusters"}
CONTROL_PLANE_NS="${NAMESPACE}-${CLUSTER_NAME}"
REPORT_FILE=".work/hcp-health-check/report-${CLUSTER_NAME}-$(date +%Y%m%d-%H%M%S).txt"
mkdir -p .work/hcp-health-check
# Initialize counters
CRITICAL_ISSUES=0
WARNING_ISSUES=0
INFO_MESSAGES=0
Arguments:
kubectl get hostedcluster output. Required.Verify the HostedCluster resource health:
echo "Checking HostedCluster Status..."
# Check if HostedCluster exists
if ! $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" &> /dev/null; then
echo "❌ CRITICAL: HostedCluster '$CLUSTER_NAME' not found in namespace '$NAMESPACE'"
exit 1
fi
# Get HostedCluster conditions
HC_AVAILABLE=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Available") | .status')
HC_PROGRESSING=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Progressing") | .status')
HC_DEGRADED=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Degraded") | .status')
if [ "$HC_AVAILABLE" != "True" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: HostedCluster is not Available"
# Get reason and message
$CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Available") | " Reason: \(.reason)\n Message: \(.message)"'
fi
if [ "$HC_DEGRADED" == "True" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: HostedCluster is Degraded"
$CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Degraded") | " Reason: \(.reason)\n Message: \(.message)"'
fi
if [ "$HC_PROGRESSING" == "True" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: HostedCluster is Progressing"
$CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Progressing") | " Reason: \(.reason)\n Message: \(.message)"'
fi
# Check version and upgrade status
HC_VERSION=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.version.history[0].version // "unknown"')
echo "ℹ️ HostedCluster version: $HC_VERSION"
Examine control plane components in the management cluster:
echo "Checking Control Plane Pods..."
# Check if control plane namespace exists
if ! $CLI get namespace "$CONTROL_PLANE_NS" &> /dev/null; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: Control plane namespace '$CONTROL_PLANE_NS' not found"
else
# Get pods that are not Running
FAILED_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | "\(.metadata.name) [\(.status.phase)]"')
if [ -n "$FAILED_CP_PODS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$FAILED_CP_PODS" | wc -l)))
echo "❌ CRITICAL: Control plane pods in failed state:"
echo "$FAILED_CP_PODS" | while read pod; do
echo " - $pod"
done
fi
# Check for pods with high restart count
HIGH_RESTART_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .restartCount > 3) | "\(.metadata.name) [Restarts: \(.status.containerStatuses[0].restartCount)]"')
if [ -n "$HIGH_RESTART_CP_PODS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$HIGH_RESTART_CP_PODS" | wc -l)))
echo "⚠️ WARNING: Control plane pods with high restart count (>3):"
echo "$HIGH_RESTART_CP_PODS" | while read pod; do
echo " - $pod"
done
fi
# Check for CrashLoopBackOff pods
CRASHLOOP_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "CrashLoopBackOff") | .metadata.name')
if [ -n "$CRASHLOOP_CP_PODS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$CRASHLOOP_CP_PODS" | wc -l)))
echo "❌ CRITICAL: Control plane pods in CrashLoopBackOff:"
echo "$CRASHLOOP_CP_PODS" | while read pod; do
echo " - $pod"
done
fi
# Check for ImagePullBackOff
IMAGE_PULL_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "ImagePullBackOff" or .state.waiting?.reason == "ErrImagePull") | .metadata.name')
if [ -n "$IMAGE_PULL_CP_PODS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$IMAGE_PULL_CP_PODS" | wc -l)))
echo "❌ CRITICAL: Control plane pods with image pull errors:"
echo "$IMAGE_PULL_CP_PODS" | while read pod; do
echo " - $pod"
done
fi
# Check critical control plane components
echo "Checking critical control plane components..."
CRITICAL_COMPONENTS="kube-apiserver etcd kube-controller-manager kube-scheduler"
for component in $CRITICAL_COMPONENTS; do
COMPONENT_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app="$component" -o json 2>/dev/null | jq -r '.items[].metadata.name')
if [ -z "$COMPONENT_PODS" ]; then
# Try alternative label
COMPONENT_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r ".items[] | select(.metadata.name | startswith(\"$component\")) | .metadata.name")
fi
if [ -z "$COMPONENT_PODS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: No pods found for component: $component"
else
# Check if any are not running
for pod in $COMPONENT_PODS; do
POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
if [ "$POD_STATUS" != "Running" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: $component pod $pod is not Running (Status: $POD_STATUS)"
fi
done
fi
done
fi
Verify etcd cluster health and performance:
echo "Checking Etcd Health..."
# Find etcd pods
ETCD_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=etcd -o json 2>/dev/null | jq -r '.items[].metadata.name')
if [ -z "$ETCD_PODS" ]; then
ETCD_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.metadata.name | startswith("etcd")) | .metadata.name')
fi
if [ -n "$ETCD_PODS" ]; then
for pod in $ETCD_PODS; do
# Check etcd endpoint health
ETCD_HEALTH=$($CLI exec -n "$CONTROL_PLANE_NS" "$pod" -- etcdctl endpoint health 2>/dev/null || echo "failed")
if echo "$ETCD_HEALTH" | grep -q "unhealthy"; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: Etcd pod $pod is unhealthy"
elif echo "$ETCD_HEALTH" | grep -q "failed"; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: Could not check etcd health for pod $pod"
fi
done
else
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: No etcd pods found"
fi
Verify NodePool health and scaling:
echo "Checking NodePool Status..."
# Get all NodePools for this HostedCluster
NODEPOOLS=$($CLI get nodepool -n "$NAMESPACE" -l hypershift.openshift.io/hostedcluster="$CLUSTER_NAME" -o json | jq -r '.items[].metadata.name')
if [ -z "$NODEPOOLS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: No NodePools found for HostedCluster $CLUSTER_NAME"
else
for nodepool in $NODEPOOLS; do
# Get NodePool status
NP_REPLICAS=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.replicas // 0')
NP_READY=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.replicas // 0')
NP_AVAILABLE=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Ready") | .status')
echo " NodePool: $nodepool [Ready: $NP_READY/$NP_REPLICAS]"
if [ "$NP_AVAILABLE" != "True" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: NodePool $nodepool is not Ready"
$CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Ready") | " Reason: \(.reason)\n Message: \(.message)"'
fi
if [ "$NP_READY" != "$NP_REPLICAS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: NodePool $nodepool has mismatched replicas (Ready: $NP_READY, Desired: $NP_REPLICAS)"
fi
# Check for auto-scaling
AUTOSCALING=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling // empty')
if [ -n "$AUTOSCALING" ]; then
MIN=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling.min')
MAX=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling.max')
echo " Auto-scaling enabled: Min=$MIN, Max=$MAX"
fi
done
fi
Validate infrastructure components (AWS/Azure specific):
echo "Checking Infrastructure Status..."
# Get infrastructure platform
PLATFORM=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.spec.platform.type')
echo " Platform: $PLATFORM"
# Check for infrastructure-related conditions
INFRA_READY=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="InfrastructureReady") | .status')
if [ "$INFRA_READY" != "True" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: Infrastructure is not ready"
$CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="InfrastructureReady") | " Reason: \(.reason)\n Message: \(.message)"'
fi
# Platform-specific checks
if [ "$PLATFORM" == "AWS" ]; then
# Check AWS-specific resources
echo " Performing AWS-specific checks..."
# Check if endpoint service is configured
ENDPOINT_SERVICE=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.platform.aws.endpointService // "not-found"')
if [ "$ENDPOINT_SERVICE" == "not-found" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: AWS endpoint service not configured"
fi
fi
Verify network connectivity and ingress configuration:
echo "Checking Network and Ingress..."
# Check if connectivity is healthy (for private clusters)
CONNECTIVITY_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=connectivity-server -o json 2>/dev/null | jq -r '.items[].metadata.name')
if [ -n "$CONNECTIVITY_PODS" ]; then
for pod in $CONNECTIVITY_PODS; do
POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
if [ "$POD_STATUS" != "Running" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: Connectivity server pod $pod is not Running"
fi
done
fi
# Check ingress/router pods
ROUTER_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=router -o json 2>/dev/null | jq -r '.items[].metadata.name')
if [ -z "$ROUTER_PODS" ]; then
ROUTER_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.metadata.name | contains("router") or contains("ingress")) | .metadata.name')
fi
if [ -n "$ROUTER_PODS" ]; then
for pod in $ROUTER_PODS; do
POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
if [ "$POD_STATUS" != "Running" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: Router/Ingress pod $pod is not Running"
fi
done
fi
Look for recent warning/error events:
echo "Checking Recent Events..."
# Get recent warning events for HostedCluster
HC_EVENTS=$($CLI get events -n "$NAMESPACE" --field-selector involvedObject.name="$CLUSTER_NAME",involvedObject.kind=HostedCluster,type=Warning --sort-by='.lastTimestamp' | tail -10)
if [ -n "$HC_EVENTS" ]; then
echo "⚠️ Recent Warning Events for HostedCluster:"
echo "$HC_EVENTS"
fi
# Get recent warning events in control plane namespace
CP_EVENTS=$($CLI get events -n "$CONTROL_PLANE_NS" --field-selector type=Warning --sort-by='.lastTimestamp' 2>/dev/null | tail -10)
if [ -n "$CP_EVENTS" ]; then
echo "⚠️ Recent Warning Events in Control Plane:"
echo "$CP_EVENTS"
fi
# Get recent events for NodePools
for nodepool in $NODEPOOLS; do
NP_EVENTS=$($CLI get events -n "$NAMESPACE" --field-selector involvedObject.name="$nodepool",involvedObject.kind=NodePool,type=Warning --sort-by='.lastTimestamp' 2>/dev/null | tail -5)
if [ -n "$NP_EVENTS" ]; then
echo "⚠️ Recent Warning Events for NodePool $nodepool:"
echo "$NP_EVENTS"
fi
done
Verify certificate validity:
echo "Checking Certificate Status..."
# Check certificate expiration warnings
CERT_CONDITIONS=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type | contains("Certificate")) | "\(.type): \(.status) - \(.message // "N/A")"')
if [ -n "$CERT_CONDITIONS" ]; then
echo "$CERT_CONDITIONS" | while read line; do
if echo "$line" | grep -q "False"; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: $line"
fi
done
fi
Create a summary of findings:
echo ""
echo "==============================================="
echo "HCP Cluster Health Check Summary"
echo "==============================================="
echo "Cluster Name: $CLUSTER_NAME"
echo "Namespace: $NAMESPACE"
echo "Control Plane Namespace: $CONTROL_PLANE_NS"
echo "Platform: $PLATFORM"
echo "Version: $HC_VERSION"
echo "Check Time: $(date)"
echo ""
echo "Results:"
echo " Critical Issues: $CRITICAL_ISSUES"
echo " Warnings: $WARNING_ISSUES"
echo ""
if [ $CRITICAL_ISSUES -eq 0 ] && [ $WARNING_ISSUES -eq 0 ]; then
echo "✅ OVERALL STATUS: HEALTHY - No issues detected"
exit 0
elif [ $CRITICAL_ISSUES -gt 0 ]; then
echo "❌ OVERALL STATUS: CRITICAL - Immediate attention required"
exit 1
else
echo "⚠️ OVERALL STATUS: WARNING - Monitoring recommended"
exit 0
fi
If --output-format json is specified, export findings as JSON:
{
"cluster": {
"name": "my-hcp-cluster",
"namespace": "clusters",
"controlPlaneNamespace": "clusters-my-hcp-cluster",
"platform": "AWS",
"version": "4.21.0",
"checkTime": "2025-11-11T12:00:00Z"
},
"summary": {
"criticalIssues": 1,
"warnings": 3,
"overallStatus": "WARNING"
},
"findings": {
"hostedCluster": {
"available": true,
"progressing": true,
"degraded": false,
"infrastructureReady": true
},
"controlPlane": {
"failedPods": [],
"crashLoopingPods": [],
"highRestartPods": ["kube-controller-manager-xxx"],
"imagePullErrors": []
},
"etcd": {
"healthy": true,
"pods": ["etcd-0", "etcd-1", "etcd-2"]
},
"nodePools": {
"total": 2,
"ready": 2,
"details": [
{
"name": "workers",
"replicas": 3,
"ready": 3,
"autoScaling": {
"enabled": true,
"min": 2,
"max": 5
}
}
]
},
"network": {
"konnectivityHealthy": true,
"ingressHealthy": true
},
"events": {
"recentWarnings": 5
}
}
}
/hcp:cluster-health-check my-cluster
Output, for healthy cluster:
HCP Cluster Health Check: my-cluster (namespace: clusters)
================================================================================
OVERALL STATUS: ✅ HEALTHY
COMPONENT STATUS:
✅ HostedCluster: Available
✅ Control Plane: All pods running (0 restarts)
✅ NodePool: 3/3 nodes ready
✅ Infrastructure: AWS resources validated
✅ Network: Operational
✅ Storage/Etcd: Healthy
No critical issues found. Cluster is operating normally.
DIAGNOSTIC COMMANDS:
Run these commands for detailed information:
kubectl get hostedcluster my-cluster -n clusters -o yaml
kubectl get pods -n clusters-my-cluster
kubectl get nodepool -n clusters
Output, with warnings:
HCP Cluster Health Check: test-cluster (namespace: clusters)
================================================================================
OVERALL STATUS: ⚠️ WARNING
COMPONENT STATUS:
✅ HostedCluster: Available
⚠️ Control Plane: 1 pod restarting
✅ NodePool: 2/2 nodes ready
✅ Infrastructure: Validated
✅ Network: Operational
⚠️ Storage/Etcd: High latency detected
ISSUES FOUND:
[WARNING] Control Plane Pod Restarting
- Component: kube-controller-manager
- Location: clusters-test-cluster namespace
- Restarts: 3 in the last hour
- Impact: May cause temporary API instability
- Recommended Action:
kubectl logs -n clusters-test-cluster kube-controller-manager-xxx --previous
Check for configuration issues or resource constraints
[WARNING] Etcd High Latency
- Component: etcd
- Metrics: Backend commit duration > 100ms
- Impact: Slower cluster operations
- Recommended Action:
Check management cluster node performance
Review etcd disk I/O with: kubectl exec -n clusters-test-cluster etcd-0 -- etcdctl endpoint status
DIAGNOSTIC COMMANDS:
1. Check HostedCluster details:
kubectl get hostedcluster test-cluster -n clusters -o yaml
2. View control plane events:
kubectl get events -n clusters-test-cluster --sort-by='.lastTimestamp' | tail -20
3. Check etcd metrics:
kubectl exec -n clusters-test-cluster etcd-0 -- etcdctl endpoint health
/hcp:cluster-health-check production-cluster prod-clusters
Performs health check on cluster "production-cluster" in the "prod-clusters" namespace.
/hcp:cluster-health-check my-cluster --verbose
/hcp:cluster-health-check my-cluster --output-format json
The command returns a structured health report containing:
Symptoms: Cluster operators showing Degraded=True or Available=False
Investigation:
oc get clusteroperator <operator-name> -o yaml
oc logs -n openshift-<operator-namespace> -l app=<operator-name>
Remediation: Check operator logs and events for specific errors
Symptoms: Nodes in NotReady state
Investigation:
oc describe node <node-name>
oc get events --field-selector involvedObject.name=<node-name>
Remediation: Common causes include network issues, disk pressure, or kubelet problems
Symptoms: Pods continuously restarting
Investigation:
oc logs <pod-name> -n <namespace> --previous
oc describe pod <pod-name> -n <namespace>
Remediation: Check application logs, resource limits, and configuration
Symptoms: Pods unable to pull container images
Investigation:
oc describe pod <pod-name> -n <namespace>
Remediation: Verify image name, registry credentials, and network connectivity