Check etcd cluster health, member status, and identify issues
Performs comprehensive health check of the etcd cluster, verifying member status, leadership, connectivity, and identifying issues like database size or disk space problems. Use this command when diagnosing control plane issues or verifying cluster stability.
/plugin marketplace add openshift-eng/ai-helpers/plugin install etcd@ai-helpers[--verbose]etcd:health-check
/etcd:health-check [--verbose]
The health-check command performs a comprehensive health check of the etcd cluster in an OpenShift environment. It examines etcd member status, cluster health, leadership, connectivity, and identifies potential issues that could affect cluster stability.
Etcd is the critical key-value store that holds all cluster state for Kubernetes/OpenShift. Issues related to etcd can cause cluster-wide failures, so monitoring its health is essential.
This command is useful for:
Before using this command, ensure you have:
OpenShift CLI (oc)
oc versionActive cluster connection
oc whoamiCluster admin permissions
oc auth can-i get pods -n openshift-etcdHealthy etcd namespace
The command performs the following checks:
Check if oc CLI is available and cluster is accessible:
if ! command -v oc &> /dev/null; then
echo "Error: oc CLI not found. Please install OpenShift CLI."
exit 1
fi
if ! oc whoami &> /dev/null; then
echo "Error: Not connected to an OpenShift cluster."
exit 1
fi
Verify the etcd namespace exists and get pod status:
echo "Checking etcd namespace and pods..."
if ! oc get namespace openshift-etcd &> /dev/null; then
echo "CRITICAL: openshift-etcd namespace not found"
exit 1
fi
# Get etcd pod status
ETCD_PODS=$(oc get pods -n openshift-etcd -l app=etcd -o json)
TOTAL_PODS=$(echo "$ETCD_PODS" | jq '.items | length')
RUNNING_PODS=$(echo "$ETCD_PODS" | jq '[.items[] | select(.status.phase == "Running")] | length')
echo "Etcd pods: $RUNNING_PODS/$TOTAL_PODS running"
if [ "$RUNNING_PODS" -eq 0 ]; then
echo "CRITICAL: No etcd pods are running"
exit 1
fi
# List all etcd pods with status
echo ""
echo "Etcd Pod Status:"
oc get pods -n openshift-etcd -l app=etcd -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,READY:.status.containerStatuses[0].ready,RESTARTS:.status.containerStatuses[0].restartCount,NODE:.spec.nodeName
Use etcdctl to check cluster health from each running etcd pod:
echo ""
echo "Checking etcd cluster health..."
# Get the first running etcd pod
ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
if [ -z "$ETCD_POD" ]; then
echo "CRITICAL: No running etcd pod found"
exit 1
fi
# Check cluster health
HEALTH_OUTPUT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster -w table 2>&1)
if echo "$HEALTH_OUTPUT" | grep -q "is healthy"; then
echo "Cluster Health Status:"
echo "$HEALTH_OUTPUT"
else
echo "CRITICAL: Etcd cluster health check failed"
echo "$HEALTH_OUTPUT"
exit 1
fi
List all etcd members and verify quorum:
echo ""
echo "Checking etcd member list..."
MEMBER_LIST=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w table 2>&1)
echo "Etcd Members:"
echo "$MEMBER_LIST"
# Count members
MEMBER_COUNT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w json 2>/dev/null | jq '.members | length')
echo ""
echo "Total members: $MEMBER_COUNT"
if [ "$MEMBER_COUNT" -lt 3 ]; then
echo "WARNING: Etcd cluster has less than 3 members (quorum at risk)"
fi
# Check for unstarted members
UNSTARTED=$(echo "$MEMBER_LIST" | grep "unstarted" | wc -l)
if [ "$UNSTARTED" -gt 0 ]; then
echo "WARNING: $UNSTARTED member(s) in unstarted state"
fi
Verify there is a healthy leader:
echo ""
echo "Checking etcd leadership..."
ENDPOINT_STATUS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w table 2>&1)
echo "Endpoint Status:"
echo "$ENDPOINT_STATUS"
# Check if there's a leader
if echo "$ENDPOINT_STATUS" | grep -q "true"; then
LEADER_ENDPOINT=$(echo "$ENDPOINT_STATUS" | grep "true" | awk '{print $2}')
echo ""
echo "Leader: $LEADER_ENDPOINT"
else
echo "CRITICAL: No etcd leader elected"
exit 1
fi
Check database size and fragmentation:
echo ""
echo "Checking etcd database size..."
# Get database size from endpoint status
DB_SIZE=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w json 2>/dev/null)
echo "$DB_SIZE" | jq -r '.[] | "Endpoint: \(.Endpoint) | DB Size: \(.Status.dbSize) bytes | DB Size in Use: \(.Status.dbSizeInUse) bytes"'
# Calculate fragmentation percentage
echo "$DB_SIZE" | jq -r '.[] |
if .Status.dbSize > 0 then
"Fragmentation: \(((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) | floor)%"
else
"Fragmentation: N/A"
end'
# Warn if database is too large
MAX_DB_SIZE=$((8 * 1024 * 1024 * 1024)) # 8GB threshold
CURRENT_SIZE=$(echo "$DB_SIZE" | jq -r '.[0].Status.dbSize')
if [ "$CURRENT_SIZE" -gt "$MAX_DB_SIZE" ]; then
echo "WARNING: Etcd database size ($CURRENT_SIZE bytes) exceeds recommended maximum (8GB)"
echo "Consider defragmentation or checking for excessive key growth"
fi
Verify disk space on nodes running etcd:
echo ""
echo "Checking disk space on etcd nodes..."
for pod in $(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}'); do
echo "Pod: $pod"
oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1
# Get disk usage percentage
DISK_USAGE=$(oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 80 ]; then
echo "WARNING: Disk usage on $pod is ${DISK_USAGE}% (threshold: 80%)"
fi
echo ""
done
Check recent logs for errors or warnings:
echo ""
echo "Checking recent etcd logs for errors..."
RECENT_ERRORS=$(oc logs -n openshift-etcd "$ETCD_POD" -c etcd --tail=100 | grep -i "error\|warn\|fatal" | tail -10)
if [ -n "$RECENT_ERRORS" ]; then
echo "Recent errors/warnings found:"
echo "$RECENT_ERRORS"
else
echo "No recent errors in etcd logs"
fi
If verbose mode is enabled, check performance metrics:
if [ "$VERBOSE" = "true" ]; then
echo ""
echo "Checking etcd performance metrics..."
# Get metrics from etcd pod
METRICS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcd -- curl -s http://localhost:2379/metrics 2>/dev/null)
# Parse key metrics
echo "Backend Commit Duration (p99):"
echo "$METRICS" | grep "etcd_disk_backend_commit_duration_seconds" | grep "quantile=\"0.99\"" | head -1
echo ""
echo "WAL Fsync Duration (p99):"
echo "$METRICS" | grep "etcd_disk_wal_fsync_duration_seconds" | grep "quantile=\"0.99\"" | head -1
echo ""
echo "Leader Changes:"
echo "$METRICS" | grep "etcd_server_leader_changes_seen_total" | head -1
fi
Create a summary of findings:
echo ""
echo "==============================================="
echo "Etcd Health Check Summary"
echo "==============================================="
echo "Check Time: $(date)"
echo "Cluster: $(oc whoami --show-server)"
echo ""
echo "Results:"
echo " Etcd Pods Running: $RUNNING_PODS/$TOTAL_PODS"
echo " Cluster Members: $MEMBER_COUNT"
echo " Leader Elected: Yes"
echo " Cluster Health: Healthy"
echo ""
if [ "$WARNINGS" -gt 0 ]; then
echo "Status: WARNING - Found $WARNINGS warnings requiring attention"
exit 0
else
echo "Status: HEALTHY - All checks passed"
exit 0
fi
The command returns different exit codes:
Output Format:
/etcd:health-check
Output:
Checking etcd namespace and pods...
Etcd pods: 3/3 running
Etcd Pod Status:
NAME STATUS READY RESTARTS NODE
etcd-ip-10-0-21-125.us-east-2... Running true 0 ip-10-0-21-125
etcd-ip-10-0-43-249.us-east-2... Running true 0 ip-10-0-43-249
etcd-ip-10-0-68-109.us-east-2... Running true 0 ip-10-0-68-109
Checking etcd cluster health...
Cluster Health Status:
+------------------------------------------+--------+
| ENDPOINT | HEALTH |
+------------------------------------------+--------+
| https://10.0.21.125:2379 | true |
| https://10.0.43.249:2379 | true |
| https://10.0.68.109:2379 | true |
+------------------------------------------+--------+
Checking etcd member list...
Etcd Members:
+------------------+---------+------------------------+
| ID | STATUS | NAME |
+------------------+---------+------------------------+
| 3a2b1c4d5e6f7890 | started | ip-10-0-21-125 |
| 4b3c2d5e6f708901 | started | ip-10-0-43-249 |
| 5c4d3e6f70890123 | started | ip-10-0-68-109 |
+------------------+---------+------------------------+
Total members: 3
Checking etcd leadership...
Leader: https://10.0.21.125:2379
===============================================
Etcd Health Check Summary
===============================================
Status: HEALTHY - All checks passed
/etcd:health-check --verbose
Symptoms: Cluster shows no leader elected
Investigation:
oc logs -n openshift-etcd <etcd-pod> -c etcd | grep -i "leader"
oc get events -n openshift-etcd
Remediation:
Symptoms: Database size exceeds 8GB
Investigation:
oc exec -n openshift-etcd <etcd-pod> -c etcdctl -- etcdctl endpoint status -w table
Remediation:
/etcd:defrag (if command exists)Symptoms: Disk usage > 80% on etcd data directory
Investigation:
oc exec -n openshift-etcd <etcd-pod> -c etcd -- df -h /var/lib/etcd
Remediation:
Symptoms: Member shows "unstarted" status
Investigation:
oc logs -n openshift-etcd <etcd-pod> -c etcd
oc describe pod -n openshift-etcd <etcd-pod>
Remediation:
/etcd:analyze-performance