Analyze etcd performance metrics, latency, and identify bottlenecks
Analyzes etcd performance metrics, logs, and cluster health to identify bottlenecks and provide optimization recommendations.
/plugin marketplace add openshift-eng/ai-helpers/plugin install etcd@ai-helpers[--duration <minutes>]etcd:analyze-performance
/etcd:analyze-performance [--duration <minutes>]
The analyze-performance command analyzes etcd performance metrics to identify latency issues, slow operations, and potential bottlenecks. It examines disk performance, commit latency, network latency, and provides recommendations for optimization.
Etcd performance is critical for cluster responsiveness. Slow etcd operations can cause:
This command is useful for:
Before using this command, ensure you have:
OpenShift CLI (oc)
oc versionActive cluster connection
oc whoamiCluster admin permissions
oc auth can-i get pods -n openshift-etcdRunning etcd pods
oc get pods -n openshift-etcd -l app=etcd--duration 15 for 15-minute windowThe command performs the following analysis:
if ! command -v oc &> /dev/null; then
echo "Error: oc CLI not found"
exit 1
fi
if ! oc whoami &> /dev/null; then
echo "Error: Not connected to cluster"
exit 1
fi
# Parse duration argument (default: 5 minutes)
DURATION=5
if [[ "$1" == "--duration" ]] && [[ -n "$2" ]]; then
DURATION=$2
fi
echo "Analyzing etcd performance (last $DURATION minutes)..."
ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
if [ -z "$ETCD_POD" ]; then
echo "Error: No running etcd pod found"
exit 1
fi
echo "Using etcd pod: $ETCD_POD"
echo ""
Get database statistics using etcdctl:
echo "==============================================="
echo "DATABASE PERFORMANCE ANALYSIS"
echo "==============================================="
echo ""
echo "Fetching database statistics..."
# Get database sizes from endpoint status
DB_STATUS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w json 2>/dev/null)
echo "Database Statistics:"
echo "$DB_STATUS" | jq -r '.[] |
"Endpoint: \(.Endpoint)
Version: \(.Status.version)
DB Size: \(.Status.dbSize) bytes (\((.Status.dbSize / 1024 / 1024) | floor)MB)
DB In Use: \(.Status.dbSizeInUse) bytes (\((.Status.dbSizeInUse / 1024 / 1024) | floor)MB)
Keys: \(.Status.header.revision)
Raft Index: \(.Status.raftIndex)
Raft Term: \(.Status.raftTerm)
Leader: \(if .Status.leader == .Status.header.member_id then "YES" else "NO" end)
"'
echo ""
echo "Fragmentation Analysis:"
echo "$DB_STATUS" | jq -r '.[] |
if .Status.dbSize > 0 then
((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) as $frag |
"Endpoint: \(.Endpoint)
Fragmentation: \($frag | floor)%" +
if $frag > 50 then
" - WARNING: High fragmentation detected, consider defragmentation"
elif $frag > 30 then
" - NOTICE: Moderate fragmentation"
else
" - OK"
end
else
"Endpoint: \(.Endpoint)
Fragmentation: N/A"
end'
Verify etcd cluster health:
echo ""
echo "==============================================="
echo "CLUSTER HEALTH"
echo "==============================================="
echo ""
oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster 2>/dev/null || echo "Health check failed"
Parse etcd logs for performance warnings:
echo ""
echo "==============================================="
echo "LOG ANALYSIS (Last $DURATION minutes)"
echo "==============================================="
echo ""
echo "Searching for performance-related warnings..."
# Get recent logs
LOGS=$(oc logs -n openshift-etcd "$ETCD_POD" -c etcd --since="${DURATION}m" 2>/dev/null)
# Count slow operations
SLOW_OPS=$(echo "$LOGS" | grep -i "slow" | wc -l)
echo "Slow operations logged: $SLOW_OPS"
if [ "$SLOW_OPS" -gt 0 ]; then
echo ""
echo "Recent slow operations (last 10):"
echo "$LOGS" | grep -i "slow" | tail -10
fi
echo ""
# Check for disk warnings
DISK_WARNINGS=$(echo "$LOGS" | grep -iE "disk|fdatasync|fsync" | grep -iE "slow|took|latency" | wc -l)
echo "Disk-related warnings: $DISK_WARNINGS"
if [ "$DISK_WARNINGS" -gt 0 ]; then
echo ""
echo "Disk performance warnings:"
echo "$LOGS" | grep -iE "disk|fdatasync|fsync" | grep -iE "slow|took|latency" | tail -5
fi
echo ""
# Check for apply warnings
APPLY_WARNINGS=$(echo "$LOGS" | grep -iE "apply.*took|slow.*apply" | wc -l)
echo "Apply operation warnings: $APPLY_WARNINGS"
if [ "$APPLY_WARNINGS" -gt 0 ]; then
echo ""
echo "Apply warnings:"
echo "$LOGS" | grep -iE "apply.*took|slow.*apply" | tail -5
fi
echo ""
# Check for compaction info
echo "Recent compaction operations:"
echo "$LOGS" | grep "finished scheduled compaction" | tail -3
if [ $(echo "$LOGS" | grep "finished scheduled compaction" | wc -l) -eq 0 ]; then
echo " No compaction operations in this time window"
fi
echo ""
# Check for snapshot operations
echo "Snapshot operations:"
SNAPSHOTS=$(echo "$LOGS" | grep -i "snapshot" | wc -l)
echo "Snapshot events: $SNAPSHOTS"
if [ "$SNAPSHOTS" -gt 0 ]; then
echo "$LOGS" | grep -i "snapshot" | tail -3
fi
Check for leader changes and stability issues:
echo ""
echo "==============================================="
echo "LEADER STABILITY ANALYSIS"
echo "==============================================="
echo ""
LEADER_CHANGES=$(echo "$LOGS" | grep -i "leader.*changed\|became leader\|lost leader" | wc -l)
echo "Leader change events: $LEADER_CHANGES"
if [ "$LEADER_CHANGES" -gt 0 ]; then
echo ""
echo "Leader change events:"
echo "$LOGS" | grep -i "leader.*changed\|became leader\|lost leader"
fi
# Check for proposal/commit issues
echo ""
echo "Proposal and commit operations:"
PROPOSAL_LOGS=$(echo "$LOGS" | grep -iE "proposal|commit" | grep -iE "slow|took|failed" | wc -l)
echo "Slow proposal/commit operations: $PROPOSAL_LOGS"
if [ "$PROPOSAL_LOGS" -gt 0 ]; then
echo ""
echo "Sample slow operations:"
echo "$LOGS" | grep -iE "proposal|commit" | grep -iE "slow|took|failed" | tail -5
fi
Check for network-related issues:
echo ""
echo "==============================================="
echo "NETWORK ANALYSIS"
echo "==============================================="
echo ""
NETWORK_ISSUES=$(echo "$LOGS" | grep -iE "network|connection|timeout|peer" | grep -iE "error|fail|slow" | wc -l)
echo "Network-related issues: $NETWORK_ISSUES"
if [ "$NETWORK_ISSUES" -gt 0 ]; then
echo ""
echo "Network issues:"
echo "$LOGS" | grep -iE "network|connection|timeout|peer" | grep -iE "error|fail|slow" | tail -5
fi
Create summary with recommendations:
echo ""
echo "==============================================="
echo "PERFORMANCE SUMMARY & RECOMMENDATIONS"
echo "==============================================="
echo ""
ISSUES=0
WARNINGS=0
# Check fragmentation from DB status
MAX_FRAG=$(echo "$DB_STATUS" | jq -r '[.[] | if .Status.dbSize > 0 then ((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) else 0 end] | max')
if (( $(echo "$MAX_FRAG > 50" | bc -l 2>/dev/null || echo 0) )); then
echo "ISSUE: High database fragmentation (${MAX_FRAG}%)"
echo " Recommendation: Run defragmentation on all etcd members"
echo " Command: oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl defrag"
echo ""
ISSUES=$((ISSUES + 1))
elif (( $(echo "$MAX_FRAG > 30" | bc -l 2>/dev/null || echo 0) )); then
echo "WARNING: Moderate database fragmentation (${MAX_FRAG}%)"
echo " Recommendation: Monitor and consider defragmentation if performance degrades"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
if [ "$LEADER_CHANGES" -gt 5 ]; then
echo "WARNING: Frequent leader changes ($LEADER_CHANGES in last ${DURATION}m)"
echo " Recommendation: Check network stability between etcd nodes"
echo " - Verify network latency between control plane nodes"
echo " - Check for packet loss or network congestion"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
if [ "$SLOW_OPS" -gt 10 ]; then
echo "WARNING: High number of slow operations ($SLOW_OPS in last ${DURATION}m)"
echo " Recommendation: Investigate disk I/O and workload patterns"
echo " - Check disk performance with 'fio' benchmarks"
echo " - Review etcd workload and consider optimization"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
if [ "$DISK_WARNINGS" -gt 5 ]; then
echo "WARNING: Multiple disk performance warnings ($DISK_WARNINGS in last ${DURATION}m)"
echo " Recommendation: Investigate disk I/O performance"
echo " - Ensure etcd is using SSD/NVMe storage"
echo " - Check for disk saturation or competing I/O"
echo " - Verify disk benchmarks meet etcd requirements (> 50 sequential IOPS)"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
# Get average DB size
AVG_DB_SIZE=$(echo "$DB_STATUS" | jq -r '[.[] | .Status.dbSize] | add / length')
AVG_DB_SIZE_MB=$(echo "scale=0; $AVG_DB_SIZE / 1024 / 1024" | bc)
if [ "$AVG_DB_SIZE_MB" -gt 8000 ]; then
echo "WARNING: Large database size (${AVG_DB_SIZE_MB}MB)"
echo " Recommendation: Review data retention and compaction policies"
echo " - Check event retention policies"
echo " - Consider more frequent compaction"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
echo "Performance Metrics Summary:"
echo " - Database size: ${AVG_DB_SIZE_MB}MB (recommended: < 8GB)"
echo " - Fragmentation: ${MAX_FRAG}% (recommended: < 30%)"
echo " - Slow operations (${DURATION}m): $SLOW_OPS (recommended: < 10)"
echo " - Leader changes (${DURATION}m): $LEADER_CHANGES (recommended: < 5)"
echo ""
if [ "$ISSUES" -eq 0 ] && [ "$WARNINGS" -eq 0 ]; then
echo "Status: ✓ HEALTHY - Performance within acceptable ranges"
exit 0
elif [ "$ISSUES" -gt 0 ]; then
echo "Status: ✗ CRITICAL - Found $ISSUES performance issues requiring attention"
exit 1
else
echo "Status: ⚠ WARNING - Found $WARNINGS performance warnings"
exit 0
fi
Output Format:
/etcd:analyze-performance
Output:
===============================================
ETCD PERFORMANCE ANALYSIS
===============================================
Analyzing etcd performance (last 5 minutes)...
Using etcd pod: etcd-dis016-p6vvv-master-0.us-central1-a.c.openshift-qe.internal
===============================================
DATABASE PERFORMANCE ANALYSIS
===============================================
Fetching database statistics...
Database Statistics:
Endpoint: https://10.0.0.5:2379
Version: 3.5.24
DB Size: 94941184 bytes (90MB)
DB In Use: 51789824 bytes (49MB)
Keys: 50240
Raft Index: 57097
Raft Term: 8
Leader: YES
Endpoint: https://10.0.0.3:2379
Version: 3.5.24
DB Size: 95363072 bytes (90MB)
DB In Use: 51789824 bytes (49MB)
Keys: 50240
Raft Index: 57097
Raft Term: 8
Leader: NO
Endpoint: https://10.0.0.6:2379
Version: 3.5.24
DB Size: 94613504 bytes (90MB)
DB In Use: 51834880 bytes (49MB)
Keys: 50240
Raft Index: 57097
Raft Term: 8
Leader: NO
Fragmentation Analysis:
Endpoint: https://10.0.0.5:2379
Fragmentation: 45% - NOTICE: Moderate fragmentation
Endpoint: https://10.0.0.3:2379
Fragmentation: 45% - NOTICE: Moderate fragmentation
Endpoint: https://10.0.0.6:2379
Fragmentation: 45% - NOTICE: Moderate fragmentation
===============================================
CLUSTER HEALTH
===============================================
https://10.0.0.5:2379 is healthy: successfully committed proposal: took = 9.848973ms
https://10.0.0.3:2379 is healthy: successfully committed proposal: took = 14.309216ms
https://10.0.0.6:2379 is healthy: successfully committed proposal: took = 14.829731ms
===============================================
LOG ANALYSIS (Last 5 minutes)
===============================================
Searching for performance-related warnings...
Slow operations logged: 0
Disk-related warnings: 0
Apply operation warnings: 0
Recent compaction operations:
{"level":"info","ts":"2025-11-19T06:15:10.136401Z","caller":"mvcc/kvstore_compaction.go:72","msg":"finished scheduled compaction","compact-revision":48026,"took":"175.577699ms","hash":1330697744}
===============================================
LEADER STABILITY ANALYSIS
===============================================
Leader change events: 0
===============================================
NETWORK ANALYSIS
===============================================
Network-related issues: 0
===============================================
PERFORMANCE SUMMARY & RECOMMENDATIONS
===============================================
WARNING: Moderate database fragmentation (45%)
Recommendation: Monitor and consider defragmentation if performance degrades
Performance Metrics Summary:
- Database size: 90MB (recommended: < 8GB)
- Fragmentation: 45% (recommended: < 30%)
- Slow operations (5m): 0 (recommended: < 10)
- Leader changes (5m): 0 (recommended: < 5)
Status: ⚠ WARNING - Found 1 performance warnings
/etcd:analyze-performance --duration 30
Symptoms: Database size significantly larger than in-use size (>30% fragmentation)
Investigation:
# Check current fragmentation
oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl endpoint status --cluster -w json | jq
Remediation:
# Defragment each etcd member (run one at a time)
oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl defrag --cluster
Recommendations:
Symptoms:
Investigation:
# Check disk performance on etcd nodes
oc debug node/<node-name> -- chroot /host fio --name=test --rw=write --bs=4k --size=1G --direct=1
Recommendations:
Symptoms: Multiple leader change events in logs
Investigation:
# Test network latency between control plane nodes
oc debug node/<node1> -- ping <node2-ip>
# Check for network packet loss
oc debug node/<node1> -- ping -c 100 <node2-ip>
Recommendations:
Symptoms:
Investigation:
# Check database size across cluster
oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl endpoint status --cluster -w table
Remediation:
# Check event retention settings
oc get kubeapiserver cluster -o yaml | grep -A5 eventTTL
# Review compaction settings
oc logs -n openshift-etcd <pod> -c etcd | grep compaction
Recommendations:
/etcd:health-checketcdctl and log analysis rather than direct metrics endpoint access--duration parameter/analyze-performance從使用者體驗的角度分析應用程式效能,並量化透過最佳化實現的感知速度提升。基於 Core Web Vitals 計算 UX 分數,並提出有優先序的最佳化策略。
/analyze-performanceAnalisa o desempenho da aplicação sob a perspectiva da experiência do usuário e quantifica as melhorias de velocidade percebida através de otimizações. Calcula pontuações UX baseadas em Core Web Vitals e propõe estratégias de otimização priorizadas.
/analyze-performance从用户体验角度分析应用程序性能,量化改进后的感知速度提升。基于 Core Web Vitals 计算 UX 评分,提出优先级明确的优化策略。