Analyze OpenShift installation failures in Prow CI jobs by examining installer logs, log bundles, and sosreports. Use when CI job fails "install should succeed" tests at bootstrap, cluster creation or other stages.
/plugin marketplace add openshift-eng/ai-helpers/plugin install prow-job@ai-helpersThis skill inherits all available tools. When active, it can use any tool Claude has access to.
This skill helps debug OpenShift installation failures in CI jobs by downloading and analyzing installer logs, log bundles, and sosreports from Google Cloud Storage.
Use this skill when:
Before starting, verify these prerequisites:
gcloud CLI Installation
which gcloudgcloud Authentication (Optional)
test-platform-results bucket is publicly accessibleThe user will provide:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview/1983307151598161920test-platform-results/Job names contain important clues about the test environment and what to look for:
Upgrade Jobs (names containing "upgrade")
periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgradeperiodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fipsperiodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgradeFIPS Jobs (names containing "fips")
IPv6 and Dualstack Jobs (names containing "ipv6" or "dualstack")
Metal Jobs with IPv6 (names containing "metal" and "ipv6")
Single-Node Jobs (names containing "single-node")
Platform-Specific Indicators
aws, gcp, azure: Cloud platform usedmetal, baremetal: Bare metal environment (uses specialized metal install failure skill)ovn: OVN-Kubernetes networking (standard)Extract bucket path
test-platform-results/ in URLExtract build_id
/(\d{10,})/ in the bucket pathDetermine job type
is_metal_job = trueis_metal_job = falseConstruct GCS paths
test-platform-resultsgs://test-platform-results/{bucket-path}//mkdir -p .work/prow-job-analyze-install-failure/{build_id}/logs
mkdir -p .work/prow-job-analyze-install-failure/{build_id}/analysis
.work/prow-job-analyze-install-failure/ as the base directory (already in .gitignore)logs/ subdirectory for all downloadsanalysis/ subdirectory for analysis files.work/prow-job-analyze-install-failure/{build_id}/Download prowjob.json
gcloud storage cp gs://test-platform-results/{bucket-path}/prowjob.json .work/prow-job-analyze-install-failure/{build_id}/logs/prowjob.json --no-user-output-enabled
Parse and validate
.work/prow-job-analyze-install-failure/{build_id}/logs/prowjob.json--target=([a-zA-Z0-9-]+)Extract target name
e2e-aws-ovn-techpreview)Note on install-status.txt: You may see an install-status.txt file in the artifacts. This file contains only the installer's exit code (a single number). The junit_install.xml file translates this exit code into a human-readable failure mode, so always prefer junit_install.xml for determining the failure stage.
Find junit_install.xml
gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "junit_install.xml"
Download junit_install.xml
# Download from wherever it was found
gcloud storage cp {full-gcs-path-to-junit_install.xml} .work/prow-job-analyze-install-failure/{build_id}/logs/junit_install.xml --no-user-output-enabled
Parse junit_install.xml to find failure stage
install should succeed: <stage>cluster bootstrap - Early install failure where we failed to bootstrap the cluster. Bootstrap is typically an ephemeral VM that runs a temporary kube apiserver. Check bootkube logs in the bundle.infrastructure - Early failure before we're able to create all cloud resources. Often but not always due to cloud quota, rate limiting, or outages. Check installer log for cloud API errors.cluster creation - Usually means one or more operators was unable to stabilize. Check operator logs in gather-must-gather artifacts.configuration - Extremely rare failure mode where we failed to create the install-config.yaml for one reason or another. Check installer log for validation errors.cluster operator stability - Operators never stabilized (available=True, progressing=False, degraded=False). Check specific operator logs to determine why they didn't reach stable state.other - Unknown install failure, could be for one of the previously declared reasons, or an unknown one. Requires full log analysis.List all artifacts to find installer logs
.openshift_install*.loggcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep -E "\.openshift_install.*\.log$" | grep -v "deprovision"
Download all installer logs found
# For each installer log found in the listing (excluding deprovision)
gcloud storage cp {full-gcs-path-to-installer-log} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled
List all artifacts to find log bundle
.tar files (NOT .tar.gz) starting with log-bundle-# Find all log bundles, preferring non-deprovision
gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep -E "log-bundle.*\.tar$"
Download log bundle
# If non-deprovision log bundles exist, download one of those (prefer most recent by timestamp)
# Otherwise, download deprovision log bundle if that's the only one available
gcloud storage cp {full-gcs-path-to-log-bundle} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled
Extract log bundle
tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/log-bundle-{timestamp}.tar -C .work/prow-job-analyze-install-failure/{build_id}/logs/
IMPORTANT: Only perform this step if is_metal_job = true
Metal IPI jobs use dev-scripts with Metal3 and Ironic to install OpenShift on bare metal. These require specialized analysis.
Invoke the metal install failure skill
prow-job:prow-job-analyze-metal-install-failure{build_id}{bucket-path}{target}.work/prow-job-analyze-install-failure/{build_id}/The metal skill will:
Continue with standard analysis:
CRITICAL: Understanding OpenShift's Eventual Consistency
OpenShift installations exhibit "eventual consistency" behavior, which means:
Error Analysis Strategy:
Example: If installation fails at 40 minutes with "kube-apiserver not available", an error at 5 minutes saying "ingress operator degraded" is likely irrelevant because it probably resolved. Focus on what was still broken when the timeout occurred.
Read installer log
time="YYYY-MM-DDTHH:MM:SSZ" level=<level> msg="<message>"Identify key failure indicators (WORK BACKWARDS FROM END)
level=error or level=fatal near the end of the loglevel=warning near the failure time that may indicate problemsExtract relevant log sections (prioritize recent errors)
Create installer log summary
.work/prow-job-analyze-install-failure/{build_id}/analysis/installer-summary.txtSkip this step if no log bundle was downloaded
Understand log bundle structure
log-bundle-{timestamp}/
bootstrap/journals/ - Journal logs from bootstrap node
bootkube.log - Bootkube service that starts initial control planekubelet.log - Kubelet service logscrio.log - Container runtime logsjournal.log.gz - Complete system journal (gzipped)bootstrap/network/ - Network configuration
ip-addr.txt - IP addressesip-route.txt - Routing tablehostname.txt - Hostnameserial/ - Serial console logs from all nodes
{cluster-name}-bootstrap-serial.log - Bootstrap node console{cluster-name}-master-N-serial.log - Master node consolesclusterapi/ - Cluster API resources
*.yaml - Kubernetes resource definitionsetcd.log - etcd logskube-apiserver.log - API server logsfailed-units.txt - List of systemd units that failedgather.log - Log bundle collection process logAnalyze based on failure mode from junit_install.xml
For "cluster bootstrap" failures:
bootstrap/journals/bootkube.log for bootkube errorsbootstrap/journals/kubelet.log for kubelet issuesclusterapi/kube-apiserver.log for API server startup issuesclusterapi/etcd.log for etcd cluster formation issuesserial/{cluster-name}-bootstrap-serial.log for bootstrap VM boot issuesFor "infrastructure" failures:
For "cluster creation" failures:
must-gather*.tar files in the gather-must-gather step directoryFor "configuration" failures:
For "cluster operator stability" failures:
must-gather*.tar files)For "other" failures:
Extract key information
failed-units.txt exists, read it to find failed services.work/prow-job-analyze-install-failure/{build_id}/analysis/log-bundle-summary.txtCreate comprehensive analysis report
Report structure
OpenShift Installation Failure Analysis
========================================
Job: {job-name}
Build ID: {build_id}
Job Type: {metal/cloud}
Prow URL: {original-url}
Failure Stage: {stage from junit_install.xml}
Summary
-------
{High-level summary of the failure}
Installer Log Analysis
----------------------
{Key findings from installer log}
First Error:
{First error message with timestamp}
Context:
{Surrounding log lines}
Log Bundle Analysis
-------------------
{Findings from log bundle}
Failed Units:
{List from failed-units.txt}
Key Journal Errors:
{Important errors from journal logs}
Metal Installation Analysis (Metal Jobs Only)
-----------------------------------------
{Summary from metal install failure skill}
- Dev-scripts setup status
- Console log findings
- Key metal-specific errors
See detailed metal analysis: .work/prow-job-analyze-install-failure/{build_id}/analysis/metal-analysis.txt
Recommended Next Steps
----------------------
{Actionable debugging steps based on failure mode:
For "configuration" failures:
- Review install-config.yaml validation errors
- Check for missing required fields
- Verify credential format and availability
For "infrastructure" failures:
- Check cloud provider quota and limits
- Review cloud provider service status for outages
- Verify API credentials and permissions
- Check for rate limiting in cloud API calls
For "cluster bootstrap" failures:
- Review bootkube logs for control plane startup issues
- Check etcd cluster formation in etcd.log
- Examine kube-apiserver startup in kube-apiserver.log
- Review bootstrap VM serial console for boot issues
For "cluster creation" failures:
- Identify which operators failed to deploy
- Check if must-gather was collected (look for must-gather*.tar files)
- If must-gather exists: Review specific operator logs in gather-must-gather
- If must-gather doesn't exist: Cluster was too unstable to collect diagnostics; rely on installer log and log bundle
- Check for resource conflicts or missing dependencies
For "cluster operator stability" failures:
- Identify operators not reaching stable state
- Check operator conditions (available, progressing, degraded)
- Check if must-gather exists before suggesting to review it
- Review operator logs for stuck operations (if must-gather available)
- Look for time-series of operator status changes
For "other" failures:
- Perform comprehensive log review
- Look for timeout or unusual error patterns
- Check all available artifacts systematically
}
Artifacts Location
------------------
All artifacts downloaded to:
.work/prow-job-analyze-install-failure/{build_id}/logs/
- Installer logs: .openshift_install*.log
- Log bundle: log-bundle-*/
- sosreport: sosreport-*/ (metal jobs only)
Save report
.work/prow-job-analyze-install-failure/{build_id}/analysis/report.txtDisplay summary
Offer next steps
Provide artifact locations
Understanding the installation stages helps target analysis:
Pre-installation (Failure mode: "configuration")
Infrastructure Creation (Failure mode: "infrastructure")
Bootstrap (Failure mode: "cluster bootstrap")
Master Node Bootstrap
Bootstrap Complete
Cluster Operators Initialization (Failure mode: "cluster creation")
Cluster Operators Stabilization (Failure mode: "cluster operator stability")
Install Complete
| JUnit Failure Mode | Installation Stage | Where to Look | Artifacts Available |
|---|---|---|---|
configuration | Pre-installation | Installer log only | No log bundle |
infrastructure | Infrastructure Creation | Installer log, cloud API errors | Partial or no log bundle |
cluster bootstrap | Bootstrap | Log bundle (bootkube, etcd, kube-apiserver) | Full log bundle |
cluster creation | Operators Initialization | gather-must-gather, operator logs | Full artifacts |
cluster operator stability | Operators Stabilization | gather-must-gather, operator status | Full artifacts |
other | Unknown | All available logs | Varies |
.openshift_install-{timestamp}.loglevel=error - Error messageslevel=fatal - Fatal errors that stop installationwaiting for - Timeout/waiting messagesMetal installations are handled by the prow-job-analyze-metal-install-failure skill.
See that skill's documentation for details on dev-scripts, libvirt logs, sosreport, and squid logs.
.work/prow-job-analyze-install-failure/{build_id}/ for re-analysisEventual Consistency Behavior
Upgrade Jobs and Installation
Log Bundle Availability
Must-Gather Availability
must-gather*.tar file is presentMetal Job Specifics
prow-job-analyze-metal-install-failure skillDebugging Workflow
Common Failure Patterns
File Formats
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.