Add a wait step to a CI workflow for debugging test failures
Adds a wait step to CI workflows or jobs for debugging test failures.
/plugin marketplace add openshift-eng/ai-helpers/plugin install ci@ai-helpers<workflow-or-job-name> [timeout]ci:add-debug-wait
/ci:add-debug-wait <workflow-or-job-name> [timeout]
The ci:add-debug-wait command adds a wait step to a CI job/workflow for debugging test failures.
What it does:
- ref: wait before the last test step (with optional timeout configuration)That's it! Simple, fast, and automated.
The command performs the following steps:
Prompt user for (in this order):
Workflow/Job Name: (from command argument $1 or prompt)
Workflow or job name: <user-input>
Example: aws-c2s-ipi-disc-priv-fips-f7
Example: baremetalds-two-node-arbiter-e2e-openshift-test-private-tests
Timeout (optional, from command argument $2):
Wait timeout in hours (optional, default: 3h):
Examples: "1h", "2h", "8h", "24h", "72h"
Valid range: 1h to 72h
timeout: property on the wait step in the workflow/job YAMLOCP Version: (prompt - REQUIRED for searching job configs)
OCP version for debugging (e.g., 4.18, 4.19, 4.20, 4.21, 4.22):
This is used to:
OpenShift Release Repo Path: (prompt if not in current directory)
Path to openshift/release repository:
Default: ~/repos/openshift-release
Silently validate (no user prompts):
cd <repo-path>
# Check 1: Repository exists and is correct
git remote -v | grep "openshift/release" || exit 1
# Skip repo update - work with current state
# User can manually update their repo if needed
Priority 1: Search job configs first (more specific and targeted):
cd <repo-path>
# Search for job config files matching the OCP version
# The job name could be in various config files, so search broadly
grep -r "as: ${job_name}" ci-operator/config/ --include="*release-${ocp_version}*.yaml" -l
Example searches:
aws-c2s-ipi-disc-priv-fips-f7 and OCP 4.21:
grep -r "as: aws-c2s-ipi-disc-priv-fips-f7" ci-operator/config/ --include="*release-4.21*.yaml" -l
Handle job config search results:
1 file found:
✅ Found job configuration:
${file_path}
Type: Job configuration file
Proceeding with job config modification...
→ Continue to Step 4a: Analyze Job Configuration
Multiple files found:
Found ${count} matching job config files:
1. ci-operator/config/.../release-4.21__amd64-nightly.yaml
2. ci-operator/config/.../release-4.21__arm64-nightly.yaml
3. ci-operator/config/.../release-4.21__ppc64le-nightly.yaml
Select file (1-${count}) or 'q' to quit:
Prompt user to select which file to modify, then continue to Step 4a: Analyze Job Configuration
0 files found:
ℹ️ No job config found for: ${job_name} (OCP ${ocp_version})
Searching for workflow files instead...
→ Continue to Priority 2 below
Priority 2: Search workflow files (if job config not found):
cd <repo-path>
# Search for workflow files
find ci-operator/step-registry -type f -name "*${workflow_name}*workflow*.yaml"
Handle workflow search results:
0 files found:
❌ No job config or workflow file found for: ${job_name}
Suggestions:
1. Check spelling of job/workflow name
2. Verify OCP version (${ocp_version})
3. Try with partial name
4. Search manually:
- Job configs: grep -r "as: ${job_name}" ci-operator/config/
- Workflows: find ci-operator/step-registry -name "*workflow*.yaml" | grep <partial-name>
1 file found:
✅ Found workflow file:
${file_path}
Type: Workflow file
Proceeding with workflow modification...
→ Continue to Step 4b: Analyze Workflow File
Multiple files found:
Found ${count} matching workflow files:
1. ci-operator/step-registry/.../workflow1.yaml
2. ci-operator/step-registry/.../workflow2.yaml
3. ci-operator/step-registry/.../workflow3.yaml
Select file (1-${count}) or 'q' to quit:
Prompt user to select which file to modify, then continue to Step 4b: Analyze Workflow File
Read and parse the job config YAML:
# Find the specific test definition
grep -A 30 "as: ${job_name}" <job-config-file>
Check for:
steps: sectiontest: section inside steps- ref: waitExample current structure:
- as: aws-c2s-ipi-disc-priv-fips-f7
cron: 36 16 3,12,19,26 * *
steps:
cluster_profile: aws-c2s-qe
env:
BASE_DOMAIN: qe.devcluster.openshift.com
FIPS_ENABLED: "true"
test:
- chain: openshift-e2e-test-qe
workflow: cucushift-installer-rehearse-aws-c2s-ipi-disconnected-private
If wait already exists:
ℹ️ Wait step already configured in job config
Current test section:
test:
- ref: wait
- chain: openshift-e2e-test-qe
No changes needed. The job is already set up for debugging.
If no test section found:
ℹ️ Job config found but no test: section
This job uses only the workflow's test steps.
Searching for the workflow: ${workflow_name}
→ Fall back to searching for workflow (Priority 2 in Step 3)
→ Continue to Step 5a: Show Diff for Job Config
Read and parse the workflow YAML:
cat <workflow-file>
Check for:
workflow: sectiontest: section- ref: waitExample current structure:
workflow:
as: baremetalds-two-node-arbiter-upgrade
steps:
pre:
- chain: baremetalds-ipi-pre
test:
- chain: baremetalds-ipi-test
post:
- chain: baremetalds-ipi-post
If wait already exists:
ℹ️ Wait step already configured in workflow
Current test section:
test:
- ref: wait
- chain: baremetalds-ipi-test
No changes needed. The workflow is already set up for debugging.
If no test section exists:
ℹ️ Workflow has no test: section
This workflow is provision/deprovision only.
The test steps must be defined in the job config.
Please provide the full job name to modify the job config instead.
→ Exit or prompt for job name
→ Continue to Step 5b: Modify Workflow File
Edit the job config file directly - no confirmation needed:
# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm
Two scenarios:
Without custom timeout (uses wait step's built-in default of 3h):
test:
- ref: wait
- chain: openshift-e2e-test-qe
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
With custom timeout (user provided timeout parameter):
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
Note: best_effort: true is required when timeout is customized to prevent the wait step from failing the job if it times out
Show brief confirmation:
✅ Modified: ${job_name} (OCP ${ocp_version})
File: <job-config-file-path>
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
Edit the workflow file directly - no confirmation needed:
# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm
Two scenarios:
Without custom timeout (uses wait step's built-in default of 3h):
test:
- ref: wait
- chain: baremetalds-ipi-test
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
With custom timeout (user provided timeout parameter):
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: baremetalds-ipi-test
Note: best_effort: true is required when timeout is customized to prevent the wait step from failing the job if it times out
Show brief confirmation:
✅ Modified: ${workflow_name} workflow
File: <workflow-file-path>
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
⚠️ Impact: Affects ALL jobs using this workflow
Branch naming:
debug-${workflow_name}-${ocp_version}-$(date +%Y%m%d)
Example: debug-baremetalds-two-node-arbiter-4.21-20250131
Git operations:
# Create branch
git checkout -b "${branch_name}"
# Modify the file (add wait step using the implementation below)
# Add '- ref: wait' as the first step in the test: section
# Stage change
git add <workflow-file>
# Commit
git commit -m "[Debug] Add wait step to ${workflow_name} for OCP ${ocp_version}
This adds a wait step to enable debugging of test failures in OCP ${ocp_version}.
The wait step pauses the workflow before tests run, allowing QE to:
- SSH into the test environment
- Inspect system state and logs
- Debug configuration issues
- Investigate test failures
OCP Version: ${ocp_version}
Workflow: ${workflow_name}"
YAML Modification Algorithm:
The modification process for both job configs and workflow files follows the same pattern:
Locate the target: Find the test: section
- as: ${job_name})Find test steps: Identify all steps (lines with - ref: or - chain:)
Check for duplicates: Ensure - ref: wait doesn't already exist
Insert wait step: Add before the last test step with matching indentation
Handle timeout:
- ref: waittimeout and best_effort propertiesExample transformation:
Before:
test:
- chain: openshift-e2e-test-qe
After (without timeout):
test:
- ref: wait
- chain: openshift-e2e-test-qe
After (with timeout=8h):
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
Critical constraints:
best_effort: true is required to prevent job failureAuto-push the branch:
git push origin "${branch_name}"
Display GitHub PR creation link:
✅ Changes pushed successfully!
Create PR here:
https://github.com/openshift/release/compare/master...${branch_name}
Branch: ${branch_name}
Job: ${job_name}
OCP: ${ocp_version}
⚠️ Remember to close PR after debugging (DO NOT MERGE)
That's it! Simple and clean.
Error: Repository Not Found
❌ Error: Repository not found at ${repo_path}
Please provide the correct path to openshift/release repository.
To clone:
git clone https://github.com/openshift/release.git
Error: Not in openshift/release Repo
❌ Error: This doesn't appear to be the openshift/release repository
Remote URL: ${current_remote}
Expected: github.com/openshift/release
Please navigate to the correct repository.
Error: Workflow File Not Found
❌ Error: Workflow file not found
Searched for: *${workflow_name}*workflow*.yaml
Location: ci-operator/step-registry/
Suggestions:
1. Verify the workflow name
2. Try a partial match
3. Search manually: find ci-operator/step-registry -name "*workflow*.yaml"
Error: Wait Step Already Exists
ℹ️ Wait step already configured in this workflow
No action needed - you can proceed with debugging using the existing wait step.
Error: Invalid OCP Version
❌ Invalid OCP version: ${version}
Valid versions: 4.18, 4.19, 4.20, 4.21, 4.22, master
Please provide a valid version.
❌ Invalid timeout format: ${timeout}
Valid format: Integer followed by 'h' (e.g., "1h", "2h", "8h", "24h", "72h")
Valid range: 1h to 72h
Examples:
- "1h" (1 hour)
- "8h" (8 hours)
- "24h" (24 hours)
- "72h" (72 hours, maximum)
Please provide a valid timeout in hours.
When a user provides a timeout like "8h", the implementation should normalize it to the standard Go duration format "8h0m0s" for consistency with existing configurations in the codebase.
/ci:add-debug-wait aws-ipi-f7-longduration-workload
Prompts for: OCP version (4.21), repo path
Result:
test:
- ref: wait
- chain: openshift-e2e-test-qe
Returns: PR creation link
/ci:add-debug-wait aws-ipi-f7-longduration-workload 8h
Prompts for: OCP version (4.21), repo path
Result:
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
Returns: PR creation link with timeout info
/ci:add-debug-wait baremetalds-two-node-arbiter-upgrade 24h
Behavior: Searches job config first, falls back to workflow if not found. Warns that workflow changes affect ALL jobs using it.
Returns: PR creation link
Before Running Command:
During Debugging:
After Debugging:
Consider adding companion commands:
/ci:close-debug-pr - Lists open debug PRs, prompts for findings, closes PR/ci:list-debug-prs - Show all open debug PRs/ci:revert-debug-pr - Revert a debug PR that was merged by mistake