Prow Job Extract Must-Gather

This skill extracts and decompresses must-gather archives from Prow CI job artifacts, automatically handling nested tar and gzip archives, and generating an interactive HTML file browser.

When to Use This Skill

Use this skill when the user wants to:

Extract must-gather archives from Prow CI job artifacts
Avoid manually downloading and extracting nested archives
Browse must-gather contents with an interactive HTML interface
Search for specific files or file types in must-gather data
Analyze OpenShift cluster state from CI test runs

Prerequisites

Before starting, verify these prerequisites:

gcloud CLI Installation
- Check if installed: which gcloud
- If not installed, provide instructions for the user's platform
- Installation guide: https://cloud.google.com/sdk/docs/install
gcloud Authentication (Optional)
- The test-platform-results bucket is publicly accessible
- No authentication is required for read access
- Skip authentication checks

Input Format

The user will provide:

Prow job URL - gcsweb URL containing test-platform-results/
- Example: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376/
- URL may or may not have trailing slash

Implementation Steps

Step 1: Parse and Validate URL

Extract bucket path
- Find test-platform-results/ in URL
- Extract everything after it as the GCS bucket relative path
- If not found, error: "URL must contain 'test-platform-results/'"
Extract build_id
- Search for pattern /(\\d{10,})/ in the bucket path
- build_id must be at least 10 consecutive decimal digits
- Handle URLs with or without trailing slash
- If not found, error: "Could not find build ID (10+ digits) in URL"
Extract prowjob name
- Find the path segment immediately preceding build_id
- Example: In .../periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376/
- Prowjob name: periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview
Construct GCS paths
- Bucket: test-platform-results
- Base GCS path: gs://test-platform-results/{bucket-path}/
- Ensure path ends with /

Step 2: Create Working Directory

Check for existing extraction first
- Check if .work/prow-job-extract-must-gather/{build_id}/logs/ directory exists and has content
- If it exists with content:
  - Use AskUserQuestion tool to ask:
    - Question: "Must-gather already extracted for build {build_id}. Would you like to use the existing extraction or re-extract?"
    - Options:
      - "Use existing" - Skip to HTML report generation (Step 6)
      - "Re-extract" - Continue to clean and re-download
  - If user chooses "Re-extract":
    - Remove all existing content: rm -rf .work/prow-job-extract-must-gather/{build_id}/logs/
    - Also remove tmp directory: rm -rf .work/prow-job-extract-must-gather/{build_id}/tmp/
    - This ensures clean state before downloading new content
  - If user chooses "Use existing":
    - Skip directly to Step 6 (Generate HTML Report)
Create directory structure
```
mkdir -p .work/prow-job-extract-must-gather/{build_id}/logs
mkdir -p .work/prow-job-extract-must-gather/{build_id}/tmp
```
- Use .work/prow-job-extract-must-gather/ as the base directory (already in .gitignore)
- Use build_id as subdirectory name
- Create logs/ subdirectory for extraction
- Create tmp/ subdirectory for temporary files
- Working directory: .work/prow-job-extract-must-gather/{build_id}/

Step 3: Download and Validate prowjob.json

Download prowjob.json

gcloud storage cp gs://test-platform-results/{bucket-path}/prowjob.json .work/prow-job-extract-must-gather/{build_id}/tmp/prowjob.json --no-user-output-enabled

Parse and validate
- Read .work/prow-job-extract-must-gather/{build_id}/tmp/prowjob.json
- Search for pattern: --target=([a-zA-Z0-9-]+)
- If not found:
  - Display: "This is not a ci-operator job. The prowjob cannot be analyzed by this skill."
  - Explain: ci-operator jobs have a --target argument specifying the test target
  - Exit skill
Extract target name
- Capture the target value (e.g., e2e-aws-ovn-techpreview)
- Store for constructing must-gather path

Step 4: Download Must-Gather Archive

Construct must-gather path
- GCS path: gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-must-gather/artifacts/must-gather.tar
- Local path: .work/prow-job-extract-must-gather/{build_id}/tmp/must-gather.tar

Download must-gather.tar

gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-must-gather/artifacts/must-gather.tar .work/prow-job-extract-must-gather/{build_id}/tmp/must-gather.tar --no-user-output-enabled

Use --no-user-output-enabled to suppress progress output
If file not found, error: "No must-gather archive found. Job may not have completed or gather-must-gather may not have run."

Step 5: Extract and Process Archives

IMPORTANT: Use the provided Python script extract_archives.py from the skill directory.

Usage:

python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \
  .work/prow-job-extract-must-gather/{build_id}/tmp/must-gather.tar \
  .work/prow-job-extract-must-gather/{build_id}/logs

What the script does:

Extract must-gather.tar
- Extract to {build_id}/logs/ directory
- Uses Python's tarfile module for reliable extraction
Rename long subdirectory to "content/"
- Find subdirectory containing "-ci-" in the name
- Example: registry-build09-ci-openshift-org-ci-op-m8t77165-stable-sha256-d1ae126eed86a47fdbc8db0ad176bf078a5edebdbb0df180d73f02e5f03779e0/
- Rename to: content/
- Preserves all files and subdirectories

Recursively process nested archives

Walk entire directory tree
Find and process archives:

For .tar.gz and .tgz files:

# Extract in place
with tarfile.open(archive_path, 'r:gz') as tar:
    tar.extractall(path=parent_dir)
# Remove original archive
os.remove(archive_path)

For .gz files (no tar):

# Gunzip in place
with gzip.open(gz_path, 'rb') as f_in:
    with open(output_path, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
# Remove original archive
os.remove(gz_path)

Progress reporting
- Print status for each extracted archive
- Count total files and archives processed
- Report final statistics
Error handling
- Skip corrupted archives with warning
- Continue processing other files
- Report all errors at the end

Step 6: Generate HTML File Browser

IMPORTANT: Use the provided Python script generate_html_report.py from the skill directory.

Usage:

python3 plugins/prow-job/skills/prow-job-extract-must-gather/generate_html_report.py \
  .work/prow-job-extract-must-gather/{build_id}/logs \
  "{prowjob_name}" \
  "{build_id}" \
  "{target}" \
  "{gcsweb_url}"

Output: The script generates .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html

What the script does:

Scan directory tree
- Recursively walk {build_id}/logs/ directory
- Collect all files with metadata:
  - Relative path from logs/
  - File size (human-readable: KB, MB, GB)
  - File extension
  - Directory depth
  - Last modified time
Classify files
- Detect file types based on extension:
  - Logs: .log, .txt
  - YAML: .yaml, .yml
  - JSON: .json
  - XML: .xml
  - Certificates: .crt, .pem, .key
  - Binaries: .tar, .gz, .tgz, .tar.gz
  - Other
- Count files by type for statistics

Generate HTML structure

Header Section:

<div class="header">
  <h1>Must-Gather File Browser</h1>
  <div class="metadata">
    <p><strong>Prow Job:</strong> {prowjob-name}</p>
    <p><strong>Build ID:</strong> {build_id}</p>
    <p><strong>gcsweb URL:</strong> <a href="{original-url}">{original-url}</a></p>
    <p><strong>Target:</strong> {target}</p>
    <p><strong>Total Files:</strong> {count}</p>
    <p><strong>Total Size:</strong> {human-readable-size}</p>
  </div>
</div>

Filter Controls:

<div class="filters">
  <div class="filter-group">
    <label class="filter-label">File Type (multi-select)</label>
    <div class="filter-buttons">
      <button class="filter-btn" data-filter="type" data-value="log">Logs ({count})</button>
      <button class="filter-btn" data-filter="type" data-value="yaml">YAML ({count})</button>
      <button class="filter-btn" data-filter="type" data-value="json">JSON ({count})</button>
      <!-- etc -->
    </div>
  </div>
  <div class="filter-group">
    <label class="filter-label">Filter by Regex Pattern</label>
    <input type="text" class="search-box" id="pattern" placeholder="Enter regex pattern (e.g., .*etcd.*, .*\\.log$)">
  </div>
  <div class="filter-group">
    <label class="filter-label">Search by Name</label>
    <input type="text" class="search-box" id="search" placeholder="Search file names...">
  </div>
</div>

File List:

<div class="file-list">
  <div class="file-item" data-type="{type}" data-path="{path}">
    <div class="file-icon">{icon}</div>
    <div class="file-info">
      <div class="file-name">
        <a href="{relative-path}" target="_blank">{filename}</a>
      </div>
      <div class="file-meta">
        <span class="file-path">{directory-path}</span>
        <span class="file-size">{size}</span>
        <span class="file-type badge badge-{type}">{type}</span>
      </div>
    </div>
  </div>
</div>

CSS Styling:

Use same dark theme as analyze-resource skill
Modern, clean design with good contrast
Responsive layout
File type color coding
Monospace fonts for paths
Hover effects on file items

JavaScript Interactivity:

// Multi-select file type filters
document.querySelectorAll('.filter-btn').forEach(btn => {
  btn.addEventListener('click', function() {
    // Toggle active state
    // Apply filters
  });
});

// Regex pattern filter
document.getElementById('pattern').addEventListener('input', function() {
  const pattern = this.value;
  if (pattern) {
    const regex = new RegExp(pattern);
    // Filter files matching regex
  }
});

// Name search filter
document.getElementById('search').addEventListener('input', function() {
  const query = this.value.toLowerCase();
  // Filter files by name substring
});

// Combine all active filters
function applyFilters() {
  // Show/hide files based on all active filters
}

Statistics Section:

<div class="stats">
  <div class="stat">
    <div class="stat-value">{total-files}</div>
    <div class="stat-label">Total Files</div>
  </div>
  <div class="stat">
    <div class="stat-value">{total-size}</div>
    <div class="stat-label">Total Size</div>
  </div>
  <div class="stat">
    <div class="stat-value">{log-count}</div>
    <div class="stat-label">Log Files</div>
  </div>
  <div class="stat">
    <div class="stat-value">{yaml-count}</div>
    <div class="stat-label">YAML Files</div>
  </div>
  <!-- etc -->
</div>

Write HTML to file
- Script automatically writes to .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html
- Includes proper HTML5 structure
- All CSS and JavaScript are inline for portability

Step 7: Present Results to User

Display summary

Must-Gather Extraction Complete

Prow Job: {prowjob-name}
Build ID: {build_id}
Target: {target}

Extraction Statistics:
- Total files: {file-count}
- Total size: {human-readable-size}
- Archives extracted: {archive-count}
- Log files: {log-count}
- YAML files: {yaml-count}
- JSON files: {json-count}

Extracted to: .work/prow-job-extract-must-gather/{build_id}/logs/

File browser generated: .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html

Open in browser to browse and search extracted files.

Open report in browser
- Detect platform and automatically open the HTML report in the default browser
- Linux: xdg-open .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html
- macOS: open .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html
- Windows: start .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html
- On Linux (most common for this environment), use xdg-open
Offer next steps
- Ask if user wants to search for specific files
- Explain that extracted files are available in .work/prow-job-extract-must-gather/{build_id}/logs/
- Mention that extraction is cached for faster subsequent browsing

Error Handling

Handle these error scenarios gracefully:

Invalid URL format
- Error: "URL must contain 'test-platform-results/' substring"
- Provide example of valid URL
Build ID not found
- Error: "Could not find build ID (10+ decimal digits) in URL path"
- Explain requirement and show URL parsing
gcloud not installed
- Detect with: which gcloud
- Provide installation instructions for user's platform
- Link: https://cloud.google.com/sdk/docs/install
prowjob.json not found
- Suggest verifying URL and checking if job completed
- Provide gcsweb URL for manual verification
Not a ci-operator job
- Error: "This is not a ci-operator job. No --target found in prowjob.json."
- Explain: Only ci-operator jobs can be analyzed by this skill
must-gather.tar not found
- Warn: "Must-gather archive not found at expected path"
- Suggest: Job may not have completed or gather-must-gather may not have run
- Provide full GCS path that was checked
Corrupted archive
- Warn: "Could not extract {archive-path}: {error}"
- Continue processing other archives
- Report all errors in final summary
No "-ci-" subdirectory found
- Warn: "Could not find expected subdirectory to rename to 'content/'"
- Continue with extraction anyway
- Files will be in original directory structure

Performance Considerations

Avoid re-extracting
- Check if .work/prow-job-extract-must-gather/{build_id}/logs/ already has content
- Ask user before re-extracting
Efficient downloads
- Use gcloud storage cp with --no-user-output-enabled to suppress verbose output
Memory efficiency
- Process archives incrementally
- Don't load entire files into memory
- Use streaming extraction
Progress indicators
- Show "Downloading must-gather archive..." before gcloud command
- Show "Extracting must-gather.tar..." before extraction
- Show "Processing nested archives..." during recursive extraction
- Show "Generating HTML file browser..." before report generation

Examples

Example 1: Extract must-gather from periodic job

User: "Extract must-gather from this Prow job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376"

Output:
- Downloads must-gather.tar to: .work/prow-job-extract-must-gather/1965715986610917376/tmp/
- Extracts to: .work/prow-job-extract-must-gather/1965715986610917376/logs/
- Renames long subdirectory to: content/
- Processes 247 nested archives (.tar.gz, .tgz, .gz)
- Creates: .work/prow-job-extract-must-gather/1965715986610917376/must-gather-browser.html
- Opens browser with interactive file list (3,421 files, 234 MB)

Tips

Always verify gcloud prerequisites before starting (gcloud CLI must be installed)
Authentication is NOT required - the bucket is publicly accessible
Use .work/prow-job-extract-must-gather/{build_id}/ directory structure for organization
All work files are in .work/ which is already in .gitignore
The Python scripts handle all extraction and HTML generation - use them!
Cache extracted files in .work/prow-job-extract-must-gather/{build_id}/ to avoid re-extraction
The HTML file browser supports regex patterns for powerful file filtering
Extracted files can be opened directly from the HTML browser (links are relative)

Important Notes

Archive Processing:
- The script automatically handles nested archives
- Original compressed files are removed after successful extraction
- Corrupted archives are skipped with warnings
Directory Renaming:
- The long subdirectory name (containing "-ci-") is renamed to "content/" for brevity
- Files within "content/" are NOT altered
- This makes paths more readable in the HTML browser
File Type Detection:
- File types are detected based on extension
- Common types are color-coded in the HTML browser
- All file types can be filtered
Regex Pattern Filtering:
- Users can enter regex patterns in the filter input
- Patterns match against full file paths
- Invalid regex patterns are ignored gracefully
Working with Scripts:
- All scripts are in plugins/prow-job/skills/prow-job-extract-must-gather/
- extract_archives.py - Extracts and processes archives
- generate_html_report.py - Generates interactive HTML file browser

Prow Job Extract Must-Gather

Prow Job Extract Must-Gather

When to Use This Skill

Prerequisites

Input Format

Implementation Steps

Step 1: Parse and Validate URL

Step 2: Create Working Directory

Step 3: Download and Validate prowjob.json

Step 4: Download Must-Gather Archive

Step 5: Extract and Process Archives

Step 6: Generate HTML File Browser

Step 7: Present Results to User

Error Handling

Performance Considerations

Examples

Example 1: Extract must-gather from periodic job

Tips

Important Notes

Similar Skills