analyze-must-gather | openstack-k8s-agent-tools

Stats

Actions

Tags

analyze-must-gather | openstack-k8s-agent-tools

Analyze must-gather

General rules

If something prevents you from taking meaningful steps forward, stop and report the problem to the user.
Consider the must-gather report read-only. Do not edit anything in it.
Even if you have tools like oc or ssh available, don't do direct cluster examination during the analysis. Stick to just analyzing the files in the must-gather report. If you find that it would be useful to have some more information, which is missing from the report but could be obtained by directly inspecting OpenShift or OpenStack or the underlying servers, highlight that in your analysis.
Match LLM context usage with the severity of the issue being investigated. Investigate serious problems first. If a problem looks serious, feel free to use more effort in investigating it. If an error is transient or of low severity, still make a note in your analysis but don't spend too much effort on hunting down the root cause.
You MUST NOT use any command whose purpose is to communicate over network.
The analysis you produce should include paths to relevant files so that the analysis can be independently verified or continued further from where you left off.

Must-gather report structure hints

The must-gather report can come already extracted as a directory, or as a .tar.xz that you should extract.
Pod logs for OpenStack services and their Operators tend to be useful. These are by default under namespaces/openstack and namespaces/openstack-operators respectively, but the naming may be customized.
There may be "sos reports" (logs and troubleshooting command outputs) from OpenStack data plane nodes (e.g. compute nodes) and OCP nodes under sos-reports/_all_nodes directory. You may need to extract them from .tar.xz files. The compute node sos reports tend to be useful when investigating problems affecting OpenStack virtual machines and services on the data plane.

Analysis workflow

Locate the must-gather report. If it's tarballed, extract it. Do not do too many ls calls to understand the structure of the report yet, it is big with many subdirectories.
Scan the report for signs of problems with tools like grep or ripgrep. The words to look for include but may not be limited to "error", "fail", "failure", "fatal", "restart".
If the problem scan highlighted obvious problems, read more info to help understand the problem and its cause better (larger file chunks or whole files). Get to the root cause, but even if that doesn't seem possible, gathering more clues is still valuable. (Feel free to use ls more than in step 1, in case it seems helpful.)
If the previous steps didn't yield any obvious problems, repeat the step "scan the report for signs of problems" but widen the search to words like "warn", "warning". If that yields something, do the step "read more info to help understand the problem".
Don't just settle for finding symptoms — find the possible root causes of the main problems.
Output a structured analysis of the problems observed in the must-gather report. Start with the most severe issues first. Don't forget to cite relevant file paths in your analysis.

RHOSO-Specialized Triage (optional)

After producing the generic analysis, check whether the must-gather contains RHOSO (Red Hat OpenStack Services on OpenShift) resources:

# Check for RHOSO namespace directories
find <must-gather-root> -maxdepth 2 -type d \( -name "openstack" -o -name "openstack-operators" \)

# Check for RHOSO CRDs
find <must-gather-root> -path "*/cluster-scoped-resources/openstack.org" -type d

# Check for OpenStackControlPlane resources
find <must-gather-root> -path "*openstackcontrolplane*" -type f | head -5

If RHOSO resources are detected:

Inform the user: "This must-gather contains RHOSO (Red Hat OpenStack Services on OpenShift) resources. A specialized RHOSO triage is available that uses omc to perform deeper diagnostics including operator installation validation, networking alignment checks, storage class analysis, and control/data plane status."
Ask: "Would you like to run RHOSO-specialized triage? (Requires omc to be installed.)"
If the user confirms, check for omc:
- Run which omc. If not found, provide install instructions and continue with just the generic analysis.
- If found, locate the must-gather root (find the directory containing cluster-scoped-resources/), run omc use <path>, and detect the OpenStack namespaces (openstack, openstack-operators).

Dispatch the support-triage agent with the generic findings as context:

Agent(
  subagent_type="openstack-k8s-agent-tools:support-triage:support-triage",
  description="RHOSO-specialized triage after generic analysis",
  prompt="Must-gather path: <path>
Namespaces: openstack=<ns>, openstack-operators=<ns-ops>
Selected categories: all

## Prior Generic Analysis Findings
<insert the structured analysis you just produced>

RHOSO-specific resources detected. The generic analysis above identified surface-level
issues. Focus on RHOSO-specific diagnostics: correlate generic findings with RHOSO root
causes, check configurations the generic scan cannot evaluate (OperatorGroup
targetNamespace, NAD/NNCP alignment, StorageClass alignment, webhook namespace selectors),
and avoid duplicating already-reported issues unless adding RHOSO-specific context."
)

Present the combined output: the generic analysis followed by the RHOSO-specialized triage report.

If RHOSO resources are NOT detected, skip this section entirely and present only the generic analysis.