From aiops-plugin
Performs root cause analysis on failed Ansible Automation Platform jobs by correlating local AAP logs with Splunk OCP pod logs and retrieving AgnosticD/AgnosticV configs from GitHub.
npx claudepluginhub redhat-et/rhdp-rca-pluginThis skill is limited to using the following tools:
Investigate failed jobs by correlating Ansible Automation Platform (AAP) job logs with Splunk OCP pod logs and analyzing AgnosticD/AgnosticV configuration to identify root causes.
README.mdpost-step4-verification.mdrequirements.txtschemas/correlation.schema.jsonschemas/job_context.schema.jsonschemas/step4_github_context.schema.jsonschemas/summary.schema.jsonscripts/__init__.pyscripts/cli.pyscripts/config.pyscripts/correlator.pyscripts/job_parser.pyscripts/log_fetcher.pyscripts/setup.pyscripts/splunk_client.pyscripts/step4_fetch_github.pyscripts/tracing.pytests/__init__.pytests/conftest.pytests/data/correlator_fixtures.jsonSearches logs and codebases for error patterns, stack traces, and anomalies. Correlates errors across systems and identifies root causes for debugging distributed issues.
Diagnoses production incidents by detecting environment, gathering symptoms, reading logs with Grep/Bash, checking metrics, tracing requests to find root causes and propose fixes with rollbacks.
Searches logs and codebases for error patterns, stack traces, and anomalies. Correlates errors across systems and identifies root causes when debugging issues, analyzing logs, or investigating production errors.
Share bugs, ideas, or general feedback.
Investigate failed jobs by correlating Ansible Automation Platform (AAP) job logs with Splunk OCP pod logs and analyzing AgnosticD/AgnosticV configuration to identify root causes.
When a user asks to analyze a failed job, execute these steps automatically. The skill's base path is provided when this skill is invoked. Run scripts relative to this folder.
# Create virtual environment and install dependencies (if .venv doesn't exist)
python3 -m venv .venv && .venv/bin/pip install -q -r requirements.txt
# Check all prerequisites (use --json for structured output)
.venv/bin/python scripts/cli.py setup --json
Review the JSON output. Some settings are required, others are optional:
Required (skill will not proceed without these):
Recommended (analysis works without these but functionality is reduced):
Optional (skill runs with reduced functionality when missing):
--fetch flag won't work (user must provide logs in JOB_LOGS_DIR manually)If any checks have "status": "missing" and "configurable": true, offer to help the user configure them:
"configurable": true:
env_vars[].prompt to explain what's needed"default", mention it (user can press enter to accept)"optional": true, let the user know they can skip it"ssh_setup_needed": true:
~/.ssh/config -- if so, use it as REMOTE_HOST~/.ssh/config, append the new Host block, and write it backREMOTE_HOST to the alias name.claude/settings.json file"env" block (create it if it doesn't exist)"hooks" block (create it if it doesn't exist):
"hooks": {
"SessionStart": [
{
"hooks": [
{
"type": "command",
"command": "INPUT=$(cat); SESSION_ID=$(echo \"$INPUT\" | jq -r '.session_id'); echo \"export CLAUDE_SESSION_ID='$SESSION_ID'\" >> \"$CLAUDE_ENV_FILE\""
},
{
"type": "command",
"command": "if ! lsof -iTCP:$MLFLOW_PORT -sTCP:LISTEN >/dev/null 2>&1; then ssh -f -N -o ExitOnForwardFailure=yes -L $MLFLOW_PORT:localhost:5000 $JUMPBOX_URI; fi"
},
{
"type": "command",
"command": "if [ \"$MLFLOW_CLAUDE_TRACING_ENABLED\" = \"true\" ]; then if ! pip show mlflow >/dev/null 2>&1; then pip install mlflow; fi; fi"
},
{
"type": "command",
"command": "if [ \"$MLFLOW_CLAUDE_TRACING_ENABLED\" = \"true\" ]; then python3 -c \"import mlflow; client=mlflow.tracking.MlflowClient(tracking_uri='http://127.0.0.1:$MLFLOW_PORT'); exp=client.get_experiment_by_name('$MLFLOW_EXPERIMENT_NAME'); client.restore_experiment(exp.experiment_id) if exp and exp.lifecycle_stage == 'deleted' else None\" && mlflow autolog claude -u http://127.0.0.1:$MLFLOW_PORT -n $MLFLOW_EXPERIMENT_NAME; fi"
}
]
}
],
"Stop": [{ "hooks": [{ "type": "command", "command": "python -c \"from mlflow.claude_code.hooks import stop_hook_handler; stop_hook_handler()\"" }] }],
}
.claude/settings.json -- ensure this file is in .gitignoreIf checks show non-configurable errors (e.g., venv issues, rsync not found), provide the fix command instead.
The MLFlow server preflight check automatically handles server connectivity:
JUMPBOX_URI is configured, it starts an SSH tunnel automaticallyIf any required checks (JOB_LOGS_DIR, JUMPBOX_URI) are still missing after the setup flow, do not proceed to analysis -- tell the user what's still needed. If MLFlow is missing, warn that tracing won't be recorded but proceed. If all required checks pass (recommended/optional items may remain missing), proceed to analysis.
Always use --fetch when analyzing by job ID. This automatically downloads the log from the remote server if it's not already present locally, and skips fetching if the log is already there.
# By job ID (auto-fetches log from remote if not found locally)
.venv/bin/python scripts/cli.py analyze --job-id <JOB_ID> --fetch
# By explicit path (when you already have the log file)
.venv/bin/python scripts/cli.py analyze --job-log <path-to-job-log>
The cli.py analyze command automatically runs all steps:
GITHUB_TOKEN to be configured)Outputs: .analysis/<job-id>/step1_job_context.json, step2_splunk_logs.json, step3_correlation.json, step4_github_fetch_history.json
This skill automatically searches for job logs in the configured JOB_LOGS_DIR.
From this skill's directory:
# Setup virtual environment (one time)
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
# Analyze by job ID (auto-fetches log if needed, runs steps 1-4)
.venv/bin/python scripts/cli.py analyze --job-id 1234567 --fetch
# Or analyze with explicit path
.venv/bin/python scripts/cli.py analyze --job-log /path/to/job_123.json.gz
Post-Step4 GitHub MCP Verification: If step4 output contains "error": "all_paths_failed" or any error status (e.g., "status": "404", "status": "timeout", "status": "500") in paths_tried arrays, reasoning errors using MCP tools. See post-step4-verification.md for complete verification process.
Input: Read the following files in order:
step1_job_context.json - Job metadata and failed task detailsstep3_correlation.json - Correlated timeline with relevant pod logs (DO NOT read step2 unless needed)step4_github_fetch_history.json - Configuration and code contextstep2_splunk_logs.json - Only read if step3 indicates errors needing deeper investigationOutput: .analysis/<job-id>/step5_analysis_summary.json
Post-Step 5 Action: After saving the summary, you MUST run the upload command to send the analysis to the Jumpbox:
python scripts/cli.py upload --job-id <job-id>
Configuration Analysis:
includes/secrets, !vault)Task Analysis:
kubernetes.core.k8s_info (RBAC/resource), ansible.builtin.uri (network/auth), command/shell (paths/permissions)when: conditions executed incorrectly (variable precedence issue)configuration|infrastructure|workload_bug|credential|resource|dependency), summary, confidencesource is agnosticv_config or agnosticd_code, MUST include github_path in format owner/repo:path/to/file.yml:linestep4.github_fetches[].fetched_configs.{purpose}.path → construct as {config_owner}/{config_repo}:{path} (e.g., example-org/config-repo:platform/account.yaml)step4.github_fetches[].location.parsed → construct as {owner}/{repo}:{file_path}:{line_number} (e.g., example-org/workload-repo:roles/example-role/tasks/main.yml:42)github_path in recommendations when referencing GitHub files (format: owner/repo:path/to/file.yml:line)Note: Job details, failed tasks, and configuration data are available in step1 and step4 files - reference them rather than duplicating in the summary.
See schemas/summary.schema.json for complete structure. Example:
{
"job_id": "{job_id}",
"analyzed_at": "2025-01-15T10:30:45Z",
"root_cause": {
"summary": "Brief description of root cause",
"category": "configuration",
"confidence": "high"
},
"correlation": {
"method": "namespace_time_match",
"identifiers": {
"guid": "{guid}",
"namespace": "{namespace}",
"pod_name": "{pod_name}"
},
"time_overlap": {
"aap_job_start": "2025-01-15T10:30:00Z",
"aap_job_end": "2025-01-15T10:35:00Z",
"splunk_first_error": "2025-01-15T10:30:15Z",
"splunk_last_error": "2025-01-15T10:34:45Z",
"overlap_confirmed": true
}
},
"evidence": [
{
"source": "aap_job",
"timestamp": "2025-01-15T10:30:45Z",
"message": "'aws_access_key_id' is undefined"
},
{
"source": "agnosticv_config",
"timestamp": "2025-01-15T10:30:45Z",
"message": "Missing variable 'aws_access_key_id' in environment config",
"github_path": "example-org/config-repo:platform/account.yaml"
},
{
"source": "agnosticd_code",
"timestamp": "2025-01-15T10:30:45Z",
"message": "Task at line 42 uses undefined variable",
"github_path": "example-org/workload-repo:roles/example-role/tasks/main.yml:42"
}
],
"recommendations": [
{
"priority": "high",
"action": "Add missing variable",
"file": "platform/account.yaml",
"github_path": "example-org/config-repo:platform/account.yaml",
"github_url": "https://github.com/example-org/config-repo/blob/main/platform/account.yaml",
"change": "Add aws_access_key_id variable",
"details": "Variable is referenced but not defined"
}
],
"contributing_factors": ["Missing variable definition", "Incomplete configuration"]
}
| Step | File | Author |
|---|---|---|
| 1 | step1_job_context.json | Python |
| 2 | step2_splunk_logs.json | Python |
| 3 | step3_correlation.json | Python |
| 4 | step4_github_fetch_history.json | Python (Optional Claude updates for MCP verification) |
| 5 | step5_analysis_summary.json | Claude |
All files in .analysis/<job-id>/