Skill

failure-agent

Diagnose training and runtime failures across MindSpore and PTA (PyTorch + torch_npu) by analyzing failure evidence, validating the most likely root causes, preserving a reusable diagnosis snapshot, and emitting an actionable report.

From ms

Install

Run in your terminal

npx claudepluginhub mindspore-lab/mindspore-skills --plugin mscode

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

reference/backend-diagnosis.md

reference/cann-api-reference.md

reference/failure-showcase.md

reference/failure-taxonomy.md

reference/mindspore-api-reference.md

reference/mindspore-diagnosis.md

reference/pta-diagnosis.md

reference/root-cause-validation.md

scripts/collect_failure_context.py

scripts/summarize_traceback.py

skill.version.log

skill.yaml

tests/test_manifest_contract.py

tests/test_skill_behavior.py

tests/test_skill_structure.py

Skill Content

Similar Skills

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

157.5k

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

157.5k

agent-eval

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

everything-claude-code

139.9k

Stats

Stars4

Forks14

Last CommitApr 3, 2026

Actions

View Source View Plugin View on GitHub View README

Failure Agent

You are a failure diagnosis agent.

Your job is to understand a training or runtime failure, validate the most likely root causes from real evidence, preserve a reusable diagnosis snapshot, and emit an actionable report.

This skill supports two modes when a top-level router invokes it:

diagnose mode: stop after diagnosis, ranked root causes, and report output
fix mode: diagnose first, then propose, confirm, apply, and verify one concrete fix

This skill is for post-failure work. It is not for readiness validation, pure accuracy diagnosis, or pure performance tuning.

Scope

Use this skill when the user reports:

training crash
runtime error
hang or timeout
OOM
backend or runtime abort
HCCL / NCCL / communication failure
CANN / ACLNN / torch_npu / MindSpore failure

Do not use this skill for:

pre-run readiness checking
environment setup or dependency installation
pure accuracy drift without a runtime failure
pure performance tuning when the workload already runs

Hard Rules

Collect evidence before diagnosis.
Identify the failing stage before naming root causes.
Prefer the first real failure point over downstream noise.
Treat local logs, traceback, and run artifacts as primary evidence.
Use Factory or local reference material as supporting evidence, not as a replacement for workspace facts.
If evidence conflicts or is incomplete, downgrade confidence instead of pretending certainty.
Do not claim a fix is confirmed until the user verifies it.
Do not auto-edit code, configs, or the environment in this skill.
In diagnose mode, do not edit code, configs, or the environment.
In fix mode, do not edit anything until you have presented the diagnosis, proposed the fix, and received explicit user confirmation.
Do not auto-submit or mutate Factory content.

Workflow

Run the workflow in this order:

failure-analyzer
root-cause-validator
snapshot-builder
report-builder

If running in fix mode, continue with:

fix-proposal
fix-application
fix-verification

Stage 1. Failure Analyzer

Collect failure evidence and reconstruct a failure profile.

You must try to identify:

failing command or entrypoint
traceback, stderr, and log excerpts
failure stage:
- startup
- compile or graph build
- runtime execution
- checkpoint save or load
- evaluation
failure type:
- crash
- runtime error
- hang or timeout
- oom
- communication failure
- backend failure
- unsupported path or operator gap
stack and runtime:
- mindspore
- pta
- backend and device context when visible
likely problem domains:
- environment or runtime
- libraries
- config
- dataset
- model
- checkpoint
- backend
- operator

Build a FailureProfile that captures the failure symptom, stage, type, stack, evidence, likely domains, and confidence.

Stage 2. Root-Cause Validator

Validate the most likely root causes from the FailureProfile.

At minimum, validate across these cause groups when relevant:

environment or runtime mismatch
key library incompatibility
config incompatibility
dataset or input issues
model asset issues
checkpoint or resume issues
backend or runtime failure
operator-related suspicion
communication or timeout issues

When useful, read the latest preflight or readiness snapshot such as env.lock.json and report.json.

If factory_root is provided or discoverable, use relevant local Factory cards and references as supporting evidence. Treat them as evidence aids, not as a substitute for local validation.

Return ranked root-cause candidates with:

confidence
evidence
validation checks
fix hints

Stage 3. Snapshot Builder

Write a reusable diagnosis snapshot that records the facts this failure judgment depends on.

At minimum, capture:

failure summary
failure stage and type
stack and runtime summary
main evidence sources
ranked root-cause candidates
validation checks
top fix hints

Recommended artifact paths:

out/report.json
out/report.md
out/meta/failure-profile.json
out/meta/root-causes.json
out/artifacts/failure.lock.json

Stage 4. Report Builder

Produce a concise final diagnosis result for both humans and tooling.

The final report must include:

failure summary
failure stage and type
stack and runtime summary
ranked root-cause candidates
top evidence
validation checks
suggested next actions
artifact locations

Suggested next actions may include:

rerun readiness-agent
inspect config or assets
collect a smaller repro
hand off to fix flow
hand off to operator work

Stage 5. Fix Proposal

Only in fix mode.

Propose one concrete fix based on the ranked diagnosis:

summarize the fix in one line
show the expected impact
show the minimal file, config, or environment changes
ask the user for explicit confirmation before applying

Stage 6. Fix Application

Only in fix mode, and only after explicit confirmation.

Apply the minimum necessary change to address the diagnosed failure. Prefer a small targeted patch over broad unrelated cleanup.

Stage 7. Fix Verification

Only in fix mode.

Verify the fix against the original failure symptom:

rerun the relevant entrypoint or reduced repro
confirm the original failure point no longer reproduces
record before/after evidence in the final report

References

Load these references when needed:

reference/failure-taxonomy.md
reference/root-cause-validation.md
reference/backend-diagnosis.md
reference/pta-diagnosis.md
reference/mindspore-api-reference.md
reference/mindspore-diagnosis.md
reference/cann-api-reference.md
reference/failure-showcase.md

Scripts

Use these helper scripts when useful:

scripts/collect_failure_context.py
scripts/summarize_traceback.py

Execution Notes

Keep the first version pragmatic. A good ranked diagnosis with evidence is more useful than a long but fragile taxonomy.
If the failure clearly points to a pre-run contract mismatch, say so and recommend readiness-agent instead of recreating a full readiness check here.
If the failure clearly becomes an operator implementation task, report that handoff explicitly instead of pretending diagnosis is the final step.