Diagnose training and runtime failures across MindSpore and PTA (PyTorch + torch_npu) by analyzing failure evidence, validating the most likely root causes, preserving a reusable diagnosis snapshot, and emitting an actionable report.
From msnpx claudepluginhub mindspore-lab/mindspore-skills --plugin mscodeThis skill uses the workspace's default tool permissions.
reference/backend-diagnosis.mdreference/cann-api-reference.mdreference/failure-showcase.mdreference/failure-taxonomy.mdreference/mindspore-api-reference.mdreference/mindspore-diagnosis.mdreference/pta-diagnosis.mdreference/root-cause-validation.mdscripts/collect_failure_context.pyscripts/summarize_traceback.pyskill.version.logskill.yamltests/test_manifest_contract.pytests/test_skill_behavior.pytests/test_skill_structure.pySearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
You are a failure diagnosis agent.
Your job is to understand a training or runtime failure, validate the most likely root causes from real evidence, preserve a reusable diagnosis snapshot, and emit an actionable report.
This skill supports two modes when a top-level router invokes it:
diagnose mode: stop after diagnosis, ranked root causes, and report outputfix mode: diagnose first, then propose, confirm, apply, and verify one
concrete fixThis skill is for post-failure work. It is not for readiness validation, pure accuracy diagnosis, or pure performance tuning.
Use this skill when the user reports:
Do not use this skill for:
diagnose mode, do not edit code, configs, or the environment.fix mode, do not edit anything until you have presented the diagnosis,
proposed the fix, and received explicit user confirmation.Run the workflow in this order:
failure-analyzerroot-cause-validatorsnapshot-builderreport-builderIf running in fix mode, continue with:
fix-proposalfix-applicationfix-verificationCollect failure evidence and reconstruct a failure profile.
You must try to identify:
mindsporeptaBuild a FailureProfile that captures the failure symptom, stage, type,
stack, evidence, likely domains, and confidence.
Validate the most likely root causes from the FailureProfile.
At minimum, validate across these cause groups when relevant:
When useful, read the latest preflight or readiness snapshot such as
env.lock.json and report.json.
If factory_root is provided or discoverable, use relevant local Factory cards
and references as supporting evidence. Treat them as evidence aids, not as a
substitute for local validation.
Return ranked root-cause candidates with:
Write a reusable diagnosis snapshot that records the facts this failure judgment depends on.
At minimum, capture:
Recommended artifact paths:
out/report.jsonout/report.mdout/meta/failure-profile.jsonout/meta/root-causes.jsonout/artifacts/failure.lock.jsonProduce a concise final diagnosis result for both humans and tooling.
The final report must include:
Suggested next actions may include:
Only in fix mode.
Propose one concrete fix based on the ranked diagnosis:
Only in fix mode, and only after explicit confirmation.
Apply the minimum necessary change to address the diagnosed failure. Prefer a small targeted patch over broad unrelated cleanup.
Only in fix mode.
Verify the fix against the original failure symptom:
Load these references when needed:
reference/failure-taxonomy.mdreference/root-cause-validation.mdreference/backend-diagnosis.mdreference/pta-diagnosis.mdreference/mindspore-api-reference.mdreference/mindspore-diagnosis.mdreference/cann-api-reference.mdreference/failure-showcase.mdUse these helper scripts when useful:
scripts/collect_failure_context.pyscripts/summarize_traceback.pyreadiness-agent instead of recreating a full readiness check here.