From curry-train
Diagnose a training failure or stall by inspecting recent logs, loss curves, OOM traces, NaN events, and config. Activate when the user asks "why did my training crash", "loss went to NaN", "OOM during step X", "training is not improving", or "help me debug this run". Delegates to the failure-diagnoser agent.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Read the artifacts of a recent run, classify the failure mode, and produce a ranked list of likely causes with concrete fixes. Most of the analytical work is delegated to the `failure-diagnoser` agent.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Read the artifacts of a recent run, classify the failure mode, and produce a ranked list of likely causes with concrete fixes. Most of the analytical work is delegated to the failure-diagnoser agent.
$1 (optional): path to a run directory or a single log file. If omitted, look for the most recently modified directory under runs/ or outputs/ in the current project.Resolve the target. If the user gave a directory, use it. Else find the most recent runs/<timestamp>/ or outputs/<timestamp>/. If nothing exists, ask the user where their logs are.
Collect the evidence bundle. Read at most:
*.log or stderr.txt in the directory.config.yaml (the resolved Hydra config).metrics.jsonl (last 200 entries).git_sha.txt, seed.txt, cuda_info.txt if present.core.* traceback or Python traceback files.Spawn the failure-diagnoser agent. Hand it the evidence bundle and ask for a structured diagnosis (see agents/failure-diagnoser.md). Output schema:
classification: <NaN | OOM | divergence | stall | crash | unclear>
primary cause: <one line>
contributing factors: <bulleted>
fix candidates: <ranked, each with concrete code or config change>
verification next step: <a single command or skill to confirm the fix>
Suggest the verification skill. Match the classification to one of:
stage5-loss-spike-rollback (recipe to recover) or stage2-init-loss-check (early sanity).primitive-recompute (activation checkpointing) or batch-size reduction.stage2-overfit-single-batch (verify pipeline at all) or stage2-grad-flow-viz (find dead layers).classification: unclear and ask for more information./tmp/ or to a job scheduler.