From curry-train
Cluster the dev-set errors of a model and surface the dominant failure modes — pointing at the most leverage-worthy next experiment. Activate when the user asks "what should I try next", "what is my model getting wrong", "error analysis", "failure mode analysis", or after a completed run that's no longer SOTA.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A targeted analysis of where the model is failing on the dev set, organized into clusters that suggest concrete next experiments. Replaces "I'll try X" with "X is the largest cluster of errors, so try addressing it specifically".
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A targeted analysis of where the model is failing on the dev set, organized into clusters that suggest concrete next experiments. Replaces "I'll try X" with "X is the largest cluster of errors, so try addressing it specifically".
"What is the model actually getting wrong, and what's the biggest single cluster I can fix?"
The next experiment should target the biggest cluster. Random ideas are noise.
Score the dev set with the trained model. Record per-example predictions, ground truth, and a per-example loss/score.
Pick the failures. Definition depends on task:
Cluster the failures. Methods, in order of effort:
Size each cluster. Count failures per bucket; report as a percentage of total failures and percentage of total dev set.
Rank by leverage. The bucket to address next is the largest one whose root cause you can plausibly fix. Don't pick the largest if it's "noise inherent in the data" — that's not actionable.
Define the next experiment. The fix for the chosen bucket becomes the variant in stage3-small-scale-ablation. Loop closes.
LLM clustering and embedding clustering are tempting but routinely produce wrong clusters that look plausible. Manual inspection of ~50 failures takes an hour and is irreplaceable for understanding what's actually going wrong. Use automated methods after to scale up the clusters you've already identified.
1. python tools/score_dev.py --run runs/<id> --out runs/<id>/dev_eval.jsonl
2. python tools/sample_failures.py --eval runs/<id>/dev_eval.jsonl --n 50 --out failures.txt
3. (manual) Read failures.txt, write a 5-bucket draft into clusters.yaml.
4. python tools/size_clusters.py --eval runs/<id>/dev_eval.jsonl --clusters clusters.yaml
5. (decide) Which bucket → which experiment.
For an LM:
If "eval noise" is the largest, the right action is to fix the eval, not the model.
Confirm the run is fully evaluated. If not, score the dev set first.
Walk through 30–50 manually selected failures with the user. Don't outsource this to an LLM yet — the manual pass is irreplaceable.
Help articulate 3–7 clusters with sharp definitions. Vague clusters ("hard examples") aren't actionable.
Size the clusters quantitatively. Render as a markdown table with cluster name, count, % of failures.
Ask the user which cluster they want to address next. Confirm the proposed fix has a plausible mechanism (this is where stage3-surrogate-task becomes useful).
Open a new entry in the run journal: cluster analysis result + chosen next experiment.
skills/stage3-small-scale-ablation — the next experiment is shaped by the chosen cluster.skills/stage3-surrogate-task — design a surrogate that targets the cluster.skills/stage6-ablation-matrix — multi-variant comparison; each variant addresses one cluster.