Skill

stage6-error-cluster

Cluster the dev-set errors of a model and surface the dominant failure modes — pointing at the most leverage-worthy next experiment. Activate when the user asks "what should I try next", "what is my model getting wrong", "error analysis", "failure mode analysis", or after a completed run that's no longer SOTA.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A targeted analysis of where the model is failing on the dev set, organized into clusters that suggest concrete next experiments. Replaces "I'll try X" with "X is the largest cluster of errors, so try addressing it specifically".

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 6 · Iterate · Error cluster analysis

Stage question

"What is the model actually getting wrong, and what's the biggest single cluster I can fix?"

The next experiment should target the biggest cluster. Random ideas are noise.

The protocol

Score the dev set with the trained model. Record per-example predictions, ground truth, and a per-example loss/score.
Pick the failures. Definition depends on task:
- Classification: predicted label != true label.
- Generation: BLEU/ROUGE below threshold, or human-eval marked wrong.
- LM: tokens with loss above the 95th percentile.
- Detection: examples with IoU < 0.5.
Cluster the failures. Methods, in order of effort:
- Manual inspection of ~50 random failures. Sort into 3–7 buckets by reading. Cheapest, most insightful.
- Embedding clustering: encode failure inputs with a small text/image model, cluster embeddings (k-means or HDBSCAN), surface top examples per cluster.
- LLM-aided clustering: feed batches of failures to an LLM and ask "group these by failure mode in 3-7 buckets, with bucket names and counts".
Size each cluster. Count failures per bucket; report as a percentage of total failures and percentage of total dev set.
Rank by leverage. The bucket to address next is the largest one whose root cause you can plausibly fix. Don't pick the largest if it's "noise inherent in the data" — that's not actionable.
Define the next experiment. The fix for the chosen bucket becomes the variant in stage3-small-scale-ablation. Loop closes.

Why manual first

LLM clustering and embedding clustering are tempting but routinely produce wrong clusters that look plausible. Manual inspection of ~50 failures takes an hour and is irreplaceable for understanding what's actually going wrong. Use automated methods after to scale up the clusters you've already identified.

Recommended workflow

1. python tools/score_dev.py --run runs/<id> --out runs/<id>/dev_eval.jsonl
2. python tools/sample_failures.py --eval runs/<id>/dev_eval.jsonl --n 50 --out failures.txt
3. (manual) Read failures.txt, write a 5-bucket draft into clusters.yaml.
4. python tools/size_clusters.py --eval runs/<id>/dev_eval.jsonl --clusters clusters.yaml
5. (decide) Which bucket → which experiment.

Cluster examples

For an LM:

"Numerical reasoning errors (math word problems)" — 23%.
"Long-range coreference (pronouns far from antecedent)" — 18%.
"OOD vocabulary (rare proper nouns)" — 15%.
"Tokenizer artifacts (multi-byte sequences)" — 12%.
"Acceptable variations marked wrong (eval noise)" — 32%.

If "eval noise" is the largest, the right action is to fix the eval, not the model.

Procedure when assisting a user

Confirm the run is fully evaluated. If not, score the dev set first.
Walk through 30–50 manually selected failures with the user. Don't outsource this to an LLM yet — the manual pass is irreplaceable.
Help articulate 3–7 clusters with sharp definitions. Vague clusters ("hard examples") aren't actionable.
Size the clusters quantitatively. Render as a markdown table with cluster name, count, % of failures.
Ask the user which cluster they want to address next. Confirm the proposed fix has a plausible mechanism (this is where stage3-surrogate-task becomes useful).
Open a new entry in the run journal: cluster analysis result + chosen next experiment.

Boundaries

Cluster analysis is qualitative on top of quantitative results. Both modes are required; one without the other is fragile.
Clusters can be biased by the user's hypotheses. Mix in random samples (not just by-loss-percentile) to avoid blind spots.
For very large dev sets, sample ~50 failures uniformly first; full clustering of millions is unnecessary and expensive.

Common mistakes

Skipping manual inspection → automated clusters describe the embedding space, not actual failure modes.
Using the same metric to define failure and to evaluate the next variant → circular.
Cherry-picking a small cluster because it has an obvious fix, ignoring a larger cluster → wastes the next experiment.
Calling "noise" everything you can't explain → may be a real cluster you're missing.

skills/stage3-small-scale-ablation — the next experiment is shaped by the chosen cluster.
skills/stage3-surrogate-task — design a surrogate that targets the cluster.
skills/stage6-ablation-matrix — multi-variant comparison; each variant addresses one cluster.
Karpathy 2019, "A Recipe for Training Neural Networks", §6 ("Tune").

stage6-error-cluster

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage6-error-cluster

Tool Access

Preview

SKILL.md

Stage 6 · Iterate · Error cluster analysis

Stage question

The protocol

Why manual first

Recommended workflow

Cluster examples

Procedure when assisting a user

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 6 · Iterate · Error cluster analysis

Stage question

The protocol

Why manual first

Recommended workflow

Cluster examples

Procedure when assisting a user

Boundaries

Common mistakes

Related