From curry-train
Run a tiny, cheap A/B comparison between a baseline and a new idea on a small model and short training budget — to predict whether the idea is worth a full-scale run. Activate when the user asks "should I scale this up", "is this idea worth running for real", "test this idea cheaply first", "small-scale ablation", or "validate before scaling".
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The cheapest way to find out if a new idea (architecture change, optimizer, loss function, data filter) actually helps before committing 10× compute. The output is a yes/no/inconclusive verdict on whether to scale up.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The cheapest way to find out if a new idea (architecture change, optimizer, loss function, data filter) actually helps before committing 10× compute. The output is a yes/no/inconclusive verdict on whether to scale up.
"On a small model and short budget, does the new variant produce a measurable, statistically meaningful improvement over the baseline?"
If yes with significance: scale up. If no: redirect compute to a different idea. If inconclusive: redo with more seeds or larger small-scale model before scaling.
Define the comparison before running anything:
Pick a small surrogate model that satisfies these criteria:
Run N ≥ 3 seeds per arm at minimum. Use the same data, same LR (calibrated via stage3-lr-range-test), same schedule for both arms.
Check kill-criterion during each run: if any run obviously diverges, abort that seed (don't include it).
Compute the effect:
Decision: variant is worth scaling iff:
mean(V) − mean(B) is in the right direction, and|mean(V) − mean(B)| > 2 × pooled_std, andIf any of these fail: not worth scaling.
For LM tasks: token-level loss at fixed step T. Avoid eval suite metrics (MMLU, etc.) at small scale — they have huge variance.
For classification: dev accuracy at the end of training, with multi-seed-variance for the std estimate.
For generative tasks where final quality matters: use a proxy like "loss on validation tokens" — small-scale generation quality is meaningless.
A small-scale ablation does not perfectly predict large-scale behavior. Things that help small models can hurt big ones (e.g. some regularizers), and vice versa. To mitigate:
stage3-mup-coord-check.stage3-scaling-fit.1. Predefine: baseline, variant (1 change), metric, threshold, kill-criterion
2. Run B and V with N=3 seeds each at small scale
3. Compute mean, std, Δ, pooled_std
4. Decision rule:
gap_significant = abs(Δ) > 2 * pooled_std
gap_meaningful = abs(rel_Δ) > 5%
scale_up = direction_correct AND gap_significant AND gap_meaningful
5. If "scale up": commit to the larger run.
If "no": archive the result with config + verdict in run journal.
If "inconclusive": rerun with N=10 seeds at the same scale before deciding.
Force them to write down the predefined threshold before running. Many ablations get rationalized post-hoc.
Insist on N ≥ 3 seeds. One seed each is meaningless; surface this loudly.
Run both arms with kill-criterion enabled — don't include diverged runs in the average.
After results, plug numbers into the decision rule. Output the verdict in one line.
If verdict is "scale up", hand off to stage4-capacity-sweep or stage4-optuna-integration. If "no", put the negative result in the run journal — negative results are valuable.
skills/stage3-multi-seed-variance — how to compute the pooled std properly.skills/stage3-kill-criterion — when to abort a single seed.skills/stage3-scaling-fit — confirm the gap doesn't vanish with size.skills/runs-diff — the same comparison logic used post-hoc on full runs.