Skill

stage3-small-scale-ablation

Run a tiny, cheap A/B comparison between a baseline and a new idea on a small model and short training budget — to predict whether the idea is worth a full-scale run. Activate when the user asks "should I scale this up", "is this idea worth running for real", "test this idea cheaply first", "small-scale ablation", or "validate before scaling".

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The cheapest way to find out if a new idea (architecture change, optimizer, loss function, data filter) actually helps before committing 10× compute. The output is a yes/no/inconclusive verdict on whether to scale up.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 3 · Pre-validate · Small-scale ablation

Stage question

"On a small model and short budget, does the new variant produce a measurable, statistically meaningful improvement over the baseline?"

If yes with significance: scale up. If no: redirect compute to a different idea. If inconclusive: redo with more seeds or larger small-scale model before scaling.

The protocol

Define the comparison before running anything:
- Baseline arm B: current best config.
- Variant arm V: exactly one change vs B (no double-changes).
- Headline metric: a single scalar both arms produce. Often "loss at step T" or "best dev metric within budget".
- Effect size threshold: how much improvement is "worth scaling". Conservative default: 2× the seed-to-seed standard deviation in the baseline, OR a relative improvement of 5% on the headline metric, whichever is larger.
Pick a small surrogate model that satisfies these criteria:
- Same architecture family, same primitive set as the target.
- 5–10× smaller in parameter count.
- Trains in under 30 minutes on the available hardware.
- Reaches a clearly non-trivial baseline (not random, not perfect).
Run N ≥ 3 seeds per arm at minimum. Use the same data, same LR (calibrated via stage3-lr-range-test), same schedule for both arms.
Check kill-criterion during each run: if any run obviously diverges, abort that seed (don't include it).
Compute the effect:
- mean of headline metric per arm
- std across seeds per arm
- difference of means
- pooled std
Decision: variant is worth scaling iff:
- mean(V) − mean(B) is in the right direction, and
- |mean(V) − mean(B)| > 2 × pooled_std, and
- the relative improvement exceeds the predefined threshold.

If any of these fail: not worth scaling.

What surrogate metric to use

For LM tasks: token-level loss at fixed step T. Avoid eval suite metrics (MMLU, etc.) at small scale — they have huge variance.

For classification: dev accuracy at the end of training, with multi-seed-variance for the std estimate.

For generative tasks where final quality matters: use a proxy like "loss on validation tokens" — small-scale generation quality is meaningless.

Why small-scale predictions are imperfect (and how to mitigate)

A small-scale ablation does not perfectly predict large-scale behavior. Things that help small models can hurt big ones (e.g. some regularizers), and vice versa. To mitigate:

Use muP-correct configurations so width-scaling preserves the dynamics. See stage3-mup-coord-check.
Run the ablation at multiple sizes (e.g. 5M, 50M params) and check the gap is not vanishing. If the gap narrows from 5M to 50M, it will likely vanish at 5B. See stage3-scaling-fit.
Be honest about the limit: small-scale = high prior probability of usefulness, not a guarantee.

Recommended workflow

1. Predefine: baseline, variant (1 change), metric, threshold, kill-criterion
2. Run B and V with N=3 seeds each at small scale
3. Compute mean, std, Δ, pooled_std
4. Decision rule:
     gap_significant = abs(Δ) > 2 * pooled_std
     gap_meaningful = abs(rel_Δ) > 5%
     scale_up = direction_correct AND gap_significant AND gap_meaningful
5. If "scale up": commit to the larger run.
   If "no": archive the result with config + verdict in run journal.
   If "inconclusive": rerun with N=10 seeds at the same scale before deciding.

Procedure when assisting a user

Force them to write down the predefined threshold before running. Many ablations get rationalized post-hoc.
Insist on N ≥ 3 seeds. One seed each is meaningless; surface this loudly.
Run both arms with kill-criterion enabled — don't include diverged runs in the average.
After results, plug numbers into the decision rule. Output the verdict in one line.
If verdict is "scale up", hand off to stage4-capacity-sweep or stage4-optuna-integration. If "no", put the negative result in the run journal — negative results are valuable.

Boundaries

This is a predictive check, not a proof. The decision rule is intentionally conservative; it will reject some ideas that would have helped at scale (Type II error). That's the right trade-off when compute is expensive.
It does not replace large-scale experiments. It only filters which ideas earn the right to a large-scale experiment.
Single-arm absolute numbers from small scale should not be reported as "the model achieves X". Only the gap is meaningful.

Common mistakes

Comparing one seed each → call it noise, not progress.
Changing two things between B and V → can't attribute.
Cherry-picking the metric after seeing the results → invalidates the test.
Not predefining the threshold → easy to rationalize.

skills/stage3-multi-seed-variance — how to compute the pooled std properly.
skills/stage3-kill-criterion — when to abort a single seed.
skills/stage3-scaling-fit — confirm the gap doesn't vanish with size.
skills/runs-diff — the same comparison logic used post-hoc on full runs.

stage3-small-scale-ablation

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage3-small-scale-ablation

Tool Access

Preview

SKILL.md

Stage 3 · Pre-validate · Small-scale ablation

Stage question

The protocol

What surrogate metric to use

Why small-scale predictions are imperfect (and how to mitigate)

Recommended workflow

Procedure when assisting a user

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 3 · Pre-validate · Small-scale ablation

Stage question

The protocol

What surrogate metric to use

Why small-scale predictions are imperfect (and how to mitigate)

Recommended workflow

Procedure when assisting a user

Boundaries

Common mistakes

Related