Skill

stage4-capacity-sweep

Sweep model capacity (width, depth, parameter count) at fixed compute to find the saturation point — where adding more parameters stops reducing the train loss. Activate when the user asks "how big should my model be", "capacity sweep", "is my model big enough", "find the right model size", or after Stage 3 pre-validation passes.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A short series of runs at increasing model sizes to identify where the train loss saturates — the size beyond which more parameters don't help. Confirms the architecture's capacity is well-matched to the data and budget.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 4 · Scale-up · Capacity sweep

Stage question

"Past which model size does train loss stop decreasing meaningfully?"

The answer tells you the smallest model worth running at full scale, and serves as a sanity check that the architecture is not silently bottlenecked.

The protocol

Pick three to five sizes spaced roughly 2–3× apart (e.g., 5M, 15M, 50M, 150M parameters).
Run each to the same compute budget (same total FLOPs, not same wall time).
Use the muP-tuned hyperparameters (one set, transferred from stage3-mup-coord-check).
Plot train loss vs parameter count on log-log axes.
Identify the saturation point: the size beyond which loss plateaus.

If train loss is still decreasing at the largest size: the architecture has more capacity to use; consider running larger.

If train loss plateaus before the target size: the architecture is bottlenecked; investigate before scaling further.

Distinguishing capacity bottleneck from data bottleneck

A train-loss plateau at moderate size has two possible causes:

Capacity bottleneck: model can't represent more. Fix by changing the architecture (more depth, wider FFN, better attention).
Data bottleneck: model is starting to memorize. Fix by adding more data or regularization.

Distinguish by checking the gap between train and val loss at the plateau:

Small gap, both flat → data bottleneck or genuine ceiling. More compute won't help; more data might.
Large gap (train low, val high) → overfitting; data bottleneck. Add data, dropout, or weight decay.
Small gap, both still decreasing → still in the productive regime; can probably scale further.

Procedure when assisting a user

Confirm Stage 3 has passed: pre-validation showed the variant works at small scale with statistical significance.
Pick the size grid using the user's compute budget. Don't sweep at sizes the user couldn't afford to run for real — small enough to be cheap, large enough to have signal.
Run with stage5-warmup-cosine schedule and stage3-kill-criterion enabled.
Plot the saturation curve. Render the verdict:
- "Saturation at ~50M params; running larger likely wastes compute."
- "Still scaling at 150M; running 500M is justified."
- "Train and val gap widening past 50M; data is the bottleneck before capacity."
Recommend next step:
- Saturated → run at the saturation size for the full budget; redirect spare budget to data or different ideas.
- Not saturated → run at target size with confidence.
- Bottlenecked → revisit Stage 1 (architecture) before scaling more.

Boundaries

Capacity sweeps test train loss, not generalization. A model can have low train loss and high val loss; that's a separate concern.
Sweeps are cheap proxies, not commitments. Final size should be chosen based on full-scale economics (stage3-compute-budget).
Sweeps assume you have a clean baseline; if the architecture is buggy, the sweep just tells you about the bug, not capacity.

Common mistakes

Sweeping with different LRs at each size → confounds capacity with hyperparameter tuning. Use muP.
Sweeping with too-narrow size range (e.g. 10M and 12M) → no signal. Use 2–3× spacing.
Stopping too short (training that doesn't reach the typical plateau timestep) → can't see saturation.
Conflating train and val loss in interpretation → distinguish them carefully.

skills/stage3-scaling-fit — fits a curve to the sweep results.
skills/stage4-optuna-integration — once size is chosen, refine other hyperparameters.
skills/stage4-parallel-primitive-intro — at large enough sizes, parallelism primitives become necessary.
skills/stage3-mup-coord-check — must hold for the sweep to be interpretable.

stage4-capacity-sweep

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage4-capacity-sweep

Tool Access

Preview

SKILL.md

Stage 4 · Scale-up · Capacity sweep

Stage question

The protocol

Distinguishing capacity bottleneck from data bottleneck

Procedure when assisting a user

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 4 · Scale-up · Capacity sweep

Stage question

The protocol

Distinguishing capacity bottleneck from data bottleneck

Procedure when assisting a user

Boundaries

Common mistakes

Related