Skill

stage3-scaling-fit

Fit a power-law scaling curve to small-scale runs at multiple sizes, then extrapolate to predict large-scale loss before committing the compute. Activate when the user asks "scaling laws", "Chinchilla", "Kaplan", "predict large-scale loss from small-scale", "is my idea going to scale", or wants to do compute-optimal training.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A small set of cheap runs at varying compute budgets, fit to a power law, then used to predict large-scale loss. Cuts the scaling decision from "guess and pray" to "extrapolate from data".

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 3 · Pre-validate · Scaling fit

A small set of cheap runs at varying compute budgets, fit to a power law, then used to predict large-scale loss. Cuts the scaling decision from "guess and pray" to "extrapolate from data".

Stage question

"Given losses at compute budgets C₁ < C₂ < C₃ (small), what loss should I expect at the target compute budget C* (large), and is the variant still better at C*?"

The two scaling-law forms (pick one)

Kaplan-style (parameter-only)

Loss is a power law in parameter count N:

L(N) ≈ A · N^(-α) + L_∞

where L_∞ is the irreducible loss (data entropy floor). Useful when data and steps scale with N.

Chinchilla-style (compute-optimal)

Loss is jointly a power law in parameters N and tokens T:

L(N, T) ≈ A · N^(-α) + B · T^(-β) + L_∞

with the compute-optimal allocation N* ∝ C^0.5 and T* ∝ C^0.5 (where compute C ≈ 6 N T).

For most curryTrain users at the start of an idea, the simpler Kaplan-style fit is enough to make a scaling decision. Chinchilla becomes useful when the question is "given a compute budget, what model + tokens is optimal".

The protocol (Kaplan-style)

Pick three or more small-scale sizes: e.g., N ∈ {5M, 20M, 80M} parameters.
Train each to the same number of steps × tokens (so steps per token is held constant), under muP if possible (stage3-mup-coord-check).
For each size, record final loss L_i (or last-10%-mean loss for stability).
Fit L = A · N^(-α) + L_∞ using non-linear least squares (3 parameters, 3 data points = barely enough; more sizes are better).
Predict loss at the target size N* using the fitted parameters.

Repeat for the variant arm. Compare predicted losses at N*.

Decision rule

The variant earns a scale-up budget iff:

Both arms produce smooth scaling curves (no anomalies at intermediate sizes).
The fitted exponents α_B and α_V are similar (not anomalous).
L_V(N*) < L_B(N*) by a margin larger than the fit's prediction interval.

If the gap between B and V shrinks as size grows (i.e., L_V(small) − L_B(small) > L_V(large) − L_B(large)), the variant probably won't help at scale — even if it helps at small scale.

Recommended implementation

Lives at template/curry_train/prevalidate/scaling_fit.py. Sketch:

import numpy as np
from scipy.optimize import curve_fit

def kaplan_fit(sizes, losses):
    """Fit L = A * N^-alpha + L_inf to (sizes, losses)."""
    def model(N, A, alpha, L_inf):
        return A * np.power(N, -alpha) + L_inf
    p0 = (1.0, 0.05, min(losses) * 0.9)
    popt, pcov = curve_fit(model, sizes, losses, p0=p0,
                           bounds=((0, 1e-3, 0), (np.inf, 1.0, np.inf)))
    return {"A": popt[0], "alpha": popt[1], "L_inf": popt[2],
            "predict": lambda N: model(N, *popt),
            "ci_width_at_target": float(np.sqrt(np.diag(pcov)).sum())}

Procedure when assisting a user

Insist on at least 3 sizes for the fit, ideally 4–5. Two-point fits cannot detect curvature.
Run sizes spaced approximately log-uniformly (e.g. ~3× apart). Don't run all three near each other.
Use muP if at all possible. Without muP, hyperparameters at each size are different, and the loss differences contain a tuning-quality artifact that breaks the fit.
Plot loss vs N on log-log axes. The fitted line should look like a power law (linear on log-log, except for the curving offset toward L_∞). Anomalies at one size are a red flag — investigate before scaling.
Predict L(N*) for both arms. Render the verdict as: predicted gap at scale, with a confidence interval from pcov.
If gap at scale is < the within-fit uncertainty: not safe to scale; collect more sizes first.

When scaling fits fail

Hyperparameter mistuning at one size masquerades as a scaling anomaly. Always pre-validate LR per size with stage3-lr-range-test (or use muP).
Phase transitions: some capabilities emerge non-smoothly with scale. A scaling fit cannot predict emergence; it only predicts loss. Treat loss as a necessary but not sufficient signal.
Wrong functional form: very small models may not yet be in the power-law regime. If the smallest size is way off the line, drop it from the fit.

Boundaries

A scaling fit predicts loss, not end-task quality. Loss-to-quality mapping is itself a curve that the fit doesn't address.
The fit assumes the variant doesn't change the exponent much (it shifts the curve). If a change alters the slope, you need separate fits for B and V.
Fits at < 100M parameters extrapolated to 10B+ are heroic. Keep the extrapolation factor under ~10× whenever possible.

Common mistakes

Two-point "fit" — that's a line through two points; pretending it's a scaling law is misleading.
Using different LR / batch size at each scale and not noticing — this destroys the curve.
Conflating "scales as expected" with "is the right model" — the fit only validates loss scaling, not architectural soundness.

skills/stage3-mup-coord-check — without muP, scaling fits are unreliable.
skills/stage3-small-scale-ablation — small-scale gap is the input; this skill extrapolates that gap.
skills/stage3-compute-budget — once you have the predicted loss, decide whether the compute is justified.
Kaplan et al. 2020 ("Scaling Laws"); Hoffmann et al. 2022 ("Chinchilla").
template/curry_train/prevalidate/scaling_fit.py.

stage3-scaling-fit

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage3-scaling-fit

Tool Access

Preview

SKILL.md

Stage 3 · Pre-validate · Scaling fit

Stage question

The two scaling-law forms (pick one)

Kaplan-style (parameter-only)

Chinchilla-style (compute-optimal)

The protocol (Kaplan-style)

Decision rule

Recommended implementation

Procedure when assisting a user

When scaling fits fail

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 3 · Pre-validate · Scaling fit

Stage question

The two scaling-law forms (pick one)

Kaplan-style (parameter-only)

Chinchilla-style (compute-optimal)

The protocol (Kaplan-style)

Decision rule

Recommended implementation

Procedure when assisting a user

When scaling fits fail

Boundaries

Common mistakes

Related