From curry-train
Fit a power-law scaling curve to small-scale runs at multiple sizes, then extrapolate to predict large-scale loss before committing the compute. Activate when the user asks "scaling laws", "Chinchilla", "Kaplan", "predict large-scale loss from small-scale", "is my idea going to scale", or wants to do compute-optimal training.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A small set of cheap runs at varying compute budgets, fit to a power law, then used to predict large-scale loss. Cuts the scaling decision from "guess and pray" to "extrapolate from data".
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A small set of cheap runs at varying compute budgets, fit to a power law, then used to predict large-scale loss. Cuts the scaling decision from "guess and pray" to "extrapolate from data".
"Given losses at compute budgets C₁ < C₂ < C₃ (small), what loss should I expect at the target compute budget C* (large), and is the variant still better at C*?"
Loss is a power law in parameter count N:
L(N) ≈ A · N^(-α) + L_∞
where L_∞ is the irreducible loss (data entropy floor). Useful when data and steps scale with N.
Loss is jointly a power law in parameters N and tokens T:
L(N, T) ≈ A · N^(-α) + B · T^(-β) + L_∞
with the compute-optimal allocation N* ∝ C^0.5 and T* ∝ C^0.5 (where compute C ≈ 6 N T).
For most curryTrain users at the start of an idea, the simpler Kaplan-style fit is enough to make a scaling decision. Chinchilla becomes useful when the question is "given a compute budget, what model + tokens is optimal".
N ∈ {5M, 20M, 80M} parameters.stage3-mup-coord-check).L_i (or last-10%-mean loss for stability).L = A · N^(-α) + L_∞ using non-linear least squares (3 parameters, 3 data points = barely enough; more sizes are better).N* using the fitted parameters.Repeat for the variant arm. Compare predicted losses at N*.
The variant earns a scale-up budget iff:
α_B and α_V are similar (not anomalous).L_V(N*) < L_B(N*) by a margin larger than the fit's prediction interval.If the gap between B and V shrinks as size grows (i.e., L_V(small) − L_B(small) > L_V(large) − L_B(large)), the variant probably won't help at scale — even if it helps at small scale.
Lives at template/curry_train/prevalidate/scaling_fit.py. Sketch:
import numpy as np
from scipy.optimize import curve_fit
def kaplan_fit(sizes, losses):
"""Fit L = A * N^-alpha + L_inf to (sizes, losses)."""
def model(N, A, alpha, L_inf):
return A * np.power(N, -alpha) + L_inf
p0 = (1.0, 0.05, min(losses) * 0.9)
popt, pcov = curve_fit(model, sizes, losses, p0=p0,
bounds=((0, 1e-3, 0), (np.inf, 1.0, np.inf)))
return {"A": popt[0], "alpha": popt[1], "L_inf": popt[2],
"predict": lambda N: model(N, *popt),
"ci_width_at_target": float(np.sqrt(np.diag(pcov)).sum())}
Insist on at least 3 sizes for the fit, ideally 4–5. Two-point fits cannot detect curvature.
Run sizes spaced approximately log-uniformly (e.g. ~3× apart). Don't run all three near each other.
Use muP if at all possible. Without muP, hyperparameters at each size are different, and the loss differences contain a tuning-quality artifact that breaks the fit.
Plot loss vs N on log-log axes. The fitted line should look like a power law (linear on log-log, except for the curving offset toward L_∞). Anomalies at one size are a red flag — investigate before scaling.
Predict L(N*) for both arms. Render the verdict as: predicted gap at scale, with a confidence interval from pcov.
If gap at scale is < the within-fit uncertainty: not safe to scale; collect more sizes first.
stage3-lr-range-test (or use muP).skills/stage3-mup-coord-check — without muP, scaling fits are unreliable.skills/stage3-small-scale-ablation — small-scale gap is the input; this skill extrapolates that gap.skills/stage3-compute-budget — once you have the predicted loss, decide whether the compute is justified.template/curry_train/prevalidate/scaling_fit.py.