From curry-train
Estimate trial-to-trial variance by running the same configuration with multiple random seeds and checking whether claimed improvements exceed that variance. Activate when the user asks "is my improvement real or noise", "how many seeds do I need", "multi-seed variance check", "statistical significance for ML", or after any A/B comparison that ran with only one seed each.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The single most over-skipped step in deep learning evaluation: distinguishing real improvements from luck. Many published "improvements" don't survive a 3-seed re-run.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The single most over-skipped step in deep learning evaluation: distinguishing real improvements from luck. Many published "improvements" don't survive a 3-seed re-run.
"Is my reported improvement larger than the variance I would get from running the same configuration with different random seeds?"
If yes: progress is real (likely). If no: the "improvement" is within noise; do not draw a conclusion.
N different seeds. Minimum:
mean_B, std_B over seeds; same for V.pooled_std = sqrt((var_B + var_V) / 2) (when N is small, equal-arm pooled std is a reasonable simplification).|mean_V − mean_B| > 2 × pooled_std (rough 2-sigma rule).This is a coarse approximation of a t-test. If the user has the budget, use the actual two-sample Welch's t-test from scipy.stats.ttest_ind(equal_var=False).
For approximately Gaussian per-seed scores, 2 × std gives roughly 95% confidence that the improvement is not from chance. Underlying assumptions:
Seed everything that introduces stochasticity:
import os, random, numpy as np, torch
def set_seed(seed: int):
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# if cuDNN nondeterminism matters:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
The DataLoader's shuffle must use a Generator seeded with the same seed (see skills/stage1-data-pipeline).
Sources of seed-to-seed variance, in roughly decreasing order of magnitude for transformers:
Reducing variance by aggressive determinism is not necessarily desirable — the variance is a real property of the algorithm. Better to measure it than hide it.
Lives at template/curry_train/prevalidate/variance.py. Sketch:
def multi_seed_run(config, n_seeds: int) -> dict:
"""Run the same config with n_seeds different seeds.
Returns {"per_seed_metrics": [...], "mean": ..., "std": ..., "n": n_seeds}.
"""
results = []
for seed in range(n_seeds):
cfg = config.copy(); cfg.seed = seed
result = run_training(cfg)
results.append(result.headline_metric)
arr = np.array(results)
return {"per_seed_metrics": results,
"mean": arr.mean(),
"std": arr.std(ddof=1),
"n": n_seeds}
def compare_arms(arm_b, arm_v) -> str:
pooled = math.sqrt((arm_b["std"]**2 + arm_v["std"]**2) / 2)
delta = arm_v["mean"] - arm_b["mean"]
if abs(delta) > 2 * pooled:
return f"V {'better' if delta > 0 else 'worse'} (Δ {delta:+.4f}, pooled σ {pooled:.4f})"
return f"indistinguishable (Δ {delta:+.4f}, pooled σ {pooled:.4f}, |Δ| < 2σ)"
Insist on N ≥ 3. If the user only has results for one seed each, tell them: the comparison is variance-blind. Do not let them claim improvement until variance is measured.
If running multi-seed is infeasible (very expensive runs), at minimum collect the baseline std by running B with N seeds, even if V gets only one seed. Then use 2 × std_B as the comparison threshold.
Render the comparison as a one-liner verdict, not a paragraph. Examples:
verdict: V better (Δ = +0.014, pooled σ = 0.005, |Δ| > 2σ)verdict: indistinguishable (Δ = +0.003, pooled σ = 0.012)If verdict is "indistinguishable", redirect compute to a different idea or commit to N = 10 seeds before deciding.
skills/stage3-small-scale-ablation — uses this skill to make the scale-up decision.skills/stage3-kill-criterion — exclude diverged runs cleanly.skills/stage6-variance-aware-decision — same logic at full scale.skills/runs-diff — produces variance-aware verdict from journals.template/curry_train/prevalidate/variance.py.