Skill

stage3-multi-seed-variance

Estimate trial-to-trial variance by running the same configuration with multiple random seeds and checking whether claimed improvements exceed that variance. Activate when the user asks "is my improvement real or noise", "how many seeds do I need", "multi-seed variance check", "statistical significance for ML", or after any A/B comparison that ran with only one seed each.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The single most over-skipped step in deep learning evaluation: distinguishing real improvements from luck. Many published "improvements" don't survive a 3-seed re-run.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 3 · Pre-validate · Multi-seed variance check

The single most over-skipped step in deep learning evaluation: distinguishing real improvements from luck. Many published "improvements" don't survive a 3-seed re-run.

Stage question

"Is my reported improvement larger than the variance I would get from running the same configuration with different random seeds?"

If yes: progress is real (likely). If no: the "improvement" is within noise; do not draw a conclusion.

The minimum protocol

Pick the two arms you want to compare: B (baseline) and V (variant).
Pick a headline metric that is a single scalar.
Run each arm with N different seeds. Minimum:
- N = 3 for a directional answer.
- N = 5 for a confident answer.
- N = 10+ for a publishable claim.
Compute mean_B, std_B over seeds; same for V.
Compute pooled_std = sqrt((var_B + var_V) / 2) (when N is small, equal-arm pooled std is a reasonable simplification).
Decision: improvement is probably real iff |mean_V − mean_B| > 2 × pooled_std (rough 2-sigma rule).

This is a coarse approximation of a t-test. If the user has the budget, use the actual two-sample Welch's t-test from scipy.stats.ttest_ind(equal_var=False).

Why "2 × std" is the rule of thumb

For approximately Gaussian per-seed scores, 2 × std gives roughly 95% confidence that the improvement is not from chance. Underlying assumptions:

Per-seed scores are approximately normal (usually fine; central limit if averaging over many minibatches).
Variance is similar across arms (often fine; if the variant is much noisier, that itself is a finding).
N is small but > 2 (use Welch's t-test for rigor).

What to seed

Seed everything that introduces stochasticity:

import os, random, numpy as np, torch
def set_seed(seed: int):
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # if cuDNN nondeterminism matters:
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

The DataLoader's shuffle must use a Generator seeded with the same seed (see skills/stage1-data-pipeline).

How variance composes

Sources of seed-to-seed variance, in roughly decreasing order of magnitude for transformers:

Initialization (the dominant source for small models).
Data shuffling order.
Augmentation randomness.
Dropout.
cuDNN nondeterminism (if enabled).

Reducing variance by aggressive determinism is not necessarily desirable — the variance is a real property of the algorithm. Better to measure it than hide it.

Recommended implementation

Lives at template/curry_train/prevalidate/variance.py. Sketch:

def multi_seed_run(config, n_seeds: int) -> dict:
    """Run the same config with n_seeds different seeds.

    Returns {"per_seed_metrics": [...], "mean": ..., "std": ..., "n": n_seeds}.
    """
    results = []
    for seed in range(n_seeds):
        cfg = config.copy(); cfg.seed = seed
        result = run_training(cfg)
        results.append(result.headline_metric)
    arr = np.array(results)
    return {"per_seed_metrics": results,
            "mean": arr.mean(),
            "std": arr.std(ddof=1),
            "n": n_seeds}

def compare_arms(arm_b, arm_v) -> str:
    pooled = math.sqrt((arm_b["std"]**2 + arm_v["std"]**2) / 2)
    delta = arm_v["mean"] - arm_b["mean"]
    if abs(delta) > 2 * pooled:
        return f"V {'better' if delta > 0 else 'worse'} (Δ {delta:+.4f}, pooled σ {pooled:.4f})"
    return f"indistinguishable (Δ {delta:+.4f}, pooled σ {pooled:.4f}, |Δ| < 2σ)"

Procedure when assisting a user

Insist on N ≥ 3. If the user only has results for one seed each, tell them: the comparison is variance-blind. Do not let them claim improvement until variance is measured.
If running multi-seed is infeasible (very expensive runs), at minimum collect the baseline std by running B with N seeds, even if V gets only one seed. Then use 2 × std_B as the comparison threshold.
Render the comparison as a one-liner verdict, not a paragraph. Examples:
- verdict: V better (Δ = +0.014, pooled σ = 0.005, |Δ| > 2σ)
- verdict: indistinguishable (Δ = +0.003, pooled σ = 0.012)
If verdict is "indistinguishable", redirect compute to a different idea or commit to N = 10 seeds before deciding.

Boundaries

This protocol assumes IID seeds. If the seeds are correlated (e.g. all use the same data shuffle by accident), the variance estimate is wrong.
For very-long training runs where each is expensive, consider partial-run variance (e.g. variance at step T < final). Surface this caveat clearly.
Catastrophic divergences (a single seed NaNs at step 1000) should be excluded only with explicit reporting — say "1/5 seeds diverged, excluded from variance".

Common mistakes

Reporting "best of 3 seeds" — selection bias inflates apparent improvement.
Comparing min-loss across runs — same selection bias.
Reporting variance over the last K logged steps within a single run — this is autocorrelated, not seed variance.
Treating cuDNN nondeterminism as a primary source — it is usually small relative to init/data.

skills/stage3-small-scale-ablation — uses this skill to make the scale-up decision.
skills/stage3-kill-criterion — exclude diverged runs cleanly.
skills/stage6-variance-aware-decision — same logic at full scale.
skills/runs-diff — produces variance-aware verdict from journals.
template/curry_train/prevalidate/variance.py.

stage3-multi-seed-variance

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage3-multi-seed-variance

Tool Access

Preview

SKILL.md

Stage 3 · Pre-validate · Multi-seed variance check

Stage question

The minimum protocol

Why "2 × std" is the rule of thumb

What to seed

How variance composes

Recommended implementation

Procedure when assisting a user

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 3 · Pre-validate · Multi-seed variance check

Stage question

The minimum protocol

Why "2 × std" is the rule of thumb

What to seed

How variance composes

Recommended implementation

Procedure when assisting a user

Boundaries

Common mistakes

Related