Skill

stage6-variance-aware-decision

At full scale, after multi-seed runs of two configs, decide whether one is genuinely better — using the same multi-seed variance machinery as Stage 3 but applied to long runs. Activate when the user asks "is run A really better than run B", "did this change help at scale", "post-hoc significance", or comparing two completed long runs.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The full-scale counterpart of `stage3-multi-seed-variance`. After multi-seed runs at the target size, decide whether the variant is genuinely better than the baseline.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 6 · Iterate · Variance-aware decision (at scale)

The full-scale counterpart of stage3-multi-seed-variance. After multi-seed runs at the target size, decide whether the variant is genuinely better than the baseline.

Stage question

"Has this change produced a real, statistically meaningful improvement at the target scale?"

This is what the project is for; don't get it wrong by under-seeding or over-claiming.

The protocol (same as Stage 3, larger budget)

Run baseline arm B with N_B seeds at the target scale (typically N_B ≥ 3).
Run variant arm V with N_V seeds at the target scale.
Pick the headline metric, ideally agreed on before running.
Compute mean and std for each arm.
Apply the decision rule.

The decision rule is identical to Stage 3:

|mean_V − mean_B| > 2 × pooled_std → improvement is probably real.
Otherwise: indistinguishable.

The difference vs. Stage 3 is:

Larger budget → can usually afford more seeds.
Stakes are higher → tendency to confirmation-bias the conclusion. Resist it.
Final reportable claim → wording and confidence intervals matter.

How many seeds at scale?

Target seeds depends on the variance and the effect size:

N = 3 → minimum for any directional claim. Confidence is loose.
N = 5 → standard for a defensible internal claim.
N = 10+ → standard for a publishable claim.

Don't accept N = 1 per arm at scale. If budget forces N = 1, frame the result as "preliminary" and explicitly state the variance is unmeasured.

Reporting language

Wording is part of the rigor. Use:

Strong claim (justified by data): "V outperforms B (Δ = +0.018 ± 0.005, 5 seeds each, |Δ| > 2σ)."
Weak claim (where data warrants weakness): "V appears to outperform B (Δ = +0.012 ± 0.011, 3 seeds each, gap within trial variance)."
Refutation (when V didn't win): "V did not significantly outperform B (Δ = −0.003, 5 seeds each)."
Insufficient evidence: "Variance not measured; single seed comparison is uninformative."

Avoid: "V is better" (no quantification), "V tends to be better" (weasel without numbers).

Edge cases

Different seed counts per arm

Use Welch's t-test for unequal sample sizes. The 2-sigma rule still works as a coarse approximation when one arm is much smaller (e.g. N_B = 5, N_V = 3).

One seed diverged

A run that hit kill-criterion is not a "result" — it's a divergence. Two reasonable choices:

Exclude the diverged seed and report the remaining ones (state explicitly: "1/5 seeds diverged, excluded").
Report the divergence rate as part of the verdict ("V diverged in 2/5 seeds, B diverged in 0/5; this is a stability regression even though successful runs are equivalent").

The right choice depends on whether divergences are due to bad luck or are an intrinsic property of the variant.

Non-Gaussian metric distribution

Some metrics (rare-event accuracy, BLEU on small test sets) are not approximately Gaussian. Use a non-parametric test (Mann-Whitney U) instead of t-test. scipy.stats.mannwhitneyu is the standard choice.

Procedure when assisting a user

Reject one-seed-per-arm comparisons. Tell the user the comparison is variance-blind and refuse to issue a verdict.
If multi-seed runs exist, compute mean, std, Δ, pooled std, and decision threshold.
Render a one-line verdict using the wording table above. Pick the wording that the data justifies, not the wording the user wants.
Surface secondary regressions: throughput delta, memory delta, divergence rate. A "marginally better loss but 30% slower" is often net-worse.
Recommend follow-up: if marginal, do another N seeds; if conclusive, archive the result and move on.

Boundaries

This is not a generalization claim — only that the gap is statistically significant on the held-out set used. Generalization to other tasks/domains requires separate evaluation.
This is not a causal claim about why the improvement happens. For mechanism, see stage3-surrogate-task (if not already done).
The 2σ rule is a heuristic. For high-stakes decisions, run Welch's t-test or Mann-Whitney U formally.

Common mistakes

"Best-of-N" reporting → selection bias inflates apparent effect.
Single-seed comparisons → no variance estimate, no claim.
Cherry-picking the metric after seeing data → invalidates the test.
Conflating "didn't reach significance" with "no effect" → it might reach significance with more seeds.

skills/stage3-multi-seed-variance — same logic at small scale.
skills/runs-diff — produces the verdict from journal data.
skills/runs-diff — slash command that wraps this analysis.
skills/stage6-ablation-matrix — applies the decision across a set of variants, not just two arms.
Welch's t-test, Mann-Whitney U.

stage6-variance-aware-decision

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage6-variance-aware-decision

Tool Access

Preview

SKILL.md

Stage 6 · Iterate · Variance-aware decision (at scale)

Stage question

The protocol (same as Stage 3, larger budget)

How many seeds at scale?

Reporting language

Edge cases

Different seed counts per arm

One seed diverged

Non-Gaussian metric distribution

Procedure when assisting a user

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 6 · Iterate · Variance-aware decision (at scale)

Stage question

The protocol (same as Stage 3, larger budget)

How many seeds at scale?

Reporting language

Edge cases

Different seed counts per arm

One seed diverged

Non-Gaussian metric distribution

Procedure when assisting a user

Boundaries

Common mistakes

Related