From curry-train
At full scale, after multi-seed runs of two configs, decide whether one is genuinely better — using the same multi-seed variance machinery as Stage 3 but applied to long runs. Activate when the user asks "is run A really better than run B", "did this change help at scale", "post-hoc significance", or comparing two completed long runs.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The full-scale counterpart of `stage3-multi-seed-variance`. After multi-seed runs at the target size, decide whether the variant is genuinely better than the baseline.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The full-scale counterpart of stage3-multi-seed-variance. After multi-seed runs at the target size, decide whether the variant is genuinely better than the baseline.
"Has this change produced a real, statistically meaningful improvement at the target scale?"
This is what the project is for; don't get it wrong by under-seeding or over-claiming.
N_B seeds at the target scale (typically N_B ≥ 3).N_V seeds at the target scale.The decision rule is identical to Stage 3:
|mean_V − mean_B| > 2 × pooled_std → improvement is probably real.The difference vs. Stage 3 is:
Target seeds depends on the variance and the effect size:
Don't accept N = 1 per arm at scale. If budget forces N = 1, frame the result as "preliminary" and explicitly state the variance is unmeasured.
Wording is part of the rigor. Use:
Avoid: "V is better" (no quantification), "V tends to be better" (weasel without numbers).
Use Welch's t-test for unequal sample sizes. The 2-sigma rule still works as a coarse approximation when one arm is much smaller (e.g. N_B = 5, N_V = 3).
A run that hit kill-criterion is not a "result" — it's a divergence. Two reasonable choices:
The right choice depends on whether divergences are due to bad luck or are an intrinsic property of the variant.
Some metrics (rare-event accuracy, BLEU on small test sets) are not approximately Gaussian. Use a non-parametric test (Mann-Whitney U) instead of t-test. scipy.stats.mannwhitneyu is the standard choice.
Reject one-seed-per-arm comparisons. Tell the user the comparison is variance-blind and refuse to issue a verdict.
If multi-seed runs exist, compute mean, std, Δ, pooled std, and decision threshold.
Render a one-line verdict using the wording table above. Pick the wording that the data justifies, not the wording the user wants.
Surface secondary regressions: throughput delta, memory delta, divergence rate. A "marginally better loss but 30% slower" is often net-worse.
Recommend follow-up: if marginal, do another N seeds; if conclusive, archive the result and move on.
stage3-surrogate-task (if not already done).skills/stage3-multi-seed-variance — same logic at small scale.skills/runs-diff — produces the verdict from journal data.skills/runs-diff — slash command that wraps this analysis.skills/stage6-ablation-matrix — applies the decision across a set of variants, not just two arms.