From curry-train
Compare two training runs and produce a concise markdown diff covering config, key metrics, loss curves, and grad-norm trajectory. Activate when the user asks to "compare run A and run B", "diff two experiments", "did this change actually help", or "is this run better than the previous one". This is both the implementation of the action and the methodology guide for variance-aware comparison.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Produce a short, decision-grade comparison between two run directories. Output is markdown with a config diff table, a metric delta table, and a one-line verdict that explicitly states whether the difference is within trial variance (i.e., probably noise) or appears real.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Produce a short, decision-grade comparison between two run directories. Output is markdown with a config diff table, a metric delta table, and a one-line verdict that explicitly states whether the difference is within trial variance (i.e., probably noise) or appears real.
stage6-variance-aware-decision skill activation that needs concrete numbers.<run-a>, <run-b>: paths to two run directories. A "run directory" is the journal layout produced by template/curry_train/infra/journal.py and contains config.yaml, metrics.jsonl, git_sha.txt, seed.txt.--metric=<name> (optional): the metric to treat as the headline (default: loss).Locate the journals. Glob for <run-*>/{config.yaml,metrics.jsonl,git_sha.txt,seed.txt}. If any of those four are missing, warn but continue with what is available.
Diff the configs. Use diff -u on config.yaml and present only the changed keys. Do not dump the entire YAML.
Summarize the headline metric. From metrics.jsonl, compute for each run:
Variance check (the key step). If both runs have a sibling sweep of additional seeds (look for seed-* peers in the same parent dir), compute the standard deviation across seeds for each run. Compute the difference in means and report whether it exceeds 2 * pooled_std. If only one seed each, explicitly say so and warn the user that the comparison is variance-blind.
Inspect grad-norm and lr trajectories. Read metrics.jsonl for grad_norm and lr series. Note any divergence patterns (e.g., spikes in B but not A, or warmup mismatch).
Verdict line. End with one of:
verdict: B clearly better (Δ > 2σ across seeds)verdict: A clearly better (Δ > 2σ across seeds)verdict: indistinguishable within trial varianceverdict: unknown — only one seed per arm; rerun with N≥3 seeds before claiming improvement# Runs diff: <a> vs <b>
## Config diff
<unified diff of changed keys>
## Headline metric: <metric>
| metric | run-a | run-b | Δ (B−A) |
|-----------------|-------|-------|---------|
| final | ? | ? | ? |
| best (step) | ?(?) | ?(?) | ? |
| last-10%-mean | ? | ? | ? |
## Stability
- grad-norm trajectory: <summary>
- lr schedule diff: <summary or "identical">
## Variance
- run-a seeds: N, σ = ?
- run-b seeds: N, σ = ?
- Δ vs 2σ: <inside | outside>
## Verdict
<one line, verdict: ...>
metrics.jsonl only.--metric is not present in both runs, list available metrics and ask which one to use.