Skill

runs-diff

Compare two training runs and produce a concise markdown diff covering config, key metrics, loss curves, and grad-norm trajectory. Activate when the user asks to "compare run A and run B", "diff two experiments", "did this change actually help", or "is this run better than the previous one". This is both the implementation of the action and the methodology guide for variance-aware comparison.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Produce a short, decision-grade comparison between two run directories. Output is markdown with a config diff table, a metric delta table, and a one-line verdict that explicitly states whether the difference is within trial variance (i.e., probably noise) or appears real.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Compare two runs

When to invoke

User asks: "did experiment X help?", "compare these two runs", "is run B better than run A?".
After any stage6-variance-aware-decision skill activation that needs concrete numbers.

Inputs

<run-a>, <run-b>: paths to two run directories. A "run directory" is the journal layout produced by template/curry_train/infra/journal.py and contains config.yaml, metrics.jsonl, git_sha.txt, seed.txt.
--metric=<name> (optional): the metric to treat as the headline (default: loss).

Procedure

Locate the journals. Glob for <run-*>/{config.yaml,metrics.jsonl,git_sha.txt,seed.txt}. If any of those four are missing, warn but continue with what is available.
Diff the configs. Use diff -u on config.yaml and present only the changed keys. Do not dump the entire YAML.
Summarize the headline metric. From metrics.jsonl, compute for each run:
- final value
- best value
- step at which best was reached
- last 10% mean (more stable than final)
Variance check (the key step). If both runs have a sibling sweep of additional seeds (look for seed-* peers in the same parent dir), compute the standard deviation across seeds for each run. Compute the difference in means and report whether it exceeds 2 * pooled_std. If only one seed each, explicitly say so and warn the user that the comparison is variance-blind.
Inspect grad-norm and lr trajectories. Read metrics.jsonl for grad_norm and lr series. Note any divergence patterns (e.g., spikes in B but not A, or warmup mismatch).
Verdict line. End with one of:
- verdict: B clearly better (Δ > 2σ across seeds)
- verdict: A clearly better (Δ > 2σ across seeds)
- verdict: indistinguishable within trial variance
- verdict: unknown — only one seed per arm; rerun with N≥3 seeds before claiming improvement

Output template

# Runs diff: <a> vs <b>

## Config diff
<unified diff of changed keys>

## Headline metric: <metric>
| metric          | run-a | run-b | Δ (B−A) |
|-----------------|-------|-------|---------|
| final           |   ?   |   ?   |    ?    |
| best (step)     | ?(?)  | ?(?)  |   ?     |
| last-10%-mean   |   ?   |   ?   |    ?    |

## Stability
- grad-norm trajectory: <summary>
- lr schedule diff: <summary or "identical">

## Variance
- run-a seeds: N, σ = ?
- run-b seeds: N, σ = ?
- Δ vs 2σ: <inside | outside>

## Verdict
<one line, verdict: ...>

Boundaries

Do not re-run training. This skill is read-only on existing run directories.
Do not call any logging backend (W&B, TensorBoard) directly. Read the canonical metrics.jsonl only.
Do not produce a verdict more confident than the data supports — single-seed comparisons must say "unknown".

Failure modes

Missing metrics.jsonl: still produce the config diff but drop the metric / variance sections; tell the user the journal is incomplete.
Mismatched headline metric: if --metric is not present in both runs, list available metrics and ask which one to use.