From curry-train
Run a short, reproducible benchmark of one optimizer step (forward + backward + optimizer step over N microbatches) using the project's registered runtime. Activate when the user asks to "benchmark a training step", "measure throughput", "time one optimizer step", or "smoke test the runtime". Wraps run_accumulated_step from curry_train.benchmark.
How this skill is triggered — by the user, by Claude, or both
Slash command
/curry-train:benchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Execute `run_accumulated_step` from `curry_train.benchmark` over a small number of optimizer steps and report wall-clock, tokens/sec, peak memory, and final loss. The intent is **smoke testing**, not throughput tuning — see `stage4-capacity-sweep` and `stage4-optuna-integration` for proper benchmarking.
Execute run_accumulated_step from curry_train.benchmark over a small number of optimizer steps and report wall-clock, tokens/sec, peak memory, and final loss. The intent is smoke testing, not throughput tuning — see stage4-capacity-sweep and stage4-optuna-integration for proper benchmarking.
new-experiment skill, to confirm the freshly scaffolded model produces a non-NaN loss in one step.--model=<name> (optional): a registered model name. Defaults to whatever the project's configs/default.yaml selects.--steps=N (optional, default 5): number of optimizer steps to run.--accum=K (optional, default 1): gradient-accumulation microbatches per optimizer step. Should match GradientAccumulation.steps in the config.Resolve runtime and model. Use curry_train.models.create_runtime(<model>) and runtime.build_model() to get a ModelHandle.
Build the gradient-accumulation primitive.
from curry_train.primitives import GradientAccumulation
accum = GradientAccumulation(steps=K)
Loop N optimizer steps. Each step calls run_accumulated_step(runtime, handle, microbatches, accum, ...) with synthetic or fixture microbatches (the model package should expose a dummy_batch() helper for this purpose).
Measure. Record per-step:
time.perf_counter)AccumulatedStepResult.tokens)torch.cuda.max_memory_allocated() if CUDA available)OptimizerStepResult.grad_norm)Report. Print a small markdown table. Add a single warning line if loss is NaN, grad-norm is 0, or step time variance > 50 % (suggests warmup or first-step compile cost not yet absorbed).
# bench — model=<name>, steps=N, accum=K
| step | loss | grad-norm | tokens | step-time (s) | peak-mem (MiB) |
|------|--------|-----------|--------|---------------|----------------|
| 1 | ? | ? | ? | ? | ? |
| 2 | ? | ? | ? | ? | ? |
| ...
throughput: ? tokens/sec (mean of steps 2..N)
verdict: <ok | NaN-loss | grad-norm-zero | step-time-unstable>
dummy_batch() from the model package or generate torch.randn of the expected shape.stage4-capacity-sweep and Optuna-driven sweeps.curry_train.models.list_models() and ask the user which one.diagnose skill (the user can ask to debug the failure in natural language).stage1-scaffolder for the template.npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.