Skill

bench

Run a short, reproducible benchmark of one optimizer step (forward + backward + optimizer step over N microbatches) using the project's registered runtime. Activate when the user asks to "benchmark a training step", "measure throughput", "time one optimizer step", or "smoke test the runtime". Wraps run_accumulated_step from curry_train.benchmark.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Execute `run_accumulated_step` from `curry_train.benchmark` over a small number of optimizer steps and report wall-clock, tokens/sec, peak memory, and final loss. The intent is **smoke testing**, not throughput tuning — see `stage4-capacity-sweep` and `stage4-optuna-integration` for proper benchmarking.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Quick training-step benchmark

Execute run_accumulated_step from curry_train.benchmark over a small number of optimizer steps and report wall-clock, tokens/sec, peak memory, and final loss. The intent is smoke testing, not throughput tuning — see stage4-capacity-sweep and stage4-optuna-integration for proper benchmarking.

When to invoke

User asks: "smoke test the runtime", "time a step", "is the model running at all", "how many tokens/sec", "verify the accumulation works".
After scaffolding a new experiment via the new-experiment skill, to confirm the freshly scaffolded model produces a non-NaN loss in one step.

Inputs

--model=<name> (optional): a registered model name. Defaults to whatever the project's configs/default.yaml selects.
--steps=N (optional, default 5): number of optimizer steps to run.
--accum=K (optional, default 1): gradient-accumulation microbatches per optimizer step. Should match GradientAccumulation.steps in the config.

Procedure

Resolve runtime and model. Use curry_train.models.create_runtime(<model>) and runtime.build_model() to get a ModelHandle.

Build the gradient-accumulation primitive.

from curry_train.primitives import GradientAccumulation
accum = GradientAccumulation(steps=K)

Loop N optimizer steps. Each step calls run_accumulated_step(runtime, handle, microbatches, accum, ...) with synthetic or fixture microbatches (the model package should expose a dummy_batch() helper for this purpose).
Measure. Record per-step:
- wall time (time.perf_counter)
- tokens processed (AccumulatedStepResult.tokens)
- loss
- peak GPU memory (torch.cuda.max_memory_allocated() if CUDA available)
- grad-norm (OptimizerStepResult.grad_norm)
Report. Print a small markdown table. Add a single warning line if loss is NaN, grad-norm is 0, or step time variance > 50 % (suggests warmup or first-step compile cost not yet absorbed).

Output template

# bench — model=<name>, steps=N, accum=K

| step | loss   | grad-norm | tokens | step-time (s) | peak-mem (MiB) |
|------|--------|-----------|--------|---------------|----------------|
| 1    |   ?    |     ?     |   ?    |       ?       |       ?        |
| 2    |   ?    |     ?     |   ?    |       ?       |       ?        |
| ...

throughput: ? tokens/sec (mean of steps 2..N)
verdict: <ok | NaN-loss | grad-norm-zero | step-time-unstable>

Boundaries

Do not use real datasets. Bench is a smoke test; use dummy_batch() from the model package or generate torch.randn of the expected shape.
Do not run for thousands of steps. If the user wants a real throughput measurement, point them to stage4-capacity-sweep and Optuna-driven sweeps.
Do not save a checkpoint. Bench is ephemeral.

Failure modes

No registered model: list registered models from curry_train.models.list_models() and ask the user which one.
Runtime build fails: surface the traceback verbatim, then suggest the diagnose skill (the user can ask to debug the failure in natural language).
dummy_batch missing: tell the user to add it to their model package; refer them to stage1-scaffolder for the template.