From curry-train
Run a short, reproducible benchmark of one optimizer step (forward + backward + optimizer step over N microbatches) using the project's registered runtime. Activate when the user asks to "benchmark a training step", "measure throughput", "time one optimizer step", or "smoke test the runtime". Wraps run_accumulated_step from curry_train.benchmark.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Execute `run_accumulated_step` from `curry_train.benchmark` over a small number of optimizer steps and report wall-clock, tokens/sec, peak memory, and final loss. The intent is **smoke testing**, not throughput tuning — see `stage4-capacity-sweep` and `stage4-optuna-integration` for proper benchmarking.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Execute run_accumulated_step from curry_train.benchmark over a small number of optimizer steps and report wall-clock, tokens/sec, peak memory, and final loss. The intent is smoke testing, not throughput tuning — see stage4-capacity-sweep and stage4-optuna-integration for proper benchmarking.
new-experiment skill, to confirm the freshly scaffolded model produces a non-NaN loss in one step.--model=<name> (optional): a registered model name. Defaults to whatever the project's configs/default.yaml selects.--steps=N (optional, default 5): number of optimizer steps to run.--accum=K (optional, default 1): gradient-accumulation microbatches per optimizer step. Should match GradientAccumulation.steps in the config.Resolve runtime and model. Use curry_train.models.create_runtime(<model>) and runtime.build_model() to get a ModelHandle.
Build the gradient-accumulation primitive.
from curry_train.primitives import GradientAccumulation
accum = GradientAccumulation(steps=K)
Loop N optimizer steps. Each step calls run_accumulated_step(runtime, handle, microbatches, accum, ...) with synthetic or fixture microbatches (the model package should expose a dummy_batch() helper for this purpose).
Measure. Record per-step:
time.perf_counter)AccumulatedStepResult.tokens)torch.cuda.max_memory_allocated() if CUDA available)OptimizerStepResult.grad_norm)Report. Print a small markdown table. Add a single warning line if loss is NaN, grad-norm is 0, or step time variance > 50 % (suggests warmup or first-step compile cost not yet absorbed).
# bench — model=<name>, steps=N, accum=K
| step | loss | grad-norm | tokens | step-time (s) | peak-mem (MiB) |
|------|--------|-----------|--------|---------------|----------------|
| 1 | ? | ? | ? | ? | ? |
| 2 | ? | ? | ? | ? | ? |
| ...
throughput: ? tokens/sec (mean of steps 2..N)
verdict: <ok | NaN-loss | grad-norm-zero | step-time-unstable>
dummy_batch() from the model package or generate torch.randn of the expected shape.stage4-capacity-sweep and Optuna-driven sweeps.curry_train.models.list_models() and ask the user which one.diagnose skill (the user can ask to debug the failure in natural language).stage1-scaffolder for the template.