Methodology-first deep learning training framework. Idea is cheap; infrastructure that lets you validate ideas fast is valuable.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainUse this agent to diagnose a failed or stalled training run by reading recent logs, metrics, config, and traces. Trigger when the user asks "why did this crash", reports NaN/OOM/divergence, or shows an unhealthy loss curve. Produces a structured diagnosis with ranked candidate causes and fixes.
Use this agent to propose an Optuna search space and kick off a hyperparameter study. Trigger when the user asks to "tune hyperparameters", "set up an Optuna sweep", "search over LR and weight decay", or "find the best hyperparameters for this experiment". Operates only on configs that already passed Stage 3 pre-validation.
Use this agent to compare two completed training runs and produce a concise variance-aware markdown diff (config, metrics, stability, verdict). Trigger when the user asks "did this change help", "is run A better than run B", or "compare two experiments". Reads run journals only; does not re-run training.
Use this agent to scaffold a new model package (config.py, model.py, checkpoint.py, protocol.py + Hydra config) inside a curryTrain project. Trigger when the user asks to "add a new model called X", "scaffold an experiment", or "generate a curryTrain model from this HF model".
Run a short, reproducible benchmark of one optimizer step (forward + backward + optimizer step over N microbatches) using the project's registered runtime. Activate when the user asks to "benchmark a training step", "measure throughput", "time one optimizer step", or "smoke test the runtime". Wraps run_accumulated_step from curry_train.benchmark.
Diagnose a training failure or stall by inspecting recent logs, loss curves, OOM traces, NaN events, and config. Activate when the user asks "why did my training crash", "loss went to NaN", "OOM during step X", "training is not improving", or "help me debug this run". Delegates to the failure-diagnoser agent.
Lightning Fabric integration recipe — minimal 5-line setup that gives DDP / FSDP / mixed precision / mixed-precision while keeping a raw PyTorch training loop. Activate when the user asks "Lightning Fabric", "torchrun", "DDP setup", "FSDP setup", "mixed precision", or wires up the launch script.
Hydra + OmegaConf configuration layout for curryTrain projects — composable defaults, structured configs, CLI override syntax, sweep integration. Activate when the user asks "Hydra setup", "config management", "compose configs", "override CLI", "Hydra defaults list", or builds the experiment configuration.
Concrete recipe for running an Optuna-driven hyperparameter sweep through Hydra, with TPE/CMA-ES/Hyperband, distributed multi-rank trials, study persistence, and per-trial run journal. Activate when the user asks "set up an Optuna sweep", "run hyperparameter search", "Hydra Optuna sweeper", or "parallel HPO".
A Logger protocol decoupling the training code from any specific tracking backend (W&B, MLflow, Aim, TensorBoard) — with TensorBoard as the zero-dependency default. Activate when the user asks "experiment tracking", "W&B integration", "TensorBoard setup", "MLflow", "switch tracking backend", or wants tracking without lock-in.
Bootstrap a new curryTrain project by copying the framework Python template into the user's working directory. Use when the user runs /curry-train:init, asks to "start a new training project with curryTrain", "initialize curryTrain", "scaffold a curryTrain project", or wants the framework code copied locally for editing.
Scaffold a new model + dataset + config triple inside an existing curryTrain project. Activate when the user asks to "add a new model to curryTrain", "scaffold a new experiment", "generate model.py / dataset.py / config.yaml for X", or describes an idea they want to start training. Delegates the actual generation to the scaffolder agent.
Sharding the sequence dimension across ranks (Ring Attention) for very long contexts that don't fit attention memory on a single GPU. Activate when the user asks "context parallel", "CP", "Ring Attention", "long context training", "32k+ sequence", or runs out of memory on attention rather than parameters.
Distributed checkpoint format that survives changes in world size and parallelism topology. Built on torch.distributed.checkpoint. Activate when the user asks "DCP", "distributed checkpoint", "resume on different topology", "FSDP checkpoint", or "save sharded model".
Optimizer state sharding across DP ranks (ZeRO-1, ZeRO-2, ZeRO-3 / FSDP). Reduces per-rank memory by sharding gradient and/or parameter copies. Activate when the user asks "ZeRO", "FSDP", "optimizer sharding", "distributed optimizer", or "OOM in optimizer".
A bank of parallel expert MLPs that consume routed tokens from TopKRouter and return per-token outputs. The "doer" half of an MoE block. Activate when the user asks "MoE experts", "expert MLPs", "Mixtral", "expert parallel", or builds an MoE block.
Grouped-query attention with rotary positional embedding (RoPE). Standard component in modern LLMs (Llama-2/3, Qwen2/3, Mistral). Activate when the user asks "GQA", "grouped query attention", "RoPE", "rotary embedding", "attention with KV groups", or builds a transformer.
Compose multiple microbatch forward/backward passes into a single optimizer step, enabling effective batch sizes larger than memory permits. Activate when the user asks "gradient accumulation", "accumulate gradients", "effective batch size", "OOM at larger batch", or asks how to set the GA factor.
Bidirectional weight conversion between HuggingFace transformers format and curryTrain internal format, including offline fallback when the HF Hub is unreachable. Activate when the user asks "load HF weights", "HuggingFace bridge", "convert weights", "HF Hub unreachable", or "offline weight loading".
Leaky Integrate-and-Fire spiking neuron with surrogate gradient — converts continuous activations into binary spike trains over T timesteps. Used by spiking transformer architectures (CSLA-MT). Activate when the user asks "LIF neuron", "spiking neural network", "SNN", "spike encoding", "surrogate gradient", or wires up a spiking layer.
Centralized state tracking the multi-dimensional parallelism topology — DP rank, TP rank, PP rank, EP rank, CP rank — and the communication groups for each. Activate when the user asks "parallel state", "process group", "rank topology", "world setup", or wires up multi-dim parallelism.
Pipeline-parallel schedules (1F1B, interleaved 1F1B, GPipe). Manages microbatches flowing through stages on different ranks. Activate when the user asks "pipeline parallel", "PP", "1F1B", "GPipe", "interleaved pipeline", or has more layers than fit on a single node.
Activation checkpointing — recompute forward activations during backward instead of storing them, trading compute for memory. Activate when the user asks "activation checkpointing", "recompute", "OOM during backward", "gradient checkpointing", or needs to fit a larger model into memory.
Root Mean Square LayerNorm — drop the mean-subtraction from LayerNorm, keep only the RMS-based scaling. Used by Llama, Qwen, and most modern LLMs. Activate when the user asks "RMSNorm", "Llama norm", "RMS layer norm", "skip mean centering", or compares RMSNorm vs LayerNorm.
Top-K token routing for Mixture-of-Experts — for each token, pick the K experts with highest gating score. Activate when the user asks "MoE routing", "top-k router", "switch transformer routing", "expert choice", or builds an MoE model.
Tensor-parallel linear layers — column-parallel and row-parallel — for splitting matmuls across GPUs along the output or input feature dimension. Activate when the user asks "tensor parallel", "column parallel linear", "row parallel linear", "TP", or "split matmul across GPUs".
Compare two training runs and produce a concise markdown diff covering config, key metrics, loss curves, and grad-norm trajectory. Activate when the user asks to "compare run A and run B", "diff two experiments", "did this change actually help", or "is this run better than the previous one". This is both the implementation of the action and the methodology guide for variance-aware comparison.
Methodology for building a leakage-safe data pipeline — split before preprocess, fit transforms on train only, time-aware splits for temporal data, deterministic shuffle. Activate when the user asks "how do I split my data", "data pipeline best practice", "is my normalizer leaking", "how to set up a dataset for curryTrain", or shows a pipeline that fits a transform on the full dataset.
A canonical set of low-cost assertions to run before any non-trivial training, catching the most common "silent" bugs (zero_grad missed, train/eval mode wrong, wrong tensor shape, label leakage in transforms). Activate when the user asks "what should I check before training", "preflight checks", "is my training set up correctly", or any time a fresh model is about to be trained.
Methodology for Stage 1 Skeleton — set up the minimum architecture (model, dataset adapter, config, registration) so the data-flow can be traced end-to-end before any optimization. Activate when the user asks "how do I add a new model", "what files does a curryTrain model need", "set up the architecture skeleton", "where does my model.py go", or "how does registration work".
Per-layer histograms of gradient magnitudes and activation statistics — used to detect dead layers, exploding gradients, or pathological depth scaling early. Activate when the user asks "is my gradient flow healthy", "are any layers dead", "exploding gradients", "vanishing gradients", or "what's a coord check" (related to muP).
Verify the loss at step 0 matches the value implied by a uniform-random model — typically -log(1/C) for C-way cross-entropy. Catches initialization bugs, double-softmax, missing bias init, and hidden activation issues. Activate when the user asks "what should my initial loss be", "init loss seems wrong", "is my model initialized correctly", or right after building a new model.
The canonical sanity check — train on 2-3 examples until the loss is near zero, proving the entire forward/backward/optimizer pipeline can fit. Activate when the user asks "how do I sanity check my model", "overfit one batch", "is my pipeline working", "loss is not decreasing", or before any real training of a fresh architecture.
Estimate the compute and dollar cost of a proposed training run before launching it, and compare against the expected gain from the small-scale ablation. Activate when the user asks "how much will this cost", "is this run worth the compute", "compute budget estimator", "how long will this take", or considers launching a multi-day run.
Define an abort condition before launching a run, so that a clearly broken or clearly under-performing run stops itself instead of consuming the full compute budget. Activate when the user asks "when should I kill a run", "abort condition", "early stop a bad run", "kill criterion", or before any expensive run.
Find a near-optimal learning rate by sweeping LR exponentially over a few hundred mini-batches and watching where the loss starts to diverge — the Leslie Smith "LR range test". Activate when the user asks "what learning rate should I use", "lr finder", "lr range test", "calibrate the learning rate before training", or after Stage 2 sanity checks pass and they're ready to commit compute.
Estimate trial-to-trial variance by running the same configuration with multiple random seeds and checking whether claimed improvements exceed that variance. Activate when the user asks "is my improvement real or noise", "how many seeds do I need", "multi-seed variance check", "statistical significance for ML", or after any A/B comparison that ran with only one seed each.
Verify that activation statistics and gradient magnitudes are width-invariant under muP parameterization, so that hyperparameters tuned at small width transfer zero-shot to large width. Activate when the user asks "muP", "muTransfer", "tune small predict big", "coord check", "width-invariant init", or before any large-scale training where they want to skip per-width hyperparameter tuning.
Fit a power-law scaling curve to small-scale runs at multiple sizes, then extrapolate to predict large-scale loss before committing the compute. Activate when the user asks "scaling laws", "Chinchilla", "Kaplan", "predict large-scale loss from small-scale", "is my idea going to scale", or wants to do compute-optimal training.
Run a tiny, cheap A/B comparison between a baseline and a new idea on a small model and short training budget — to predict whether the idea is worth a full-scale run. Activate when the user asks "should I scale this up", "is this idea worth running for real", "test this idea cheaply first", "small-scale ablation", or "validate before scaling".
Construct a tiny synthetic task that *requires* the new feature to solve, run the model on it, and use success/failure as a structural signal of whether the feature is doing what it claims. Activate when the user asks "how do I test if my new mechanism actually works", "surrogate task", "synthetic benchmark", "probe task", or claims a new architecture component "helps with X" without quantitative evidence.
Sweep model capacity (width, depth, parameter count) at fixed compute to find the saturation point — where adding more parameters stops reducing the train loss. Activate when the user asks "how big should my model be", "capacity sweep", "is my model big enough", "find the right model size", or after Stage 3 pre-validation passes.
Set up an Optuna hyperparameter study integrated with Hydra and the project's Logger protocol, supporting TPE/CMA-ES/PBT samplers and Hyperband pruning. Activate when the user asks "set up Optuna", "hyperparameter sweep", "tune hyperparameters", "Bayesian optimization for training", or wants to refine LR/batch/dropout after capacity is chosen.
Decide which parallelism primitive (DP, ZeRO, TP, PP, EP, CP) to introduce next based on what bottleneck appears at the current model size. Activate when the user asks "do I need tensor parallelism", "OOM at scale", "training too slow", "should I add pipeline parallel", "how to scale beyond N GPUs", or after capacity-sweep when single-GPU runs no longer fit.
Decide how often to checkpoint, what to checkpoint (full vs parameter-only), and how many to keep — balancing recovery, rollback, and storage. Activate when the user asks "how often should I checkpoint", "checkpoint policy", "rollback checkpoint", "DCP setup", "best-K checkpoints", or before any long-running training.
An automated recovery procedure for loss spikes during long-running training — detect a spike, roll back to a recent checkpoint, skip a window of batches, resume. Modeled on the PaLM training paper. Activate when the user asks "loss spike", "training spiked then crashed", "recover from divergence", "PaLM rollback recipe", or experiences instability mid-run.
Maintain a structured per-run journal capturing seed, config diff, git SHA, full training curves, kill events, rollbacks, and resumes — so that any run is fully reproducible and comparable later. Activate when the user asks "experiment tracking", "reproducibility", "run journal", "what should I record", or shows runs without traceable metadata.
A standard warmup-then-cosine learning rate schedule that prevents early divergence and produces stable long-run training. Activate when the user asks "what learning rate schedule", "warmup", "cosine schedule", "no warmup is bad", "schedule diverges at start", or before any long run.
Run a structured grid of ablation experiments (multiple changes vs baseline, possibly combinations), report the matrix with variance-aware verdicts, isolate which changes actually contributed. Activate when the user asks "ablation study", "which changes matter", "ablation table", "isolate contribution of X", or after multiple variants have been evaluated.
Cluster the dev-set errors of a model and surface the dominant failure modes — pointing at the most leverage-worthy next experiment. Activate when the user asks "what should I try next", "what is my model getting wrong", "error analysis", "failure mode analysis", or after a completed run that's no longer SOTA.
At full scale, after multi-seed runs of two configs, decide whether one is genuinely better — using the same multi-seed variance machinery as Stage 3 but applied to long runs. Activate when the user asks "is run A really better than run B", "did this change help at scale", "post-hoc significance", or comparing two completed long runs.
Ultra-compressed communication mode. Cuts ~75% of tokens while keeping full technical accuracy by speaking like a caveman.
Comprehensive UI/UX design plugin for mobile (iOS, Android, React Native) and web applications with design systems, accessibility, and modern patterns
AI image generation Creative Director powered by Google Gemini Nano Banana models. Claude interprets intent, selects domain expertise, constructs optimized prompts, and orchestrates Gemini for best results.
Qiushi Skill: methodology skills for AI agents guided by seeking truth from facts, with Claude Code, Cursor, OpenClaw, Codex, OpenCode, and Hermes guidance.
Uses power tools
Uses Bash, Write, or Edit tools
Share bugs, ideas, or general feedback.
A methodology-first deep learning training framework, packaged as a Claude Code plugin.
Idea is cheap. Infrastructure that lets you validate ideas fast is valuable.
curryTrain organizes deep learning training around the actual end-to-end workflow, not around an algorithm catalog. The plugin provides Skills, Agents, and a minimal Python template that scaffolds a new training project and assists you through six well-defined stages.
| Stage | Question it answers | Representative skills |
|---|---|---|
| 1. Skeleton | Does the architecture exist and does data flow through it? | scaffolder, preflight-asserts, data-pipeline |
| 2. Sanity | Is the implementation actually correct? | overfit-single-batch, init-loss-check, grad-flow-viz |
| 3. Pre-validate | Will this idea pay off, before I burn the compute? | lr-range-test, small-scale-ablation, multi-seed-variance, mup-coord-check, scaling-fit, surrogate-task, compute-budget, kill-criterion |
| 4. Scale-up | Will it scale stably to the target size? | capacity-sweep, optuna-integration, parallel-primitive-intro |
| 5. Stabilize | Will it survive a long run? | warmup-cosine, loss-spike-rollback, checkpoint-cadence, run-journal |
| 6. Iterate | Which experiment was actually better? | variance-aware-decision, error-cluster, ablation-matrix, runs-diff |
Stage 3 is where most projects waste compute and where curryTrain provides the most differentiated value.
/curry-train:init is exposed as a slash command; the other 46 skills auto-activate from natural-language phrasing in your messages.template/curry_train/ — a minimal layered scaffold (Runtime / Primitive / Model) you copy into your project via /curry-train:initIn Claude Code, run:
/plugin marketplace add curryfromuestc/curry-train
/plugin install curry-train@curry-train
This adds the GitHub repo as a marketplace and installs the curry-train plugin from it. After installation, the /curry-train:init slash command and all description-activated skills (workflow, methodology, primitive, infra) become available in your sessions.
If you cloned this repo locally and want to edit the plugin while using it:
git clone https://github.com/curryfromuestc/curry-train.git
mkdir -p ~/.claude/plugins
ln -s "$(pwd)/curry-train" ~/.claude/plugins/curry-train
Reload Claude Code (or run /reload-plugins) and the plugin will be picked up.
claude --plugin-dir /path/to/curry-train
/curry-train:init is the only explicit slash command; everything else activates from natural-language phrasing.
# Bootstrap a new training project (copies the Python template into ./curry_train)
/curry-train:init my-experiment
Then drive the rest of the workflow by describing what you want:
new-experiment skill (Stage 1)bench skilldiagnose skillruns-diff skillThis is by design: the methodology lives in skills and trips on what you describe, so you don't have to memorize a command surface.
Logger protocol with TensorBoard as the default backend (no lock-in to W&B / MLflow)torchrun for launch (no custom launcher)Architecture inspired by NVIDIA Bumblebee's three-layer split (Runtime ↔ Primitive ↔ Model). Workflow inspired by Karpathy's "A Recipe for Training Neural Networks". Built for engineers who train models — including unconventional ones (SNN, CV, multimodal) — and need fast, trustworthy iteration.
The framework intentionally keeps the Python core small. The framework's value is in methodology (skills), not in re-implementing what Lightning Fabric / Accelerate / DeepSpeed already do well.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claim