From curry-train
Methodology for Stage 1 Skeleton — set up the minimum architecture (model, dataset adapter, config, registration) so the data-flow can be traced end-to-end before any optimization. Activate when the user asks "how do I add a new model", "what files does a curryTrain model need", "set up the architecture skeleton", "where does my model.py go", or "how does registration work".
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Before any tuning, before any sanity check, the architecture must exist as **four small files** with a clean, layered boundary. This skill describes the contract.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Before any tuning, before any sanity check, the architecture must exist as four small files with a clean, layered boundary. This skill describes the contract.
"Does the architecture exist and does data flow through it from input to loss?"
If the answer is "I'm not sure", you're still in Stage 1.
Every model in curryTrain lives at curry_train/models/<name>/ and has exactly these files:
| File | Job | Typical size |
|---|---|---|
config.py | Architecture parameters as a frozen dataclass. HuggingFace-style. | ~50–90 lines |
model.py | Layers and the model class, built from curry_train.primitives.*. | ~150–260 lines |
checkpoint.py | HF weight ↔ internal weight conversion (uses primitive-hf-bridge). | ~120–180 lines |
protocol.py | register_model(...) call with a build function. | ~30–50 lines |
If a file grows past these sizes, you are likely mixing layers and should split.
TrainingRuntime protocol from curry_train.runtime. It never imports model code.from curry_train.primitives import .... It never imports a specific runtime backend.configs/<name>.yaml and register_model(...) are the only public seams.If an arrow goes the wrong way, the architecture is wrong. Surface this to the user immediately.
Follow this exact order; every step is verifiable.
Decide the tensor shape contract before writing any code.
(B, N, D).(B, T, N, D) where T is the spike-time dimension.(B, C, H, W) until a flattening point.
Document the shape contract at the top of model.py as a comment.Write config.py first. Frozen dataclass with __post_init__ validation. No defaults that hide bugs (e.g. don't default n_layers=12 silently — require it).
Write model.py second. Use only curry_train.primitives.* for the building blocks. If a primitive is missing, write a stub in curry_train.primitives and a one-line note in the relevant primitive-* skill — do not inline the missing logic into model.py.
Write checkpoint.py only when an HF source exists. If the user is starting from scratch (no pre-trained weights), this file may be empty initially. Mark it with TODO: HF weight conversion not yet needed.
Write protocol.py last. Call register_model(ModelSpec(name=..., package=..., impls={...})). The impls dict points to functions that build a runtime; for V1 most users will register a single impl backed by runtimes/local_torch.LocalTorchRuntime.
Run preflight asserts (stage1-preflight-asserts) and bench (1 step) before declaring Stage 1 complete.
model.py. Loops live in curry_train.loop.import torch.distributed in model.py. Distributed concerns live in primitives like parallel-state and distributed-optimizer.model.py. If a tensor enters with the wrong shape, raise immediately.model.py that aren't sourced from config.py.Stage 1 is done when:
bench --steps=1 produces a finite loss and finite grad-norm.stage1-preflight-asserts passes all checks.If any of these fail, do not let the user advance to Stage 2.
all_reduce inside model.py).nn.LayerNorm directly when an RMSNorm primitive exists (or vice versa) — every choice should go through a primitive.primitive-gqattention.create_runtime(<name>) raises KeyError before the user notices.skills/stage1-preflight-asserts — the checks to run after scaffolding.skills/stage1-data-pipeline — leakage-safe split + transform.agents/scaffolder.md — the agent that actually writes the four files for you.template/curry_train/runtime.py — the protocol you must satisfy.