From curry-train
Construct a tiny synthetic task that *requires* the new feature to solve, run the model on it, and use success/failure as a structural signal of whether the feature is doing what it claims. Activate when the user asks "how do I test if my new mechanism actually works", "surrogate task", "synthetic benchmark", "probe task", or claims a new architecture component "helps with X" without quantitative evidence.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A targeted synthetic task that the model **cannot** solve without the new feature, but **can** solve with it. Used to validate the *mechanism* of an architectural change, not just the loss number.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A targeted synthetic task that the model cannot solve without the new feature, but can solve with it. Used to validate the mechanism of an architectural change, not just the loss number.
"Does my new component actually do what I claim it does, in isolation?"
A surrogate task gives a yes/no with a clear mechanistic interpretation. Loss numbers on real data don't.
In each case, the surrogate is smaller than the real task, more controlled, and diagnostic — failure on it is decisive.
A useful surrogate satisfies all four:
If the surrogate is solved by both arms, it's not diagnostic — design a harder one. If neither arm solves it, the surrogate is wrong (too hard, or the feature implementation is wrong).
| Claim | Surrogate task |
|---|---|
| Long-context attention works | Copying: input is <random tokens><DELIM><same random tokens>; the model must copy. Length L pinned. |
| Positional encoding extrapolates | Train on length L sequences; test on length 4L. Metric: accuracy at extrapolated length. |
| MoE experts specialize | Multi-task synthetic: 4 different deterministic mappings, route by class. Confirm experts cluster. |
| Recurrent state preserves info | Long-range memorize-then-retrieve task with arbitrary delay. |
| New optimizer escapes saddle | Train a known-saddle objective (e.g., 2-layer linear net with synthetic data); compare convergence. |
| Spiking time encoding helps | Time-varying signal classification at different rates; SNN T-dim used to encode rate. |
Articulate the claim in one sentence: "Component C makes property P hold". P should be testable with a synthetic task.
Design the surrogate so that property P is the bottleneck. Strip everything else (small model, small vocab, deterministic data).
Run baseline (architecture without C). Confirm it fails (or has high loss). This step is non-optional — without it, success on the surrogate doesn't tell you the feature was needed.
Run variant (architecture with C). Confirm it succeeds.
Vary the difficulty of the surrogate (e.g. sequence length for retrieval). Plot success rate. The variant's success curve should extend further than the baseline's.
Document the surrogate in the run journal alongside the small-scale ablation result. A surrogate-task pass does not replace a small-scale ablation — they are complementary.
Press the user to articulate the mechanistic claim in one sentence. If they cannot, the work is at the "I think this might help" stage and the surrogate-task work is premature — go back to small-scale ablation first.
Help design the surrogate by working backward from the claim. The right surrogate is usually obvious once the claim is sharp.
Confirm the surrogate is cheap. If it takes more than ~30 minutes to run, redesign.
Confirm the baseline fails. Without this control, the result is uninterpretable.
After running, render the result as one line: "Variant solves surrogate at length L=4096 with accuracy 0.94; baseline 0.21. Mechanistic claim supported." Or: "Variant fails surrogate at L=4096 (0.34); claim not supported at this scale."
stage3-small-scale-ablation on real data. They test different things.skills/stage3-small-scale-ablation — measures aggregate improvement on real data; the surrogate isolates mechanism.skills/stage3-multi-seed-variance — apply to surrogate too; one-seed surrogates are noisy.