npx claudepluginhub leeroo-ai/superml --plugin supermlThis skill uses the workspace's default tool permissions.
Externalize your experimental reasoning. Every ML project is a sequence of hypotheses tested — this skill makes that sequence visible, persistent, and learnable.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Externalize your experimental reasoning. Every ML project is a sequence of hypotheses tested — this skill makes that sequence visible, persistent, and learnable.
NO NEW EXPERIMENT WITHOUT LOGGING THE HYPOTHESIS FIRST
If you're about to change a hyperparameter, swap a dataset, try a new architecture, or modify a training recipe — write down what you expect to happen and why BEFORE running it. This is how you learn from experiments instead of just running them.
Maintain these files in the project root (create if they don't exist):
experiments/
├── journal.md — Running experiment log (append-only)
└── lessons.md — Distilled patterns and rules (curated)
Before changing anything or running anything new:
experiments/journal.md (if it exists) to see what's been tried### YYYY-MM-DD HH:MM — [Experiment Name]
**Status**: PLANNED
**Hypothesis**: [What you expect to happen and why]
**Change**: [Exactly what's being modified — one variable at a time]
**Config**:
- key_param_1: old_value → new_value
- key_param_2: value (unchanged)
**Expected outcome**: [Specific metric target or qualitative expectation]
**Baseline**: [Current best metric to beat]
Gate: Entry is written before any code runs. No exceptions.
Once results are in:
**Status**: COMPLETED
**Actual outcome**: [What actually happened — metrics, behavior]
**Delta**: [How this compared to expectation — better/worse/different than expected]
**Duration**: [Wall time, GPU hours]
**Learning**: [One sentence — what this taught you]
**Next**: [What to try based on this result]
Gate: Result is logged before starting the next experiment.
Before proposing or starting the next experiment:
experiments/journal.md — scan recent entriesexperiments/lessons.md — are there rules that apply?Gate: You can articulate why this experiment is different from previous attempts.
After every 3-5 experiments, or when a pattern emerges:
experiments/lessons.md:## Lessons
- [YYYY-MM-DD] [Context]: [Lesson]. Source: [user correction / experiment result / KB finding]
Example: "2024-03-15 QLoRA: alpha/r ratio matters more than absolute rank for 7B models. Source: experiments showed r=32/alpha=64 outperformed r=64/alpha=64"
## Rules (hard-won)
- NEVER [thing that always fails] because [reason]. Learned: [date]
- ALWAYS [thing that always works] when [condition]. Learned: [date]
Gate: Lessons file has been updated before closing out a series of experiments.
| Mistake | Why it happens | What to do instead |
|---|---|---|
| Running without logging | "I'll just try this quick thing" | Even quick experiments get logged — they compound into knowledge |
| Changing multiple variables | "Let me also bump the LR while I'm at it" | One variable per experiment. Otherwise you can't attribute the result. |
| Not recording the baseline | "I'll remember what the old score was" | Write the baseline metric in the entry. Memory is unreliable. |
| Skipping the review step | "I know what I tried before" | Read the journal. You'll find experiments you forgot about. |
| Never distilling lessons | "The journal has everything" | A 200-entry journal is noise. Lessons are signal. Distill regularly. |
Starting a fine-tuning experiment:
User: "Let's try QLoRA with rank 64 instead of 32"
Agent: [Reads experiments/journal.md]
Agent: [Writes new entry with hypothesis: "Higher rank captures more task-specific features, expecting +2% accuracy"]
Agent: [Proceeds with implementation]
After getting results:
User: "Training finished, eval accuracy went from 78% to 81%"
Agent: [Updates journal entry with actual outcome, delta (+3% vs expected +2%), learning]
Agent: [Suggests next experiment based on result]
Before next iteration:
User: "What should we try next?"
Agent: [Reads journal — sees rank 64 worked, rank 16 didn't, data augmentation untested]
Agent: [Invokes ml-iterate with full history context]