From sjh-skills
Manages ML experiment lifecycle via YAML registry: register experiments, record benchmarks, compare runs, track status. For Python ML research metadata without databases or job launching.
npx claudepluginhub jiahao-shao1/sjh-skills --plugin sjh-skillsThis skill uses the workspace's default tool permissions.
Structured YAML experiment registry for ML research. YAML was chosen over databases or JSON because experiment files should be human-readable, git-diffable, and hand-editable — researchers often need to inspect or tweak entries directly.
Maintains persistent ML experiment journals in Markdown files, logging hypotheses, changes, results, metrics, and learnings across sessions.
Sets up ML experiment tracking with MLflow or Weights & Biases: installs packages, initializes tools, and provides logging code for parameters, metrics, and artifacts.
Creates, runs, and analyzes Arize experiments using ax CLI for evaluating, comparing, and benchmarking AI model performance on datasets.
Share bugs, ideas, or general feedback.
Structured YAML experiment registry for ML research. YAML was chosen over databases or JSON because experiment files should be human-readable, git-diffable, and hand-editable — researchers often need to inspect or tweak entries directly.
The exp CLI must be installed:
pip install exp-registry
If exp is not found, install it first. If the project has no exp.config.yaml, run exp init before any other command.
This tool manages the metadata layer of experiments — what was run, with what config, and what results came out. It does NOT manage training code, checkpoints, or logs.
The mental model: each experiment gets a YAML file that serves as its "identity card." Over time, benchmark results accumulate in that file as the experiment progresses through training steps. When it's time to report or decide next steps, you compare across experiments.
When the user mentions experiments, follow this flow:
Check environment first. Does exp.config.yaml exist? If not, ask if they want to initialize (exp init). Don't silently init — the user should confirm the project root.
Understand intent. Map the user's request to the right action:
exp register (but first exp list to check for ID conflicts)exp list (suggest filters if there are many)exp add-benchmark (ask for dataset and step if not provided)exp compare (auto-discover shared datasets from exp show)exp show first, then summarize findingsBe proactive with context. After any operation, offer useful next steps:
When the user asks "compare experiments" or "which is better," don't just dump the table. First run exp show on each experiment to discover which datasets and eval_modes they share, then run exp compare on those shared dimensions. If experiments have no common benchmarks, tell the user — don't produce an empty comparison silently.
Experiment IDs like exp07a, exp07b, exp07c automatically group into series exp07. This lets you filter by series to see all variants of one experimental idea. When a user discusses "the exp07 experiments," use exp list --series exp07 to get the full picture.
| Task | Command |
|---|---|
| Initialize | exp init |
| List all | exp list |
| Filter by status | exp list --status completed |
| Filter by type | exp list --type rl |
| Filter by series | exp list --series exp07 |
| Show details | exp show <id> |
| JSON output | exp show <id> --json |
| Register new | exp register <id> --type rl --model Qwen3-VL-8B |
| Add benchmark | exp add-benchmark <id> --dataset mmlu --eval-mode cot --samples 100 --step 50 --extra acc=0.72 |
| Compare | exp compare <id1> <id2> --dataset mmlu |
| Update status | exp update <id> --status completed |
| Add finding | exp update <id> --finding "key insight" |
All list/show/compare commands support --json for machine-readable output.
register → (train) → add-benchmark at step N → add-benchmark at step M → update status + finding
Example:
exp register exp01 --type rl --model Qwen3-VL-8B --reward reward_tool_strictexp add-benchmark exp01 --dataset zebra-cot --eval-mode agent --samples 50 --step 50 --extra text_only=0.42 with_tools=0.52exp add-benchmark exp01 --dataset zebra-cot --eval-mode agent --samples 50 --step 90 --extra text_only=0.44 with_tools=0.55exp update exp01 --status completed --finding "tool use improves reasoning"Benchmarks are organized by step number within each dataset — this tracks how performance evolves during training, which is critical for deciding when to stop or which checkpoint to use.
exp compare exp07a exp07b exp07c --dataset zebra-cot
Outputs a Markdown table — paste directly into docs or slides.
exp.config.yaml at project root:
registry_dir: experiments/ # where YAML files live
paths_template:
local: outputs/{id}/ # {id} is replaced with experiment ID
defaults:
type: rl # default for `exp register`
types:
rl:
fields: [model, config, script, reward]
sft:
fields: [model, config, script, base_model]
Config is discovered by walking up from CWD. No config = sensible defaults (registry_dir: experiments/).
Each experiment is one file in <registry_dir>/<exp_id>.yaml:
Required fields: id, name, type, series, date, status
Auto-generated: series (inferred from ID prefix), paths (from template), date (today)
Structured benchmarks:
benchmarks:
- dataset: string
eval_mode: string
samples: int
steps:
<step_number>: { <metric>: <value>, ... }
| Error | Cause | What to Do |
|---|---|---|
| "experiment already exists" | Duplicate ID | exp show <id> to check, use a different ID |
| "experiment not found" | Wrong ID | exp list to see available IDs |
| "No benchmark data found" | Wrong dataset in compare | exp show <id> to check available datasets |
| "Missing required fields" | Corrupted YAML | Inspect and fix the YAML file directly — it's designed to be human-editable |