Skill

experiment-registry

Manages ML experiment lifecycle via YAML registry: register experiments, record benchmarks, compare runs, track status. For Python ML research metadata without databases or job launching.

Python

ai-ml

npx claudepluginhub jiahao-shao1/sjh-skills --plugin sjh-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Structured YAML experiment registry for ML research. YAML was chosen over databases or JSON because experiment files should be human-readable, git-diffable, and hand-editable — researchers often need to inspect or tweak entries directly.

SKILL.md

Similar Skills

ml-experiment

155

Maintains persistent ML experiment journals in Markdown files, logging hypotheses, changes, results, metrics, and learnings across sessions.

superml

setting-up-experiment-tracking

1.9k

Sets up ML experiment tracking with MLflow or Weights & Biases: installs packages, initializes tools, and provides logging code for parameters, metrics, and artifacts.

3 files6 tools

experiment-tracking-setup

arize-experiment

Creates, runs, and analyzes Arize experiments using ax CLI for evaluating, comparing, and benchmarking AI model performance on datasets.

2 files

arize-skills

Stats

Stars15

Forks0

Last CommitApr 25, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Experiment Registry

Prerequisites

The exp CLI must be installed:

pip install exp-registry

If exp is not found, install it first. If the project has no exp.config.yaml, run exp init before any other command.

How to Think About Experiment Registry

This tool manages the metadata layer of experiments — what was run, with what config, and what results came out. It does NOT manage training code, checkpoints, or logs.

The mental model: each experiment gets a YAML file that serves as its "identity card." Over time, benchmark results accumulate in that file as the experiment progresses through training steps. When it's time to report or decide next steps, you compare across experiments.

Decision Flow

When the user mentions experiments, follow this flow:

Check environment first. Does exp.config.yaml exist? If not, ask if they want to initialize (exp init). Don't silently init — the user should confirm the project root.
Understand intent. Map the user's request to the right action:
- "New experiment" / "register" → exp register (but first exp list to check for ID conflicts)
- "What's running" / "experiment status" → exp list (suggest filters if there are many)
- "Record results" / "add benchmark" → exp add-benchmark (ask for dataset and step if not provided)
- "Compare" / "which is better" → exp compare (auto-discover shared datasets from exp show)
- "What happened with exp07" → exp show first, then summarize findings
Be proactive with context. After any operation, offer useful next steps:
- After registering → "Ready to add benchmarks when results come in"
- After adding a benchmark → suggest comparing with related experiments if they exist
- After comparing → highlight the best performer and surface any findings

Smart Comparison

When the user asks "compare experiments" or "which is better," don't just dump the table. First run exp show on each experiment to discover which datasets and eval_modes they share, then run exp compare on those shared dimensions. If experiments have no common benchmarks, tell the user — don't produce an empty comparison silently.

Series Grouping

Experiment IDs like exp07a, exp07b, exp07c automatically group into series exp07. This lets you filter by series to see all variants of one experimental idea. When a user discusses "the exp07 experiments," use exp list --series exp07 to get the full picture.

Command Reference

Task	Command
Initialize	`exp init`
List all	`exp list`
Filter by status	`exp list --status completed`
Filter by type	`exp list --type rl`
Filter by series	`exp list --series exp07`
Show details	`exp show <id>`
JSON output	`exp show <id> --json`
Register new	`exp register <id> --type rl --model Qwen3-VL-8B`
Add benchmark	`exp add-benchmark <id> --dataset mmlu --eval-mode cot --samples 100 --step 50 --extra acc=0.72`
Compare	`exp compare <id1> <id2> --dataset mmlu`
Update status	`exp update <id> --status completed`
Add finding	`exp update <id> --finding "key insight"`

All list/show/compare commands support --json for machine-readable output.

Typical Workflows

Experiment Lifecycle

register → (train) → add-benchmark at step N → add-benchmark at step M → update status + finding

Example:

exp register exp01 --type rl --model Qwen3-VL-8B --reward reward_tool_strict
(user trains the model)
exp add-benchmark exp01 --dataset zebra-cot --eval-mode agent --samples 50 --step 50 --extra text_only=0.42 with_tools=0.52
exp add-benchmark exp01 --dataset zebra-cot --eval-mode agent --samples 50 --step 90 --extra text_only=0.44 with_tools=0.55
exp update exp01 --status completed --finding "tool use improves reasoning"

Benchmarks are organized by step number within each dataset — this tracks how performance evolves during training, which is critical for deciding when to stop or which checkpoint to use.

Compare for Meeting/Report

exp compare exp07a exp07b exp07c --dataset zebra-cot

Outputs a Markdown table — paste directly into docs or slides.

Project Configuration

exp.config.yaml at project root:

registry_dir: experiments/           # where YAML files live
paths_template:
  local: outputs/{id}/               # {id} is replaced with experiment ID
defaults:
  type: rl                           # default for `exp register`
types:
  rl:
    fields: [model, config, script, reward]
  sft:
    fields: [model, config, script, base_model]

Config is discovered by walking up from CWD. No config = sensible defaults (registry_dir: experiments/).

YAML Schema

Each experiment is one file in <registry_dir>/<exp_id>.yaml:

Required fields: id, name, type, series, date, status

Auto-generated: series (inferred from ID prefix), paths (from template), date (today)

Structured benchmarks:

benchmarks:
  - dataset: string
    eval_mode: string
    samples: int
    steps:
      <step_number>: { <metric>: <value>, ... }

Error Recovery

Error	Cause	What to Do
"experiment already exists"	Duplicate ID	`exp show <id>` to check, use a different ID
"experiment not found"	Wrong ID	`exp list` to see available IDs
"No benchmark data found"	Wrong dataset in compare	`exp show <id>` to check available datasets
"Missing required fields"	Corrupted YAML	Inspect and fix the YAML file directly — it's designed to be human-editable