Help us improve
Share bugs, ideas, or general feedback.
Standardizes training experiment tracking with per-experiment notebooks and a project-level index. Use when config changes between runs to keep results comparable.
npx claudepluginhub lightning-rod-labs/lightningrod-python-sdk --plugin lightningrod-python-sdkHow this skill is triggered — by the user, by Claude, or both
Slash command
/lightningrod-python-sdk:experiment-trackingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Every meaningful training run is its own experiment: a self-contained notebook plus one row in a project-level index. This makes runs comparable, makes regressions visible, and lets the user (or you, in a later session) understand at a glance what has been tried and what worked.
Maintains persistent ML experiment journals in Markdown files, logging hypotheses, changes, results, metrics, and learnings across sessions.
Manages ML experiment lifecycle via structured YAML registry. Registers experiments, records results, compares runs, tracks status. Activates on experiment-related queries.
Provides Markdown template and Python utilities for logging ML experiments with hypothesis, configs, results, environment, and decisions for reproducibility. Use when running ML experiments.
Share bugs, ideas, or general feedback.
Every meaningful training run is its own experiment: a self-contained notebook plus one row in a project-level index. This makes runs comparable, makes regressions visible, and lets the user (or you, in a later session) understand at a glance what has been tried and what worked.
./userland/<project>/
├── experiments/
│ ├── exp_001_baseline.ipynb
│ ├── exp_002_more_steps.ipynb
│ ├── exp_003_conf_threshold.ipynb
│ └── ...
└── experiments.md # the index (single source of truth for results)
exp_004_lora_rank_64, not exp_004_run.exp_001_baseline — it captures the out-of-the-box config from the relevant example skill, with no modifications.Create a new exp_NNN_<slug>.ipynb whenever the next training run's tracked config differs from the last experiment's config. Tracked config knobs:
base_model_idtraining_steps, lora_rank, batch_size, num_rollouts, max_response_length, learning_rateepochs (SFT)max_questions if the training dataset was regeneratedSkip creating a new experiment if you are re-running the exact same config (e.g. recovering from a transient failure) — append a note to the existing experiment's notebook instead.
Initial small-scale verification runs (max_questions=50 test) are not experiments — they belong in the main pipeline notebook. The first experiment starts when you have a dataset you intend to train on.
Each exp_NNN_<slug>.ipynb is a self-contained record. It must run end-to-end from a clean kernel using artifacts in the project (dataset ID, training config). Required cells, in order:
Header (markdown) — fixed format, fill every field:
# exp_003 — Raise labeler confidence threshold
- **Date started:** 2026-05-14
- **Hypothesis:** Raising WebSearchLabeler confidence from 0.6 → 0.8 removes the noisiest labels and improves Brier vs frontier.
- **Parent experiment:** exp_002
- **Change vs parent:** `labeler.confidence_threshold` 0.6 → 0.8. All other config unchanged.
- **Dataset:** `<dataset_id>` (n_train=..., n_test=...)
- **Frontier baseline:** `openai/gpt-5.5`
- **Status:** running
Update Status to done / failed once eval completes, and add a Result line (see step 5).
Config cell — the full GRPOTrainingConfig / SFTTrainingConfig literal. Inline, not imported. Future-you reading this notebook should see every knob without cross-referencing another file.
Cost estimate cell — lr.training.estimate_cost(config, dataset=train_dataset). Print the result.
Train cell — lr.training.run(config, dataset=train_dataset, name=f"<project>-exp_003"). The job name must include the experiment ID so it is identifiable in the dashboard.
Eval cell — lr.evals.run_from_training_job(config, job, test_dataset, extra_models=[EvalModel(model_id="openai/gpt-5.5", label="GPT-5.5")]). The frontier model is always included (see lightningrod-assistant "Frontier benchmark"). Print the eval summary.
Result cell (markdown) — fill in once eval returns. Always report both Brier and ECE, and both deltas (vs frontier and vs base model). Lower is better for both metrics, so positive Δ = we beat the comparison model.
## Result
| | Brier | ECE |
|-----------------------|--------|--------|
| Fine-tuned | 0.1821 | 0.0612 |
| Base (gpt-oss-120b) | 0.1980 | 0.0834 |
| Frontier (GPT-5.5) | 0.2003 | 0.0701 |
| **Δ vs base** | +0.0159 Brier / +0.0222 ECE |
| **Δ vs frontier** | +0.0182 Brier / +0.0089 ECE |
- **Verdict:** Beat both base and frontier on Brier and ECE; threshold change worked as hypothesised.
- **Next:** Try 0.8 → 0.9 in exp_004, or revert and explore a different axis.
Index update cell (last cell) — appends/updates the row in ../experiments.md (see below). Keep this as the final cell so it only runs once the result is real.
One markdown table at ./userland/<project>/experiments.md. Newest row on top. Single source of truth — read it before designing the next experiment.
# Experiments — <project name>
Base model: `openai/gpt-oss-120b` · Frontier benchmark: `openai/gpt-5.5` · Metrics: Brier, ECE (lower is better; Δ shown as fine-tuned − comparison, signed so positive = we win)
| ID | Date | Hypothesis | Δ Brier (base / frontier) | Δ ECE (base / frontier) | Status | Notebook |
|-----|------------|------------------------------|---------------------------|-------------------------|--------|-----------------------------------------------------|
| 003 | 2026-05-14 | Raise conf threshold 0.6→0.8 | +0.016 / +0.018 | +0.022 / +0.009 | done | [exp_003](experiments/exp_003_conf_threshold.ipynb) |
| 002 | 2026-05-12 | 2x training steps | -0.001 / -0.004 | +0.003 / -0.002 | done | [exp_002](experiments/exp_002_more_steps.ipynb) |
| 001 | 2026-05-10 | Baseline (golf defaults) | +0.010 / +0.012 | +0.015 / +0.005 | done | [exp_001](experiments/exp_001_baseline.ipynb) |
Column rules:
NNN zero-padded. Matches the notebook filename.YYYY-MM-DD of the day the experiment started./: fine-tuned vs base model, then fine-tuned vs frontier. Sign convention: positive = fine-tuned wins (since lower Brier is better, this is base − fine_tuned / frontier − fine_tuned).— for any delta column while the experiment is still running.running, done, failed. Update at completion..ipynb.When the assistant starts a new experiment, it writes a row with Status: running and Δ vs frontier: —, then updates that row in place once eval completes. Never reorder rows except to keep newest on top when inserting.
./userland/<project>/experiments.md. Find the previous experiment's tracked config (from its notebook's config cell).exp_NNN_<slug>.ipynb from the template above. If nothing changed, do not create a new experiment.experiments.md with running and — before kicking off lr.training.run.experiments.md.experiments.md first, not the notebooks.Status: failed with a short note in the Result cell. Do not renumber.comparison − fine_tuned, so positive always means the fine-tuned model wins. Keep this consistent across every row and project.