From ml-research
Systematic experiment tracking, comparison, and analysis for machine learning research.
npx claudepluginhub nishide-dev/claude-code-ml-researchThis skill uses the workspace's default tool permissions.
Systematic experiment tracking, comparison, and analysis for machine learning research.
Verifies tests pass on completed feature branch, presents options to merge locally, create GitHub PR, keep as-is or discard; executes choice and cleans up worktree.
Guides root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Writes implementation plans from specs for multi-step tasks, mapping files and breaking into TDD bite-sized steps before coding.
Share bugs, ideas, or general feedback.
Systematic experiment tracking, comparison, and analysis for machine learning research.
Directory Structure:
logs/
├── 2026-02-22/
│ ├── 14-30-22/ # Timestamp of run
│ │ ├── .hydra/
│ │ │ ├── config.yaml # Full resolved config
│ │ │ ├── overrides.yaml # CLI overrides
│ │ │ └── hydra.yaml
│ │ ├── checkpoints/
│ │ ├── metrics.csv
│ │ └── train.log
│ └── 15-45-10/
└── experiment_registry.json # Central registry
Interactive Setup - Ask User:
Generate: configs/experiment/<name>.yaml
# @package _global_
# Metadata
name: "vit_imagenet_finetuning"
description: "Fine-tune Vision Transformer on ImageNet subset"
tags: ["vision-transformer", "transfer-learning", "imagenet"]
# Compose from existing configs
defaults:
- override /model: vit_base
- override /data: imagenet
- override /trainer: gpu_multi
- override /logger: wandb
# Seed
seed: 42
# Model overrides
model:
pretrained: true
freeze_backbone: false
num_classes: 1000
optimizer:
lr: 0.001
# Data overrides
data:
batch_size: 256
num_workers: 8
image_size: 224
# Trainer overrides
trainer:
max_epochs: 50
precision: "16-mixed"
devices: 4
strategy: "ddp"
# Callbacks
callbacks:
model_checkpoint:
monitor: "val/acc"
mode: "max"
save_top_k: 3
early_stopping:
monitor: "val/loss"
patience: 10
mode: "min"
# Logger
logger:
wandb:
project: "imagenet-classification"
tags: ${tags}
notes: ${description}
Run experiment:
python src/train.py experiment=vit_imagenet_finetuning
See templates/experiment-templates.yaml for common experiment types.
# In LightningModule
def on_train_end(self):
# Log experiment to registry
from scripts.experiment_registry import log_experiment
log_experiment(
name=self.hparams.experiment_name,
config_path=self.hparams.config_path,
metrics={
"best_val_acc": self.trainer.callback_metrics["val/acc"].item(),
"best_val_loss": self.trainer.checkpoint_callback.best_model_score.item(),
"epochs_trained": self.trainer.current_epoch,
},
hyperparameters={
"lr": self.hparams.optimizer.lr,
"batch_size": self.hparams.data.batch_size,
"optimizer": self.hparams.optimizer._target_,
},
tags=self.hparams.tags,
)
logs/experiment_registry.json:
{
"experiments": [
{
"id": "exp_001",
"name": "baseline_resnet50",
"timestamp": "2026-02-22T14:30:22",
"config": "configs/experiment/baseline.yaml",
"status": "completed",
"metrics": {
"best_val_acc": 0.876,
"best_val_loss": 0.324,
"final_train_loss": 0.145,
"epochs_trained": 45
},
"hyperparameters": {
"lr": 0.001,
"batch_size": 128,
"optimizer": "AdamW"
},
"runtime": "2h 34m",
"gpu_count": 2,
"tags": ["baseline", "resnet"]
}
]
}
See scripts/experiment_registry.py for implementation.
Compare specific experiments:
python scripts/compare_experiments.py exp_001 exp_002 exp_003
Output:
ID Name Val Acc Val Loss LR Batch Runtime
exp_001 baseline_resnet50 0.876 0.324 0.001 128 2h 34m
exp_002 resnet50_tuned 0.892 0.298 0.005 256 3h 12m
exp_003 resnet50_dropout 0.884 0.312 0.001 128 2h 45m
Comparison plot:
# Generates logs/experiment_comparison.png
# - Bar charts for accuracy and loss
# - Side-by-side comparison
See scripts/compare_experiments.py for full implementation.
# configs/experiment/baseline.yaml
name: "baseline"
description: "Baseline with default hyperparameters"
tags: ["baseline"]
# Use defaults from model/data/trainer
model: {}
data: {}
trainer:
max_epochs: 100
# configs/experiment/ablation_dropout.yaml
name: "ablation_dropout"
description: "Effect of dropout rate"
tags: ["ablation", "regularization"]
# Run with: --multirun model.dropout=0.0,0.1,0.2,0.3,0.4,0.5
model:
dropout: 0.3
# configs/experiment/hp_optimization.yaml
name: "hp_optimization"
description: "Hyperparameter optimization with Optuna"
tags: ["optimization", "tuning"]
defaults:
- override hydra/sweeper: optuna
hydra:
sweeper:
n_trials: 100
direction: maximize
study_name: "model_optimization"
params:
model.hidden_dims:
type: categorical
choices: [[512,256], [1024,512,256]]
model.optimizer.lr:
type: float
low: 0.0001
high: 0.01
log: true
optimized_metric: "val/acc"
See templates/ for more experiment types.
# Save package versions
pixi list > logs/exp_001/environment.txt
# or
uv pip freeze > logs/exp_001/requirements.txt
# Save git commit
git rev-parse HEAD > logs/exp_001/commit_hash.txt
# Save system info
python -c "import torch; print(f'PyTorch: {torch.__version__}\nCUDA: {torch.version.cuda}')" > logs/exp_001/system_info.txt
# Checkout exact code
git checkout $(cat logs/exp_001/commit_hash.txt)
# Restore environment
pixi install
# or
uv pip install -r logs/exp_001/requirements.txt
# Run with exact config
python src/train.py \
--config-path ../logs/exp_001/.hydra \
--config-name config
Reproducibility Checklist:
python scripts/analyze_experiment.py logs/2026-02-22/14-30-22/
Generates:
analysis.png - Training curves (loss, accuracy, LR)Example:
Experiment Summary:
Best Val Acc: 0.8921
Best Val Loss: 0.2984
Epochs Trained: 45
Final LR: 0.000123
# List all experiments
python scripts/list_experiments.py
# Filter by tags
python scripts/list_experiments.py --tags baseline ablation
# Export to CSV
python scripts/export_results.py --output results.csv
# Generate markdown report
python scripts/generate_report.py --format markdown --output report.md
See examples/experiment-analysis.md for detailed analysis workflows.
import wandb
api = wandb.Api()
runs = api.runs("my-project")
# Filter runs
runs = api.runs("my-project", filters={"tags": "baseline"})
# Get metrics
for run in runs:
print(f"{run.name}: val_acc={run.summary['val/acc']:.4f}")
# Download artifacts
best_run = runs[0]
best_run.file("model.pt").download()
# Open workspace
wandb workspace
# Generate report
wandb reports create --title "Experiment Comparison"
# Initialize sweep
wandb sweep configs/sweep/bayesian_optimization.yaml
# Run sweep agent
wandb agent <sweep-id>
See examples/wandb-integration.md for complete guide.
vit_large_imagenet_pretrainedexp_2026_02_baselineablation_, optimization_, baseline_configs/experiment/
├── baselines/
│ ├── resnet_baseline.yaml
│ └── vit_baseline.yaml
├── ablations/
│ ├── ablation_dropout.yaml
│ └── ablation_lr.yaml
└── optimizations/
└── hp_optimization.yaml
Purpose: Establish reference performance.
name: "baseline"
tags: ["baseline"]
model: {} # Use defaults
Purpose: Isolate effect of single component.
name: "ablation_batch_norm"
tags: ["ablation"]
model:
use_batch_norm: false # Remove batch norm
Purpose: Find optimal hyperparameters.
name: "hp_tuning"
tags: ["optimization"]
# Use with --multirun or Optuna sweeper
Purpose: Fine-tune pretrained model.
name: "transfer_learning"
tags: ["transfer-learning"]
model:
pretrained: true
freeze_backbone: true # Freeze early layers
Purpose: Compare different architectures.
# Run multiple architectures
python src/train.py --multirun \
experiment=architecture_search \
model=resnet18,resnet50,vit_base
# Create new experiment
python src/train.py experiment=<name>
# List experiments
python scripts/list_experiments.py
# Compare experiments
python scripts/compare_experiments.py exp_001 exp_002 exp_003
# Analyze experiment
python scripts/analyze_experiment.py logs/2026-02-22/14-30-22/
# Clean old experiments (keep best 5)
python scripts/clean_experiments.py --keep-best 5
# Export results
python scripts/export_results.py --output results.csv
# Generate report
python scripts/generate_report.py --format markdown --output report.md
Experiment registry not updating:
logs/experiment_registry.jsonon_train_end callback is calledCan't reproduce results:
W&B runs not logging:
WANDB_API_KEY is setwandb login againMetrics not saving:
log_every_n_steps is setExperiments are well-organized and easily comparable!