ml-train | ml-research | ClaudePluginHub

Skill

ml-train

From ml-research

Execute training runs with proper monitoring, checkpointing, and experiment tracking. Use when starting training, resuming training, debugging training issues, or setting up multi-GPU/distributed training with PyTorch Lightning and Hydra.

$

npx claudepluginhub nishide-dev/claude-code-ml-research

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Execute training runs with proper monitoring, checkpointing, and experiment tracking using PyTorch Lightning and Hydra.

Supporting Assets

examples/gnn-training.mdexamples/image-classification.mdexamples/text-classification.mdreference.mdtemplates/basic-training.yamltemplates/distributed-training.yamltemplates/fsdp-large-model.yamltemplates/multi-gpu-training.yaml

SKILL.md

Similar Skills

finishing-a-development-branch

177.6k

Verifies tests pass on completed feature branch, presents options to merge locally, create GitHub PR, keep as-is or discard; executes choice and cleans up worktree.

systematic-debugging

177.6k

Guides root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.

10 files

writing-plans

177.6k

Writes implementation plans from specs for multi-step tasks, mapping files and breaking into TDD bite-sized steps before coding.

1 file

Stats

Stars0

Forks0

Last CommitFeb 22, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

ML Training Execution

Execute training runs with proper monitoring, checkpointing, and experiment tracking using PyTorch Lightning and Hydra.

Quick Start

Choose a training template based on your setup:

Basic single-GPU - Simple training on one GPU
Multi-GPU single-node - Multiple GPUs on one machine
Distributed multi-node - Training across multiple machines
FSDP large models - Very large models with Fully Sharded Data Parallel

Training Commands

Basic training:

python src/train.py

With specific experiment config:

python src/train.py experiment=my_experiment

With CLI overrides:

python src/train.py \
  model.learning_rate=1e-3 \
  data.batch_size=64 \
  trainer.max_epochs=100

Resume from checkpoint:

python src/train.py ckpt_path="checkpoints/epoch_42.ckpt"

Multi-GPU training:

python src/train.py \
  trainer.devices=4 \
  trainer.strategy=ddp

Hyperparameter sweep:

python src/train.py --multirun \
  model.learning_rate=1e-4,1e-3,1e-2 \
  data.batch_size=32,64,128

Pre-Training Checklist

Before starting training, verify:

Environment:

# Check Python version
python --version  # Should be >= 3.10

# Check CUDA availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')"

# Check package installation
python -c "import pytorch_lightning as pl; print(f'Lightning: {pl.__version__}')"

# Validate config
python src/train.py --cfg job

# Dry run
python src/train.py trainer.fast_dev_run=5

Disk space:

Estimate checkpoint storage: model_size_mb × save_top_k × num_epochs / checkpoint_freq
Ensure logging directory has sufficient space

Experiment tracking:

# Initialize W&B (if using)
wandb login
export WANDB_PROJECT="your-project-name"

# Sync offline runs (if needed)
wandb sync

Monitoring Training

Real-time monitoring:

Progress bar: Lightning shows metrics automatically
GPU usage: watch -n 1 nvidia-smi
W&B dashboard: Metrics, system stats, model graphs

Key metrics to watch:

Training loss: Should decrease steadily
Validation loss: Should decrease without diverging from train loss
Learning rate: Verify scheduler is working
GPU utilization: Should be >80%
Training speed: Samples/sec or batches/sec

Red flags:

Loss is NaN/inf: Check learning rate, add gradient clipping
Val loss increasing, train decreasing: Overfitting - add regularization
Very slow training: Check data loading (num_workers), batch size
Low GPU usage: Increase batch size, check data pipeline
Out of memory: Reduce batch size, use mixed precision, gradient accumulation

Common Training Issues

Gradient issues:

# Add to trainer config
trainer = Trainer(
    gradient_clip_val=1.0,
    gradient_clip_algorithm="norm",
    track_grad_norm=2,  # Log gradient norms
)

Memory issues:

# Use mixed precision + gradient accumulation
trainer = Trainer(
    precision="16-mixed",
    accumulate_grad_batches=4,
)

# Compile model (PyTorch 2.0+)
model = torch.compile(model)

Slow data loading:

# Profile to identify bottleneck
trainer = Trainer(profiler="simple")

# Optimize data loading
data_module = DataModule(
    num_workers=8,
    pin_memory=True,
    persistent_workers=True,
    prefetch_factor=2,
)

Overfitting:

# Add early stopping
from pytorch_lightning.callbacks import EarlyStopping

trainer = Trainer(
    callbacks=[
        EarlyStopping(monitor="val/loss", patience=10, min_delta=0.001)
    ]
)

# Increase regularization
model:
  dropout: 0.3
  weight_decay: 0.0001

Multi-GPU Strategies

DDP (Distributed Data Parallel):

Best for: Most use cases, models that fit in single GPU memory
Usage: trainer.strategy=ddp trainer.devices=4
Pros: Fast, stable, well-tested
Cons: Each GPU holds full model copy

FSDP (Fully Sharded Data Parallel):

Best for: Very large models (>10B parameters)
Usage: See FSDP template
Pros: Shard model across GPUs, minimal memory per GPU
Cons: Slower than DDP, more complex setup

DeepSpeed:

Best for: Extreme scale (100B+ parameters)
Usage: trainer.strategy=deepspeed_stage_3
Pros: Most memory efficient, supports ZeRO optimization
Cons: Requires additional configuration

Advanced Techniques

Mixed precision training:

python src/train.py trainer.precision=16-mixed

Gradient checkpointing (save memory):

model.gradient_checkpointing_enable()

Learning rate finder:

trainer = Trainer()
lr_finder = trainer.tuner.lr_find(model, datamodule=dm)
fig = lr_finder.plot(suggest=True)

Stochastic Weight Averaging:

from pytorch_lightning.callbacks import StochasticWeightAveraging

trainer = Trainer(callbacks=[StochasticWeightAveraging(swa_lrs=1e-2)])

Post-Training

Load best checkpoint:

best_model_path = trainer.checkpoint_callback.best_model_path
model = MyModel.load_from_checkpoint(best_model_path)

Evaluate on test set:

trainer.test(model, datamodule=dm, ckpt_path="best")

Generate predictions:

predictions = trainer.predict(model, datamodule=dm)

Domain-Specific Training

PyTorch Geometric (GNN Training)

For graph neural networks, see GNN training guide:

# Node classification
python src/train.py \
  model=gnn \
  data=graph \
  data.dataset_name=Cora

# Graph classification with batching
python src/train.py \
  model=gnn \
  data=graph \
  data.dataset_name=PROTEINS \
  data.batch_size=32

# Large graph sampling
python src/train.py \
  model=gnn \
  data=graph \
  data.use_sampling=true \
  data.num_neighbors=[15,10,5]

GNN-specific metrics to monitor:

Node-level accuracy / Graph-level accuracy
Over-smoothing (node representations becoming too similar)
Graph connectivity statistics
Layer-wise gradient norms

See complete GNN guide for architectures, sampling strategies, and troubleshooting.

Hyperparameter Optimization

Quick Grid Search

python src/train.py --multirun \
  model.learning_rate=1e-4,1e-3,1e-2 \
  data.batch_size=32,64,128

Bayesian Optimization with Optuna

python src/train.py \
  --multirun \
  hydra/sweeper=optuna \
  hydra.sweeper.n_trials=50

For advanced sweep configurations (random search, Bayesian optimization, multi-objective), see reference guide.

W&B Integration

Track Custom Metrics

# In LightningModule
def training_step(self, batch, batch_idx):
    loss = ...

    # Log metrics
    self.log("train/loss", loss)
    self.log("train/acc", accuracy)
    self.log("lr", self.optimizers().param_groups[0]["lr"])

    return loss

Log Artifacts

W&B can automatically log:

Model checkpoints (log_model=true)
Confusion matrices
Sample predictions
Model graphs

See reference guide for logging confusion matrices, sample predictions, and custom artifacts.

Common Debug Commands

# Quick debug run (5 batches)
python src/train.py trainer.fast_dev_run=5

# Overfit single batch (check model capacity)
python src/train.py trainer.overfit_batches=1 trainer.max_epochs=100

# Profile training (identify bottlenecks)
python src/train.py trainer.profiler=advanced trainer.max_epochs=1

For complete command reference, see reference guide.

Examples

See complete training examples:

Troubleshooting

Training doesn't start:

Check config syntax: python src/train.py --cfg job
Verify imports: python -c "import pytorch_lightning; import hydra"
Check CUDA: python -c "import torch; print(torch.cuda.is_available())"

Training is unstable:

Lower learning rate
Add gradient clipping
Check for NaN in data
Try different optimizer (AdamW → SGD)

Training is slow:

Profile: trainer.profiler="advanced"
Check data loading bottleneck
Increase batch size
Use mixed precision
Enable torch.compile() (PyTorch 2.0+)

Success Criteria

Training starts without errors
Metrics logged correctly to W&B/TensorBoard
Checkpoints saved at expected intervals
Training loss decreases steadily
Validation metrics improve
No NaN or inf values in loss
GPU utilization is high (>80%)
Training completes or stops early with best model saved

Additional Resources

For advanced topics, see:

Reference Guide - Hyperparameter sweeps (Optuna, Bayesian), W&B artifacts, environment variables, cloud/HPC training
GNN Training - PyTorch Geometric specific guide with sampling strategies and architectures
Image Classification - Complete CIFAR-10 example with ResNet
Text Classification - Complete IMDB example with BERT

Happy training! 🚀