From ml-research
Comprehensive validation of ML project structure, configurations, code quality, and training readiness. Use when setting up a new project, before training runs, or debugging configuration issues. Validates config loading, data pipeline, model architecture, and dependencies.
npx claudepluginhub nishide-dev/claude-code-ml-researchThis skill uses the workspace's default tool permissions.
Comprehensive validation of ML project structure, configurations, code quality, and training readiness.
Verifies tests pass on completed feature branch, presents options to merge locally, create GitHub PR, keep as-is or discard; executes choice and cleans up worktree.
Guides root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Writes implementation plans from specs for multi-step tasks, mapping files and breaking into TDD bite-sized steps before coding.
Share bugs, ideas, or general feedback.
Comprehensive validation of ML project structure, configurations, code quality, and training readiness.
# Run full validation
python scripts/validate_project.py
# Quick config check
python src/train.py --cfg job
# Fast dev run (1 batch train/val/test)
python src/train.py trainer.fast_dev_run=true
Required directories:
src/ - Source codesrc/models/ - Model implementationssrc/data/ - DataModule implementationsconfigs/ - Hydra configuration filestests/ - Unit tests (recommended)Required files:
src/train.py - Training scriptconfigs/config.yaml - Main configpyproject.toml or pixi.toml - Package managerCheck manually:
# Verify structure
test -d src && test -d configs && echo "✓ Basic structure OK"
test -f src/train.py && echo "✓ Training script found"
test -f configs/config.yaml && echo "✓ Main config found"
YAML syntax:
# Validate all YAML files
python -c "
import yaml
from pathlib import Path
for yaml_file in Path('configs').rglob('*.yaml'):
try:
yaml.safe_load(yaml_file.read_text())
print(f'✓ {yaml_file}')
except yaml.YAMLError as e:
print(f'❌ {yaml_file}: {e}')
"
Config composition:
# Test Hydra config loads correctly
python src/train.py --cfg job
target validation:
_target_ paths must be importableUse scripts/validate_project.py for automated checking.
Linting:
# Ruff checks
ruff check src/ tests/
# Auto-fix issues
ruff check --fix src/ tests/
Type checking:
# ty (type checker)
ty check src/
# mypy (alternative)
mypy src/ --ignore-missing-imports
Import validation:
# Check all files have valid Python syntax
import ast
from pathlib import Path
for py_file in Path("src").rglob("*.py"):
try:
ast.parse(py_file.read_text())
print(f"✓ {py_file}")
except SyntaxError as e:
print(f"❌ {py_file}: {e}")
Required packages:
torch - PyTorchpytorch_lightning - Lightning frameworkhydra-core - Configuration managementOptional but recommended:
wandb - Experiment trackingtensorboard - Visualizationtorch_geometric - For GNNstransformers - For NLPCheck installation:
python -c "
import torch
import pytorch_lightning
import hydra
print(f'PyTorch: {torch.__version__}')
print(f'Lightning: {pytorch_lightning.__version__}')
print(f'Hydra: {hydra.__version__}')
"
GPU availability:
python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
print(f'CUDA version: {torch.version.cuda}')
print(f'GPU count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
"
DataModule instantiation:
from hydra import compose, initialize_config_dir
from hydra.utils import instantiate
from pathlib import Path
# Load config
config_dir = Path.cwd() / "configs"
with initialize_config_dir(version_base=None, config_dir=str(config_dir)):
cfg = compose(config_name="config")
# Instantiate DataModule
dm = instantiate(cfg.data)
print(f"✓ DataModule: {type(dm).__name__}")
# Test setup
dm.setup("fit")
print("✓ DataModule.setup() successful")
# Check dataloaders
train_loader = dm.train_dataloader()
print(f"✓ Train batches: {len(train_loader)}")
Data directory:
# Verify data path exists
python -c "
from omegaconf import OmegaConf
from pathlib import Path
cfg = OmegaConf.load('configs/config.yaml')
data_dir = Path(cfg.data.data_dir)
if data_dir.exists():
print(f'✓ Data directory: {data_dir}')
print(f' Files: {len(list(data_dir.rglob(\"*\")))}')
else:
print(f'⚠️ Data directory not found: {data_dir}')
"
Model instantiation:
from hydra import compose, initialize_config_dir
from hydra.utils import instantiate
from pathlib import Path
# Load config
config_dir = Path.cwd() / "configs"
with initialize_config_dir(version_base=None, config_dir=str(config_dir)):
cfg = compose(config_name="config")
# Instantiate model
model = instantiate(cfg.model)
print(f"✓ Model: {type(model).__name__}")
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f" Total params: {total_params:,}")
print(f" Trainable: {trainable_params:,}")
Forward pass test:
import torch
# Create dummy input (adjust for your model)
batch_size = 2
dummy_input = torch.randn(batch_size, 3, 224, 224)
# Test forward pass
model.eval()
with torch.no_grad():
output = model(dummy_input)
print(f"✓ Forward pass OK")
print(f" Input: {dummy_input.shape}")
print(f" Output: {output.shape}")
Fast dev run:
# Run 1 batch of train/val/test
python src/train.py trainer.fast_dev_run=true
# Expected output:
# - No errors
# - Completes in <1 minute
# - Shows train/val/test progress
Logger check:
from hydra import compose, initialize_config_dir
from pathlib import Path
import os
config_dir = Path.cwd() / "configs"
with initialize_config_dir(version_base=None, config_dir=str(config_dir)):
cfg = compose(config_name="config")
if "logger" in cfg:
print(f"✓ Logger: {cfg.logger.get('_target_', 'unknown')}")
# Check W&B credentials if using wandb
if "wandb" in str(cfg.logger.get("_target_", "")):
if "WANDB_API_KEY" in os.environ:
print("✓ W&B API key set")
else:
print("⚠️ W&B not logged in (run: wandb login)")
Use the automated validation script:
python scripts/validate_project.py
What it checks:
Example output:
INFO: Starting ML project validation...
INFO: ✓ Project structure valid
INFO: ✓ All configs valid
INFO: ✓ Code quality OK
INFO: ✓ All dependencies installed
INFO: ✓ Model instantiated successfully
INFO: ✓ DataModule instantiated successfully
INFO: ✓ Fast dev run completed
INFO: ✓ All validation checks passed!
See scripts/validate_project.py for implementation.
# Config only
python src/train.py --cfg job && echo "✓ Config OK"
# Full validation
python scripts/validate_project.py && echo "✓ All OK"
# 1. Structure
test -d src -a -d configs -a -f src/train.py && echo "✓ Structure"
# 2. Config
python src/train.py --cfg job && echo "✓ Config"
# 3. Dependencies
python -c "import torch, pytorch_lightning, hydra" && echo "✓ Deps"
# 4. GPU
python -c "import torch; assert torch.cuda.is_available()" && echo "✓ GPU"
# 5. Fast dev run
python src/train.py trainer.fast_dev_run=true && echo "✓ Training"
Add to .github/workflows/validate.yml:
name: Validate ML Project
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: uv sync --all-extras
- name: Validate project
run: uv run python scripts/validate_project.py
- name: Test config
run: uv run python src/train.py --cfg job
- name: Fast dev run
run: uv run python src/train.py trainer.fast_dev_run=true
Cause: Typo in defaults or invalid YAML.
Fix:
# Check YAML syntax
python -c "import yaml; yaml.safe_load(open('configs/config.yaml'))"
# Check defaults exist
ls configs/model/ configs/data/ configs/trainer/
Cause: Module path incorrect or not installed.
Fix:
# Check import works
python -c "from src.models.my_model import MyModel"
# Verify path in config matches file structure
Cause: Data directory missing or incorrect path.
Fix:
# Check data path in config
grep data_dir configs/data/*.yaml
# Create data directory
mkdir -p data/
Cause: Various issues in training loop.
Fix:
# Run with verbose logging
python src/train.py trainer.fast_dev_run=true --verbose
# Check logs for specific error
✅ Project is ready for training!