Master training pipelines - orchestration, distributed training, hyperparameter tuning
Build production training pipelines with orchestration, distributed training, and hyperparameter tuning. Use when creating end-to-end ML workflows that require multi-GPU training, Optuna hyperparameter sweeps, or Kubeflow pipeline deployment.
/plugin marketplace add pluginagentmarketplace/custom-plugin-mlops/plugin install custom-plugin-mlops@pluginagentmarketplace-mlopsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlassets/schema.jsonreferences/GUIDE.mdreferences/PATTERNS.mdscripts/validate.pyLearn: Build production training pipelines with orchestration and distributed training.
| Attribute | Value |
|---|---|
| Bonded Agent | 04-training-pipelines |
| Difficulty | Intermediate to Advanced |
| Duration | 40 hours |
| Prerequisites | mlops-basics, experiment-tracking |
Pipeline Architecture:
┌────────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌───────┐ │
│ │ Data │─▶│Preprocess│─▶│ Train │─▶│ Evaluate│─▶│Register│ │
│ │ Load │ │ │ │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └───────┘ │
│ ║ │
│ ▼ │
│ [Hyperparameter] │
│ [ Tuning ] │
│ │
└────────────────────────────────────────────────────────────────┘
PyTorch DDP Setup:
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup_distributed():
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
return local_rank
# Wrap model
model = DDP(model, device_ids=[local_rank])
# Use DistributedSampler
sampler = DistributedSampler(dataset)
loader = DataLoader(dataset, sampler=sampler)
Exercises:
Optuna Configuration:
import optuna
def objective(trial):
lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
batch_size = trial.suggest_categorical("batch_size", [16, 32, 64])
hidden_size = trial.suggest_int("hidden_size", 64, 512, step=64)
model = build_model(hidden_size)
metrics = train_model(model, lr, batch_size)
return metrics["val_loss"]
study = optuna.create_study(
direction="minimize",
sampler=TPESampler(),
pruner=HyperbandPruner()
)
study.optimize(objective, n_trials=100)
Kubeflow Pipeline:
from kfp import dsl, compiler
@dsl.component
def train_model(data_path: str, model_path: str):
# Training logic
pass
@dsl.pipeline(name="training-pipeline")
def training_pipeline(dataset_uri: str):
preprocess_task = preprocess_data(input_path=dataset_uri)
train_task = train_model(data_path=preprocess_task.output)
train_task.set_gpu_limit(1)
# templates/train.py
import torch
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
class ProductionTrainer:
"""Production-ready training wrapper."""
def __init__(self, config: dict):
self.config = config
def train(self, model, train_loader, val_loader):
callbacks = [
ModelCheckpoint(
monitor="val_loss",
mode="min",
save_top_k=3
),
EarlyStopping(
monitor="val_loss",
patience=5
)
]
trainer = pl.Trainer(
max_epochs=self.config["epochs"],
accelerator="gpu",
devices=self.config["gpus"],
strategy="ddp" if self.config["gpus"] > 1 else "auto",
callbacks=callbacks,
precision="16-mixed"
)
trainer.fit(model, train_loader, val_loader)
return trainer
| Issue | Cause | Solution |
|---|---|---|
| GPU OOM | Batch too large | Reduce batch, use gradient accumulation |
| Slow training | I/O bottleneck | Increase workers, prefetch |
| Distributed hang | NCCL timeout | Check network, increase timeout |
| Version | Date | Changes |
|---|---|---|
| 2.0.0 | 2024-12 | Production-grade with DDP examples |
| 1.0.0 | 2024-11 | Initial release |
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.