Router to training optimization skills based on symptoms and training problems
Routes you to the right training optimization specialist based on symptoms. Diagnoses issues like loss not decreasing, instability, overfitting, or slow training, then directs you to the specific skill needed.
/plugin marketplace add tachyon-beep/skillpacks/plugin install yzmir-training-optimization@foundryside-marketplaceThis skill inherits all available tools. When active, it can use any tool Claude has access to.
batch-size-and-memory-tradeoffs.mddata-augmentation-strategies.mdexperiment-tracking.mdgradient-management.mdhyperparameter-tuning.mdlearning-rate-scheduling.mdloss-functions-and-objectives.mdoptimization-algorithms.mdoverfitting-prevention.mdtraining-loop-architecture.mdThis meta-skill routes you to the right training optimization specialist based on symptoms. Training issues often have multiple potential causes—this skill helps diagnose symptoms and route to the appropriate specialist. Load this skill when you encounter training problems but aren't sure which specific technique to apply.
Core Principle: Diagnose before routing. Training issues often have multiple causes. Ask clarifying questions to understand symptoms before routing to specific skills. Wrong diagnosis wastes time—systematic routing saves it.
Load this skill when:
Don't use for: PyTorch implementation bugs (use pytorch-engineering), model architecture selection (use neural-architectures), production deployment (use ml-production), RL-specific training (use deep-rl), LLM fine-tuning specifics (use llm-specialist)
IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.
When this skill is loaded from:
skills/using-training-optimization/SKILL.md
Reference sheets like optimization-algorithms.md are at:
skills/using-training-optimization/optimization-algorithms.md
NOT at:
skills/optimization-algorithms.md ← WRONG PATH
When you see a link like [optimization-algorithms.md](optimization-algorithms.md), read the file from the same directory as this SKILL.md.
Keywords: stuck, flat loss, not improving, not learning, accuracy not increasing
Diagnostic questions (ask BEFORE routing):
Route based on answers:
| Loss Behavior | Likely Cause | Route To | Why |
|---|---|---|---|
| Flat from epoch 0 | LR too low OR wrong optimizer OR inappropriate loss | learning-rate-scheduling + optimization-algorithms | Need to diagnose starting conditions |
| Was learning, then plateaued | Local minima OR LR too high OR overfitting | learning-rate-scheduling (scheduler) + check validation loss for overfitting | Adaptation needed during training |
| Oscillating wildly | LR too high OR gradient instability | learning-rate-scheduling + gradient-management | Instability issues |
| NaN or Inf | Gradient explosion OR numerical instability | gradient-management (PRIMARY) + loss-functions-and-objectives | Stability critical |
| Loss function doesn't match task | Wrong objective | loss-functions-and-objectives | Fundamental mismatch |
Multi-skill routing: Often need optimization-algorithms (choose optimizer) + learning-rate-scheduling (choose LR/schedule) together for "not learning" issues.
Keywords: NaN, Inf, exploding, diverging, unstable, spikes
Diagnostic questions:
Route to (in priority order):
gradient-management (PRIMARY)
learning-rate-scheduling (SECONDARY)
loss-functions-and-objectives (if numerical issues)
Cross-pack note: If using mixed precision (AMP), also check pytorch-engineering (mixed-precision-and-optimization) for gradient scaling issues.
Keywords: overfitting, train/val gap, poor generalization, memorizing training data
Diagnostic questions:
Route to (multi-skill approach):
overfitting-prevention (PRIMARY)
data-augmentation-strategies (HIGHLY RECOMMENDED)
hyperparameter-tuning
Decision factors:
Keywords: slow training, low GPU utilization, takes too long, time per epoch
CRITICAL: Diagnose bottleneck before routing
Diagnostic questions:
Route based on bottleneck:
| GPU Utilization | Likely Cause | Route To | Why |
|---|---|---|---|
| < 50% consistently | Data loading bottleneck OR CPU preprocessing | pytorch-engineering (data loading, profiling) | Not compute-bound, infrastructure issue |
| High (> 80%) but still slow | Batch size too small OR need distributed training | batch-size-and-memory-tradeoffs + possibly pytorch-engineering (distributed) | Compute-bound, need scaling |
| High + heavy augmentation | Augmentation overhead | data-augmentation-strategies (optimization) + pytorch-engineering (profiling) | Augmentation CPU cost |
| Memory-limited batch size | Can't increase batch size due to OOM | batch-size-and-memory-tradeoffs (gradient accumulation) + pytorch-engineering (memory) | Memory constraints limiting throughput |
Cross-pack boundaries:
Key principle: Profile FIRST before optimizing. Low GPU utilization = wrong optimization target.
Direct hyperparameter questions route to specific skills:
| Question | Route To | Examples |
|---|---|---|
| "Which optimizer?" | optimization-algorithms | SGD vs Adam vs AdamW, momentum, betas |
| "Which learning rate?" | learning-rate-scheduling | Initial LR, warmup, schedule type |
| "Which batch size?" | batch-size-and-memory-tradeoffs | Batch size effects on convergence and speed |
| "Which loss function?" | loss-functions-and-objectives | Cross-entropy vs focal vs custom |
| "How to prevent overfitting?" | overfitting-prevention | Dropout, weight decay, early stopping |
| "Which augmentation?" | data-augmentation-strategies | Type and strength of augmentation |
| "How to tune hyperparameters?" | hyperparameter-tuning | Search strategies, AutoML |
| "How to track experiments?" | experiment-tracking | MLflow, W&B, TensorBoard |
For new project setup, route to MULTIPLE in sequence:
Keywords: experiment tracking, MLflow, wandb, tensorboard, compare runs, log metrics
Route to:
experiment-tracking (PRIMARY)
hyperparameter-tuning (if systematic search)
training-loop-architecture (for integration)
Route to (in order):
Why this order: Foundation (optimizer/LR/batch) → Objective (loss) → Infrastructure (tracking/loop)
Route to (diagnose first, then in order):
Why this order: Stability (gradients) → Adaptation (LR) → Algorithm (optimizer)
Common mistakes:
Route to (multi-pronged approach):
Why all three: Overfitting needs comprehensive strategy, not single technique.
Prioritization:
Route to:
Why: Speed and memory often coupled—need to balance batch size, memory, and throughput.
Route to:
Why: Custom losses need careful design + gradient analysis + tuning.
When symptom unclear, ASK ONE diagnostic question before routing:
| Vague Query | Clarifying Question | Why |
|---|---|---|
| "Fix my training" | "What specific issue? Not learning? Unstable? Overfitting? Too slow?" | 4+ different routing paths |
| "Improve model" | "Improve what? Training speed? Accuracy? Generalization?" | Different optimization targets |
| "Training not working well" | "What's 'not working'? Loss behavior? Accuracy? Convergence speed?" | Need specific symptoms |
| "Optimize hyperparameters" | "Which hyperparameters? All of them? Specific ones like LR?" | Specific vs broad search |
| "Model performs poorly" | "Training accuracy poor or validation accuracy poor or both?" | Underfitting vs overfitting |
Never guess when ambiguous. Ask once, route accurately.
| Symptom | Wrong Route | Correct Route | Why |
|---|---|---|---|
| "Training slow" | batch-size-and-memory | ASK: Check GPU utilization first | Might be data loading, not compute |
| "Not learning" | optimization-algorithms | ASK: Diagnose loss behavior | Could be LR, gradients, loss function |
| "Loss NaN" | learning-rate-scheduling | gradient-management FIRST | Gradient explosion most common cause |
| "Overfitting" | overfitting-prevention only | overfitting-prevention + data-augmentation | Need multi-pronged approach |
| "Need to speed up training" | optimization-algorithms | Profile first (pytorch-engineering) | Don't optimize without measuring |
| "Which optimizer for transformer" | neural-architectures | optimization-algorithms | Optimizer choice, not architecture |
Key principle: Diagnosis before solutions, clarification before routing, multi-skill for complex issues.
Skip training-optimization when:
| Symptom | Wrong Pack | Correct Pack | Why |
|---|---|---|---|
| "CUDA out of memory" | training-optimization | pytorch-engineering | Infrastructure issue, not training algorithm |
| "DDP not working" | training-optimization | pytorch-engineering | Distributed setup, not hyperparameters |
| "Which architecture to use" | training-optimization | neural-architectures | Architecture choice precedes training |
| "Model won't load" | training-optimization | pytorch-engineering | Checkpointing/serialization issue |
| "Inference too slow" | training-optimization | ml-production | Production optimization, not training |
| "How to deploy model" | training-optimization | ml-production | Deployment concern |
| "RL exploration issues" | training-optimization | deep-rl | RL-specific training concern |
| "RLHF for LLM" | training-optimization | llm-specialist | LLM-specific technique |
Training-optimization pack is for: Framework-agnostic training algorithms, hyperparameters, optimization techniques, and training strategies that apply across architectures.
Boundaries:
If you catch yourself about to:
All of these mean: You're about to give incomplete advice. Route to specialist instead.
| Rationalization | Reality | What To Do Instead |
|---|---|---|
| "User is rushed, skip diagnostic questions" | Diagnosis takes 30 seconds, wrong route wastes 10+ minutes | Ask ONE quick diagnostic question: "Is loss flat, oscillating, or NaN?" |
| "Symptoms are obvious, route immediately" | Symptoms often have multiple causes | Ask clarifying question to eliminate ambiguity |
| "User suggested optimizer change" | User self-diagnosis can be wrong | "What loss behavior are you seeing?" to verify root cause |
| "Expert user doesn't need routing" | Expert users benefit from specialist skills too | Route based on symptoms, not user sophistication |
| "Just a quick question" | Quick questions deserve correct answers | Route to specialist—they have quick diagnostics too |
| "Single solution will fix it" | Training issues often multi-causal | Consider multi-skill routing for complex symptoms |
| "Time pressure means guess quickly" | Wrong guess wastes MORE time | Fast systematic diagnosis faster than trial-and-error |
| "They already tried X" | Maybe tried X wrong or X wasn't the issue | Route to specialist to verify X was done correctly |
| "Too complex to route" | Complex issues need specialists MORE | Use multi-skill routing for complex scenarios |
| "Direct answer is helpful" | Wrong direct answer wastes time | Routing IS the helpful answer |
If you catch yourself thinking ANY of these, STOP and route to specialist or ask diagnostic question.
| Pressure | Wrong Response | Correct Response |
|---|---|---|
| "Demo tomorrow, need quick fix" | Give untested suggestions | "Fast systematic diagnosis ensures demo success: [question]" |
| "Production training failing" | Panic and guess | "Quick clarification prevents longer outage: [question]" |
| "Just tell me which optimizer" | "Use Adam" | "30-second clarification ensures right choice: [question]" |
Emergency protocol: Fast clarification (30 sec) → Correct routing (60 sec) → Specialist handles efficiently
| Pressure | Wrong Response | Correct Response |
|---|---|---|
| "Senior said use SGD" | Accept without verification | "To apply SGD effectively, let me verify the symptoms: [question]" |
| "PM wants optimizer change" | Change without diagnosis | "Let's diagnose to ensure optimizer is the issue: [question]" |
Authority protocol: Acknowledge → Verify symptoms → Route based on evidence, not opinion
| Pressure | Wrong Response | Correct Response |
|---|---|---|
| "I think it's the optimizer" | Discuss optimizer choice | "What loss behavior makes you think optimizer? [diagnostic]" |
| "Obviously need to clip gradients" | Implement clipping | "What symptoms suggest gradient issues? [verify]" |
Verification protocol: User attribution is hypothesis, not diagnosis. Verify with symptoms.
Before giving ANY training advice or routing, ask yourself:
❓ Did I identify specific symptoms?
❓ Is this symptom in my routing table?
❓ Am I about to give direct advice?
❓ Could this symptom have multiple causes?
❓ Is this training-optimization or another pack?
❓ Am I feeling pressure to skip routing?
If you failed ANY check, do NOT give direct advice. Route to specialist or ask clarifying question.
After routing, load the appropriate specialist skill for detailed guidance:
| Symptom | Primary Skill | Secondary Skills | Diagnostic Question |
|---|---|---|---|
| Loss flat/stuck | learning-rate-scheduling | optimization-algorithms | "Flat from start or plateaued later?" |
| Loss NaN/Inf | gradient-management | learning-rate-scheduling, loss-functions | "When does NaN appear?" |
| Overfitting | overfitting-prevention | data-augmentation, hyperparameter-tuning | "Dataset size? Current regularization?" |
| Training slow | batch-size-and-memory OR pytorch-engineering | data-augmentation | "GPU utilization percentage?" |
| Oscillating loss | learning-rate-scheduling | gradient-management | "What's your current LR?" |
| Which optimizer | optimization-algorithms | learning-rate-scheduling | Task type, architecture, dataset |
| Which LR | learning-rate-scheduling | optimization-algorithms | Optimizer, task, current symptoms |
| Track experiments | experiment-tracking | hyperparameter-tuning, training-loop | Tools preference, scale of experiments |
| Poor generalization | overfitting-prevention | data-augmentation | Train vs val accuracy gap |
| Convergence issues | gradient-management | learning-rate-scheduling, optimization-algorithms | Gradient norms, loss curve |
Cross-pack boundaries: pytorch-engineering (infrastructure), neural-architectures (architecture selection), deep-rl (RL-specific), llm-specialist (LLM-specific), ml-production (deployment).
This meta-skill: Diagnose symptoms → Route to specialists → Resist pressure to give direct advice. Training issues often have multiple causes. Clarify symptoms, route to specialists. Wrong routing wastes more time than asking one clarifying question. Route based on symptoms, not guesses.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.