Routes to appropriate PyTorch specialist skill based on symptoms and problem type
Routes to appropriate PyTorch specialist based on symptoms like OOM errors, slow training, or NaN losses. Use when encountering PyTorch-specific issues to match problems to experts for memory, distributed training, performance, debugging, or custom operations.
/plugin marketplace add tachyon-beep/skillpacks/plugin install yzmir-pytorch-engineering@foundryside-marketplaceThis skill inherits all available tools. When active, it can use any tool Claude has access to.
checkpointing-and-reproducibility.mdcustom-autograd-functions.mddebugging-techniques.mddistributed-training-strategies.mdmixed-precision-and-optimization.mdmodule-design-patterns.mdperformance-profiling.mdtensor-operations-and-memory.mdThis meta-skill routes you to the right PyTorch specialist based on symptoms. PyTorch engineering problems fall into distinct categories that require specialized knowledge. Load this skill when you encounter PyTorch-specific issues but aren't sure which specialized skill to use.
Core Principle: Different PyTorch problems require different specialists. Match symptoms to the appropriate specialist skill. Don't guess at solutions—route to the expert.
Load this skill when:
Don't use for: Framework-agnostic ML theory, non-PyTorch frameworks, algorithm selection (use training-optimization or other packs)
IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.
When this skill is loaded from:
skills/using-pytorch-engineering/SKILL.md
Reference sheets like tensor-operations-and-memory.md are at:
skills/using-pytorch-engineering/tensor-operations-and-memory.md
NOT at:
skills/tensor-operations-and-memory.md ← WRONG PATH
When you see a link like [tensor-operations-and-memory.md](tensor-operations-and-memory.md), read the file from the same directory as this SKILL.md.
Symptoms:
Route to: See tensor-operations-and-memory.md for memory management and optimization.
Why: Memory management is foundational. Must understand tensor lifecycles, efficient operations, and profiling before other optimizations.
Example queries:
Symptoms:
Route to: See module-design-patterns.md for model architecture and nn.Module patterns.
Why: Proper module design prevents bugs and enables features like checkpointing, distributed training, and serialization.
Example queries:
Symptoms:
Route to: See distributed-training-strategies.md for DDP setup and multi-GPU training.
Why: Distributed training has unique setup requirements, synchronization patterns, and pitfalls. Generic advice breaks in distributed settings.
Example queries:
Symptoms:
Route to: See performance-profiling.md FIRST for systematic bottleneck identification.
Why: MUST profile before optimizing. Many "performance" problems are actually data loading or other non-compute bottlenecks. Profile to identify the real bottleneck.
After profiling, may route to:
Example queries:
Symptoms:
Route to: See mixed-precision-and-optimization.md for AMP and numerical stability.
Why: Mixed precision requires careful handling of numerical stability, gradient scaling, and operation compatibility.
Example queries:
Symptoms:
Route to: See debugging-techniques.md for systematic NaN/Inf debugging.
Why: NaN/Inf issues require systematic debugging—checking gradients layer by layer, identifying numerical instability sources, and targeted fixes.
Example queries:
Symptoms:
Route to: See checkpointing-and-reproducibility.md for complete state management.
Why: Proper checkpointing requires saving ALL state (model, optimizer, scheduler, RNG states). Reproducibility requires deterministic operations and careful seed management.
Example queries:
Symptoms:
Route to: See custom-autograd-functions.md for custom backward passes.
Why: Custom autograd functions require understanding the autograd engine, proper gradient computation, and numerical stability.
Example queries:
Some scenarios require multiple specialized skills in sequence:
Distributed training with memory constraints:
Performance optimization:
Custom module with proper patterns:
Training instability with mixed precision:
Load in order of execution: Setup before optimization, diagnosis before fixes, structure before customization.
When symptom unclear, ASK ONE clarifying question:
"Fix my PyTorch training" → Ask: "What specific issue? Memory? Speed? Accuracy? NaN?"
"Optimize my model" → Ask: "Optimize what? Training speed? Memory usage? Inference?"
"Setup distributed training" → Ask: "Single-node multi-GPU or multi-node? What's not working?"
"Model not working" → Ask: "What's broken? Training fails? Wrong outputs? Performance?"
Never guess when ambiguous. Ask once, route accurately.
| Symptom | Wrong Route | Correct Route | Why |
|---|---|---|---|
| "Training slow" | mixed-precision | performance-profiling FIRST | Don't optimize without profiling |
| "OOM in distributed" | tensor-memory | distributed-strategies FIRST | Distributed setup might be wrong |
| "Custom layer slow" | performance-profiling | module-design-patterns FIRST | Design might be inefficient |
| "NaN with AMP" | mixed-precision | debugging-techniques FIRST | Debug NaN source, then fix AMP |
| "Save model" | module-design | checkpointing FIRST | Checkpointing is specialized topic |
Key principle: Diagnosis before solutions, setup before optimization, root cause before fixes.
If you catch yourself about to:
All of these mean: You're about to give incomplete advice. Route to the specialist instead.
| Excuse | Reality | What To Do |
|---|---|---|
| "User is rushed, skip routing" | Routing takes 5 seconds. Wrong fix wastes minutes. | Route anyway - specialists have quick diagnostics |
| "They already tried X" | May have done X wrong, misunderstood, or X wasn't applicable. | Route to specialist to verify X was done correctly |
| "Authority/senior says Y" | Authority can misdiagnose bottlenecks without profiling. | Profile first, authority second. Respect skills over seniority. |
| "User is tired, don't ask" | Exhaustion makes clarity MORE important, not less. | Ask ONE clarifying question - saves time overall |
| "User suggested Z" | Z might not be best option for their specific case. | Route to specialist to evaluate if Z is right approach |
| "Too complex, can't route" | Complex scenarios need specialists MORE, not less. | Use cross-cutting section - route to multiple skills in sequence |
| "User sounds confident" | Confidence about custom autograd often precedes subtle bugs. | Route to specialist for systematic verification |
| "Just a quick question" | No such thing - symptoms need diagnosis. | Quick questions deserve correct answers - route properly |
| "Simple issue" | Simple symptoms can have complex root causes. | Route based on symptoms, not perceived complexity |
| "Direct answer is helpful" | Wrong direct answer wastes time and frustrates user. | Routing to specialist IS the helpful answer |
If you catch yourself thinking ANY of these, STOP and route to the specialist.
Before giving ANY PyTorch advice, ask yourself:
❓ Did I identify the symptom?
❓ Is this symptom in my routing table?
❓ Am I about to give advice directly?
❓ Is this a diagnosis issue or solution issue?
❓ Is query ambiguous?
❓ Am I feeling pressure to skip routing?
If you failed ANY check above, do NOT give direct advice. Route to specialist or ask clarifying question.
Skip PyTorch pack when:
PyTorch pack is for: PyTorch-specific implementation, infrastructure, debugging, and optimization issues.
Critical: Many PyTorch issues require diagnosis before solutions:
| Issue Type | Diagnosis Skill | Then Solution Skill |
|---|---|---|
| Performance | performance-profiling | mixed-precision / distributed |
| Memory | tensor-memory (profiling section) | tensor-memory (optimization) |
| NaN/Inf | debugging-techniques | mixed-precision / module-design |
| Training bugs | debugging-techniques | Appropriate fix |
If unclear what's wrong, route to diagnostic skill first.
After routing, load the appropriate specialist skill for detailed guidance:
Phase 1 - Standalone: PyTorch skills are self-contained
Future cross-references:
Current focus: Route within PyTorch pack only. Other packs handle other concerns.
This skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.