Routes to appropriate deep-RL skills based on problem type and algorithm family
Routes to 13 specialized deep RL skills based on your problem type. Use when implementing RL algorithms, training agents, or debugging sequential decision-making problems. Triggered by questions about Q-learning, PPO, SAC, offline datasets, multi-agent systems, or agents not learning.
/plugin marketplace add tachyon-beep/skillpacks/plugin install yzmir-deep-rl@foundryside-marketplaceThis skill inherits all available tools. When active, it can use any tool Claude has access to.
actor-critic-methods.mdcounterfactual-reasoning.mdexploration-strategies.mdmodel-based-rl.mdmulti-agent-rl.mdmulti-skill-scenarios.mdoffline-rl.mdpolicy-gradient-methods.mdreward-shaping-engineering.mdrl-debugging.mdrl-environments.mdrl-evaluation.mdrl-foundations.mdvalue-based-methods.mdInvoke this meta-skill when you encounter:
This is the entry point for the deep-rl pack. It routes to 13 specialized skills based on problem characteristics.
IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.
When this skill is loaded from:
skills/using-deep-rl/SKILL.md
Reference sheets like rl-foundations.md are at:
skills/using-deep-rl/rl-foundations.md
NOT at:
skills/rl-foundations.md ← WRONG PATH
Problem type determines algorithm family.
The correct approach depends on:
Always clarify the problem BEFORE suggesting algorithms.
Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.
| Condition | Route To | Why |
|---|---|---|
| Small action space (< 100) + online | value-based-methods (DQN) | Q-networks excel at discrete |
| Large action space OR need policy flexibility | policy-gradient-methods (PPO) | Scales to larger spaces |
| Condition | Route To | Why |
|---|---|---|
| Sample efficiency critical | actor-critic-methods (SAC) | Off-policy, automatic entropy |
| Stability critical | actor-critic-methods (TD3) | Deterministic, handles overestimation |
| Simplicity preferred | policy-gradient-methods (PPO) | On-policy, simpler |
CRITICAL: NEVER suggest DQN for continuous actions. DQN requires discrete actions.
→ offline-rl (CQL, IQL)
Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.
| Problem | Route To | Key Consideration |
|---|---|---|
| Multiple agents | multi-agent-rl | Non-stationarity, credit assignment |
| Sample efficiency extreme | model-based-rl | Learns environment model |
| Counterfactual/causal | counterfactual-reasoning | HER, off-policy evaluation |
| Problem | Route To | Why |
|---|---|---|
| "Not learning" / reward flat | rl-debugging FIRST | 80% of issues are bugs, not algorithms |
| Exploration problems | exploration-strategies | Curiosity, RND, intrinsic motivation |
| Reward design issues | reward-shaping | Potential-based shaping, inverse RL |
| Environment setup | rl-environments | Gym API, wrappers, vectorization |
| Evaluation questions | rl-evaluation | Deterministic vs stochastic, multiple seeds |
Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first.
| Rationalization | Reality | Counter-Guidance |
|---|---|---|
| "Just use PPO for everything" | PPO is general but not optimal for all cases | Clarify: discrete or continuous? Sample efficiency constraints? |
| "DQN for continuous actions" | DQN requires discrete actions | Use SAC or TD3 for continuous |
| "Offline RL is just RL on a dataset" | Offline has distribution shift, needs special algorithms | Route to offline-rl for CQL, IQL |
| "More data always helps" | Sample efficiency and distribution matter | Off-policy vs on-policy matters |
| "My algorithm isn't learning, I need a better one" | Usually bugs, not algorithm | Route to rl-debugging first |
| "I'll discretize continuous actions for DQN" | Discretization loses precision, explodes action space | Use actor-critic-methods |
| "Epsilon-greedy is enough for exploration" | Complex environments need sophisticated exploration | Route to exploration-strategies |
| "I'll just increase the reward when it doesn't learn" | Reward scaling breaks learning | Route to rl-debugging |
| "I can reuse online RL code for offline data" | Offline needs conservative algorithms | Route to offline-rl |
| "Test reward lower than training = overfitting" | Exploration vs exploitation difference | Route to rl-evaluation |
Watch for these signs of incorrect routing:
If any red flag triggered → STOP → Ask diagnostic questions → Route correctly
START: RL problem
├─ Need foundations? → rl-foundations
│
├─ DISCRETE actions?
│ ├─ Small space + online → value-based-methods (DQN)
│ └─ Large space → policy-gradient-methods (PPO)
│
├─ CONTINUOUS actions?
│ ├─ Sample efficiency → actor-critic-methods (SAC)
│ ├─ Stability → actor-critic-methods (TD3)
│ └─ Simplicity → policy-gradient-methods (PPO)
│
├─ OFFLINE data? → offline-rl (CQL, IQL) [CRITICAL]
│
├─ MULTI-AGENT? → multi-agent-rl
│
├─ Sample efficiency EXTREME? → model-based-rl
│
├─ COUNTERFACTUAL? → counterfactual-reasoning
│
└─ DEBUGGING?
├─ Not learning → rl-debugging
├─ Exploration → exploration-strategies
├─ Reward design → reward-shaping
├─ Environment → rl-environments
└─ Evaluation → rl-evaluation
| User Request | Correct Pack | Reason |
|---|---|---|
| "Train classifier on labeled data" | training-optimization | Supervised learning |
| "Design transformer architecture" | neural-architectures | Architecture design |
| "Deploy model to production" | ml-production | Deployment |
| "Fine-tune LLM with RLHF" | llm-specialist | LLM-specific |
See multi-skill-scenarios.md for detailed routing sequences:
After routing, load the appropriate specialist skill for detailed guidance:
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Guide users through a structured workflow for co-authoring documentation. Use when user wants to write documentation, proposals, technical specs, decision docs, or similar structured content. This workflow helps users efficiently transfer context, refine content through iteration, and verify the doc works for readers. Trigger when user mentions writing docs, creating proposals, drafting specs, or similar documentation tasks.