Using Deep RL Meta-Skill

When to Use This Skill

Invoke this meta-skill when you encounter:

RL Implementation: Implementing reinforcement learning algorithms (Q-learning, DQN, PPO, SAC, etc.)
Agent Training: Training agents in environments (games, robotics, control systems)
Sequential Decision-Making: Problems requiring learning from trial and error
Policy Optimization: Learning policies that maximize cumulative rewards
RL Debugging: Debugging training issues, agents not learning, reward problems

This is the entry point for the deep-rl pack. It routes to 13 specialized skills based on problem characteristics.

How to Access Reference Sheets

IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.

When this skill is loaded from: skills/using-deep-rl/SKILL.md

Reference sheets like rl-foundations.md are at: skills/using-deep-rl/rl-foundations.md

NOT at: skills/rl-foundations.md ← WRONG PATH

Core Principle

Problem type determines algorithm family.

The correct approach depends on:

Action Space: Discrete (button presses) vs Continuous (joint angles)
Data Regime: Online (interact with environment) vs Offline (fixed dataset)
Experience Level: Need foundations vs ready to implement
Special Requirements: Multi-agent, model-based, exploration, reward design

Always clarify the problem BEFORE suggesting algorithms.

The 13 Deep RL Skills

rl-foundations - MDP formulation, Bellman equations, value vs policy basics
value-based-methods - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow
policy-gradient-methods - REINFORCE, PPO, TRPO, policy optimization
actor-critic-methods - A2C, A3C, SAC, TD3, advantage functions
model-based-rl - World models, Dyna, MBPO, planning with learned models
offline-rl - Batch RL, CQL, IQL, learning from fixed datasets
multi-agent-rl - MARL, cooperative/competitive, communication
exploration-strategies - ε-greedy, UCB, curiosity, RND, intrinsic motivation
reward-shaping - Reward design, potential-based shaping, inverse RL
counterfactual-reasoning - Causal inference, HER, off-policy evaluation, twin networks
rl-debugging - Common RL bugs, why not learning, systematic debugging
rl-environments - Gym, MuJoCo, custom envs, wrappers, vectorization
rl-evaluation - Evaluation methodology, variance, sample efficiency metrics

Routing Decision Framework

Step 1: Assess Experience Level

If user asks "what is RL" or "how does RL work" → rl-foundations
If confused about value vs policy, on-policy vs off-policy → rl-foundations
If user has specific problem and RL background → Continue to Step 2

Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.

Step 2: Classify Action Space

Discrete Actions (buttons, menu selections, discrete signals)

Condition	Route To	Why
Small action space (< 100) + online	value-based-methods (DQN)	Q-networks excel at discrete
Large action space OR need policy flexibility	policy-gradient-methods (PPO)	Scales to larger spaces

Continuous Actions (joint angles, motor forces, steering)

Condition	Route To	Why
Sample efficiency critical	actor-critic-methods (SAC)	Off-policy, automatic entropy
Stability critical	actor-critic-methods (TD3)	Deterministic, handles overestimation
Simplicity preferred	policy-gradient-methods (PPO)	On-policy, simpler

CRITICAL: NEVER suggest DQN for continuous actions. DQN requires discrete actions.

Step 3: Identify Data Regime

Online Learning (Agent Interacts with Environment)

Discrete → value-based-methods OR policy-gradient-methods
Continuous → actor-critic-methods
Sample efficiency critical → Consider model-based-rl

Offline Learning (Fixed Dataset, No Interaction)

→ offline-rl (CQL, IQL)

Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.

Step 4: Special Problem Types

Problem	Route To	Key Consideration
Multiple agents	multi-agent-rl	Non-stationarity, credit assignment
Sample efficiency extreme	model-based-rl	Learns environment model
Counterfactual/causal	counterfactual-reasoning	HER, off-policy evaluation

Step 5: Debugging and Infrastructure

Problem	Route To	Why
"Not learning" / reward flat	rl-debugging FIRST	80% of issues are bugs, not algorithms
Exploration problems	exploration-strategies	Curiosity, RND, intrinsic motivation
Reward design issues	reward-shaping	Potential-based shaping, inverse RL
Environment setup	rl-environments	Gym API, wrappers, vectorization
Evaluation questions	rl-evaluation	Deterministic vs stochastic, multiple seeds

Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first.

Rationalization Resistance Table

Rationalization	Reality	Counter-Guidance
"Just use PPO for everything"	PPO is general but not optimal for all cases	Clarify: discrete or continuous? Sample efficiency constraints?
"DQN for continuous actions"	DQN requires discrete actions	Use SAC or TD3 for continuous
"Offline RL is just RL on a dataset"	Offline has distribution shift, needs special algorithms	Route to offline-rl for CQL, IQL
"More data always helps"	Sample efficiency and distribution matter	Off-policy vs on-policy matters
"My algorithm isn't learning, I need a better one"	Usually bugs, not algorithm	Route to rl-debugging first
"I'll discretize continuous actions for DQN"	Discretization loses precision, explodes action space	Use actor-critic-methods
"Epsilon-greedy is enough for exploration"	Complex environments need sophisticated exploration	Route to exploration-strategies
"I'll just increase the reward when it doesn't learn"	Reward scaling breaks learning	Route to rl-debugging
"I can reuse online RL code for offline data"	Offline needs conservative algorithms	Route to offline-rl
"Test reward lower than training = overfitting"	Exploration vs exploitation difference	Route to rl-evaluation

Red Flags Checklist

Watch for these signs of incorrect routing:

Algorithm-First Thinking: Recommending algorithm before asking about action space, data regime
DQN for Continuous: Suggesting DQN/Q-learning for continuous action spaces
Offline Blindness: Not recognizing fixed dataset requires offline-rl
PPO Cargo-Culting: Defaulting to PPO without considering alternatives
No Problem Characterization: Not asking: discrete vs continuous? online vs offline?
Skipping Foundations: Implementing algorithms when user doesn't understand RL basics
Debug-Last: Suggesting algorithm changes before systematic debugging
Sample Efficiency Ignorance: Not asking about sample constraints

If any red flag triggered → STOP → Ask diagnostic questions → Route correctly

Routing Decision Tree Summary

START: RL problem

├─ Need foundations? → rl-foundations
│
├─ DISCRETE actions?
│  ├─ Small space + online → value-based-methods (DQN)
│  └─ Large space → policy-gradient-methods (PPO)
│
├─ CONTINUOUS actions?
│  ├─ Sample efficiency → actor-critic-methods (SAC)
│  ├─ Stability → actor-critic-methods (TD3)
│  └─ Simplicity → policy-gradient-methods (PPO)
│
├─ OFFLINE data? → offline-rl (CQL, IQL) [CRITICAL]
│
├─ MULTI-AGENT? → multi-agent-rl
│
├─ Sample efficiency EXTREME? → model-based-rl
│
├─ COUNTERFACTUAL? → counterfactual-reasoning
│
└─ DEBUGGING?
   ├─ Not learning → rl-debugging
   ├─ Exploration → exploration-strategies
   ├─ Reward design → reward-shaping
   ├─ Environment → rl-environments
   └─ Evaluation → rl-evaluation

Diagnostic Questions

Action Space

"Discrete choices or continuous values?"
"How many actions? Small (< 100), large, or infinite?"

Data Regime

"Can agent interact with environment, or fixed dataset?"
"Online learning or offline?"

Experience Level

"New to RL, or specific problem?"
"Understand MDPs, value functions, policy gradients?"

Special Requirements

"Multiple agents? Cooperate or compete?"
"Sample efficiency critical? How many episodes?"
"Sparse reward (only at goal) or dense (every step)?"

When NOT to Use This Pack

User Request	Correct Pack	Reason
"Train classifier on labeled data"	training-optimization	Supervised learning
"Design transformer architecture"	neural-architectures	Architecture design
"Deploy model to production"	ml-production	Deployment
"Fine-tune LLM with RLHF"	llm-specialist	LLM-specific

Multi-Skill Scenarios

See multi-skill-scenarios.md for detailed routing sequences:

Complete beginner to RL
Continuous control (robotics)
Offline RL from dataset
Multi-agent cooperative task
Sample-efficient learning
Sparse reward problem
RL-controlled neural architecture

Final Reminders

Problem characterization BEFORE algorithm selection
DQN for discrete ONLY (never continuous)
Offline data needs offline-rl (CQL, IQL)
PPO is not universal (good general-purpose, not optimal everywhere)
Debug before changing algorithms (route to rl-debugging)
Ask questions, don't assume (action space? data regime?)

Deep RL Specialist Skills

After routing, load the appropriate specialist skill for detailed guidance:

rl-foundations.md - MDP formulation, Bellman equations, value vs policy basics
value-based-methods.md - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow
policy-gradient-methods.md - REINFORCE, PPO, TRPO, policy optimization
actor-critic-methods.md - A2C, A3C, SAC, TD3, advantage functions
model-based-rl.md - World models, Dyna, MBPO, planning with learned models
offline-rl.md - Batch RL, CQL, IQL, learning from fixed datasets
multi-agent-rl.md - MARL, cooperative/competitive, communication
exploration-strategies.md - ε-greedy, UCB, curiosity, RND, intrinsic motivation
reward-shaping-engineering.md - Reward design, potential-based shaping, inverse RL
counterfactual-reasoning.md - Causal inference, HER, off-policy evaluation, twin networks
rl-debugging.md - Common RL bugs, why not learning, systematic debugging
rl-environments.md - Gym, MuJoCo, custom envs, wrappers, vectorization
rl-evaluation.md - Evaluation methodology, variance, sample efficiency metrics
multi-skill-scenarios.md - Common problem routing sequences

using-deep-rl