npx claudepluginhub aznatkoiny/zai-skills --plugin AI-ToolkitWant just this skill?
Then install: npx claudepluginhub u/[userId]/[slug]
Reinforcement Learning best practices for Python using modern libraries (Stable-Baselines3, RLlib, Gymnasium). Use when: - Implementing RL algorithms (PPO, SAC, DQN, TD3, A2C) - Creating custom Gymnasium environments - Training, debugging, or evaluating RL agents - Setting up hyperparameter tuning for RL - Deploying RL models to production
This skill uses the workspace's default tool permissions.
references/algorithms.mdreferences/debugging.mdreferences/deployment.mdreferences/environments.mdreferences/evaluation.mdreferences/training.mdReinforcement Learning Best Practices
Overview
This skill provides comprehensive guidance for implementing reinforcement learning in Python using the modern ecosystem (2024-2025). Gymnasium has replaced OpenAI Gym as the standard environment interface. Stable-Baselines3 (SB3) is recommended for prototyping, RLlib for production/distributed training, and CleanRL for research.
When to Use
- Building RL agents for discrete or continuous control tasks
- Creating custom simulation environments
- Tuning hyperparameters for RL algorithms
- Debugging training issues (reward curves, policy collapse, numerical instability)
- Deploying trained policies to production
Library Selection
| Library | Best For | Ease | Flexibility | Production |
|---|---|---|---|---|
| Stable-Baselines3 | Prototyping, learning | High | Medium | Good |
| RLlib | Production, distributed | Medium | High | Excellent |
| CleanRL | Research, understanding | High | Low | Poor |
| TorchRL | Custom implementations | Low | Highest | Good |
Algorithm Decision Tree
Start
|
v
Action space type?
|
+-- Discrete --> Sample efficiency critical?
| |
| +-- Yes --> DQN (or Double/Dueling DQN)
| +-- No --> Stability critical?
| |
| +-- Yes --> PPO
| +-- No --> A2C (faster iterations)
|
+-- Continuous --> Sample efficiency critical?
|
+-- Yes --> SAC (auto entropy) or TD3
+-- No --> PPO (more stable, less efficient)
Quick Selection Table:
| Scenario | Recommended | Why |
|---|---|---|
| Discrete actions, getting started | PPO | Stable, good defaults |
| Continuous control | SAC or TD3 | Sample efficient, handles continuous well |
| Sample efficiency critical | SAC, DQN | Off-policy, reuses experience |
| Stability critical | PPO | Trust region, consistent |
| High-dimensional obs (images) | PPO + CNN | Handles visual input well |
| Fast iteration needed | A2C | Simpler, faster per update |
Quick Start with Stable-Baselines3
Basic Training
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
# Create vectorized environment (4 parallel envs)
env = make_vec_env("CartPole-v1", n_envs=4)
# Initialize and train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)
# Save and load
model.save("ppo_cartpole")
loaded_model = PPO.load("ppo_cartpole")
# Evaluate
obs = env.reset()
for _ in range(1000):
action, _ = loaded_model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
Custom Environment Template
import gymnasium as gym
from gymnasium import spaces
import numpy as np
class CustomEnv(gym.Env):
metadata = {"render_modes": ["human", "rgb_array"]}
def __init__(self, render_mode=None):
super().__init__()
self.observation_space = spaces.Box(
low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
)
self.action_space = spaces.Discrete(2)
self.render_mode = render_mode
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
return self.state.astype(np.float32), {}
def step(self, action):
# Implement environment dynamics here
observation = self.state.astype(np.float32)
reward = 1.0
terminated = False # Episode ended due to task completion/failure
truncated = False # Episode ended due to time limit
info = {}
return observation, reward, terminated, truncated, info
def render(self):
pass
Hyperparameter Tuning with Optuna
import optuna
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
def objective(trial):
learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
n_steps = trial.suggest_categorical("n_steps", [256, 512, 1024, 2048])
gamma = trial.suggest_float("gamma", 0.9, 0.9999)
model = PPO(
"MlpPolicy", "CartPole-v1",
learning_rate=learning_rate,
n_steps=n_steps,
gamma=gamma,
verbose=0
)
model.learn(total_timesteps=50_000)
mean_reward, _ = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
return mean_reward
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)
print(f"Best params: {study.best_params}")
Core Workflow
- Define the environment - Use Gymnasium API, validate spaces
- Select algorithm - Based on action space and requirements
- Start simple - Default hyperparameters, short training
- Monitor training - TensorBoard, check reward curves
- Debug issues - Use the debugging playbook
- Tune hyperparameters - Optuna for systematic search
- Evaluate properly - Separate eval env, multiple seeds
- Deploy - Export to ONNX/TorchScript
Reference Files
- algorithms.md - Deep dive on DQN, PPO, SAC, A2C, TD3
- environments.md - Gymnasium setup, custom envs, wrappers
- training.md - Hyperparameters, reward engineering, normalization
- debugging.md - Failure modes, diagnostics, sanity checks
- evaluation.md - Metrics, logging, reproducibility
- deployment.md - ONNX export, inference optimization, safety
Essential Dependencies
pip install gymnasium stable-baselines3 tensorboard optuna
# For Atari environments
pip install gymnasium[atari] gymnasium[accept-rom-license]
# For MuJoCo
pip install gymnasium[mujoco]
Common Pitfalls to Avoid
- Not normalizing observations - Use
VecNormalizewrapper - Wrong action space handling - Check discrete vs continuous
- Ignoring seed management - Set seeds for reproducibility
- Training and eval on same env - Use separate eval environment
- Not monitoring entropy - Low entropy = policy collapse
- Sparse rewards without shaping - Add intermediate rewards
- Too large/small learning rate - Start with 3e-4 for most algorithms
Similar Skills
Expert guidance for Next.js Cache Components and Partial Prerendering (PPR). **PROACTIVE ACTIVATION**: Use this skill automatically when working in Next.js projects that have `cacheComponents: true` in their next.config.ts/next.config.js. When this config is detected, proactively apply Cache Components patterns and best practices to all React Server Component implementations. **DETECTION**: At the start of a session in a Next.js project, check for `cacheComponents: true` in next.config. If enabled, this skill's patterns should guide all component authoring, data fetching, and caching decisions. **USE CASES**: Implementing 'use cache' directive, configuring cache lifetimes with cacheLife(), tagging cached data with cacheTag(), invalidating caches with updateTag()/revalidateTag(), optimizing static vs dynamic content boundaries, debugging cache issues, and reviewing Cache Component implementations.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.