Help us improve
Share bugs, ideas, or general feedback.
From AI-Toolkit
Reinforcement Learning best practices for Python using modern libraries (Stable-Baselines3, RLlib, Gymnasium). Use when: - Implementing RL algorithms (PPO, SAC, DQN, TD3, A2C) - Creating custom Gymnasium environments - Training, debugging, or evaluating RL agents - Setting up hyperparameter tuning for RL - Deploying RL models to production
npx claudepluginhub aznatkoiny/zai-skills --plugin AI-ToolkitHow this skill is triggered — by the user, by Claude, or both
Slash command
/AI-Toolkit:reinforcement-learningThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill provides comprehensive guidance for implementing reinforcement learning in Python using the modern ecosystem (2024-2025). Gymnasium has replaced OpenAI Gym as the standard environment interface. Stable-Baselines3 (SB3) is recommended for prototyping, RLlib for production/distributed training, and CleanRL for research.
Guides training RL agents with Stable-Baselines3 (PPO, SAC, DQN, TD3, DDPG, A2C), creating custom Gymnasium environments, and using callbacks. Best for single-agent RL experiments and quick prototyping.
Trains RL agents with Stable-Baselines3 (PPO, SAC, DQN, TD3, DDPG, A2C) using a scikit-learn-like API. Covers custom Gymnasium environments, callbacks, and model saving/loading.
Train one of AgentDB's 9 RL algorithms on a stream of episodes. Use when the user has accumulated successful/failed episodes and wants to derive a policy, or when a task type is repeated enough to benefit from RL routing.
Share bugs, ideas, or general feedback.
This skill provides comprehensive guidance for implementing reinforcement learning in Python using the modern ecosystem (2024-2025). Gymnasium has replaced OpenAI Gym as the standard environment interface. Stable-Baselines3 (SB3) is recommended for prototyping, RLlib for production/distributed training, and CleanRL for research.
| Library | Best For | Ease | Flexibility | Production |
|---|---|---|---|---|
| Stable-Baselines3 | Prototyping, learning | High | Medium | Good |
| RLlib | Production, distributed | Medium | High | Excellent |
| CleanRL | Research, understanding | High | Low | Poor |
| TorchRL | Custom implementations | Low | Highest | Good |
Start
|
v
Action space type?
|
+-- Discrete --> Sample efficiency critical?
| |
| +-- Yes --> DQN (or Double/Dueling DQN)
| +-- No --> Stability critical?
| |
| +-- Yes --> PPO
| +-- No --> A2C (faster iterations)
|
+-- Continuous --> Sample efficiency critical?
|
+-- Yes --> SAC (auto entropy) or TD3
+-- No --> PPO (more stable, less efficient)
Quick Selection Table:
| Scenario | Recommended | Why |
|---|---|---|
| Discrete actions, getting started | PPO | Stable, good defaults |
| Continuous control | SAC or TD3 | Sample efficient, handles continuous well |
| Sample efficiency critical | SAC, DQN | Off-policy, reuses experience |
| Stability critical | PPO | Trust region, consistent |
| High-dimensional obs (images) | PPO + CNN | Handles visual input well |
| Fast iteration needed | A2C | Simpler, faster per update |
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
# Create vectorized environment (4 parallel envs)
env = make_vec_env("CartPole-v1", n_envs=4)
# Initialize and train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)
# Save and load
model.save("ppo_cartpole")
loaded_model = PPO.load("ppo_cartpole")
# Evaluate
obs = env.reset()
for _ in range(1000):
action, _ = loaded_model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
import gymnasium as gym
from gymnasium import spaces
import numpy as np
class CustomEnv(gym.Env):
metadata = {"render_modes": ["human", "rgb_array"]}
def __init__(self, render_mode=None):
super().__init__()
self.observation_space = spaces.Box(
low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
)
self.action_space = spaces.Discrete(2)
self.render_mode = render_mode
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
return self.state.astype(np.float32), {}
def step(self, action):
# Implement environment dynamics here
observation = self.state.astype(np.float32)
reward = 1.0
terminated = False # Episode ended due to task completion/failure
truncated = False # Episode ended due to time limit
info = {}
return observation, reward, terminated, truncated, info
def render(self):
pass
import optuna
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
def objective(trial):
learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
n_steps = trial.suggest_categorical("n_steps", [256, 512, 1024, 2048])
gamma = trial.suggest_float("gamma", 0.9, 0.9999)
model = PPO(
"MlpPolicy", "CartPole-v1",
learning_rate=learning_rate,
n_steps=n_steps,
gamma=gamma,
verbose=0
)
model.learn(total_timesteps=50_000)
mean_reward, _ = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
return mean_reward
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)
print(f"Best params: {study.best_params}")
pip install gymnasium stable-baselines3 tensorboard optuna
# For Atari environments
pip install gymnasium[atari] gymnasium[accept-rom-license]
# For MuJoCo
pip install gymnasium[mujoco]
VecNormalize wrapper