Stats

Actions

Tags

Help us improve

Share bugs, ideas, or general feedback.

reinforcement-learning | AI-Toolkit

Skill

reinforcement-learning

From AI-Toolkit

Reinforcement Learning best practices for Python using modern libraries (Stable-Baselines3, RLlib, Gymnasium). Use when: - Implementing RL algorithms (PPO, SAC, DQN, TD3, A2C) - Creating custom Gymnasium environments - Training, debugging, or evaluating RL agents - Setting up hyperparameter tuning for RL - Deploying RL models to production

$

npx claudepluginhub aznatkoiny/zai-skills --plugin AI-Toolkit

Popularity

Parent stars

4

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/AI-Toolkit:reinforcement-learning

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill provides comprehensive guidance for implementing reinforcement learning in Python using the modern ecosystem (2024-2025). Gymnasium has replaced OpenAI Gym as the standard environment interface. Stable-Baselines3 (SB3) is recommended for prototyping, RLlib for production/distributed training, and CleanRL for research.

Supporting Files

references/algorithms.mdreferences/debugging.mdreferences/deployment.mdreferences/environments.mdreferences/evaluation.mdreferences/training.md

SKILL.md

199 lines · ~1.7k tokens

Similar Skills

stable-baselines3

14

Guides training RL agents with Stable-Baselines3 (PPO, SAC, DQN, TD3, DDPG, A2C), creating custom Gymnasium environments, and using callbacks. Best for single-agent RL experiments and quick prototyping.

7 files

alterlab-stable-baselines3

22

Trains RL agents with Stable-Baselines3 (PPO, SAC, DQN, TD3, DDPG, A2C) using a scikit-learn-like API. Covers custom Gymnasium environments, callbacks, and model saving/loading.

7 files5 tools

alterlab-writing-tools

agentdb-learn

49

Train one of AgentDB's 9 RL algorithms on a stream of episodes. Use when the user has accumulated successful/failed episodes and wants to derive a policy, or when a task type is repeated enough to benefit from RL routing.

agentdb-learning

Stats

LanguageTypeScript

Parent stars4

MaintenanceGood

Last CommitFeb 13, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Reinforcement Learning Best Practices

Overview

This skill provides comprehensive guidance for implementing reinforcement learning in Python using the modern ecosystem (2024-2025). Gymnasium has replaced OpenAI Gym as the standard environment interface. Stable-Baselines3 (SB3) is recommended for prototyping, RLlib for production/distributed training, and CleanRL for research.

When to Use

Building RL agents for discrete or continuous control tasks
Creating custom simulation environments
Tuning hyperparameters for RL algorithms
Debugging training issues (reward curves, policy collapse, numerical instability)
Deploying trained policies to production

Library Selection

Library	Best For	Ease	Flexibility	Production
Stable-Baselines3	Prototyping, learning	High	Medium	Good
RLlib	Production, distributed	Medium	High	Excellent
CleanRL	Research, understanding	High	Low	Poor
TorchRL	Custom implementations	Low	Highest	Good

Algorithm Decision Tree

Start
  |
  v
Action space type?
  |
  +-- Discrete --> Sample efficiency critical?
  |                  |
  |                  +-- Yes --> DQN (or Double/Dueling DQN)
  |                  +-- No  --> Stability critical?
  |                               |
  |                               +-- Yes --> PPO
  |                               +-- No  --> A2C (faster iterations)
  |
  +-- Continuous --> Sample efficiency critical?
                       |
                       +-- Yes --> SAC (auto entropy) or TD3
                       +-- No  --> PPO (more stable, less efficient)

Quick Selection Table:

Scenario	Recommended	Why
Discrete actions, getting started	PPO	Stable, good defaults
Continuous control	SAC or TD3	Sample efficient, handles continuous well
Sample efficiency critical	SAC, DQN	Off-policy, reuses experience
Stability critical	PPO	Trust region, consistent
High-dimensional obs (images)	PPO + CNN	Handles visual input well
Fast iteration needed	A2C	Simpler, faster per update

Quick Start with Stable-Baselines3

Basic Training

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Create vectorized environment (4 parallel envs)
env = make_vec_env("CartPole-v1", n_envs=4)

# Initialize and train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)

# Save and load
model.save("ppo_cartpole")
loaded_model = PPO.load("ppo_cartpole")

# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _ = loaded_model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)

Custom Environment Template

import gymnasium as gym
from gymnasium import spaces
import numpy as np

class CustomEnv(gym.Env):
    metadata = {"render_modes": ["human", "rgb_array"]}

    def __init__(self, render_mode=None):
        super().__init__()
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
        )
        self.action_space = spaces.Discrete(2)
        self.render_mode = render_mode

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
        return self.state.astype(np.float32), {}

    def step(self, action):
        # Implement environment dynamics here
        observation = self.state.astype(np.float32)
        reward = 1.0
        terminated = False  # Episode ended due to task completion/failure
        truncated = False   # Episode ended due to time limit
        info = {}
        return observation, reward, terminated, truncated, info

    def render(self):
        pass

Hyperparameter Tuning with Optuna

import optuna
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
    n_steps = trial.suggest_categorical("n_steps", [256, 512, 1024, 2048])
    gamma = trial.suggest_float("gamma", 0.9, 0.9999)

    model = PPO(
        "MlpPolicy", "CartPole-v1",
        learning_rate=learning_rate,
        n_steps=n_steps,
        gamma=gamma,
        verbose=0
    )
    model.learn(total_timesteps=50_000)

    mean_reward, _ = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
    return mean_reward

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)
print(f"Best params: {study.best_params}")

Core Workflow

Define the environment - Use Gymnasium API, validate spaces
Select algorithm - Based on action space and requirements
Start simple - Default hyperparameters, short training
Monitor training - TensorBoard, check reward curves
Debug issues - Use the debugging playbook
Tune hyperparameters - Optuna for systematic search
Evaluate properly - Separate eval env, multiple seeds
Deploy - Export to ONNX/TorchScript

Reference Files

algorithms.md - Deep dive on DQN, PPO, SAC, A2C, TD3
environments.md - Gymnasium setup, custom envs, wrappers
training.md - Hyperparameters, reward engineering, normalization
debugging.md - Failure modes, diagnostics, sanity checks
evaluation.md - Metrics, logging, reproducibility
deployment.md - ONNX export, inference optimization, safety

Essential Dependencies

pip install gymnasium stable-baselines3 tensorboard optuna
# For Atari environments
pip install gymnasium[atari] gymnasium[accept-rom-license]
# For MuJoCo
pip install gymnasium[mujoco]

Common Pitfalls to Avoid

Not normalizing observations - Use VecNormalize wrapper
Wrong action space handling - Check discrete vs continuous
Ignoring seed management - Set seeds for reproducibility
Training and eval on same env - Use separate eval environment
Not monitoring entropy - Low entropy = policy collapse
Sparse rewards without shaping - Add intermediate rewards
Too large/small learning rate - Start with 3e-4 for most algorithms