Skill

agentdb-learn

Train one of AgentDB's 9 RL algorithms on a stream of episodes. Use when the user has accumulated successful/failed episodes and wants to derive a policy, or when a task type is repeated enough to benefit from RL routing.

npx claudepluginhub ruvnet/agentdb --plugin agentdb-learning

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Train an RL agent on episode data to derive a policy.

SKILL.md

Similar Skills

agentdb-route

Ask the AgentDB bandit which RL algorithm / skill / pattern fits the current task best. Use at task start when there are multiple plausible approaches and you want the data-driven pick.

agentdb-learning

stable-baselines3

Guides training RL agents with Stable Baselines3 algorithms (PPO, SAC, DQN, TD3, A2C) using Gymnasium environments, custom env creation, callbacks, and optimization.

7 files

superpowers

ui-ux-pro-max

76.2k

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

Stats

Parent Repo Stars37

Parent Repo Forks3

Last CommitMay 6, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Learn

Train an RL agent on episode data to derive a policy.

When to use

Repeated task type with measurable reward and 50+ episodes

User asks to "train", "build a policy", "make it learn"

Pre-deployment of an autonomous agent that should pick actions itself

Algorithms (the bandit picks the right one)

Algo

Best for

Q-Learning

Tabular state-spaces, discrete actions

SARSA

On-policy variant, conservative exploration

DQN

High-dimensional state, neural Q-fn

PPO

Continuous control, high-dim action

Actor-Critic

Baseline reduction, stable training

Policy Gradient

Direct policy parameterization

Decision Transformer

Offline RL on trajectories

MCTS

Tree-search planning under known dynamics

Model-Based RL

Sample-efficient when env model is learnable

If you don't know which to pick, call agentdb_learning_route first — the bandit suggests one based on past performance on similar task signatures.

API

agentdb_learning_train( algorithm: <one of above> // or 'auto' to let bandit pick episodes: [<episodeId>, ...] // or task name → fetch automatically hyperparams: { lr, gamma, epsilon, ... } iterations: N )

After training

The bandit logs the resulting reward distribution on the chosen algorithm.

The trained model is stored in the AgentDB skill library as a callable skill.

Future calls to agentdb_learning_route(task) may pick this trained skill if it scores well.

Don't

Don't train on < 50 episodes — high-variance models overfit.

Don't train multiple algorithms in parallel "to see which wins" — that's the bandit's job, and parallel training pollutes the reward signal.

Don't ignore the route output. If agentdb_learning_route says "no algorithm has > 0.6 expected reward on this task", the answer is to gather more episodes, not to force-train.

Learn

Train an RL agent on episode data to derive a policy.

When to use

Repeated task type with measurable reward and 50+ episodes
User asks to "train", "build a policy", "make it learn"
Pre-deployment of an autonomous agent that should pick actions itself

Algorithms (the bandit picks the right one)

Algo	Best for
Q-Learning	Tabular state-spaces, discrete actions
SARSA	On-policy variant, conservative exploration
DQN	High-dimensional state, neural Q-fn
PPO	Continuous control, high-dim action
Actor-Critic	Baseline reduction, stable training
Policy Gradient	Direct policy parameterization
Decision Transformer	Offline RL on trajectories
MCTS	Tree-search planning under known dynamics
Model-Based RL	Sample-efficient when env model is learnable

If you don't know which to pick, call agentdb_learning_route first — the bandit suggests one based on past performance on similar task signatures.

API

agentdb_learning_train(
  algorithm:   <one of above>           // or 'auto' to let bandit pick
  episodes:    [<episodeId>, ...]       // or task name → fetch automatically
  hyperparams: { lr, gamma, epsilon, ... }
  iterations:  N
)

After training

The bandit logs the resulting reward distribution on the chosen algorithm.
The trained model is stored in the AgentDB skill library as a callable skill.
Future calls to agentdb_learning_route(task) may pick this trained skill if it scores well.

Don't

Don't train on < 50 episodes — high-variance models overfit.
Don't train multiple algorithms in parallel "to see which wins" — that's the bandit's job, and parallel training pollutes the reward signal.
Don't ignore the route output. If agentdb_learning_route says "no algorithm has > 0.6 expected reward on this task", the answer is to gather more episodes, not to force-train.