Skill guidance for agentdb reinforcement learning training.
/plugin marketplace add DNYoussef/context-cascade/plugin install dnyoussef-context-cascade@DNYoussef/context-cascadeThis skill inherits all available tools. When active, it can use any tool Claude has access to.
PROCESS.mdREADME.mdSKILL-meta.yamlprocess-diagram.gvBefore writing ANY code, you MUST check:
.claude/library/catalog.json.claude/docs/inventories/LIBRARY-PATTERNS-GUIDE.mdD:\Projects\*| Match | Action |
|---|---|
| Library >90% | REUSE directly |
| Library 70-90% | ADAPT minimally |
| Pattern exists | FOLLOW pattern |
| In project | EXTRACT |
| No match | BUILD (add to library after) |
Train AI learning plugins with AgentDB's 9 reinforcement learning algorithms including Decision Transformer, Q-Learning, SARSA, Actor-Critic, PPO, and more. Build self-learning agents, implement RL, and optimize agent behavior through experience.
Use this skill when you need to:
Objective: Setup AgentDB learning infrastructure with environment configuration
Agent: ml-developer
Steps:
npm install agentdb-learning@latest
npm install @agentdb/rl-algorithms @agentdb/environments
import { AgentDB, LearningPlugin } from 'agentdb-learning';
const learningDB = new AgentDB({
name: 'rl-training-db',
dimensions: 512, // State embedding dimension
learning: {
enabled: true,
persistExperience: true,
replayBufferSize: 100000
}
});
await learningDB.initialize();
// Create learning plugin
const learningPlugin = new LearningPlugin({
database: learningDB,
algorithms: ['q-learning', 'dqn', 'ppo', 'actor-critic'],
config: {
batchSize: 64,
learningRate: 0.001,
discountFactor: 0.99,
explorationRate: 1.0,
explorationDecay: 0.995
}
});
await learningPlugin.initialize();
import { Environment } from '@agentdb/environments';
const environment = new Environment({
name: 'grid-world',
stateSpace: {
type: 'continuous',
shape: [10, 10],
bounds: [[0, 10], [0, 10]]
},
actionSpace: {
type: 'discrete',
actions: ['up', 'down', 'left', 'right']
},
rewardFunction: (state, action, nextState) => {
// Distance to goal reward
const goalDistance = Math.sqrt(
Math.pow(nextState[0] - 9, 2) +
Math.pow(nextState[1] - 9, 2)
);
return -goalDistance + (goalDistance === 0 ? 100 : 0);
},
terminalCondition: (state) => {
return state[0] === 9 && state[1] === 9; // Reached goal
}
});
await environment.initialize();
const monitor = learningPlugin.createMonitor({
metrics: ['reward', 'loss', 'exploration-rate', 'episode-length'],
logInterval: 100, // Log every 100 episodes
saveCheckpoints: true,
checkpointInterval: 1000
});
monitor.on('episode-complete', (episode) => {
console.log('Episode:', episode.number, 'Reward:', episode.totalReward);
});
Memory Pattern:
await agentDB.memory.store('agentdb/learning/environment', {
name: environment.name,
stateSpace: environment.stateSpace,
actionSpace: environment.actionSpace,
initialized: Date.now()
});
Validation:
Objective: Select and configure RL algorithm for the learning task
Agent: ml-developer
Steps:
// Example: Deep Q-Network (DQN)
const dqnAgent = learningPlugin.createAgent({
algorithm: 'dqn',
config: {
networkArchitecture: {
layers: [
{ type: 'dense', units: 128, activation: 'relu' },
{ type: 'dense', units: 128, activation: 'relu' },
{ type: 'dense', units: environment.actionSpace.size, activation: 'linear' }
]
},
learningRate: 0.001,
batchSize: 64,
replayBuffer: {
size: 100000,
prioritized: true,
alpha: 0.6,
beta: 0.4
},
targetNetwork: {
updateFrequency: 1000,
tauSync: 0.001 // Soft update
},
exploration: {
initial: 1.0,
final: 0.01,
decay: 0.995
},
training: {
startAfter: 1000, // Start training after 1000 experiences
updateFrequency: 4
}
}
});
await dqnAgent.initialize();
const hyperparameters = {
// Learning parameters
learningRate: 0.001,
discountFactor: 0.99, // Gamma
batchSize: 64,
// Exploration
epsilonStart: 1.0,
epsilonEnd: 0.01,
epsilonDecay: 0.995,
// Experience replay
replayBufferSize: 100000,
minReplaySize: 1000,
prioritizedReplay: true,
// Training
maxEpisodes: 10000,
maxStepsPerEpisode: 1000,
targetUpdateFrequency: 1000,
// Evaluation
evalFrequency: 100,
evalEpisodes: 10
};
dqnAgent.setHyperparameters(hyperparameters);
import { PrioritizedReplayBuffer } from '@agentdb/rl-algorithms';
const replayBuffer = new PrioritizedReplayBuffer({
capacity: 100000,
alpha: 0.6, // Prioritization exponent
beta: 0.4, // Importance sampling
betaIncrement: 0.001,
epsilon: 0.01 // Small constant for stability
});
dqnAgent.setReplayBuffer(replayBuffer);
const trainingConfig = {
episodes: 10000,
stepsPerEpisode: 1000,
warmupSteps: 1000,
trainFrequency: 4,
targetUpdateFrequency: 1000,
saveFrequency: 1000,
evalFrequency: 100,
earlyStoppingPatience: 500,
earlyStoppingThreshold: 0.01
};
dqnAgent.setTrainingConfig(trainingConfig);
Memory Pattern:
await agentDB.memory.store('agentdb/learning/algorithm-config', {
algorithm: 'dqn',
hyperparameters: hyperparameters,
trainingConfig: trainingConfig,
configured: Date.now()
});
Validation:
Objective: Execute training iterations and optimize agent behavior
Agent: safla-neural
Steps:
async function trainAgent() {
console.log('Starting RL training...');
const trainingStats = {
episodes: [],
totalReward: [],
episodeLength: [],
loss: [],
explorationRate: []
};
for (let episode = 0; episode < trainingConfig.episodes; episode++) {
let state = await environment.reset();
let episodeReward = 0;
let episodeLength = 0;
let episodeLoss = 0;
for (let step = 0; step < trainingConfig.stepsPerEpisode; step++) {
// Select action
const action = await dqnAgent.selectAction(state, {
explore: true
});
// Execute action
const { nextState, reward, done } = await environment.step(action);
// Store experience
await dqnAgent.storeExperience({
state,
action,
reward,
nextState,
done
});
// Train if enough experiences
if (dqnAgent.canTrain()) {
const loss = await dqnAgent.train();
episodeLoss += loss;
}
episodeReward += reward;
episodeLength += 1;
state = nextState;
if (done) break;
}
// Update target network
if (episode % trainingConfig.targetUpdateFrequency === 0) {
await dqnAgent.updateTargetNetwork();
}
// Decay exploration
dqnAgent.decayExploration();
// Log progress
trainingStats.episodes.push(episode);
trainingStats.totalReward.push(episodeReward);
trainingStats.episodeLength.push(episodeLength);
trainingStats.loss.push(episodeLoss / episodeLength);
trainingStats.explorationRate.push(dqnAgent.getExplorationRate());
if (episode % 100 === 0) {
console.log(`Episode ${episode}:`, {
reward: episodeReward.toFixed(2),
length: episodeLength,
loss: (episodeLoss / episodeLength).toFixed(4),
epsilon: dqnAgent.getExplorationRate().toFixed(3)
});
}
// Save checkpoint
if (episode % trainingConfig.saveFrequency === 0) {
await dqnAgent.save(`checkpoint-${episode}`);
}
// Evaluate
if (episode % trainingConfig.evalFrequency === 0) {
const evalReward = await evaluateAgent(dqnAgent, environment);
console.log(`Evaluation at episode ${episode}: ${evalReward.toFixed(2)}`);
}
// Early stopping
if (checkEarlyStopping(trainingStats, episode)) {
console.log('Early stopping triggered');
break;
}
}
return trainingStats;
}
const trainingStats = await trainAgent();
monitor.on('training-update', (stats) => {
// Calculate moving averages
const window = 100;
const recentRewards = stats.totalReward.slice(-window);
const avgReward = recentRewards.reduce((a, b) => a + b, 0) / recentRewards.length;
// Store metrics
agentDB.memory.store('agentdb/learning/training-progress', {
episode: stats.episodes[stats.episodes.length - 1],
avgReward: avgReward,
explorationRate: stats.explorationRate[stats.explorationRate.length - 1],
timestamp: Date.now()
});
// Plot learning curve (if visualization enabled)
if (monitor.visualization) {
monitor.plot('reward-curve', stats.episodes, stats.totalReward);
monitor.plot('loss-curve', stats.episodes, stats.loss);
}
});
function checkConvergence(stats, windowSize = 100, threshold = 0.01) {
if (stats.totalReward.length < windowSize * 2) {
return false;
}
const recent = stats.totalReward.slice(-windowSize);
const previous = stats.totalReward.slice(-windowSize * 2, -windowSize);
const recentAvg = recent.reduce((a, b) => a + b, 0) / recent.length;
const previousAvg = previous.reduce((a, b) => a + b, 0) / previous.length;
const improvement = (recentAvg - previousAvg) / Math.abs(previousAvg);
return improvement < threshold;
}
await dqnAgent.save('trained-agent-final', {
includeReplayBuffer: false,
includeOptimizer: false,
metadata: {
trainingStats: trainingStats,
hyperparameters: hyperparameters,
finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1]
}
});
console.log('Training complete. Model saved.');
Memory Pattern:
await agentDB.memory.store('agentdb/learning/training-results', {
algorithm: 'dqn',
episodes: trainingStats.episodes.length,
finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1],
converged: checkConvergence(trainingStats),
modelPath: 'trained-agent-final',
timestamp: Date.now()
});
Validation:
Objective: Benchmark trained agent and validate performance
Agent: performance-benchmarker
Steps:
const trainedAgent = await learningPlugin.loadAgent('trained-agent-final');
async function evaluateAgent(agent, env, numEpisodes = 100) {
const results = {
rewards: [],
episodeLengths: [],
successRate: 0
};
for (let i = 0; i < numEpisodes; i++) {
let state = await env.reset();
let episodeReward = 0;
let episodeLength = 0;
let success = false;
for (let step = 0; step < 1000; step++) {
const action = await agent.selectAction(state, { explore: false });
const { nextState, reward, done } = await env.step(action);
episodeReward += reward;
episodeLength += 1;
state = nextState;
if (done) {
success = env.isSuccessful(state);
break;
}
}
results.rewards.push(episodeReward);
results.episodeLengths.push(episodeLength);
if (success) results.successRate += 1;
}
results.successRate /= numEpisodes;
return {
meanReward: results.rewards.reduce((a, b) => a + b, 0) / results.rewards.length,
stdReward: calculateStd(results.rewards),
meanLength: results.episodeLengths.reduce((a, b) => a + b, 0) / results.episodeLengths.length,
successRate: results.successRate,
results: results
};
}
const evalResults = await evaluateAgent(trainedAgent, environment, 100);
console.log('Evaluation results:', evalResults);
// Random policy baseline
const randomAgent = learningPlugin.createAgent({ algorithm: 'random' });
const randomResults = await evaluateAgent(randomAgent, environment, 100);
// Calculate improvement
const improvement = {
rewardImprovement: (evalResults.meanReward - randomResults.meanReward) / Math.abs(randomResults.meanReward),
lengthImprovement: (randomResults.meanLength - evalResults.meanLength) / randomResults.meanLength,
successImprovement: evalResults.successRate - randomResults.successRate
};
console.log('Improvement over random:', improvement);
const benchmarks = {
performanceMetrics: {
meanReward: evalResults.meanReward,
stdReward: evalResults.stdReward,
successRate: evalResults.successRate,
meanEpisodeLength: evalResults.meanLength
},
algorithmComparison: {
dqn: evalResults,
random: randomResults,
improvement: improvement
},
inferenceTiming: {
actionSelection: 0,
totalEpisode: 0
}
};
// Measure inference speed
const timingTrials = 1000;
const startTime = performance.now();
for (let i = 0; i < timingTrials; i++) {
const state = await environment.randomState();
await trainedAgent.selectAction(state, { explore: false });
}
const endTime = performance.now();
benchmarks.inferenceTiming.actionSelection = (endTime - startTime) / timingTrials;
await agentDB.memory.store('agentdb/learning/benchmarks', benchmarks);
Memory Pattern:
await agentDB.memory.store('agentdb/learning/validation', {
evaluated: true,
meanReward: evalResults.meanReward,
successRate: evalResults.successRate,
improvement: improvement,
timestamp: Date.now()
});
Validation:
Objective: Deploy trained agents to production environment
Agent: ml-developer
Steps:
await trainedAgent.export('production-agent', {
format: 'onnx', // or 'tensorflowjs', 'pytorch'
optimize: true,
quantize: 'int8', // Quantization for faster inference
includeMetadata: true
});
import express from 'express';
const app = express();
app.use(express.json());
// Load production agent
const productionAgent = await learningPlugin.loadAgent('production-agent');
app.post('/api/predict', async (req, res) => {
try {
const { state } = req.body;
const action = await productionAgent.selectAction(state, {
explore: false,
returnProbabilities: true
});
res.json({
action: action.action,
probabilities: action.probabilities,
confidence: action.confidence
});
} catch (error) {
res.status(500).json({ error: error.message });
}
});
app.listen(3000, () => {
console.log('RL agent API running on port 3000');
});
import { ProductionMonitor } from '@agentdb/monitoring';
const prodMonitor = new ProductionMonitor({
agent: productionAgent,
metrics: ['inference-latency', 'action-distribution', 'reward-feedback'],
alerting: {
latencyThreshold: 100, // ms
anomalyDetection: true
}
});
await prodMonitor.start();
const deploymentPipeline = {
stages: [
{
name: 'validation',
steps: [
'Load trained model',
'Run validation suite',
'Check performance metrics',
'Verify inference speed'
]
},
{
name: 'export',
steps: [
'Export to production format',
'Optimize model',
'Quantize weights',
'Package artifacts'
]
},
{
name: 'deployment',
steps: [
'Deploy to staging',
'Run smoke tests',
'Deploy to production',
'Monitor performance'
]
}
]
};
await agentDB.memory.store('agentdb/learning/deployment-pipeline', deploymentPipeline);
Memory Pattern:
await agentDB.memory.store('agentdb/learning/production', {
deployed: true,
modelPath: 'production-agent',
apiEndpoint: 'http://localhost:3000/api/predict',
monitoring: true,
timestamp: Date.now()
});
Validation:
#!/bin/bash
# train-rl-agent.sh
set -e
echo "AgentDB RL Training Script"
echo "=========================="
# Phase 1: Initialize
echo "Phase 1: Initializing learning environment..."
npm install agentdb-learning @agentdb/rl-algorithms
# Phase 2: Configure
echo "Phase 2: Configuring algorithm..."
node -e "require('./config-algorithm.js')"
# Phase 3: Train
echo "Phase 3: Training agent..."
node -e "require('./train-agent.js')"
# Phase 4: Validate
echo "Phase 4: Validating performance..."
node -e "require('./evaluate-agent.js')"
# Phase 5: Deploy
echo "Phase 5: Deploying to production..."
node -e "require('./deploy-agent.js')"
echo "Training complete!"
// quickstart-rl.ts
import { setupRLTraining } from './setup';
async function quickStart() {
console.log('Starting RL training quick setup...');
// Setup
const { learningDB, environment, agent } = await setupRLTraining({
algorithm: 'dqn',
environment: 'grid-world',
episodes: 1000
});
// Train
console.log('Training agent...');
const stats = await agent.train(environment, {
episodes: 1000,
logInterval: 100
});
// Evaluate
console.log('Evaluating agent...');
const results = await agent.evaluate(environment, {
episodes: 100
});
console.log('Results:', results);
// Save
await agent.save('quickstart-agent');
console.log('Quick start complete!');
}
quickStart().catch(console.error);
Training Convergence (Self-Consistency)
Performance Benchmarks (Quantitative)
Algorithm Validation (Chain-of-Verification)
Production Readiness (Multi-Agent Consensus)
This skill operates using AgentDB's npm package and API only. No additional MCP servers required.
All AgentDB learning plugin operations are performed through:
npx agentdb@latest create-pluginimport { AgentDB, LearningPlugin } from 'agentdb-learning'AgentDB Reinforcement Learning Training operates on 3 fundamental principles:
Naive online learning suffers from correlation bias (agent learns from consecutive similar experiences) and catastrophic forgetting (new experiences overwrite old knowledge). Prioritized experience replay stores diverse experiences in a buffer, samples uniformly to break correlations, and prioritizes high-error transitions to focus learning on difficult scenarios.
In practice:
The exploration-exploitation dilemma is fundamental to RL: explore too much and waste time on suboptimal actions, exploit too much and miss better strategies. Epsilon-greedy with decay starts high (explore aggressively to map state space) and decays toward zero (exploit learned policy once confident).
In practice:
Q-learning suffers from instability when the target (expected future reward) shifts while training. Using the same network for both action selection and target computation creates a moving target problem. Target networks freeze value estimates periodically, stabilizing training and improving convergence reliability.
In practice:
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Reward Hacking - Agent Finds Unintended Policy Shortcuts | Poorly designed reward functions incentivize agents to exploit loopholes rather than solve the intended task. Classic example: agent learns to pause game indefinitely to avoid losing instead of playing well. | Use shaped rewards with multiple components (task completion + efficiency + constraints). Validate reward function with adversarial testing - manually identify shortcuts and penalize them. Prefer sparse terminal rewards over dense step rewards when task definition is clear. |
| Training Convergence Blindness - Run Fixed Episode Count Without Monitoring | Training for arbitrary 10K episodes wastes compute if convergence happens at 3K or fails to converge at all. Agents either plateau early or train indefinitely without improvement. | Implement early stopping with patience threshold (stop if no improvement in 500 episodes). Monitor moving average reward over 100-episode window. Track loss curves alongside rewards - diverging loss indicates hyperparameter tuning needed before continuing training. |
| Hyperparameter Lottery - Use Default Values Without Task-Specific Tuning | RL algorithms are notoriously sensitive to hyperparameters. Default learning_rate=0.001 may be 10x too high for high-dimensional state spaces or 10x too low for simple tasks. | Start with baseline hyperparameters from algorithm papers for similar task domains. Run hyperparameter sweeps on key parameters (learning_rate, discount_factor, exploration_rate) using small-scale experiments (1K episodes). Use grid search or Bayesian optimization to find task-specific optimal values before full-scale training. |
AgentDB Reinforcement Learning Training provides a production-ready framework for training autonomous agents across 9 RL algorithms, from classic Q-Learning to state-of-the-art Decision Transformers. The 5-phase SOP systematically guides you from environment initialization and algorithm configuration through training iterations, performance validation, and production deployment, with comprehensive monitoring and benchmarking at each stage. By integrating experience replay, exploration-exploitation balancing, and target network stabilization, the framework implements proven RL best practices that accelerate convergence and improve final policy quality.
This skill is essential when building self-learning agents for game AI, robotics control, resource optimization, or any domain where optimal behavior must be discovered through trial-and-error rather than explicitly programmed. The key differentiator is systematic validation - rather than blindly training for arbitrary episode counts, the framework monitors convergence, validates against baselines, and implements early stopping to prevent wasted computation. The deployment pipeline ensures trained policies are properly exported, optimized (quantization, format conversion), and monitored in production with latency and action distribution tracking.
The choice of RL algorithm matters critically. Q-Learning and SARSA suit discrete action spaces with full observability. DQN scales to high-dimensional state spaces (images, sensor data). Policy gradient methods (PPO, A2C) handle continuous control and partial observability. Decision Transformers enable offline RL from logged data without environment interaction. The framework provides all 9 algorithms with unified interfaces, allowing rapid experimentation to find the optimal approach for your specific task complexity, state/action space characteristics, and data availability constraints. With proper validation against baselines and comprehensive benchmarking, you can confidently deploy RL agents that genuinely learn and improve rather than memorizing fixed policies.
This skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.