Skill

Reinforcement Learning for Trading

name: reinforcement-learning-trading

Install

npx claudepluginhub brainbytes-dev/everything-claude-trading

Tool Access

This skill uses the workspace's default tool permissions.

Preview

name: reinforcement-learning-trading

SKILL.md

Similar Skills

kotlin-ktor-patterns

Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.

everything-claude-code

163.2k

deep-research

Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.

everything-claude-code

163.2k

inventory-demand-planning

Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.

everything-claude-code

163.2k

Stats

Stars0

Forks0

Last CommitMar 14, 2026

Actions

View Source View Plugin View on GitHub View README

Reinforcement Learning for Trading

name: reinforcement-learning-trading description: Reinforcement learning for trading — Q-learning, policy gradient, reward shaping for portfolio management. Use when exploring RL-based trading systems.

When to Activate

User wants to apply reinforcement learning to portfolio management or execution
Designing reward functions for trading agents
Formulating trading as a Markov Decision Process
Evaluating RL vs traditional optimization for dynamic allocation
Building simulation environments for training trading agents

First Questions

What is the trading task — portfolio allocation, execution/order placement, or market making?
What action space do you envision (continuous weights, discrete buy/sell/hold)?
What defines "success" — Sharpe ratio, P&L, risk-adjusted return, execution cost?
Do you have a realistic market simulator, or will you train on historical data?
What are the constraints (transaction costs, position limits, leverage limits)?

Core Concepts

MDP Formulation for Trading

Trading can be framed as a Markov Decision Process (MDP):

State s_t:
  Market features: prices, returns, volume, volatility, order book
  Portfolio features: current positions, cash, P&L, holding periods
  Account features: margin, risk limits, exposure

Action a_t:
  Discrete: {buy, sell, hold} for each asset
  Continuous: target portfolio weights w_t in [0,1]^N (or [-1,1]^N with shorting)
  Execution: order size, limit price, timing

Reward r_t:
  Simple: portfolio return r_{t+1}
  Risk-adjusted: r_{t+1} - lambda * risk_penalty
  Sharpe-like: (r_{t+1} - r_f) / running_std
  Log utility: log(1 + r_{t+1})

Transition: s_{t+1} = f(s_t, a_t, market_dynamics)
  The market is NOT influenced by the agent (price-taking assumption)
  With market impact: transition depends on order size

Why RL is Appealing for Trading

Sequential decision-making: RL naturally handles multi-period optimization where actions affect future states
No need for explicit forecasts: RL learns a policy (state -> action mapping) directly, bypassing the predict-then-optimize pipeline
Handles constraints naturally: Transaction costs, position limits, risk budgets can be embedded in the reward/environment
Adapts to regime changes: Online RL algorithms can continuously update the policy

Why RL is Extremely Difficult for Trading

Non-stationary environment: Markets change — policies trained on past data may fail
Sparse/delayed rewards: Annual Sharpe ratio depends on daily decisions spread over 252 days
Sim-to-real gap: Simulators cannot capture all market dynamics (impact, liquidity, slippage)
Sample efficiency: RL algorithms need millions of interactions; financial data is limited
Reward hacking: Agents exploit simulator artifacts rather than learning genuine strategies
Evaluation difficulty: You cannot A/B test trading strategies — live performance is the only true test

Detailed Methodology

Q-Learning and DQN

Tabular Q-Learning (for simple cases):

Q(s, a) <- Q(s, a) + alpha * [r + gamma * max_a' Q(s', a') - Q(s, a)]

alpha: learning rate (0.01-0.1)
gamma: discount factor (0.99 for daily trading — discounts future rewards)
epsilon: exploration rate (start 1.0, decay to 0.01)

Requires discretized state space — impractical for high-dimensional financial states

Deep Q-Network (DQN):

Replace Q-table with neural network: Q(s, a; theta)

Key techniques:
  Experience replay: Store (s, a, r, s') transitions in buffer, sample mini-batches
  Target network: Separate network for target Q-values, updated periodically
  Double DQN: Use online network for action selection, target for evaluation
  Dueling DQN: Separate value and advantage streams

Architecture:
  Input: state features (flattened or through LSTM for sequence)
  Hidden: 2-3 dense layers (128-256 units)
  Output: Q-value for each discrete action

Limitations for trading:
  - Requires discrete actions (buy/sell/hold or discretized position sizes)
  - Overestimates Q-values in noisy environments (financial data is very noisy)
  - Requires large experience replay buffer

Policy Gradient Methods

Directly learn a policy pi(a|s; theta) without Q-function estimation:

REINFORCE:

Gradient: nabla_theta J = E[sum_t nabla_theta log(pi(a_t|s_t)) * G_t]
G_t: cumulative discounted reward from time t

Simple but high variance. Impractical for long trading episodes.

Actor-Critic (A2C):

Actor: pi(a|s; theta_a) — policy network
Critic: V(s; theta_c) — value network

Advantage: A_t = r_t + gamma * V(s_{t+1}) - V(s_t)
Update actor: nabla_theta_a log(pi(a_t|s_t)) * A_t
Update critic: minimize (r_t + gamma * V(s_{t+1}) - V(s_t))^2

Lower variance than REINFORCE due to baseline (critic)

PPO (Proximal Policy Optimization):

Clip the policy ratio to prevent large updates:
L = min(r_t * A_t, clip(r_t, 1-epsilon, 1+epsilon) * A_t)

r_t = pi_new(a_t|s_t) / pi_old(a_t|s_t)
epsilon: clipping parameter (typically 0.2)

PPO is currently the most popular RL algorithm for trading:
  - Stable training
  - Works with continuous action spaces
  - Handles multi-asset portfolios naturally
  - Good sample efficiency relative to other PG methods

Reward Shaping for Financial Objectives

Reward design is the most critical and difficult aspect of RL for trading:

Naive reward (portfolio return):
  r_t = portfolio_value_t / portfolio_value_{t-1} - 1
  Problem: Ignores risk. Agent takes maximum leverage.

Risk-adjusted reward:
  r_t = return_t - lambda * return_t^2
  Approximates mean-variance utility. lambda controls risk aversion.

Differential Sharpe ratio (Moody & Saffell):
  r_t = (B_{t-1} * delta_A_t - 0.5 * A_{t-1} * delta_B_t) / (B_{t-1} - A_{t-1}^2)^{3/2}
  Where A_t, B_t are exponential moving averages of return and squared return
  Directly optimizes Sharpe ratio in an online manner

Transaction cost penalty:
  r_t = return_t - cost * |w_t - w_{t-1}|
  Essential: without this, agent churns the portfolio

Drawdown penalty:
  r_t = return_t - lambda_dd * max(0, max_value - current_value)
  Penalizes drawdowns to keep them bounded

Simulation Environment Design

The environment is where RL agents train. Its realism determines whether learned policies transfer to live trading.

Minimum viable environment:
  - Historical price data (open, high, low, close, volume)
  - Transaction costs (spread + commission + slippage model)
  - Position limits and leverage constraints
  - Cash management and margin requirements

Better environment:
  - Order book simulation (level 2 data)
  - Market impact model (linear: impact = kappa * sqrt(volume/ADV))
  - Partial fills and queue position
  - Corporate actions (splits, dividends, delistings)
  - Realistic data delivery (no look-ahead in state construction)

Environment frameworks:
  - OpenAI Gym / Gymnasium: Standard interface
  - FinRL: Finance-specific RL environment
  - Custom: Most flexibility, most development effort

Backtesting RL Agents

RL backtesting has unique pitfalls beyond standard strategy backtesting:

Issues:
1. Training on historical data IS the backtest — there is no separate OOS
   Solution: Walk-forward training with held-out test periods

2. Agent learns to exploit simulator artifacts
   Solution: Test on multiple data sources, add noise to environment

3. Episode boundaries create discontinuities
   Solution: Use continuing tasks (no terminal states) for portfolio management

4. Hyperparameter tuning overfits to historical data
   Solution: Tune on validation period, test on held-out period (ONCE)

Walk-forward RL backtesting:
  For each year Y in [2015, 2016, ..., 2024]:
    Train agent on data up to Y-1
    Test on year Y (no retraining within test year)
    Record: Sharpe, returns, drawdown, turnover
  Aggregate: OOS Sharpe, consistency across years

Practical Challenges and Mitigations

Challenge	Mitigation
Low signal-to-noise	Use risk-adjusted rewards, not raw returns
Non-stationarity	Periodic retraining, meta-learning
Limited data	Data augmentation, transfer from simulation
Reward hacking	Careful reward design, multiple reward components
High dimensionality	Feature selection before RL, PCA
Transaction costs	Include in reward, penalize turnover explicitly
Sim-to-real gap	Conservative policies, model uncertainty

Templates / Examples

RL Trading Agent Specification

Task: Multi-asset portfolio allocation (10 ETFs)
Algorithm: PPO (continuous actions)
State space: 10 assets x 15 features = 150 dimensions + 10 current weights
Action space: 10 continuous weights in [-1, 1], normalized to sum <= 1
Reward: Differential Sharpe ratio - 0.001 * turnover

Network:
  Actor: [150+10] -> Dense(256) -> Dense(128) -> Dense(10) -> Softmax
  Critic: [150+10] -> Dense(256) -> Dense(128) -> Dense(1)

Training:
  Episodes: 10,000 (each = 252 trading days)
  Learning rate: 3e-4 (Adam)
  Gamma: 0.99
  PPO clip: 0.2
  Entropy bonus: 0.01

Walk-Forward Test Results (2018-2024):
  Sharpe:     0.72 (vs 0.85 equal-weight benchmark)
  Max DD:     -18% (vs -34% benchmark)
  Turnover:   45% annual
  Comment:    Agent learned risk management but underperforms on returns.
              Value is in drawdown reduction, not alpha generation.

RL vs Traditional Comparison Template

Method          | Sharpe | Max DD | Turnover | Compute  | Interpretability
Equal Weight    | 0.85   | -34%   | 5%       | None     | Full
Mean-Variance   | 0.62   | -42%   | 120%     | Minimal  | High
Risk Parity     | 0.78   | -22%   | 15%      | Minimal  | High
RL (PPO)        | 0.72   | -18%   | 45%      | 40 hrs   | Low
RL (DQN)        | 0.58   | -28%   | 85%      | 60 hrs   | Low

Conclusion: RL excels at dynamic risk management but rarely generates
alpha above well-constructed traditional approaches. Compute cost is
significant. Consider RL for execution optimization rather than allocation.

Quality Gate

Before deploying an RL trading agent: