From everything-claude-trading
name: reinforcement-learning-trading
npx claudepluginhub brainbytes-dev/everything-claude-tradingThis skill uses the workspace's default tool permissions.
name: reinforcement-learning-trading
Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.
Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
name: reinforcement-learning-trading description: Reinforcement learning for trading — Q-learning, policy gradient, reward shaping for portfolio management. Use when exploring RL-based trading systems.
Trading can be framed as a Markov Decision Process (MDP):
State s_t:
Market features: prices, returns, volume, volatility, order book
Portfolio features: current positions, cash, P&L, holding periods
Account features: margin, risk limits, exposure
Action a_t:
Discrete: {buy, sell, hold} for each asset
Continuous: target portfolio weights w_t in [0,1]^N (or [-1,1]^N with shorting)
Execution: order size, limit price, timing
Reward r_t:
Simple: portfolio return r_{t+1}
Risk-adjusted: r_{t+1} - lambda * risk_penalty
Sharpe-like: (r_{t+1} - r_f) / running_std
Log utility: log(1 + r_{t+1})
Transition: s_{t+1} = f(s_t, a_t, market_dynamics)
The market is NOT influenced by the agent (price-taking assumption)
With market impact: transition depends on order size
Tabular Q-Learning (for simple cases):
Q(s, a) <- Q(s, a) + alpha * [r + gamma * max_a' Q(s', a') - Q(s, a)]
alpha: learning rate (0.01-0.1)
gamma: discount factor (0.99 for daily trading — discounts future rewards)
epsilon: exploration rate (start 1.0, decay to 0.01)
Requires discretized state space — impractical for high-dimensional financial states
Deep Q-Network (DQN):
Replace Q-table with neural network: Q(s, a; theta)
Key techniques:
Experience replay: Store (s, a, r, s') transitions in buffer, sample mini-batches
Target network: Separate network for target Q-values, updated periodically
Double DQN: Use online network for action selection, target for evaluation
Dueling DQN: Separate value and advantage streams
Architecture:
Input: state features (flattened or through LSTM for sequence)
Hidden: 2-3 dense layers (128-256 units)
Output: Q-value for each discrete action
Limitations for trading:
- Requires discrete actions (buy/sell/hold or discretized position sizes)
- Overestimates Q-values in noisy environments (financial data is very noisy)
- Requires large experience replay buffer
Directly learn a policy pi(a|s; theta) without Q-function estimation:
REINFORCE:
Gradient: nabla_theta J = E[sum_t nabla_theta log(pi(a_t|s_t)) * G_t]
G_t: cumulative discounted reward from time t
Simple but high variance. Impractical for long trading episodes.
Actor-Critic (A2C):
Actor: pi(a|s; theta_a) — policy network
Critic: V(s; theta_c) — value network
Advantage: A_t = r_t + gamma * V(s_{t+1}) - V(s_t)
Update actor: nabla_theta_a log(pi(a_t|s_t)) * A_t
Update critic: minimize (r_t + gamma * V(s_{t+1}) - V(s_t))^2
Lower variance than REINFORCE due to baseline (critic)
PPO (Proximal Policy Optimization):
Clip the policy ratio to prevent large updates:
L = min(r_t * A_t, clip(r_t, 1-epsilon, 1+epsilon) * A_t)
r_t = pi_new(a_t|s_t) / pi_old(a_t|s_t)
epsilon: clipping parameter (typically 0.2)
PPO is currently the most popular RL algorithm for trading:
- Stable training
- Works with continuous action spaces
- Handles multi-asset portfolios naturally
- Good sample efficiency relative to other PG methods
Reward design is the most critical and difficult aspect of RL for trading:
Naive reward (portfolio return):
r_t = portfolio_value_t / portfolio_value_{t-1} - 1
Problem: Ignores risk. Agent takes maximum leverage.
Risk-adjusted reward:
r_t = return_t - lambda * return_t^2
Approximates mean-variance utility. lambda controls risk aversion.
Differential Sharpe ratio (Moody & Saffell):
r_t = (B_{t-1} * delta_A_t - 0.5 * A_{t-1} * delta_B_t) / (B_{t-1} - A_{t-1}^2)^{3/2}
Where A_t, B_t are exponential moving averages of return and squared return
Directly optimizes Sharpe ratio in an online manner
Transaction cost penalty:
r_t = return_t - cost * |w_t - w_{t-1}|
Essential: without this, agent churns the portfolio
Drawdown penalty:
r_t = return_t - lambda_dd * max(0, max_value - current_value)
Penalizes drawdowns to keep them bounded
The environment is where RL agents train. Its realism determines whether learned policies transfer to live trading.
Minimum viable environment:
- Historical price data (open, high, low, close, volume)
- Transaction costs (spread + commission + slippage model)
- Position limits and leverage constraints
- Cash management and margin requirements
Better environment:
- Order book simulation (level 2 data)
- Market impact model (linear: impact = kappa * sqrt(volume/ADV))
- Partial fills and queue position
- Corporate actions (splits, dividends, delistings)
- Realistic data delivery (no look-ahead in state construction)
Environment frameworks:
- OpenAI Gym / Gymnasium: Standard interface
- FinRL: Finance-specific RL environment
- Custom: Most flexibility, most development effort
RL backtesting has unique pitfalls beyond standard strategy backtesting:
Issues:
1. Training on historical data IS the backtest — there is no separate OOS
Solution: Walk-forward training with held-out test periods
2. Agent learns to exploit simulator artifacts
Solution: Test on multiple data sources, add noise to environment
3. Episode boundaries create discontinuities
Solution: Use continuing tasks (no terminal states) for portfolio management
4. Hyperparameter tuning overfits to historical data
Solution: Tune on validation period, test on held-out period (ONCE)
Walk-forward RL backtesting:
For each year Y in [2015, 2016, ..., 2024]:
Train agent on data up to Y-1
Test on year Y (no retraining within test year)
Record: Sharpe, returns, drawdown, turnover
Aggregate: OOS Sharpe, consistency across years
| Challenge | Mitigation |
|---|---|
| Low signal-to-noise | Use risk-adjusted rewards, not raw returns |
| Non-stationarity | Periodic retraining, meta-learning |
| Limited data | Data augmentation, transfer from simulation |
| Reward hacking | Careful reward design, multiple reward components |
| High dimensionality | Feature selection before RL, PCA |
| Transaction costs | Include in reward, penalize turnover explicitly |
| Sim-to-real gap | Conservative policies, model uncertainty |
Task: Multi-asset portfolio allocation (10 ETFs)
Algorithm: PPO (continuous actions)
State space: 10 assets x 15 features = 150 dimensions + 10 current weights
Action space: 10 continuous weights in [-1, 1], normalized to sum <= 1
Reward: Differential Sharpe ratio - 0.001 * turnover
Network:
Actor: [150+10] -> Dense(256) -> Dense(128) -> Dense(10) -> Softmax
Critic: [150+10] -> Dense(256) -> Dense(128) -> Dense(1)
Training:
Episodes: 10,000 (each = 252 trading days)
Learning rate: 3e-4 (Adam)
Gamma: 0.99
PPO clip: 0.2
Entropy bonus: 0.01
Walk-Forward Test Results (2018-2024):
Sharpe: 0.72 (vs 0.85 equal-weight benchmark)
Max DD: -18% (vs -34% benchmark)
Turnover: 45% annual
Comment: Agent learned risk management but underperforms on returns.
Value is in drawdown reduction, not alpha generation.
Method | Sharpe | Max DD | Turnover | Compute | Interpretability
Equal Weight | 0.85 | -34% | 5% | None | Full
Mean-Variance | 0.62 | -42% | 120% | Minimal | High
Risk Parity | 0.78 | -22% | 15% | Minimal | High
RL (PPO) | 0.72 | -18% | 45% | 40 hrs | Low
RL (DQN) | 0.58 | -28% | 85% | 60 hrs | Low
Conclusion: RL excels at dynamic risk management but rarely generates
alpha above well-constructed traditional approaches. Compute cost is
significant. Consider RL for execution optimization rather than allocation.
Before deploying an RL trading agent: