npx claudepluginhub brainbytes-dev/everything-claude-tradingThis skill uses the workspace's default tool permissions.
name: deep-learning-trading
Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.
Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
name: deep-learning-trading description: Deep learning for trading — LSTM, transformers, sequence models for price prediction. Use when applying neural networks to financial data.
Deep learning excels when:
Deep learning is often NOT worth it when:
| Architecture | Strengths | Weaknesses | Best For |
|---|---|---|---|
| LSTM/GRU | Sequential dependencies, variable length | Slow training, gradient issues | Mid-frequency time series |
| TCN | Parallel training, fixed receptive field | Fixed context window | Regular-frequency data |
| Transformer | Long-range dependencies, attention | Data hungry, positional encoding needed | Large datasets, multi-asset |
| CNN-1D | Fast, local pattern detection | Fixed kernel size | Technical pattern recognition |
| Autoencoder | Unsupervised feature learning | No direct prediction | Feature extraction, anomaly detection |
Long Short-Term Memory networks maintain a cell state that can capture long-range dependencies:
Architecture for return prediction:
Input: (batch_size, sequence_length, n_features)
Layer 1: LSTM(hidden_size=64, return_sequences=True)
Layer 2: Dropout(0.3)
Layer 3: LSTM(hidden_size=32, return_sequences=False)
Layer 4: Dropout(0.3)
Layer 5: Dense(16, activation='relu')
Layer 6: Dense(1, activation='linear') [regression]
Dense(3, activation='softmax') [classification: down/flat/up]
Key hyperparameters:
sequence_length: 20-60 for daily data (lookback window)
hidden_size: 32-128 (larger = more capacity, more overfit risk)
num_layers: 1-3 (deeper rarely helps for finance)
dropout: 0.2-0.5 (critical for regularization)
learning_rate: 1e-4 to 1e-3 (use scheduler with decay)
batch_size: 32-256
GRU (Gated Recurrent Unit): Simpler than LSTM (2 gates vs 3), fewer parameters. Often performs comparably. Preferred when data is limited.
Training tips for financial LSTM:
Dilated causal convolutions that can process sequences in parallel:
Architecture:
Input: (batch_size, n_features, sequence_length)
Residual Block 1: Conv1D(filters=64, kernel=3, dilation=1) + ReLU + Dropout
Residual Block 2: Conv1D(filters=64, kernel=3, dilation=2)
Residual Block 3: Conv1D(filters=64, kernel=3, dilation=4)
Residual Block 4: Conv1D(filters=64, kernel=3, dilation=8)
Global Average Pooling
Dense(1)
Receptive field = sum of dilations * (kernel_size - 1) + 1
With 4 blocks, dilation [1,2,4,8], kernel=3: receptive field = 31
Advantages over LSTM:
- Parallelizable (much faster training)
- Fixed receptive field (predictable memory)
- No vanishing gradient problem
- Often matches LSTM performance on financial data
Self-attention mechanism captures relationships across any positions in the sequence:
Architecture:
Input Embedding: Linear(n_features, d_model=64)
Positional Encoding: Learnable or sinusoidal
Transformer Encoder (N=2-4 layers):
Multi-Head Attention(heads=4, d_model=64)
Feed-Forward(d_ff=256, dropout=0.1)
Layer Normalization
Output Head:
Global Average Pooling or [CLS] token
Dense(1)
Financial adaptations:
- Use CAUSAL attention mask (prevent looking into the future)
- Relative positional encoding (handles variable-length sequences better)
- Small model (d_model=32-128) — financial data cannot support GPT-scale models
- Cross-asset attention: attend across assets at each time step
Attention mechanism insight: Attention weights reveal which past time steps the model considers most important for prediction. This provides limited interpretability — inspect attention maps to verify the model is not attending to noise.
Non-stationarity mitigation:
1. Use returns, not prices (first-order stationarity)
2. Z-score features using ROLLING statistics (e.g., 252-day rolling mean/std)
3. Retrain periodically (quarterly or monthly) with expanding or rolling window
4. Use domain-specific normalization (e.g., normalize volume by ADV, volatility by long-run average)
Handling class imbalance (for direction prediction):
Data augmentation (limited options in finance):
Pre-train on a large related dataset, fine-tune on the target:
Approaches:
1. Cross-asset transfer: Pre-train on liquid assets (SPY, QQQ), fine-tune on illiquid
2. Cross-market transfer: Pre-train on US equities, fine-tune on EM equities
3. Cross-frequency: Pre-train on high-frequency, fine-tune on daily
4. Foundation models: Pre-train on broad market data, fine-tune for specific tasks
Practical considerations:
- Freeze early layers (general feature extraction), fine-tune later layers
- Use smaller learning rate for fine-tuning (1/10 of pre-training LR)
- Limited evidence of large transfer benefits in finance (unlike NLP/vision)
- Most useful when target dataset is very small
Attention analysis:
Gradient-based methods:
SHAP for neural networks:
Model: 2-Layer LSTM (64/32 hidden units)
Task: Next-day return sign prediction
Universe: S&P 500 constituents
Features: 35 (technical + fundamental + sentiment)
Sequence Length: 40 trading days
Training: 2012-2019 | Validation: 2020-2021 | Test: 2022-2024
GPU: NVIDIA A100 | Training time: 4.5 hours
--- Test Performance ---
AUC: 0.534
Directional Acc: 52.8%
IC (rank corr): 0.028
Top-Bottom Decile: 4.2% annualized (gross)
--- Baselines ---
LightGBM: AUC 0.531, IC 0.025
Linear Regression: AUC 0.519, IC 0.018
Buy-and-Hold: N/A (benchmark)
--- Resource vs Performance ---
LSTM training: 4.5 hours, IC = 0.028
LightGBM training: 12 minutes, IC = 0.025
Marginal IC gain: +0.003 for 22x compute cost
Verdict: Marginal improvement over tree models. DL justified only
if compute is cheap and the pipeline supports neural network ops.
Data Size | Frequency | Features | Recommended Architecture
< 5K samples | Daily | < 20 | DO NOT use deep learning. Use trees.
5K-50K | Daily | 20-100 | LSTM or TCN (small, heavy regularization)
50K-500K | Daily/Hour | 50-200 | TCN or Transformer (medium)
> 500K | Intraday | 100+ | Transformer (can scale)
Order book data | Tick | LOB | TCN or CNN-1D (fast inference needed)
Multi-asset | Any | Per-asset | Transformer with cross-asset attention
Before deploying a deep learning model for trading: