Help us improve
Share bugs, ideas, or general feedback.
From everything-claude-trading
- Setting up Jupyter/research notebook workflows for quantitative research
npx claudepluginhub brainbytes-dev/everything-claude-tradingHow this skill is triggered — by the user, by Claude, or both
Slash command
/everything-claude-trading:research-notebooksThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- Setting up Jupyter/research notebook workflows for quantitative research
Generates static reports, tearsheets, and exports from FactorMiner research runs for human review. Useful after finishing a factor search to package results.
Orchestrates full research pipeline from Brainstorming to Reporting via Planning, Implementation, Testing & Visualization phases with user checkpoints. Configurable for physics, AI/ML, statistics, math domains, depth, and agent personas.
Core autonomous research loop. Reads research.md, proposes hypotheses, runs experiments, evaluates results mechanically, keeps improvements, discards failures, and iterates until the target metric is achieved or the iteration budget is exhausted. TRIGGER when: user invokes "autoresearch" (no subcommand); research.md exists; user wants the 5-stage loop; user wants iterative optimization overnight.
Share bugs, ideas, or general feedback.
Standard Notebook Template:
1. Title and Hypothesis
- Clear statement of what is being tested
- Expected outcome and success criteria defined upfront
2. Setup
- Import statements
- Configuration (date ranges, tickers, parameters)
- Random seed setting
- Data source specification
3. Data Loading and Validation
- Load data with explicit source attribution
- Data quality checks (missing values, outliers, splits/dividends)
- Summary statistics and date range confirmation
4. Exploratory Analysis
- Descriptive statistics
- Distribution plots
- Correlation analysis
- Regime identification
5. Signal Construction
- Feature engineering
- Signal generation
- Signal analysis (distribution, autocorrelation, decay)
6. Backtest Execution
- Strategy logic implementation
- Transaction cost model
- Results computation
7. Evaluation
- Performance metrics (Sharpe, drawdown, hit rate)
- Walk-forward or cross-validation results
- Statistical significance tests
8. Conclusions
- Does the evidence support the hypothesis?
- What are the limitations?
- Next steps for research
Anti-Patterns to Avoid:
- Running cells out of order (breaks reproducibility)
- Modifying cells after seeing results (hidden selection bias)
- No markdown documentation between code cells
- Giant notebooks (>500 lines) that try to do everything
- Hardcoded file paths or credentials
- No version control for notebook state
- Presenting only successful experiments (survivorship bias in research)
Random Seed Management:
# Set seeds at the TOP of every notebook
import numpy as np
import random
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
# If using PyTorch/TensorFlow:
# torch.manual_seed(SEED)
# torch.cuda.manual_seed_all(SEED)
# If using sklearn with randomized components:
# Pass random_state=SEED to all estimators
# Document why this seed was chosen (or that it is arbitrary)
# Test with 2-3 different seeds to ensure results are not seed-dependent
Data Versioning:
Problem: research conducted on a dataset that gets updated/corrected later
cannot be reproduced with the original data.
Solutions:
1. Snapshot data with timestamp at download
- Store as: data/prices_20240115_snapshot.parquet
- Record data source URL and download time
2. Use data versioning tools
- DVC (Data Version Control): git-like versioning for large files
- Quilt: versioned data packages
- LakeFS: git for data lakes
3. Checksum verification
- Compute and store SHA256 hash of input data
- Verify hash at notebook start: if mismatch, warn researcher
4. Point-in-time databases
- Use as-of queries: "what was known on date X?"
- Prevents lookahead bias in fundamental data
- Sources: Quandl/Nasdaq Data Link, Bloomberg point-in-time
Environment Management:
Document the exact environment:
- Python version
- Package versions (pip freeze > requirements.txt)
- OS and hardware (affects floating point reproducibility)
Best practice:
- requirements.txt or environment.yml per project
- Docker containers for critical research
- Virtual environments (venv, conda) per project — never use system Python
Notebook header should include:
import sys
print(f"Python: {sys.version}")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
# etc.
Mandatory Charts for Strategy Research:
1. Equity Curve
- Cumulative returns (log scale recommended)
- Include benchmark for comparison
- Shade drawdown periods
- Mark regime changes (if applicable)
2. Drawdown Chart
- Underwater curve (% from peak)
- Duration of each drawdown annotated
- Horizontal lines at key thresholds (-10%, -20%)
3. Rolling Performance
- Rolling 12-month Sharpe ratio
- Rolling 12-month return
- Shows performance stability over time
4. Return Distribution
- Histogram with normal overlay
- QQ plot to assess tail behavior
- Skewness and kurtosis annotated
5. Monthly Return Heatmap
- Years on Y-axis, months on X-axis
- Color-coded returns
- Shows seasonality and year-by-year consistency
6. Signal Analysis
- Signal distribution (histogram)
- Signal autocorrelation (ACF/PACF)
- Signal vs forward return scatter plot
Chart Formatting Standards:
- Title: descriptive, includes asset and date range
- Axis labels: always labeled with units
- Legend: always present if multiple series
- Date format: YYYY-MM-DD on x-axis
- Font size: minimum 12pt for readability
- Color palette: colorblind-friendly (viridis, Set2)
- Grid: light gray, aids reading values
- Figure size: minimum 10x6 inches for presentations
- Save as both PNG (presentations) and SVG/PDF (papers)
- DPI: 150 for screen, 300 for print
Metrics Reporting Standard:
Performance Summary:
- CAGR (annualized return)
- Volatility (annualized)
- Sharpe Ratio (annualized, excess return / volatility)
- Sortino Ratio (downside deviation only)
- Maximum Drawdown (% from peak)
- Maximum Drawdown Duration (days/months)
- Calmar Ratio (CAGR / max drawdown)
- Win Rate (% of positive return periods)
- Profit Factor (gross profit / gross loss)
- Average Win / Average Loss ratio
- Number of trades
- Average holding period
Always report:
- Time period (start date, end date)
- Frequency (daily, weekly, monthly)
- Benchmark comparison
- Transaction cost assumptions
- Whether returns are gross or net
- Whether compounding is arithmetic or geometric
Research Log:
Maintain a running log of ALL experiments, not just successful ones:
Date: 2024-01-15
Hypothesis: RSI(14) mean reversion on SPY works on 5-min bars
Result: Sharpe 0.3, not significant after costs
Notes: Transaction costs dominate at this frequency
Next: Test on daily bars or reduce trading frequency
Date: 2024-01-18
Hypothesis: RSI(14) mean reversion on SPY daily bars
Result: Sharpe 1.1, passes walk-forward at WFE=0.65
Notes: Promising, but performance concentrated in high-vol regimes
Next: Add vol regime filter, test on other ETFs
Tracking failed experiments prevents:
- Re-testing the same idea months later
- Underestimating the number of trials for DSR calculation
- Team members duplicating failed research
Git Best Practices:
Problem: Jupyter notebooks are JSON files with embedded output
- Diffs are unreadable (binary image data, execution counts)
- Merge conflicts are nightmare to resolve
- Output bloats repository size
Solutions:
1. nbstripout: automatically strip output on commit
pip install nbstripout
nbstripout --install # sets up git filter
Only source code is committed; outputs regenerated on execution
2. Jupytext: pair notebooks with plain Python scripts
Sync .ipynb with .py (percent format)
Version control the .py file
.ipynb in .gitignore (regenerated from .py)
3. Papermill: parameterize notebooks for batch execution
Execute notebooks with different parameters programmatically
Store results separately from notebook code
4. Review process:
- Review .py diffs (readable, standard code review)
- Re-run notebook to verify outputs match claims
- CI/CD pipeline can auto-run notebooks on merge
Pipeline Stages:
Stage 1: Data Pipeline
Input: raw data sources (APIs, databases, files)
Process: download, clean, validate, store
Output: clean, versioned datasets
Cadence: daily or on-demand
Stage 2: Feature Pipeline
Input: clean data
Process: compute features, signals, indicators
Output: feature matrix with timestamps
Cadence: daily, synced with data pipeline
Stage 3: Research Pipeline
Input: feature matrix
Process: hypothesis testing, backtesting, walk-forward
Output: research notebooks, performance reports
Cadence: ad-hoc (researcher-driven)
Stage 4: Deployment Pipeline
Input: validated strategy code
Process: convert notebook logic to production code
Output: production-ready strategy module
Cadence: on strategy approval
Separation of concerns:
- Data pipeline team maintains data quality
- Research uses data pipeline outputs (never raw data directly)
- Production code is NEVER a Jupyter notebook
- Notebook insights are translated into clean, tested Python modules
Notebook to Production Translation:
Research notebook: exploratory, messy, visual, single-use
Production code: clean, tested, modular, runs daily
Translation checklist:
[ ] Extract strategy logic into functions/classes
[ ] Remove all hardcoded values (use config files)
[ ] Add type hints and docstrings
[ ] Write unit tests for each function
[ ] Add error handling and logging
[ ] Remove visualization code (separate monitoring dashboard)
[ ] Add input validation (handle missing data, wrong types)
[ ] Performance test (can it run within time constraints?)
[ ] Code review by someone who did NOT write the notebook
Before sharing results:
[ ] All cells run top-to-bottom without error (restart kernel and run all)
[ ] Random seeds set and documented
[ ] Data version/snapshot documented
[ ] All parameter choices justified (not arbitrary)
[ ] Transaction costs included and realistic
[ ] No lookahead bias (verified data alignment)
[ ] Statistical significance assessed (not just point estimates)
[ ] Benchmark comparison included
[ ] Limitations explicitly stated
[ ] Failed experiments documented in research log
[ ] Notebook is under 300 lines of code (split if larger)
project-momentum-etfs/
README.md # Project description and hypothesis
requirements.txt # Exact package versions
config.yaml # Parameters, date ranges, tickers
data/
raw/ # Original downloaded data (read-only)
processed/ # Cleaned, aligned data
checksums.json # SHA256 hashes of all data files
notebooks/
01_data_exploration.ipynb # EDA, no optimization
02_signal_construction.ipynb # Feature engineering
03_backtest.ipynb # Strategy backtest
04_validation.ipynb # Walk-forward, CPCV, Monte Carlo
05_robustness.ipynb # Parameter sensitivity, alternative assets
src/
data_loader.py # Reusable data loading functions
signals.py # Signal computation functions
backtest_engine.py # Backtest execution logic
metrics.py # Performance metric calculations
tests/
test_signals.py # Unit tests for signal logic
test_backtest.py # Unit tests for backtest engine
research_log.md # Running log of all experiments
"""
Research Notebook: Momentum Signal on ETF Universe
Hypothesis: 12-month price momentum predicts next-month ETF returns
Author: [Name]
Date: 2024-01-15
Data: CRSP ETF returns, 2005-2024
Version: 1.2 (added transaction costs)
"""
# --- Setup ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
SEED = 42
np.random.seed(SEED)
# Configuration
CONFIG = {
'start_date': '2005-01-01',
'end_date': '2024-01-01',
'lookback_months': 12,
'rebalance_freq': 'monthly',
'transaction_cost_bps': 10,
'n_quantiles': 5,
'data_file': '../data/processed/etf_returns_v2.parquet',
'data_checksum': 'a1b2c3d4...', # SHA256
}
# Environment
import sys
print(f"Python: {sys.version}")
print(f"NumPy: {np.__version__}, Pandas: {pd.__version__}")
# Success criteria (defined BEFORE analysis):
# Sharpe > 0.5, WFE > 0.5, significant at p < 0.10 after DSR adjustment
## Experiment Log: Momentum ETF Strategy
### 2024-01-15 — Experiment 1
Hypothesis: 12-month momentum, top quintile long, bottom quintile short
Parameters: lookback=12, rebalance=monthly, cost=10bps
Result: Sharpe=0.82, Max DD=18%, 245 trades over 19 years
WFE: 0.61 (3-year rolling IS, 6-month OOS)
Notes: Reasonable but momentum crash in 2009 caused 35% drawdown
Status: Promising, continue investigation
### 2024-01-17 — Experiment 2
Hypothesis: Add volatility scaling (inverse vol weighting)
Parameters: same as Exp 1, plus vol_lookback=60 days
Result: Sharpe=0.95, Max DD=14%, improved risk-adjusted returns
WFE: 0.58 (slightly lower — vol scaling adds a parameter)
Notes: Reduced momentum crash drawdown by 40%
Status: Better than Exp 1, investigate further
### 2024-01-20 — Experiment 3 (FAILED)
Hypothesis: Use 6-month momentum instead of 12-month
Parameters: lookback=6, everything else same as Exp 2
Result: Sharpe=0.35, not significant
Notes: Short-term momentum has higher turnover and lower signal quality
Status: Rejected, stick with 12-month lookback
Total trials: 3 (need to account for all in DSR calculation)
Before considering research complete, verify: