From everything-claude-trading
- Evaluating whether a backtest result reflects genuine alpha or data mining artifacts
npx claudepluginhub brainbytes-dev/everything-claude-tradingThis skill uses the workspace's default tool permissions.
- Evaluating whether a backtest result reflects genuine alpha or data mining artifacts
Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.
Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
Definition: A strategy is overfit when it captures noise (random patterns) in historical data rather than signal (persistent patterns), resulting in poor live performance despite excellent backtest results.
Symptoms of Overfitting:
1. Exceptional backtest performance (Sharpe > 2.5 for daily strategies is suspicious)
2. Many tuned parameters relative to data points
3. Performance degrades sharply out-of-sample
4. Strategy only works on one specific asset or time period
5. Small parameter changes cause large performance changes
6. Strategy requires frequent re-optimization to maintain performance
7. Complex, unintuitive logic with no economic rationale
8. Backtest includes survivorship bias, lookahead bias, or unrealistic fills
Overfitting Probability:
With N independent strategy configurations tested:
P(at least one appears significant at p=0.05) = 1 - (0.95)^N
N=1: 5% false positive rate
N=10: 40% false positive rate
N=20: 64% false positive rate
N=50: 92% false positive rate
N=100: 99.4% false positive rate
Implication: If you test 100 parameter combinations, you are almost
guaranteed to find one that "works" even on random data.
This is why multiple testing adjustment is essential.
Problem: Standard Sharpe ratio does not account for the number of strategies tested (selection bias) or the statistical properties of returns (skewness, kurtosis).
Deflated Sharpe Ratio (DSR):
DSR adjusts the Sharpe ratio for:
1. Number of trials (strategies tested)
2. Data length (more data = more reliable)
3. Skewness (negative skew inflates Sharpe)
4. Kurtosis (fat tails inflate Sharpe)
DSR = Prob(SR > SR_benchmark | trials, skew, kurtosis, T)
Using the haircut formula:
SR_required = SR_observed * sqrt(1 - gamma * skewness / (6 * T) + (kurtosis - 3) / (24 * T))
Where gamma depends on the number of trials and significance level
Practical interpretation:
- If you tested 50 strategies and the best has Sharpe = 1.5:
DSR might show this is only significant at p=0.15 (not significant)
- If you tested 3 strategies and the best has Sharpe = 1.2:
DSR might show this is significant at p=0.02 (significant)
- Rule of thumb: record ALL strategies tested, not just the ones that worked
Minimum Required Sharpe:
Given N trials and desired significance p=0.05:
SR_min ≈ sqrt(2 * ln(N)) * (1/sqrt(T)) + E[max(SR)] under null
Approximate minimum Sharpe for significance (T=252 daily obs per year, 5 years):
N=1: SR_min ≈ 0.40
N=10: SR_min ≈ 0.70
N=50: SR_min ≈ 0.95
N=100: SR_min ≈ 1.10
N=1000: SR_min ≈ 1.40
This means: if you tested 100 strategies, only those with Sharpe > 1.10
should be considered potentially significant.
Problem with Standard k-Fold CV:
CPCV (Lopez de Prado):
Key innovations:
1. Purging: remove observations from training set that are near the test set boundary
- Purge window = max holding period of the strategy
- Prevents lookahead: if strategy holds for 5 days, purge 5 days around test boundaries
2. Embargo: additional buffer after purging
- Prevents serial correlation from leaking information
- Embargo period = 1-2x the autocorrelation decay period
3. Combinatorial: instead of sequential folds, use all possible combinations
- N groups, select k for testing, use remaining for training
- C(N,k) combinations provides many more test paths
- Each path is a valid walk-forward simulation
Implementation:
- Split data into N groups (e.g., N=10)
- For each combination of k test groups (e.g., k=2):
- Remove test groups from training data
- Purge observations near test group boundaries
- Apply embargo period
- Train on remaining data
- Evaluate on test groups
- Average performance across all C(N,k) paths
CPCV vs Standard Walk-Forward:
Walk-forward: produces 1 OOS path (single sequence of OOS results)
CPCV: produces C(N,k) OOS paths (many sequences)
Advantages of CPCV:
- More statistically robust (larger sample of test paths)
- Can estimate distribution of performance, not just mean
- Probability of loss (% of paths with negative return) is directly observable
- Less dependent on specific IS/OOS split choice
Example: N=10, k=2 -> C(10,2) = 45 test paths
If 40/45 paths are profitable: strong evidence of genuine edge
If 25/45 paths are profitable: strategy likely fragile or overfit
White's Reality Check (2000):
Purpose: Test if the BEST strategy among many is significantly better
than a benchmark, accounting for data snooping.
Method:
1. Define null hypothesis: best strategy is no better than benchmark
2. Bootstrap the data (resample returns with replacement)
3. For each bootstrap sample, find the best strategy's performance
4. Build distribution of "best strategy performance under null"
5. Compare actual best strategy to this distribution
6. p-value = fraction of bootstrap samples where best >= actual
If p < 0.05: best strategy is significantly better than benchmark
even after accounting for having searched many strategies
Hansen's Superior Predictive Ability (SPA) Test (2005):
Improvement over White's Reality Check:
- Less conservative (more power to detect true outperformance)
- Uses studentized statistics (accounts for variance of each strategy)
- Better handles strategies with different volatilities
Implementation:
1. Compute loss differential for each strategy vs benchmark
2. Studentize by strategy-specific standard error
3. Bootstrap using stationary bootstrap (preserves time series structure)
4. Test statistic: max studentized loss differential
5. p-value from bootstrap distribution
Practical note: both tests require many bootstrap samples (1000+)
and are computationally intensive for large strategy universes.
Concept: How much data do you need for a Sharpe ratio estimate to be statistically meaningful?
MBL (in years) for a given Sharpe ratio and significance level:
Formula: MBL = (z_alpha / SR)^2 + 1 (approximate, in years)
Where z_alpha = 1.96 for 95% confidence
SR=0.5: MBL ≈ 16.4 years
SR=1.0: MBL ≈ 4.8 years
SR=1.5: MBL ≈ 2.7 years
SR=2.0: MBL ≈ 2.0 years
SR=3.0: MBL ≈ 1.4 years
Implication: A strategy with Sharpe 0.5 needs 16+ years of data
to be statistically distinguishable from zero. Strategies tested
on 2-3 years of data can only reliably detect Sharpe > 1.5.
For daily strategies: multiply by 252 to get minimum observations
For monthly strategies: multiply by 12
Concept: Each additional parameter or rule in a strategy consumes a degree of freedom, reducing the reliability of the backtest.
Adjusted Sharpe ≈ SR * sqrt(1 - k/N)
Where k = number of free parameters, N = number of independent observations
Example:
- Strategy with 5 parameters, 1000 daily observations
- SR = 1.5
- Adjusted SR = 1.5 * sqrt(1 - 5/1000) = 1.5 * 0.9975 = 1.496
(minimal penalty — sufficient data)
- Strategy with 20 parameters, 200 daily observations
- SR = 2.0
- Adjusted SR = 2.0 * sqrt(1 - 20/200) = 2.0 * 0.949 = 1.90
(noticeable penalty — not enough data for this many parameters)
Rule of thumb: observations per parameter should be > 50
(ideally >100) for reliable estimation
Tier 1 (Deploy with full allocation):
- DSR significant at p < 0.05 after accounting for all trials
- CPCV: >80% of paths profitable
- Walk-forward WFE > 0.6
- Economic rationale is clear and documented
- Works on at least 2 independent datasets/assets
Tier 2 (Deploy with reduced allocation, monitor closely):
- DSR significant at p < 0.10
- CPCV: >65% of paths profitable
- Walk-forward WFE > 0.4
- Economic rationale exists but is less clear
Tier 3 (Paper trade only, continue research):
- DSR significant at p < 0.20
- CPCV: >50% of paths profitable
- Walk-forward WFE < 0.4
- Possible economic rationale but unproven
Reject:
- DSR not significant at p < 0.20
- CPCV: <50% of paths profitable
- No economic rationale
Research process:
- Tested 5 lookback periods
- Tested 4 entry thresholds
- Tested 3 exit methods
- Tested 2 position sizing rules
- Total configurations: 5 * 4 * 3 * 2 = 120
Best configuration:
- Sharpe ratio: 1.85 (5 years daily data)
- Looks great in isolation
DSR analysis:
- N = 120 trials
- SR_min at p=0.05: approximately 1.25
- Best SR (1.85) > SR_min (1.25): passes DSR threshold
- But: with negative skew (-0.8) and excess kurtosis (4.2):
DSR adjusts down further to SR_required ≈ 1.45
- 1.85 > 1.45: still passes (strategy is likely genuine)
If best SR were 1.35 instead of 1.85:
- 1.35 > 1.25 but 1.35 < 1.45 (fails after skew/kurtosis adjustment)
- Would not pass DSR — likely a data mining artifact
Strategy: Mean reversion on ETF basket (10 ETFs)
Data: 8 years daily (2016-2024)
CPCV: N=8 groups, k=2 -> C(8,2) = 28 paths
Purge window: 5 days (max holding period)
Embargo: 2 days
Results across 28 paths:
- Paths with positive return: 24/28 (86%)
- Paths with Sharpe > 0.5: 20/28 (71%)
- Paths with Sharpe > 1.0: 12/28 (43%)
- Average Sharpe across paths: 0.82
- Std of Sharpe across paths: 0.55
- Worst path Sharpe: -0.35
- Best path Sharpe: 1.95
Assessment: 86% profitable paths with average Sharpe 0.82
is strong evidence. The worst path (-0.35) shows the strategy
can underperform for extended periods but losses are limited.
Strategy meets Tier 1 acceptance criteria.
Original strategy: 8 parameters, Sharpe = 2.1
Simplified versions:
- Remove 2 parameters (fix at defaults): Sharpe = 1.85 (12% drop)
- Remove 4 parameters: Sharpe = 1.55 (26% drop)
- Remove 6 parameters (2 remaining): Sharpe = 1.20 (43% drop)
Analysis:
- Core edge is captured by 4 parameters (Sharpe 1.55)
- Additional 4 parameters add 0.55 Sharpe — much of this is likely overfitting
- With 4 parameters, DSR is more lenient (fewer trials per parameter)
- 4-parameter version has better parameter stability in walk-forward
Decision: deploy the 4-parameter version (Sharpe 1.55)
instead of the 8-parameter version (Sharpe 2.1).
Expected live Sharpe: 1.0-1.2 (after realistic degradation)
vs 8-parameter expected live Sharpe: 0.8-1.0 (more degradation from overfitting)
Before accepting a strategy as not overfit, verify: