Skill

Overfitting Prevention

- Evaluating whether a backtest result reflects genuine alpha or data mining artifacts

npx claudepluginhub brainbytes-dev/everything-claude-trading

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/everything-claude-trading:overfitting-prevention

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- Evaluating whether a backtest result reflects genuine alpha or data mining artifacts

SKILL.md

347 lines · ~3.2k tokens

Similar Skills

ui-ux-pro-max

90.2k

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

context7-mcp

55.5k

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

gitnexus-exploring

38.9k

Explores codebases via GitNexus: discover repos, query execution flows, trace processes, inspect symbol callers/callees, and review architecture.

1 file

gitnexus

Stats

LanguageJavaScript

Stars3

Forks1

MaintenanceFair

Last CommitMar 14, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Overfitting Prevention

When to Activate

Evaluating whether a backtest result reflects genuine alpha or data mining artifacts
Applying statistical tests to adjust for multiple hypothesis testing
Implementing cross-validation methods designed for financial time series
Assessing the minimum backtest length and statistical significance of strategy performance
Distinguishing between robust strategies and curve-fitted illusions

Core Concepts

What Is Overfitting in Trading?

Definition: A strategy is overfit when it captures noise (random patterns) in historical data rather than signal (persistent patterns), resulting in poor live performance despite excellent backtest results.

Symptoms of Overfitting:

1. Exceptional backtest performance (Sharpe > 2.5 for daily strategies is suspicious)
2. Many tuned parameters relative to data points
3. Performance degrades sharply out-of-sample
4. Strategy only works on one specific asset or time period
5. Small parameter changes cause large performance changes
6. Strategy requires frequent re-optimization to maintain performance
7. Complex, unintuitive logic with no economic rationale
8. Backtest includes survivorship bias, lookahead bias, or unrealistic fills

Overfitting Probability:

With N independent strategy configurations tested:
P(at least one appears significant at p=0.05) = 1 - (0.95)^N

N=1:   5% false positive rate
N=10:  40% false positive rate
N=20:  64% false positive rate
N=50:  92% false positive rate
N=100: 99.4% false positive rate

Implication: If you test 100 parameter combinations, you are almost
guaranteed to find one that "works" even on random data.
This is why multiple testing adjustment is essential.

Deflated Sharpe Ratio (Bailey & Lopez de Prado)

Problem: Standard Sharpe ratio does not account for the number of strategies tested (selection bias) or the statistical properties of returns (skewness, kurtosis).

Deflated Sharpe Ratio (DSR):

DSR adjusts the Sharpe ratio for:
1. Number of trials (strategies tested)
2. Data length (more data = more reliable)
3. Skewness (negative skew inflates Sharpe)
4. Kurtosis (fat tails inflate Sharpe)

DSR = Prob(SR > SR_benchmark | trials, skew, kurtosis, T)

Using the haircut formula:
SR_required = SR_observed * sqrt(1 - gamma * skewness / (6 * T) + (kurtosis - 3) / (24 * T))

Where gamma depends on the number of trials and significance level

Practical interpretation:
- If you tested 50 strategies and the best has Sharpe = 1.5:
  DSR might show this is only significant at p=0.15 (not significant)
- If you tested 3 strategies and the best has Sharpe = 1.2:
  DSR might show this is significant at p=0.02 (significant)
- Rule of thumb: record ALL strategies tested, not just the ones that worked

Minimum Required Sharpe:

Given N trials and desired significance p=0.05:
SR_min ≈ sqrt(2 * ln(N)) * (1/sqrt(T)) + E[max(SR)] under null

Approximate minimum Sharpe for significance (T=252 daily obs per year, 5 years):
N=1:    SR_min ≈ 0.40
N=10:   SR_min ≈ 0.70
N=50:   SR_min ≈ 0.95
N=100:  SR_min ≈ 1.10
N=1000: SR_min ≈ 1.40

This means: if you tested 100 strategies, only those with Sharpe > 1.10
should be considered potentially significant.

Combinatorial Purged Cross-Validation (CPCV)

Problem with Standard k-Fold CV:

Financial returns are autocorrelated — standard CV leaks information between folds
Training on data adjacent to test data creates lookahead bias
Standard CV overestimates performance

CPCV (Lopez de Prado):

Key innovations:
1. Purging: remove observations from training set that are near the test set boundary
   - Purge window = max holding period of the strategy
   - Prevents lookahead: if strategy holds for 5 days, purge 5 days around test boundaries

2. Embargo: additional buffer after purging
   - Prevents serial correlation from leaking information
   - Embargo period = 1-2x the autocorrelation decay period

3. Combinatorial: instead of sequential folds, use all possible combinations
   - N groups, select k for testing, use remaining for training
   - C(N,k) combinations provides many more test paths
   - Each path is a valid walk-forward simulation

Implementation:
- Split data into N groups (e.g., N=10)
- For each combination of k test groups (e.g., k=2):
  - Remove test groups from training data
  - Purge observations near test group boundaries
  - Apply embargo period
  - Train on remaining data
  - Evaluate on test groups
- Average performance across all C(N,k) paths

CPCV vs Standard Walk-Forward:

Walk-forward: produces 1 OOS path (single sequence of OOS results)
CPCV: produces C(N,k) OOS paths (many sequences)

Advantages of CPCV:
- More statistically robust (larger sample of test paths)
- Can estimate distribution of performance, not just mean
- Probability of loss (% of paths with negative return) is directly observable
- Less dependent on specific IS/OOS split choice

Example: N=10, k=2 -> C(10,2) = 45 test paths
If 40/45 paths are profitable: strong evidence of genuine edge
If 25/45 paths are profitable: strategy likely fragile or overfit

White's Reality Check and Hansen's SPA Test

White's Reality Check (2000):

Purpose: Test if the BEST strategy among many is significantly better
than a benchmark, accounting for data snooping.

Method:
1. Define null hypothesis: best strategy is no better than benchmark
2. Bootstrap the data (resample returns with replacement)
3. For each bootstrap sample, find the best strategy's performance
4. Build distribution of "best strategy performance under null"
5. Compare actual best strategy to this distribution
6. p-value = fraction of bootstrap samples where best >= actual

If p < 0.05: best strategy is significantly better than benchmark
even after accounting for having searched many strategies

Hansen's Superior Predictive Ability (SPA) Test (2005):

Improvement over White's Reality Check:
- Less conservative (more power to detect true outperformance)
- Uses studentized statistics (accounts for variance of each strategy)
- Better handles strategies with different volatilities

Implementation:
1. Compute loss differential for each strategy vs benchmark
2. Studentize by strategy-specific standard error
3. Bootstrap using stationary bootstrap (preserves time series structure)
4. Test statistic: max studentized loss differential
5. p-value from bootstrap distribution

Practical note: both tests require many bootstrap samples (1000+)
and are computationally intensive for large strategy universes.

Minimum Backtest Length (MBL)

Concept: How much data do you need for a Sharpe ratio estimate to be statistically meaningful?

MBL (in years) for a given Sharpe ratio and significance level:

Formula: MBL = (z_alpha / SR)^2 + 1 (approximate, in years)

Where z_alpha = 1.96 for 95% confidence

SR=0.5:  MBL ≈ 16.4 years
SR=1.0:  MBL ≈ 4.8 years
SR=1.5:  MBL ≈ 2.7 years
SR=2.0:  MBL ≈ 2.0 years
SR=3.0:  MBL ≈ 1.4 years

Implication: A strategy with Sharpe 0.5 needs 16+ years of data
to be statistically distinguishable from zero. Strategies tested
on 2-3 years of data can only reliably detect Sharpe > 1.5.

For daily strategies: multiply by 252 to get minimum observations
For monthly strategies: multiply by 12

Degrees of Freedom Penalty

Concept: Each additional parameter or rule in a strategy consumes a degree of freedom, reducing the reliability of the backtest.

Adjusted Sharpe ≈ SR * sqrt(1 - k/N)

Where k = number of free parameters, N = number of independent observations

Example:
- Strategy with 5 parameters, 1000 daily observations
- SR = 1.5
- Adjusted SR = 1.5 * sqrt(1 - 5/1000) = 1.5 * 0.9975 = 1.496
  (minimal penalty — sufficient data)

- Strategy with 20 parameters, 200 daily observations
- SR = 2.0
- Adjusted SR = 2.0 * sqrt(1 - 20/200) = 2.0 * 0.949 = 1.90
  (noticeable penalty — not enough data for this many parameters)

Rule of thumb: observations per parameter should be > 50
(ideally >100) for reliable estimation

Methodology

Overfitting Detection Checklist

Count all trials — every parameter combination, strategy variant, and data exploration step
Calculate DSR — adjust the best Sharpe ratio for number of trials and return distribution
Run CPCV — evaluate the distribution of OOS paths, not just the best path
Check parameter stability — do parameters change drastically in walk-forward windows?
Assess economic rationale — does the strategy exploit a known, persistent inefficiency?
Test on alternative data — different time periods, assets, or frequencies
Simplify — can you reduce parameters and maintain most of the performance?

Strategy Acceptance Criteria

Tier 1 (Deploy with full allocation):
- DSR significant at p < 0.05 after accounting for all trials
- CPCV: >80% of paths profitable
- Walk-forward WFE > 0.6
- Economic rationale is clear and documented
- Works on at least 2 independent datasets/assets

Tier 2 (Deploy with reduced allocation, monitor closely):
- DSR significant at p < 0.10
- CPCV: >65% of paths profitable
- Walk-forward WFE > 0.4
- Economic rationale exists but is less clear

Tier 3 (Paper trade only, continue research):
- DSR significant at p < 0.20
- CPCV: >50% of paths profitable
- Walk-forward WFE < 0.4
- Possible economic rationale but unproven

Reject:
- DSR not significant at p < 0.20
- CPCV: <50% of paths profitable
- No economic rationale

Examples

Example 1: Multiple Testing Adjustment

Research process:
- Tested 5 lookback periods
- Tested 4 entry thresholds
- Tested 3 exit methods
- Tested 2 position sizing rules
- Total configurations: 5 * 4 * 3 * 2 = 120

Best configuration:
- Sharpe ratio: 1.85 (5 years daily data)
- Looks great in isolation

DSR analysis:
- N = 120 trials
- SR_min at p=0.05: approximately 1.25
- Best SR (1.85) > SR_min (1.25): passes DSR threshold
- But: with negative skew (-0.8) and excess kurtosis (4.2):
  DSR adjusts down further to SR_required ≈ 1.45
- 1.85 > 1.45: still passes (strategy is likely genuine)

If best SR were 1.35 instead of 1.85:
- 1.35 > 1.25 but 1.35 < 1.45 (fails after skew/kurtosis adjustment)
- Would not pass DSR — likely a data mining artifact

Example 2: CPCV Analysis

Strategy: Mean reversion on ETF basket (10 ETFs)
Data: 8 years daily (2016-2024)
CPCV: N=8 groups, k=2 -> C(8,2) = 28 paths
Purge window: 5 days (max holding period)
Embargo: 2 days

Results across 28 paths:
- Paths with positive return: 24/28 (86%)
- Paths with Sharpe > 0.5: 20/28 (71%)
- Paths with Sharpe > 1.0: 12/28 (43%)
- Average Sharpe across paths: 0.82
- Std of Sharpe across paths: 0.55
- Worst path Sharpe: -0.35
- Best path Sharpe: 1.95

Assessment: 86% profitable paths with average Sharpe 0.82
is strong evidence. The worst path (-0.35) shows the strategy
can underperform for extended periods but losses are limited.
Strategy meets Tier 1 acceptance criteria.

Example 3: Simplification Test

Original strategy: 8 parameters, Sharpe = 2.1
Simplified versions:
- Remove 2 parameters (fix at defaults): Sharpe = 1.85 (12% drop)
- Remove 4 parameters: Sharpe = 1.55 (26% drop)
- Remove 6 parameters (2 remaining): Sharpe = 1.20 (43% drop)

Analysis:
- Core edge is captured by 4 parameters (Sharpe 1.55)
- Additional 4 parameters add 0.55 Sharpe — much of this is likely overfitting
- With 4 parameters, DSR is more lenient (fewer trials per parameter)
- 4-parameter version has better parameter stability in walk-forward

Decision: deploy the 4-parameter version (Sharpe 1.55)
instead of the 8-parameter version (Sharpe 2.1).
Expected live Sharpe: 1.0-1.2 (after realistic degradation)
vs 8-parameter expected live Sharpe: 0.8-1.0 (more degradation from overfitting)

Quality Gate

Before accepting a strategy as not overfit, verify:

Overfitting Prevention

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

Overfitting Prevention

Popularity

Invocation

Context Preview

SKILL.md

Overfitting Prevention

When to Activate

Core Concepts

What Is Overfitting in Trading?

Deflated Sharpe Ratio (Bailey & Lopez de Prado)

Combinatorial Purged Cross-Validation (CPCV)

White's Reality Check and Hansen's SPA Test

Minimum Backtest Length (MBL)

Degrees of Freedom Penalty

Methodology

Overfitting Detection Checklist

Strategy Acceptance Criteria

Examples

Example 1: Multiple Testing Adjustment

Example 2: CPCV Analysis

Example 3: Simplification Test

Quality Gate

Similar Skills

Help us improve

Overfitting Prevention

When to Activate

Core Concepts

What Is Overfitting in Trading?

Deflated Sharpe Ratio (Bailey & Lopez de Prado)

Combinatorial Purged Cross-Validation (CPCV)

White's Reality Check and Hansen's SPA Test

Minimum Backtest Length (MBL)

Degrees of Freedom Penalty

Methodology

Overfitting Detection Checklist

Strategy Acceptance Criteria

Examples

Example 1: Multiple Testing Adjustment

Example 2: CPCV Analysis

Example 3: Simplification Test

Quality Gate