This chapter explains why most backtested strategies in finance are false discoveries. The core problem is Selection Bias under Multiple Testing (SBuMT), also known as "backtest overfitting."
A backtest is a historical simulation, not a controlled experiment. Researchers (and firms) run thousands of trials and only present the best-performing ones. This process of "cherry-picking" inflates performance and leads to strategies that fail in live trading. The chapter provides a framework for quantifying this inflation.
The Problem in Terms of Precision and Recall
The reliability of a "significant" backtest depends on the "prior" probability that a strategy is truly profitable.
Let θ=sT/sF be the odds ratio of true strategies (sT) to false strategies (sF). In finance, θ is very low.
Let α be the false positive rate (Type I error) and β be the false negative rate (Type II error).
The precision (the probability that a strategy proven significant is actually true) is:
precision =(1−β)θ+α(1−β)θ
Key Insight: Even with a low α (e.g., p-value of 0.05), if the odds θ are tiny (e.g., 1/99), the precision will be extremely low. The text calculates that a 5% p-value could imply an 86% false discovery rate.
Multiple Testing and Error Rates
When K independent trials are run, the error rates are compounded:
Familywise Error Rate (FWER): The probability of getting at least one false positive.
αK=1−(1−α)K
Familywise Miss Rate: The probability of missing all true positives.
βK=βK
This makes the adjusted precision even worse:
precision =(1−βK)θ+1−(1−α)K(1−βK)θ
The Sharpe Ratio (SR) and SBuMT
The chapter provides a framework for correcting the Sharpe Ratio (SR) for selection bias.
1. The Distribution of the Sharpe Ratio
The estimated SR (SR) is asymptotically Normal, even if returns are non-Normal. However, its variance depends on the returns' skewness (γ3) and kurtosis (γ4).
Mertens' Asymptotic Distribution:
(SR−SR)→aN[0,T1+21SR2−γ3SR+4γ4−3SR2]
2. The "False Strategy" Theorem
This theorem estimates the Expected Maximum Sharpe RatioE[maxk{SRk}] that a researcher would get by chance after running K trials of a false strategy (where the true SR is 0).
(where γ is the Euler-Mascheroni constant and Z−1 is the inverse Gaussian CDF).
Implication: A researcher running 1,000 trials of a random strategy (true SR=0) will expect to find a "winner" with SR≈3.26.
Solutions for Quantifying Overfitting
The chapter provides two methods to test if a strategy's SR is truly significant or just the result of multiple testing.
Step 1: Find the Effective Number of Trials (K) A researcher may run 1,000 backtests, but many are correlated (not independent). We must find the effective number of independent trials, E[K].
Solution: Use an ML clustering algorithm (like ONC) on the return series of all 1,000 backtests. The resulting number of clusters is the effective number of trials, E[K].
Step 2 (Option A): The Deflated Sharpe Ratio (DSR) The DSR recalculates the p-value of the SR by testing it against the correct null hypothesis (the expected maximum from the False Strategy Theorem) and adjusting for non-Normal returns.
Step 2 (Option B): The Familywise Error Rate (FWER) for SR This approach calculates the true p-value (αK) of the observed SR, given the effective number of trials E[K].
First, calculate the z-statistic for the observed SR assuming the true SR=0 (using the Mertens distribution): z^[0].
Find the single-test p-value: α=1−Z[z^[0]].
Apply the FWER formula using the estimated E[K] from clustering:
αK=1−(1−α)E[K]=1−Z[z^[0]]E[K]
This αK is the actual probability that the "discovered" strategy is a false positive.
API reference
RiskLabAI implements these in Python and Julia (signatures auto-generated from the package source):