- Published on
Financial Backtesting and the Curse of Overfitting
Even if you manage to avoid all the above pitfalls, your backtesting may still lead to false positives due to multiple testing, selection bias, or overfitting. Overfitting happens when a strategy is tailored too closely to historical data, making it unlikely to perform well on new, unseen data.
A Technical Solution: Combinatorially Symmetric Cross-Validation (CSCV)
CSCV is a technique that uses combinations of submatrices created from the performance metrics of various trials. These submatrices are then used to train and test the model to evaluate the likelihood of backtesting overfitting.
def probability_of_backtest_overfitting(
performances: np.ndarray,
n_partitions: int = 16,
risk_free_return: float = 0.0,
metric: Callable = None,
n_jobs: int = 1
) -> Tuple[float, np.ndarray]:
if n_partitions % 2 == 1:
raise ValueError("Number of partitions must be even.")
if metric is None:
metric = sharpe_ratio
_, n_strategies = performances.shape
partitions = np.array_split(performances, n_partitions)
partition_indices = range(n_partitions)
partition_combinations_indices = list(combinations(partition_indices, n_partitions // 2))
results = Parallel(n_jobs=n_jobs)(
delayed(performance_evaluation)(
np.concatenate([partitions[i] for i in train_indices], axis=0),
np.concatenate([partitions[i] for i in partition_indices
if i not in train_indices], axis=0),
n_strategies,
metric,
risk_free_return
)
for train_indices in partition_combinations_indices
)
results = np.array(results)
pbo = results[:, 0].mean(axis=0)
logit_values = results[:, 1]
return pbo, logit_values
These functionalities are available in both Python and Julia in the RiskLabAI library.
Mathematical Formula for CSCV
To calculate the total number of combinations, use:
References
- De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
- De Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.