Backtesting through Cross-Validation (CPCV)

This chapter contrasts the three primary methods for backtesting a quantitative strategy. It argues that the most common method (Walk-Forward) is flawed and easily overfit, while standard Cross-Validation has its own drawbacks. It concludes by introducing a new, more robust method called Combinatorial Purged Cross-Validation (CPCV).

The Walk-Forward (WF) Method

What It Is: A standard historical simulation. It trains on data from [0, t] and tests on data at t+1, moving forward in time. This is the most common form of "backtesting."
Advantages:
1. Has a clear, intuitive historical interpretation.
2. Guarantees no information leakage (if purging is used correctly) because the training set always predates the testing set.
Disadvantages (Critical):
1. Tests a Single Path: It only tests the one historical scenario that happened, which is easily overfit.
2. Path-Dependent Overfitting: The model's performance is highly dependent on the sequence of historical events (e.g., a 2007-2017 backtest is different from a 2017-2007 backtest).
3. Uneven Information: Decisions at the beginning of the backtest are based on much less data than decisions at the end, making results inconsistent.

The Cross-Validation (CV) Method

What It Is: This method tests a model's performance on "stress scenarios." It splits data into $k$ sets, then trains on $k-1$ sets and tests on the 1 held-out set. For example, it might train on 2009-2017 data and then test on the 2008 crisis.
Goal: The goal is not historical accuracy, but to see how a model (trained on "normal" data) would perform under an unknown stress event.
Advantages:
1. Tests $k$ different scenarios, not just the single historical path.
2. Every decision is made using an equal amount of training data.
3. Uses the entire dataset for testing (no warm-up period).
Disadvantages:
1. Still only produces a single backtest path (by stitching the $k$ tests together).
2. Leakage is a high risk because the training set can contain future data. Requires purging and embargoing (from Ch. 7).

The Combinatorial Purged Cross-Validation (CPCV) Method

This is the author's novel method, designed to fix the flaws of WF and CV by testing multiple paths to generate a distribution of performance metrics, not just a single number.

What It Is:
1. The data is split into $N$ groups.
2. A test-set size of $k$ groups is chosen (where $k \le N/2$ ).
3. The algorithm then generates all possible combinations of training/testing splits. For each split, it trains on $N-k$ groups and tests on $k$ groups.
4. All training sets are purged (and embargoed) to prevent leakage.
5. This combinatorial process generates $\varphi$ unique, full-length backtest paths.
Number of Paths ( $\varphi$ ): The number of unique backtest paths generated is:
$\varphi[N, k]=\frac{k}{N}\left(\begin{array}{c} N \\ N-k \end{array}\right)=\frac{\prod_{i=1}^{k-1}(N-i)}{(k-1) !}$
- Example: Using $k=2$ (testing on 2 groups at a time) is a powerful "sweet spot." It generates $\varphi[N, 2] = N-1$ paths while keeping the training set size large.
How It Solves Overfitting:
- WF and CV produce a single Sharpe Ratio (SR), $y_i$ . This $y_i$ has a high variance ( $\sigma^2(y_i)$ ) and is easily "cherry-picked" (selection bias).
- CPCV generates $\varphi$ different SRs for the same strategy. It produces a distribution of performance, allowing us to analyze the mean SR, $\mu_i$ .
- The variance of this mean, $\sigma^2(\mu_i)$ , is much lower than the variance of a single backtest. $\sigma^{2}(\mu_{i}) = \varphi^{-1} \sigma_{i}^{2}\left(1+(\varphi-1) \bar{\rho}_{i}\right)$
- Because the variance is so much lower, it is much harder to find a "false discovery." CPCV defeats backtest overfitting by forcing the strategy to prove its profitability across many different scenarios (paths), not just the single historical one.

Cross-Validator Design in `RiskLabAI`

To make these complex cross-validation strategies easy to use and interchangeable, we implement a Factory and Controller design pattern.

CrossValidator (Interface): An abstract base class that defines the common API all validators must implement (split, backtest_paths, backtest_predictions). This ensures that any validator can be used in the same way.
CrossValidatorFactory: A simple factory class that constructs the correct validator instance (e.g., 'purgedkfold' or 'combinatorialpurged') based on a string name.
CrossValidatorController: A high-level controller that acts as the main user-facing class. It uses the factory to create and hold a validator instance, simplifying the workflow.

This design allows a user to switch from a WalkForward backtest to a CombinatorialPurged backtest by changing only one line of code (the validator_type string), promoting rapid and robust experimentation.

API reference

RiskLabAI implements these in Python and Julia (signatures auto-generated from the package source):

Python	Julia
`class KFold(CrossValidator):`	`struct KFoldCV n_splits::Int shuffle::Bool rng::AbstractRNG end KFoldCV(n_splits::Integer; shuffle::Bool = false, random_seed = nothing) = KFoldCV( n_splits, shuffle, random_seed === nothing ? default_rng() : MersenneTwister(random_seed), )`
`class PurgedKFold(CrossValidator):`	`struct PurgedKFoldCV n_splits::Int event_starts::Vector event_ends::Vector embargo::Float64 end PurgedKFoldCV( n_splits::Integer, event_starts::AbstractVector, event_ends::AbstractVector; embargo::Real = 0.0, ) = PurgedKFoldCV(n_splits, collect(event_starts), collect(event_ends), Float64(embargo))`
`class CombinatorialPurged(PurgedKFold):`	`struct CombinatorialPurgedCV n_splits::Int n_test_groups::Int event_starts::Vector event_ends::Vector embargo::Float64 end function CombinatorialPurgedCV( n_splits::Integer, n_test_groups::Integer, event_starts::AbstractVector, event_ends::AbstractVector; embargo::Real = 0.0, )`
`class WalkForward(KFold):`	`struct WalkForwardCV n_splits::Int max_train_size::Union{Nothing,Int} gap::Int end WalkForwardCV(n_splits::Integer; max_train_size = nothing, gap::Integer = 0) =`
`def bagging_classifier_accuracy(N: int, p: float) -> float:`	`function bagging_classifier_accuracy(N::Integer, p::Real)`
class BaggingClassifierAccuracy: """ Evaluates a bagging classifier's accuracy using different weighting schemes based on decision tree c_i scores. Methods: - fit: Fits the bagging classifier. - calculate_c_i: Calculates the c_i score for each tree. - calculate_weights: Computes weights (uniform, c_i, 1-c_i^2). - predict: Predicts class labels using specified weights. - evaluate_all_schemes: Gets accuracy for all weighting schemes. """ def __init__( self, n_estimators: int = 1000, max_samples: int = 100, max_features: float = 1.0, random_state: Optional[int] = None, ):	`function fit_bagging( x::AbstractMatrix{<:Real}, y::AbstractVector; n_estimators::Integer = 1000, max_samples::Integer = 100, max_features::Integer = 1, random_state = nothing, )`
`def calculate_bootstrap_accuracy( clf: BaggingClassifier, X: pd.DataFrame, y: pd.Series, n_bootstraps: int = 1000 ) -> tuple[np.ndarray, float, float]:`	`function calculate_bootstrap_accuracy( trees, classes, x::AbstractMatrix{<:Real}, y::AbstractVector; weights::AbstractVector{<:Real} = fill(1.0 / length(trees), length(trees)), n_bootstraps::Integer = 1000, random_state = nothing, )`
`def backtest_predictions( self, estimator: Union[Estimator, dict[str, Estimator]], data: Union[pd.DataFrame, dict[str, pd.DataFrame]], labels: Union[pd.Series, dict[str, pd.Series]], sample_weights: Optional[Union[np.ndarray, dict[str, np.ndarray]]] = None, predict_probability: bool = False, n_jobs: int = 1, ) -> Union[dict[str, np.ndarray], dict[str, dict[str, np.ndarray]]]:`	`function cross_val_score( cv, x::AbstractMatrix{<:Real}, y::AbstractVector; n_trees::Integer = 100, n_subfeatures::Integer = -1, max_depth::Integer = -1, scoring::Symbol = :accuracy, random_state::Integer = 0, )`

Full source: Python · Julia