Backtest Statistics Categories

Backtest statistics are essential for evaluating the efficacy of investment strategies. These metrics fall into different categories:

General Features: Includes metrics like Time range, Average AUM, Capacity, and Leverage.
Performance Metrics: Such as PnL, annualized rate of return, hit ratio, etc.

backtest_statistics.py

def bet_timing(target_positions: pd.Series) -> pd.Index:

    zero_positions = target_positions[target_positions == 0].index

    lagged_non_zero_positions = target_positions.shift(1)
    lagged_non_zero_positions = lagged_non_zero_positions[lagged_non_zero_positions != 0].index

    bets = zero_positions.intersection(lagged_non_zero_positions)
    zero_positions = target_positions.iloc[1:] * target_positions.iloc[:-1].values
    bets = bets.union(zero_positions[zero_positions < 0].index).sort_values()

    if target_positions.index[-1] not in bets:
        bets = bets.append(target_positions.index[-1:])

    return bets

View More: Julia | Python

TWRR is a method for calculating returns that adjusts for external cash flows. The formula is complex but can be summarized with $r_{i, t}$ : TWRR for portfolio $i$ between time $[t-1, t]$ , $\pi_{i, t}$ : Mark-to-market profit or loss for portfolio $i$ at time $t$ , $K_{i, t}$ : Market value of assets managed by portfolio $i$ over sub-period $t$ :

r_{i, t} =\frac{\pi_{i, t}}{K_{i, t}}

backtest_statistics.py

def calculate_holding_period(target_positions: pd.Series) -> tuple:
    hold_period, time_entry = pd.DataFrame(columns=['dT', 'w']), 0.0
    position_difference = target_positions.diff()
    time_difference = (target_positions.index - target_positions.index[0]) / np.timedelta64(1, 'D')

    for i in range(1, target_positions.shape[0]):
        if position_difference.iloc[i] * target_positions.iloc[i - 1] >= 0:
            if target_positions.iloc[i] != 0:
                time_entry = (time_entry * target_positions.iloc[i - 1] + time_difference[i] * position_difference.iloc[i]) / target_positions.iloc[i]
        else:
            if target_positions.iloc[i] * target_positions.iloc[i - 1] < 0:
                hold_period.loc[target_positions.index[i], ['dT', 'w']] = (time_difference[i] - time_entry, abs(target_positions.iloc[i - 1]))
                time_entry = time_difference[i]
            else:
                hold_period.loc[target_positions.index[i], ['dT', 'w']] = (time_difference[i] - time_entry, abs(position_difference.iloc[i]))

    if hold_period['w'].sum() > 0:
        mean_holding_period = (hold_period['dT'] * hold_period['w']).sum() / hold_period['w'].sum()
    else:
        mean_holding_period = np.nan

    return hold_period, mean_holding_period

View More: Julia | Python

Performance statistics that are not risk-adjusted include: PnL: Total dollars earned, PnL from Long Positions: Earnings from only long holdings, Annualized Rate of Return: Includes all forms of earnings and expenses, Hit Ratio: Percentage of profitable bets. Investment strategies often contain series of returns, known as "runs," that can be either positive or negative. Understanding the concentration of these runs and their impact on risk factors like drawdowns and time under water is essential for assessing a strategy's viability.

Consider a time series of bet returns, $r_t$ , with a length $T$ . We can split these returns into positive and negative subsets, $r^+$ and $r^-$ . Two weight series, $w^+$ and $w^-$ , can be defined as:

w^+ = \frac{r^+}{\sum r^+} \quad \text{and} \quad w^- = \frac{r^-}{\sum r^-}

We define the Herfindahl-Hirschman Index (HHI)-based concentration of positive returns ( $h^+$ ) and negative returns ( $h^-$ ) as:

h^+ = \frac{\sum (w^+)^2 - 1/\|w^+\|}{1 - 1/\|w^+\|}

h^- = \frac{\sum (w^-)^2 - 1/\|w^-\|}{1 - 1/\|w^-\|}

Desirable strategy characteristics include: High Sharpe ratio, Many bets per year, High hit ratio (low $w^-$ ), Low $h^+$ , Low $h^-$ .

HHI Concentration Functions

backtest_statistics.py

def calculate_hhi_concentration(returns: pd.Series) -> tuple:
    """
    Calculate the HHI concentration measures.

    :param returns: Series of returns.
    :return: Tuple containing positive returns HHI, negative returns HHI, and time-concentrated HHI.
    """
    returns_hhi_positive = calculate_hhi(returns[returns >= 0])
    returns_hhi_negative = calculate_hhi(returns[returns < 0])
    time_concentrated_hhi = calculate_hhi(returns.groupby(pd.Grouper(freq='M')).count())

    return returns_hhi_positive, returns_hhi_negative, time_concentrated_hhi

View More: Julia | Python

These functionalities are available in both Python and Julia in the RiskLabAI library.

Drawdown and Time Under Water

Drawdown (DD) is the most significant loss between two high watermarks (HWMs), while Time under Water (TuW) is the duration taken to surpass a previous HWM.

DD and TuW Functions

backtest_statistics.py

def compute_drawdowns_time_under_water(series: pd.Series, dollars: bool = False) -> tuple:
    series_df = series.to_frame('PnL').reset_index(names='Datetime')
    series_df['HWM'] = series.expanding().max().values

    def process_groups(group):
        if len(group) <= 1:
            return None

        result = pd.Series()
        result.loc['Start'] = group['Datetime'].iloc[0]
        result.loc['Stop'] = group['Datetime'].iloc[-1]
        result.loc['HWM'] = group['HWM'].iloc[0]
        result.loc['Min'] = group['PnL'].min()
        result.loc['Min. Time'] = group['Datetime'][group['PnL'] == group['PnL'].min()].iloc[0]

        return result

    groups = series_df.groupby('HWM')
    drawdown_analysis = pd.DataFrame()

    for _, group in groups:
        drawdown_analysis = drawdown_analysis.append(process_groups(group), ignore_index=True)

    if dollars:
        drawdown = drawdown_analysis['HWM'] - drawdown_analysis['Min']
    else:
        drawdown = 1 - drawdown_analysis['Min'] / drawdown_analysis['HWM']

    drawdown.index = drawdown_analysis['Start']
    drawdown.index.name = 'Datetime'

    time_under_water = ((drawdown_analysis['Stop'] - drawdown_analysis['Start']) / np.timedelta64(1, 'Y')).values
    time_under_water = pd.Series(time_under_water, index=drawdown_analysis['Start'])

    return drawdown, time_under_water, drawdown_analysis

View More: Julia | Python

These functionalities are available in both Python and Julia in the RiskLabAI library.

Key Metrics for runs statistics: HHI index for both positive and negative returns, Time between bets measured by HHI index, 95th percentile of Drawdown (DD) and Time under Water (TuW). These metrics are useful to understand the concentration of portfolio returns and the risk involved.

backtest_statistics.py

def calculate_hhi(bet_returns: pd.Series) -> float:
    """
    Calculate the Herfindahl-Hirschman Index (HHI) concentration measure.

    :param bet_returns: Series of bet returns.
    :return: Calculated HHI value.
    """
    if bet_returns.shape[0] <= 2:
        return np.nan

    weight = bet_returns / bet_returns.sum()
    hhi_ = (weight ** 2).sum()
    hhi_ = (hhi_ - bet_returns.shape[0] ** -1) / (1.0 - bet_returns.shape[0] ** -1)

    return hhi_

View More: Julia | Python

Implementation Failure Metrics

Key Metrics to prevent investment plans from failing:

Broker fees per turnover
Average slippage per turnover
Dollar performance per turnover
Return on execution costs

These metrics help you understand how your portfolio could be affected by hidden costs.

Efficiency Metrics

Sharpe Ratio (SR)

This ratio measures performance by dividing the average returns by the standard deviation of returns.

\text{SR} = \frac{\mu}{\sigma}

Probabilistic Sharpe Ratio (PSR)

This metric adjusts the Sharpe ratio to account for data distortions like skewness and kurtosis.

\widehat{PSR}[SR^{*}] = Z\left[\frac{(\widehat{SR}-SR^{*})\sqrt{T-1}}{\sqrt{1-\hat{\gamma}_{3}\widehat{SR}+\frac{\hat{\gamma}_{4}-1}{4}\widehat{SR}^{2}}}\right]

Deflated Sharpe Ratio (DSR)

This is an extension of PSR, which accounts for the number of trials performed to obtain the Sharpe ratio.

SR^{*} = \sqrt{V[\{\widehat{SR}_{n}\}]}\left((1-\gamma)Z^{-1}[1-\frac{1}{N}]+\gamma Z^{-1}[1-\frac{1}{N}e^{-1}]\right)

Other Efficiency Metrics

Annualized Sharpe Ratio
Information Ratio
Probabilistic Sharpe Ratio (PSR)
Deflated Sharpe Ratio (DSR)

Classification Scores

Metrics for evaluating the performance of machine learning algorithms in trading strategies include:

Accuracy:
$\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}$
Precision:
$\text{Precision} = \frac{TP}{TP+FP}$
Recall:
$\text{Recall} = \frac{TP}{TP+FN}$
F1 Score:
$F1 = 2\frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

These metrics help you gauge how accurately your machine learning model is performing in real trading scenarios.

References

De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
De Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.