 Published on
Financial Data Weighting
 Authors
 Name
 Tails Azimuth
Table of Contents
 The Challenge of NonIID Data in Finance
 When and Why Does the IID Assumption Fail?
 Defining Concurrency in Financial Labels
 Measuring Label Uniqueness
 Overlapping Outcomes Problem in Bootstrapping
 Solving with Sequential Bootstrapping
 Index Matrix Calculation
 Calculating Average Uniqueness
 Sequential Bootstrap Sampling
 Monte Carlo Verification of Method Effectiveness
 Generating Random Timestamps
 Monte Carlo Simulation for Sequential Bootstraps
 Multiple Iterations
 Results and Figures
 Weighting Returns in Machine Learning Models
 Calculating Sample Weight with Return Attribution
 TimeDecay of Sample Weights
 Cases to Consider for TimeDecay
 References
Financial Data Weighting
The Challenge of NonIID Data in Finance
You might have noticed that many financial models rely on the assumption that data points are independent and identically distributed (IID). However, this is often not the case in realworld financial applications. This blog post will show you how to leverage sample weights to address these challenges.
When and Why Does the IID Assumption Fail?
Financial labels, such as returns, are often based on overlapping time intervals. This overlapping nature makes these labels nonIID. While some machine learning applications can manage without the IID assumption, most financial models struggle without it. Let's explore some techniques to mitigate this issue.
Defining Concurrency in Financial Labels
We say that two labels, $y_i$ and $y_j$, are concurrent if they depend on the same return. To quantify this, we use an indicator function, $\mathbb{I}_{t, i}$, defined as:
The number of labels that are concurrent at time $t$ is represented by $c_t = \sum_{i=1}^{I} \mathbb{I}_{t, i}$.
Python  Julia 



Measuring Label Uniqueness
Next, we introduce a function to measure label uniqueness at a given time $t$. This function, denoted as $u_{t, i}$, is defined as:
The average uniqueness of a label $i$ over time $T$ is given by:
In both RiskLabAI's Python and Julia libraries, you can estimate label uniqueness using specific functions.
Python  Julia 



Overlapping Outcomes Problem in Bootstrapping
When using bootstrapping to sample $I$ items from a set of $I$ items with replacements, there's a chance some items get selected more than once, leading to overlapping outcomes. For larger sets, the probability of not selecting a particular element converges to $e^{1}$. As a result, only about $2/3$ of the observations are unique, making bootstrapping inefficient.
Solving with Sequential Bootstrapping
Sequential bootstrapping assigns different probabilities to observations, making the sampling process more efficient. The probability density for selecting observation $i$ at step $m$ is calculated using:
where $\bar{u}_{i}^{(m)}$ and $u_{t, i}^{(m)}$ are computed using specific formulas. This approach minimizes the chance of selecting overlapping outcomes.
Index Matrix Calculation
Both Python and Julia libraries in RiskLabAI offer functions to calculate the index matrix. In Python, it's index_matrix
and in Julia, it's indexMatrix
.
Python  Julia 



Calculating Average Uniqueness
Both libraries also offer functions to calculate the average uniqueness of the samples. In Python, it's averageUniqueness
and in Julia, it's also averageUniqueness
.
Python  Julia 



Sequential Bootstrap Sampling
Finally, for sequential bootstrap sampling, the Python function is SequentialBootstrap
and in Julia, it's sequentialBootstrap
.
Python  Julia 



These functionalities are available in both Python and Julia in the RiskLabAI library. For more details, you can visit the GitHub repositories for each language.
Monte Carlo Verification of Method Effectiveness
Our goal is to assess the performance of different bootstrapping techniques. We focus on comparing the Sequential Bootstrap method with the Standard Bootstrap. We accomplish this through Monte Carlo experiments that utilize random timestamps and various other parameters.
Generating Random Timestamps
We generate random timestamps for each observation within the given parameters. The function randomTimestamp
does this job in both the Python and Julia libraries of RiskLabAI.
Python  Julia 



Monte Carlo Simulation for Sequential Bootstraps
We run Monte Carlo simulations to compare the Sequential Bootstrap with the Standard Bootstrap using the monteCarloSimulationforSequentionalBootstraps
function.
Python  Julia 



Multiple Iterations
For a more robust assessment, we run the Monte Carlo simulation in multiple iterations for both Sequential and Standard Bootstraps using SimulateSequentionalVsStandardBootstrap
.
Python  Julia 



Results and Figures
The Monte Carlo tests reveal differences between Standard and Sequential Bootstraps.
We also examine the histogram of the average uniqueness for both bootstrapping techniques. This gives us insights into how unique each sample is, allowing for better analysis.
Weighting Returns in Machine Learning Models
In machine learning for finance, it's critical to weigh data properly. Returns with high absolute values should have more weight than those with low absolute returns. The uniqueness of an observation also plays a role in determining its weight.
Calculating Sample Weight with Return Attribution
In RiskLabAI, we offer functions to handle this weight assignment. The Julia function sampleWeight
and the Python function mpSampleWeightAbsoluteReturn
both serve this purpose.
Python  Julia 



These functionalities are available in both Python and Julia in the RiskLabAI library.
TimeDecay of Sample Weights
Over time, older market data becomes less relevant. Thus, a timedecay factor is applied to the sample weights. The decay factor is defined by a userspecified parameter $c$. The weight decay follows the formula:
where $a$ and $b$ are calculated based on boundary conditions and $c$.
Again, RiskLabAI has builtin functions for this. The Julia function TimeDecay
and its Python equivalent handle weight adjustments based on time.
Python  Julia 



These functionalities are available in both Python and Julia in the RiskLabAI library.
Cases to Consider for TimeDecay
 $c=1$ implies no decay.
 $0 < c < 1$ implies linear decay, with all observations still getting some weight.
 $c=0$ leads to weights converging to zero as they age.
 $c < 0$ implies that the oldest observations get zero weight.
References
 De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
 De Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.