Published on

Microstructural Features

Authors
Table of Contents

Microstructural Features

This chapter explores how to derive predictive features from market microstructure data (like FIX messages). This data reveals how participants trade, which often exposes informational asymmetries that ML algorithms can exploit. The literature is divided into three generations of models.


First Generation: Price Sequences

These models use only price information to infer market properties like liquidity.

  • The Tick Rule: A simple algorithm to infer the aggressor side (buy or sell) of a trade. A buy is 1, a sell is -1.
    bt={1 if Δpt>01 if Δpt<0bt1 if Δpt=0b_{t}= \begin{cases}1 & \text { if } \Delta p_{t}>0 \\ -1 & \text { if } \Delta p_{t}<0 \\ b_{t-1} & \text { if } \Delta p_{t}=0\end{cases}
  • The Roll Model: Estimates the effective bid-ask spread (cc) from the serial covariance of price changes. This is useful for illiquid assets without a visible order book.
    σ(Δpt,Δpt1)=c2    c=max{0,σ(Δpt,Δpt1)}\sigma\left(\Delta p_{t}, \Delta p_{t-1}\right)=-c^{2} \quad \implies \quad c=\sqrt{\max \left\{0,-\sigma\left(\Delta p_{t}, \Delta p_{t-1}\right)\right\}}
  • Corwin and Schultz Estimator: Estimates the spread (StS_t) using only daily (or bar) high and low prices, based on the principle that H/L ratios capture both volatility and the spread.
    • The spread is estimated as:
      St=2(eαt1)1+eαtS_{t}=\frac{2\left(e^{\alpha_{t}}-1\right)}{1+e^{\alpha_{t}}}
    • Where αt\alpha_t is derived from βt\beta_t (a 2-bar average of squared log H/L ratios) and γt\gamma_t (the squared log H/L ratio over the 2-bar period).
      αt=2βtβt322γt322\alpha_{t}=\frac{\sqrt{2 \beta_{t}}-\sqrt{\beta_{t}}}{3-2 \sqrt{2}}-\sqrt{\frac{\gamma_{t}}{3-2 \sqrt{2}}}

Implementation RiskLabAI

In our RiskLabAI library, we provide a direct implementation of this estimator in the features.microstructural_features.corwin_schultz module. The code is modular, with each function corresponding to a component of the formula.

The core logic is broken into:

  • beta_estimates: Calculates βt\beta_t, the two-day sum of squared log H/L ratios, averaged over a window_span.
  • gamma_estimates: Calculates γt\gamma_t, the squared log-ratio of the two-day high and low.
  • alpha_estimates: Uses βt\beta_t and γt\gamma_t to solve for αt\alpha_t.

The main function, corwin_schultz_estimator, orchestrates these steps to return the final spread StS_t.

Bekker-Parkinson Volatility Estimator

As an extension, we also implement the Bekker-Parkinson volatility estimator, which adjusts the standard Parkinson volatility by incorporating the Corwin-Schultz spread components.

Methodology

The estimator is calculated as:

k2=8/πk_2 = \sqrt{8 / \pi}
d=322d = 3 - 2\sqrt{2}
σ=(21)βd+γk22d\sigma = \frac{(\sqrt{2} - 1) \sqrt{\beta}}{d} + \sqrt{\frac{\gamma}{k_2^2 d}}

Where β\beta and γ\gamma are the same components calculated for the Corwin-Schultz estimator.

Implementation

In our RiskLabAI library, we implement this in the features.microstructural_features.bekker_parkinson_volatility_estimator module.

The main function bekker_parkinson_volatility_estimates is a convenience wrapper that first computes β\beta and γ\gamma (by calling beta_estimates and gamma_estimates from the corwin_schultz module) and then passes them to the sigma_estimates function, which contains the core logic for the σ\sigma formula.


Second Generation: Strategic Trade Models

These models incorporate volume to measure illiquidity and price impact, modeling trade as a strategic interaction.

  • Kyle's Lambda (λ\lambda): Models price impact as the result of a game between an informed trader and a market maker. λ\lambda is an inverse measure of liquidity. It can be estimated via a simple regression:

    Δpt=λ(btVt)+εt\Delta p_{t}=\lambda\left(b_{t} V_{t}\right)+\varepsilon_{t}

    where btVtb_t V_t is the signed volume.

  • Amihud's Lambda (λ\lambda): Measures illiquidity as the absolute price response per dollar of trading volume.

    Δlog(p~τ)=λtBτ(ptVt)+ετ\left|\Delta \log \left(\tilde{p}_{\tau}\right)\right| = \lambda \sum_{t \in B_{\tau}}\left(p_{t} V_{t}\right)+\varepsilon_{\tau}
  • Hasbrouck's Lambda (λ\lambda): A Bayesian model that estimates price impact using the square-root of dollar volume, which is often found to be a more accurate specification.

    log(p~i,τ)log(p~i,τ1)=λitBi,τ(bi,tpi,tVi,t)+εi,τ\log \left(\tilde{p}_{i, \tau}\right)-\log \left(\tilde{p}_{i, \tau-1}\right)=\lambda_{i} \sum_{t \in B_{i, \tau}}\left(b_{i, t} \sqrt{p_{i, t} V_{i, t}}\right)+\varepsilon_{i, \tau}

Third Generation: Sequential Trade Models

These models focus on asymmetric information and the strategic, sequential nature of trading.

  • PIN (Probability of Informed Trading): A foundational model where the bid-ask spread is the premium a market maker charges for the risk of being adversely selected. The probability of informed trading is a function of:
    • α\alpha: Probability of a new information event.
    • μ\mu: Arrival rate of informed traders.
    • ε\varepsilon: Arrival rate of uninformed (noise) traders.
    PINt=αtμαtμ+2εP I N_{t}=\frac{\alpha_{t} \mu}{\alpha_{t} \mu+2 \varepsilon}
  • VPIN (Volume-Synchronized PIN): A high-frequency, practical estimate of PIN that uses volume bars.
    • Key Insight: In a volume bar of size VV, the total volume is V=αμ+2εV = \alpha\mu + 2\varepsilon and the expected imbalance is E(VτBVτS)αμ\mathrm{E}(|V_{\tau}^{B}-V_{\tau}^{S}|) \approx \alpha \mu.
    • VPIN Equation: The fraction of volume that is from informed traders is the ratio of imbalance to total volume.
      VPINτ=τ=1nVτBVτSnVV P I N_{\tau}=\frac{\sum_{\tau=1}^{n}\left|V_{\tau}^{B}-V_{\tau}^{S}\right|}{n V}

Additional Microstructural Features

  • Order Size Distribution: Round numbers (e.g., 10, 50, 100 contracts) indicate human "GUI traders," while randomized sizes indicate "silicon traders" (algorithms). A shift in this distribution can be predictive.
  • Cancellations & Order Types: High cancellation rates or specific patterns can reveal predatory algorithms (quote stuffers, squeezers, pack hunters).
  • TWAP Algorithm Footprints: Institutional (TWAP) algorithms often execute at regular time intervals, creating detectable volume spikes (e.g., at the beginning of every minute).
  • Serial Correlation of Signed Order Flow: High persistence in order flow (e.g., many buy orders in a row) is attributed to informed traders splitting large orders.

Author's Definition: What is Microstructural Information?

The chapter proposes a novel definition of information, not as "news," but as a measure of market maker predictability.

  • Concept: "Information" is high when market makers' models are failing. This failure is measured by the cross-entropy loss (LτL_{\tau}) of the market maker's own classifier.
  • Feature: The "microstructural information" feature ϕτ\phi_{\tau} is the Cumulative Distribution Function (CDF) of this loss.
    ϕτ=F(Lτ)\phi_{\tau}=F\left(-L_{\tau}\right)
  • Interpretation: When ϕτ\phi_{\tau} is high, the market maker's model is experiencing high losses, signaling the presence of informed traders and a high probability of adverse selection. This was evident during the 2010 Flash Crash.

API reference

RiskLabAI implements these in Python and Julia (signatures auto-generated from the package source):

PythonJulia
def beta_estimates(
    high_prices: pd.Series, low_prices: pd.Series, window_span: int
) -> pd.Series:
function beta_estimates(
    high_prices::AbstractVector{<:Real},
    low_prices::AbstractVector{<:Real},
    window_span::Integer,
)
def gamma_estimates(high_prices: pd.Series, low_prices: pd.Series) -> pd.Series:
function gamma_estimates(
    high_prices::AbstractVector{<:Real},
    low_prices::AbstractVector{<:Real},
)
def alpha_estimates(beta: pd.Series, gamma: pd.Series) -> pd.Series:
function alpha_estimates(beta::AbstractVector{<:Real}, gamma::AbstractVector{<:Real})
def corwin_schultz_estimator(
    high_prices: pd.Series, low_prices: pd.Series, window_span: int = 20
) -> pd.Series:
function corwin_schultz_estimator(
    high_prices::AbstractVector{<:Real},
    low_prices::AbstractVector{<:Real},
    window_span::Integer = 20,
)
def sigma_estimates(beta: pd.Series, gamma: pd.Series) -> pd.Series:
function sigma_estimates(beta::AbstractVector{<:Real}, gamma::AbstractVector{<:Real})
def bekker_parkinson_volatility_estimates(
    high_prices: pd.Series, low_prices: pd.Series, window_span: int = 20
) -> pd.Series:
function bekker_parkinson_volatility_estimates(
    high_prices::AbstractVector{<:Real},
    low_prices::AbstractVector{<:Real},
    window_span::Integer = 20,
)

Full source: Python · Julia