- Published on
Microstructural Features
- Authors

- Name
- Tails Azimuth
Table of Contents
Microstructural Features
This chapter explores how to derive predictive features from market microstructure data (like FIX messages). This data reveals how participants trade, which often exposes informational asymmetries that ML algorithms can exploit. The literature is divided into three generations of models.
First Generation: Price Sequences
These models use only price information to infer market properties like liquidity.
- The Tick Rule: A simple algorithm to infer the aggressor side (buy or sell) of a trade. A buy is
1, a sell is-1. - The Roll Model: Estimates the effective bid-ask spread () from the serial covariance of price changes. This is useful for illiquid assets without a visible order book.
- Corwin and Schultz Estimator: Estimates the spread () using only daily (or bar) high and low prices, based on the principle that H/L ratios capture both volatility and the spread.
- The spread is estimated as:
- Where is derived from (a 2-bar average of squared log H/L ratios) and (the squared log H/L ratio over the 2-bar period).
- The spread is estimated as:
Implementation RiskLabAI
In our RiskLabAI library, we provide a direct implementation of this estimator in the features.microstructural_features.corwin_schultz module. The code is modular, with each function corresponding to a component of the formula.
The core logic is broken into:
beta_estimates: Calculates , the two-day sum of squared log H/L ratios, averaged over awindow_span.gamma_estimates: Calculates , the squared log-ratio of the two-day high and low.alpha_estimates: Uses and to solve for .
The main function, corwin_schultz_estimator, orchestrates these steps to return the final spread .
Bekker-Parkinson Volatility Estimator
As an extension, we also implement the Bekker-Parkinson volatility estimator, which adjusts the standard Parkinson volatility by incorporating the Corwin-Schultz spread components.
Methodology
The estimator is calculated as:
Where and are the same components calculated for the Corwin-Schultz estimator.
Implementation
In our RiskLabAI library, we implement this in the features.microstructural_features.bekker_parkinson_volatility_estimator module.
The main function bekker_parkinson_volatility_estimates is a convenience wrapper that first computes and (by calling beta_estimates and gamma_estimates from the corwin_schultz module) and then passes them to the sigma_estimates function, which contains the core logic for the formula.
Second Generation: Strategic Trade Models
These models incorporate volume to measure illiquidity and price impact, modeling trade as a strategic interaction.
Kyle's Lambda (): Models price impact as the result of a game between an informed trader and a market maker. is an inverse measure of liquidity. It can be estimated via a simple regression:
where is the signed volume.
Amihud's Lambda (): Measures illiquidity as the absolute price response per dollar of trading volume.
Hasbrouck's Lambda (): A Bayesian model that estimates price impact using the square-root of dollar volume, which is often found to be a more accurate specification.
Third Generation: Sequential Trade Models
These models focus on asymmetric information and the strategic, sequential nature of trading.
- PIN (Probability of Informed Trading): A foundational model where the bid-ask spread is the premium a market maker charges for the risk of being adversely selected. The probability of informed trading is a function of:
- : Probability of a new information event.
- : Arrival rate of informed traders.
- : Arrival rate of uninformed (noise) traders.
- VPIN (Volume-Synchronized PIN): A high-frequency, practical estimate of PIN that uses volume bars.
- Key Insight: In a volume bar of size , the total volume is and the expected imbalance is .
- VPIN Equation: The fraction of volume that is from informed traders is the ratio of imbalance to total volume.
Additional Microstructural Features
- Order Size Distribution: Round numbers (e.g., 10, 50, 100 contracts) indicate human "GUI traders," while randomized sizes indicate "silicon traders" (algorithms). A shift in this distribution can be predictive.
- Cancellations & Order Types: High cancellation rates or specific patterns can reveal predatory algorithms (quote stuffers, squeezers, pack hunters).
- TWAP Algorithm Footprints: Institutional (TWAP) algorithms often execute at regular time intervals, creating detectable volume spikes (e.g., at the beginning of every minute).
- Serial Correlation of Signed Order Flow: High persistence in order flow (e.g., many buy orders in a row) is attributed to informed traders splitting large orders.
Author's Definition: What is Microstructural Information?
The chapter proposes a novel definition of information, not as "news," but as a measure of market maker predictability.
- Concept: "Information" is high when market makers' models are failing. This failure is measured by the cross-entropy loss () of the market maker's own classifier.
- Feature: The "microstructural information" feature is the Cumulative Distribution Function (CDF) of this loss.
- Interpretation: When is high, the market maker's model is experiencing high losses, signaling the presence of informed traders and a high probability of adverse selection. This was evident during the 2010 Flash Crash.
API reference
RiskLabAI implements these in Python and Julia (signatures auto-generated from the package source):
| Python | Julia |
|---|---|
| |
| |
| |
| |
| |
| |