Close Navigation
Statistical Tests for Mean Reversion (Stationarity)

Statistical Tests for Mean Reversion (Stationarity)

Posted January 27, 2026 at 1:30 pm

Tribhuvan Bisen
Quant Insider

ADF, KPSS, Variance Ratio, and the Practical Issues You Actually Run Into

1. Context and the Basic Distinction

Let me start with the obvious thing that people think they understand until they actually try to trade it: a random walk is very different from something that is mean reverting.

A random walk wanders. If it goes up for a while, it can stay up for a long time. If it crosses the mean again, that can just be chance. Importantly, being far from the mean does not create a stronger pull back.

A mean reverting process is centered around some level, and the further away it gets, the stronger the expected pull back. This is the difference between something that crosses the mean rarely and something that crosses it many times.

The classic econometric framing is: many of the statistics you want to use require stationarity. If something is a random walk, you typically difference it first. If it is already stationary, you can regress it, fit AR models, and so on, without everything falling apart.

Two canonical toy models:

Random walk (unit root).

y t = y t 1 + ε t , ε t ( 0 , σ 2 ) .


Mean reversion (AR(1), |p|<1)

y t = ρ y t 1 + ε t , | ρ | < 1.

The whole point of the tests below is to distinguish “ p = 1” from “ p < 1” in a statistically disciplined way.

2. Augmented Dickey–Fuller (ADF) Test

The ADF test is the standard workhorse. Conceptually it is very simple: we regress changes on lagged levels and ask whether the level term has a negative coefficient.

Start from an AR(1):

y t = ρ y t 1 + ε t .

Subtract yt-1:

Δ y t = ( ρ 1 ) y t 1 + ε t .

Define β := ρ 1 , so

Δ y t = β y t 1 + ε t .

If ρ = 1 then β = 0 , and you have:

Δ y t = ε t ,

which is exactly the “increments are white noise” property of a random walk.

The augmented form adds lagged differences to mop up autocorrelation in the residuals:

Δ y t = β y t 1 + i = 1 p γ i Δ y t i + ε t .

2.1. Hypotheses

ADF is usually written as:

H 0 : β = 0 ( unit root / random walk ) , H 1 : β < 0 ( stationary / mean reverting ) .

2.2. The part people get wrong: it is not a Student-t test

Even though the regression looks like a normal regression, the test statistic under H0 does not follow a Student t distribution. It follows a Dickey–Fuller type distribution. That is why ADF uses special critical values.

If you treat it like a standard t-test, you are going to fool yourself.

2.3. Choosing the lag length p

You have to choose p. Different p gives you a different test. The standard practice is to pick p by minimizing an information criterion such as AIC or BIC.

AIC.

A I C = 2 log L + 2 k .

BIC.

B I C = 2 log L + k log T .

Here L is the likelihood, k is the number of estimated parameters, and T is sample size. AIC and BIC are both “goodness of fit minus a penalty.” They are conceptually similar to how people think about cross-validation: you do not just maximize fit, you penalize complexity.

You can pick lags visually using ACF/PACF rules of thumb. People do that. But if you’re running systematic tests, letting the routine pick p by AIC/BIC is a perfectly reasonable default.

2.4. Python example (ADF with autolag AIC)

import numpy as np
from statsmodels.tsa.stattools import adfuller

np.random.seed(0)

# Example 1: random walk
y_rw = np.cumsum(np.random.randn(2000))

# Example 2: AR(1) mean reverting
eps = np.random.randn(2000)
rho = 0.95
y_ar = np.zeros_like(eps)
for t in range(1, len(eps)):
    y_ar[t] = rho * y_ar[t-1] + eps[t]

for name, y in [("Random Walk", y_rw), ("AR(1) Mean Reverting", y_ar)]:
    stat, pval, used_lag, nobs, crit, icbest = adfuller(y, autolag="AIC")
    print("\n---", name, "---")
    print("ADF stat:", stat)
    print("p-value:", pval)
    print("lags used:", used_lag)
    print("nobs:", nobs)
    print("crit:", crit)

In trading I am not religious about the 5% line. I care whether the series is clearly mean reverting or it is basically a random walk. A p-value that barely tips under a threshold is not the same thing as a statistic that is strongly on one side.

3. Why the t-statistic fails in time series unit-root regressions

In a textbook regression, you write:

t = β ^ S E ( β ^ )

and you compare to a Student-t distribution.

In unit root settings, the asymptotics are different. Under the unit root null, the process behaves like Brownian motion in the limit. The distribution of the statistic depends on functionals of Brownian motion, not a Student-t.

One way to write the limiting idea (schematically) is that you get ratios of integrals involving Brownian motion W (.):

0 1 W ( r ) d W ( r ) 0 1 W ( r ) 2 d r ,

which is not the same object as the usual regression t-statistic limit.

So: use the Dickey–Fuller critical values. That is the point of the test.

4. KPSS: Reversing the Null Hypothesis

KPSS is useful because it flips the null and the alternative compared to ADF.

ADF says: assume unit root, try to reject it.
KPSS says: assume stationarity, try to reject it.

The model can be written as:

y t = μ t + ε t , μ t = μ t 1 + u t .

If Var ( u t ) = 0 , then μ t is constant and y t is stationary around a mean. If Var ( u t ) > 0 , then μ t wanders and you have a unit root type behavior.

Operationally, KPSS is built from residual partial sums. Regress y t on a constant (or constant+trend), take residuals e t , define partial sums:

S t = i = 1 t e i .

Then the statistic is:

K P S S = 1 T 2 σ ^ 2 t = 1 T S t 2 ,

where σ ^ 2 is an estimate of the long-run variance.

4.1. Hypotheses

H 0 : stationary , H 1 : unit root / non-stationary .

4.2. Python example (KPSS)

from statsmodels.tsa.stattools import kpss

# Using the same y_rw and y_ar from earlier
for name, y in [("Random Walk", y_rw), ("AR(1) Mean Reverting", y_ar)]:
    stat, pval, lags, crit = kpss(y, regression="c", nlags="auto")
    print("\n---", name, "---")
    print("KPSS stat:", stat)
    print("p-value:", pval)
    print("lags used:", lags)
    print("crit:", crit)

How I actually use it.

I like ADF and KPSS together because they are complementary. If ADF fails to reject a unit root and KPSS rejects stationarity, you are not looking at mean reversion. If both are ambiguous, you are probably in the grey zone, and you should stop pretending a binary label is going to save you.

5. Variance Ratio Test (Lo–MacKinlay)

Variance ratio is more intuitive than people give it credit for. It is based on scaling properties of a random walk:

If y t is a random walk, then:

Var ( y t + k y t ) = k Var ( y t + 1 y t ) .

Define:

V R ( k ) = Var ( y t + k y t ) k Var ( y t + 1 y t ) .

Interpretation:

V R ( k ) = 1 random walk , V R ( k ) < 1 mean reversion / negative autocorrelation , V R ( k ) > 1 trend / momentum .

The reason I like this test is that you can compute it across multiple horizons k and see the structure. It is a very visual diagnostic.

5.1. Python example (simple variance ratio across horizons)

import numpy as np

def variance_ratio(y, k):
    y = np.asarray(y)
    dy1 = np.diff(y)
    dyk = y[k:] - y[:-k]
    return np.var(dyk, ddof=1) / (k * np.var(dy1, ddof=1))

ks = [2, 5, 10, 20, 50]

for name, y in [("Random Walk", y_rw), ("AR(1) Mean Reverting", y_ar)]:
    print("\n---", name, "---")
    for k in ks:
        print(f"k={k:>3d}  VR={variance_ratio(y, k):.4f}")


In practice, I care less about whether V R(k) is “statistically significant” at 5% and more about how far it is from one. If it is barely below one, it is usually not worth trading. If it is meaningfully below one, you often see better risk-adjusted behavior. This is not magic: stronger negative autocorrelation is a stronger economic effect.

6. Lag Length Choice: AIC, BIC, PACF (and why it matters)

In ADF, the whole “augmented” part is the lagged differences:

i = 1 p γ i Δ y t i .

If you use too few lags, you leave autocorrelation in residuals and contaminate the test. If you use too many, you eat degrees of freedom and lose power.

There are multiple ways to pick p:

  • AIC/BIC: automated, objective-ish, and widely used in practice.
  • PACF: common visual heuristic for AR order selection; people do this.

I would not pretend there is a universal “correct” p. If your series is high frequency, your effective dynamics are different from daily. Use AIC/BIC as a default, and if you are building a strategy, treat p as a hyperparameter you stress-test.

7. Putting It Together: How I Would Actually Screen Mean Reversion

If you are screening a spread or a residual for mean reversion:

  1. Run ADF (unit root null). Check whether you can reject.
  2. Run KPSS (stationarity null). Check whether it rejects stationarity.
  3. Compute variance ratios across horizons. Ask: is it meaningfully below 1?

If the outputs disagree, that is information, not a nuisance. It usually means you are in a borderline case where you should not be overly confident.

8. Final Remarks (from trading reality)

A few points that matter more than people admit:

  • These tests tell you what the data looked like. They do not guarantee persistence.
  • A series can be mean reverting in one period and then stop. Mean reversion is often episodic.
  • Statistical significance is not the same thing as tradability. Transaction costs and turnover matter.
  • Do not treat a p-value as a trading signal. Use it as a filter, then backtest the whole pipeline.

And the last thing: the reason we do this is not because statistics make money. We do this because statistics stop us from fooling ourselves too easily.

Other articles by Quant Insider include:

Join The Conversation

For specific platform feedback and suggestions, please submit it directly to our team using these instructions.

If you have an account-specific question or concern, please reach out to Client Services.

We encourage you to look through our FAQs before posting. Your question may already be covered!

Leave a Reply

Disclosure: Interactive Brokers Third Party

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from Quant Insider and is being posted with its permission. The views expressed in this material are solely those of the author and/or Quant Insider and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

IBKR Campus Newsletters

This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.