IBKR Quant Blog


K-Means Clustering For Pair Selection In Python - Overview

In this series, we will cover what K-Means clustering is, how it can be used for solving the age-old problem of pair selection for Statistical Arbitrage, and the advantage of using K-Means for pair selection compared to using a brute force method. We will also create a Statistical Arbitrage strategy using K-Means for pair selection and implement the elbow technique to determine the value of K.

Let’s get started!

Part I – Life Without K-Means

To gain an understanding of why we may want to use K-Means to solve the problem of pair selection we will attempt to implement a Statistical Arbitrage as if there was no K-Means. That is, we will attempt to develop a brute force solution to our pair selection problem and then apply that solution within our Statistical Arbitrage strategy.

Let’s take a moment to think about why K-Means could be used for trading. What’s the benefit of using K-Means to form subgroups of possible pairs? Couldn’t we just come up with the pairs ourselves?

This is a great question and one undoubtedly you may have wondered about. To better understand the strength of using a technique like K-Means for Statistical Arbitrage, we’ll do a walk-through of trading a Statistical Arbitrage strategy if there was no K-Means. I’ll be your ghost of trading past so to speak.

First, let’s identify the key components of any Statistical Arbitrage trading strategy.

  1. We must identify assets that have a tradable relationship
  2. We must calculate the Z-Score of the spread of these assets, as well as the hedge ratio for position sizing
  3. We generate buy and sell decisions when the Z-Score exceeds some upper or lower bound

To begin, we need some pairs to trade. But we can’t trade Statistical Arbitrage without knowing whether or not the pairs we select are cointegrated. Cointegration simply means that the statistical properties between our two assets are stable. Even if the two assets move randomly, we can count on the relationship between them to be constant, or at least most of the time.

Traditionally, when solving the problem of pair selection, in a world with no K-Means, we must find pairs by brute force or trial and error. This was usually done by grouping stocks together that were merely in the same sector or industry. The idea was that if these stocks were of companies in similar industries, thus having similarities in their operations, their stocks should move similarly as well. But, as we shall see, this is not necessarily the case.

The first step is to think of some pairs of stocks that should yield a trading relationship. We’ll use stocks in the S&P 500 but this process could be applied to any stocks within any index. Hmm, how about Walmart and Target. They both are retailers and direct competitors. Surely they should be cointegrated and thus would allow us to trade them in a Statistical Arbitrage Strategy.

Let’s begin by importing the necessary libraries as well as the data that we will need. We will use 2014-2016 as our analysis period.

#importing necessary libraries

#data analysis/manipulation

import numpy as np
import pandas as pd

#importing pandas datareader to get our data
import pandas_datareader as pdr

#importing the Augmented Dickey Fuller Test to check for cointegration
from statsmodels.tsa.api import adfuller

Now that we have our libraries, let’s get our data.

#setting start and end dates
#importing Walmart and Target using pandas datareader

Before testing our two stocks for cointegration, let’s take a look at their performance over the period. We’ll create a plot of Walmart and Target.

#Creating a figure to plot on

#Creating WMT and TGT plots

plt.title('Walmart and Target Over 2014-2016')



In the above plot, we can see a slight correlation at the beginning of 2014. But this doesn’t really give us a clear idea of the relationship between Walmart and Target. To get a definitive idea of the relationship between the two stocks, we’ll create a correlation heat-map.

To begin creating our correlation heatmap, must first place Walmart* and Target* prices in the same dataframe. Let’s create a new dataframe for our stocks.

#initializing newDF as a pandas dataframe
#adding WMT closing prices as a column to the newDF
#adding TGT closing prices as a column to the newDF

Now that we have created a new dataframe to hold our Walmart* and Target* stock prices, let’s take a look at it.



We can see that we have the prices of both our stocks in one place.

In the next post, we will create a correlation heat map of stocks and run some ADF tests


*Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

If you want to learn more about K-Means Clustering for Pair Selection in Python, or to download the code, visit QuantInsti website and the educational offerings at their Executive Programme in Algorithmic Trading (EPAT™).

This article is from QuantInsti and is being posted with QuantInsti’s permission. The views expressed in this article are solely those of the author and/or QuantInsti and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad-based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.



We appreciate your feedback. If you have any questions or comments about IBKR Quant Blog please contact ibkrquant@ibkr.com.

The material (including articles and commentary) provided on IBKR Quant Blog is offered for informational purposes only. The posted material is NOT a recommendation by Interactive Brokers (IB) that you or your clients should contract for the services of or invest with any of the independent advisors or hedge funds or others who may post on IBKR Quant Blog or invest with any advisors or hedge funds. The advisors, hedge funds and other analysts who may post on IBKR Quant Blog are independent of IB and IB does not make any representations or warranties concerning the past or future performance of these advisors, hedge funds and others or the accuracy of the information they provide. Interactive Brokers does not conduct a "suitability review" to make sure the trading of any advisor or hedge fund or other party is suitable for you.

Securities or other financial instruments mentioned in the material posted are not suitable for all investors. The material posted does not take into account your particular investment objectives, financial situations or needs and is not intended as a recommendation to you of any particular securities, financial instruments or strategies. Before making any investment or trade, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice. Past performance is no guarantee of future results.

Any information provided by third parties has been obtained from sources believed to be reliable and accurate; however, IB does not warrant its accuracy and assumes no responsibility for any errors or omissions.

Any information posted by employees of IB or an affiliated company is based upon information that is believed to be reliable. However, neither IB nor its affiliates warrant its completeness, accuracy or adequacy. IB does not make any representations or warranties concerning the past or future performance of any financial instrument. By posting material on IB Quant Blog, IB is not representing that any particular financial instrument or trading strategy is appropriate for you.