# IBKR Quant Blog

### Decision Tree For Trading Using Python - Part I

Decision Trees are a Machine Supervised Learning method used in Classification and Regression problems, also known as CART.

Remember that a Classification problem tries to classify unknown elements into a class or category; the output always are categorical variables (i.e. yes/no, up/down, red/blue/yellow, etc.).

A Regression problem tries to forecast a number such as the return for the next day. It must not be confused with linear regression which is used to study the relationship between variables.
Although the classification and regression problems have different objectives, the trees have the same structure:

• The Root node is at the top and has no incoming pathways.
• Internal nodes or test nodes are at the middle and can be at different levels or sub-spaces and have incoming and outgoing pathways.
• Leaf nodes or decision nodes are at the bottom, have incoming pathways but no outgoing pathways and here we can find the expected outputs.

Thanks to Python’s Sklearn library, the tree is automatically created for us taking as a starting point the predictor variables that we hypothetically think are responsible for the output we are looking for.

In this introduction post to decision trees, we will create a classification decision tree in Python to make forecasts about whether the financial instrument we are going to analyze will go up or down the next day.

We will also create a regression decision tree to make forecasts about the concrete return of the index the next day.

Preparing the Environment

Be sure you have available the following software pieces in order to follow the examples:

• Python 3.6
• Pandas library for data structure
• Numpy library with scientific mathematical functions
• Quandl library to retrieve market data
• Ta-lib library to calculate technical indicators
• Sklearn ML library to build the trees and perform analysis (among many others things)
• Graphviz library to plot the tree

Building a Decision Tree

Building a classification decision tree or a regression decision tree is very similar in the way we organize the input data and predictor variables, then, by calling the corresponding functions, the classification decision tree or regression decision tree will be automatically created for us according to some criteria we must specify.

The main steps to build a decision tree are:

1. Retrieve market data for a financial instrument.
2. Introduce the Predictor variables (i.e. Technical indicators, Sentiment indicators, Breath indicators, etc.)
3. Setup the Target variable or the desired output.
4. Split data between training and test data.
5. Generate the decision tree training the model.
6. Testing and analyzing the model.

If we look at the first four steps, they are common operations for data processing. If you are a newcomer to decision trees the predictor and target variables may sound exotic to you. However, they are nothing more than additional columns in the data frame that contain some type of indicator. These indicators or predictors are used to predict the target variable that is the financial instrument will go up or down for the classification model, or the future price level for the regression model. Likewise, splitting data is a mandatory task in any back testing process (ML or not), the idea is to have one set of data to train the model and another set of data, which have not been used in training, to test the model.

Steps 5 and 6 are related to the ML algorithms for the decision trees specifically. As we will see, the implementation in Python is quite simple. However, it is fundamental to understand well the parameterization and the analysis of the results. This post is eminently practical and to go deeper into the underlying mathematics we recommend reading the references at the bottom of the post.

Getting the data

The raw material for any algorithm is data. In our case, they would be the time series of financial instruments, such as indices, stocks etc. and it usually contains details like the opening price, maximum, minimum, closing price and volume. This information is recorded at a certain frequency, such as minutes, hours, days or weeks, and forms a time series.

Here, we are going to work with twenty years of daily data from the Emini S&P 500 that we will retrieve through Quandl.

import quandl

df = quandl.get("CHRIS/CME_ES2")

df.tail()

df.shape

(5391, 8)

We now have just over 21 years of Emini S&P500 data available. We will use the settle price as the closing price reference.

In the next post, the author will discuss creating the variables.

21733