Asset Classes

Free investment financial education

Language

Multilingual content from IBKR

Close Navigation
Learn more about IBKR accounts
Building a Zipline Bundle for Yahoo CSV Files – Part I

Building a Zipline Bundle for Yahoo CSV Files – Part I

Posted August 11, 2022 at 11:52 am
Mario Pisa
QuantInsti

Zipline is a fantastic tool for backtesting and data is the main raw material for doing this kind of analysis. In this post, we are going to focus on how to load our own data files. Through an example, we will create a bundle to load data from csv files downloaded from Yahoo finance.

We cover:

  • Zipline recap
  • A bundle overview
  • Creating a bundle for Yahoo csv daily data
  • Registering the bundle
  • Ingesting data into Zipline
  • Run a backtest with the new bundle

Zipline recap

As we saw in the last post, the Zipline library is a powerful tool for backtesting that lets us focus on the strategy not without first making every effort to have the system ready.

Although Quantopian has stopped operations, we can still enjoy the great work they did with the Zipline library.

In this blog, we will see how to load data in Zipline from several sources such as Yahoo. The data will come from csv files for undated instruments such as:

  • Stocks,
  • ETFs,
  • CFDs,
  • FX, etc.

Before reading on, it’s imperative to remember that if you want to simplify your life, you can use Blueshift that provides historical data for backtesting and real-time data with connection to several brokers to put your algorithm live without the slightest effort. Otherwise, keep reading.

Zipline calls this the ingest process. The connector that lets us be able to read a data source and load to Zipline is the bundle script.

By default, the Zipline library comes with a few bundles to connect with eg. Quandl Wiki DB and csv files. Yet usually we need to connect to other data sources with different formats, column names, etc.

For this reason, we need to create a bundle in order to be able to ingest the data and run backtests over them. That’s the topic we discuss here.

A bundle overview

A bundle is an ETL tool. The Extraction, Transformation and Load (ETL) is a well-known process in data science. It means that the bundle Python script needs to connect to a data source (web, file or database).

  • Extract the data and load into memory in a convenient data structure as a DataFrame.
  • Normalize the data by cleaning and Transforming the NA, column names, dates and times, etc.
  • Finally, load the normalized data into the Zipline data repository. By default is an SQLite although can be any other DB.

Although it may seem like an overwhelming task, we can use the available csvdir bundle as a template. So the bundle development will be a bit easier.


Creating a bundle for Yahoo csv daily data

Let’s assume we have a folder with daily data downloaded from Yahoo. Note that, by default, the csvdir.py script looks for the data inside folders named daily and minute, hence we need to include Yahoo’s csv files inside the daily folder.

The whole process in one line:

We need to read the data, transform them to the Zipline format and load them into the Zipline repository. This is the ETL process.

We will use the csvdir bundle included with the library as a template. The csvdir.py script  is inside the following folder:

~/opt/miniconda3/envs/zipline35/lib/python3.5/site-packages/zipline/data/bundles

The marked part of the path depends on your machine and on the Conda environment name you are using. Our customized bundle file must be in that folder too.

First, let’s create a copy of the csvdir.py to a recognizable name for what we are going to do. For example, here, we will make a bundle for Yahoo data listed on the NYSE. For example yahoo_NYSE.py

Open the new yahoo_NYSE.py bundle in your favourite  editor. We are going to start editing it to adapt the Yahoo data to Zipline data format and be able to use it in the ingestion process.

If we look inside the file, we have functions, classes and methods needed to undertake in the ETL process. In this post we won’t explain all the code, you have the API documentation for that. Here we’ll look at the parts needed for understanding and change.

Change the name of the main function, I like to use the same name as the file name. So the name will be yahoo_NYSE.

This function accepts two input parameters. The first one is a list offor the data frequency. Minute, daily or both. The second one, is the folder where we have the Yahoo daily data for this case. We don’t use these parameters at this point, but it is useful to be aware of them.

The output of this function is a class named CSVDIRBundle, modify this name as, for example, Yahoo_NYSEBundle.

At lines 92, 97 and 98 it’s needed to change the bundle name, this is the function name we call with the ingest zipline’s command. Line 97 indicates the name we will be registering as a bundle inside Zipline.

Inside the function declared at line 98, we can see the data format expected by Zipline, there is some code to deal with the input parameters and works with metadata, splits, etc.

We need to modify the market calendar CSVDIR in order to use the generic market calendar for the NYSE at line 161.

The function needed to modify in order to adapt our data into Zipline format is named _pricing_iter at line 171. This function reads the csv files and loads them into the Zipline DB.

Here we can see the key part of the code:

It reads the csv files and after that, we can inspect the content, modify the column names, drop the NA or any other change required in the data. For example, in line 188, we drop the possible duplicate dates.

We can include as many print sentences as needed to trace the code execution.

The key here is to align the csv data index with the NYSE market calendar. Line 207 needs the sessions variable to do that.

We create the sessions dates from our data first date to the last date. Include this line at line 154, after the time frame is defined.

Include the variable name in the parameters of the write function calling, line 156.

And accept it in the input parameters of the _pricing_iter function.

Finally, comment or drop the last code line, because we want to use the NYSE calendar with these data files.

Stay tuned for the next installment to learn how to test the new bundle.

Visit QuantInsti for additional insight on this topic: https://blog.quantinsti.com/zipline-bundle-yahoo/.

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from QuantInsti and is being posted with its permission. The views expressed in this material are solely those of the author and/or QuantInsti and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

IBKR Campus Newsletters

This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.