False Alternatives to Pandas

Background

If you’ve done any type of data analysis in Python, chances are you’ve probably used pandas. Though widely used in the data world, if you’ve run into space or computational issues with it, you’re not alone. This post discusses several faster alternatives to pandas.

R’s data table in Python

If you’ve used R, you’re probably familiar with the data.table package. A port of this library is also available in Python. In this example, we show how you can read in a CSV file faster than using standard pandas. For our purposes, we’ll be using an open source dataset from the UCI repository.

import datatable
 
start = time.time()
os_scan_data = datatable.fread("OS Scan_dataset.csv", header = None)
end = time.time()
 
print(end - start)

Using datatable, we can read in the CSV file in ~20 seconds. Reading the same file using pandas takes almost 76 seconds!

Next, we can also sort faster with datatable.

start = time.time()
os_scan_data[0].sort()
end = time.time()
 
print(end - start)

In datatable, this takes ~0.002 seconds, but takes ~0.934 seconds in pandas.

In a later article, we’ll go into more detail with datatable. You can check out its documentation by clicking here.

The modin package

modin is another pandas alternative to speed up functions while keeping the syntax largely the same. modin works by utilizing the multiple cores available on a machine (like your laptop, for instance) to run pandas operations in parallel. Since most laptops have between four and eight cores, this means you can still have a performance boost even without using a more powerful server.

First, let’s install modin using pip. For this step, we’re going to install all the dependencies, which includes dask and ray. These will not be installed if you leave out the “[all]” piece of the installation command.

pip install modin[all]

Next, we can get started by importing the package like below. We’ll also import the time package to compare runtimes.

import modin.pandas as pd
import time

For this example, we’ll be using the dataset found here.

os_scan_data = pd.read_csv("OS Scan_dataset.csv", header = None)

Also, we’re going to increase the size of the dataset artificially by simply duplicating the rows multiple times:

combined_data = pd.concat([os_scan_data, os_scan_data, os_scan_data])

This gives us a dataset with over 5 millions rows and and 115 columns.

Next, let’s create a new column using the map function. Using modin, we’ll able to generate the new field in around 0.03 seconds.

start = time.time()
combined_data["test"] = combined_data[9].map(lambda val: "above" if val > 3 else "below")
end = time.time()
 
print(end - start)

If we were to use normal pandas, we get the following result at ~1.34 seconds.

Check out more about modin by clicking here.

The PandaPy library

PandaPy is another alternative to pandas. According to its documentation page, PandaPy is recommended as a potential faster alternative to pandas when the data you’re dealing with has less than 50,000 rows, but possibly as high as 500,000 rows, depending on the data. Another benefit of this package is that it often reduces the amount of memory needed to store datasets when you have mixed data types.

PandaPy can be download via pip:

pip install pandapy

For this example, we’ll use a credit card dataset from Kaggle. Now, we can read in the data. In PandaPy, the dataset is read in as a structured array.

import pandapy as pp
 
# read in dataset
credit_data = pp.read("creditcard.csv")
 
# get descriptive stats
pp.describe(credit_data)

General column operations are similar – for example, we can divide two columns just like in pandas:

credit_data["V1"] / credit_data["V2"]

Similarly, we can get the mean of a column just like pandas:

credit_data["V1"].mean()

See documentation for PandaPy here.

numpy

Several pandas functions can be implemented more efficiently using numpy. For example, if you want to calculate quantiles, like the 90% or 95%, etc., you can use either pandas or numpy. However, numpy will generally be faster.

# pandas
start = time.time()
 
q = np.arange(0.05, 1, 0.05)
quantiles = [email_data.W.quantile(val) for val in q]
end = time.time()
 
print(end - start)
 
# numpy
start = time.time()
 
q = np.arange(0.05, 1, 0.05)
quantiles = [np.quantile(email_data.W, val) for val in q]
end = time.time()
 
print(end - start)

Conclusion

That’s all for this post! These are just a few of the alternatives to pandas. If you’d like to see tutorials on other alternatives, feel free to let me know. Also, if you enjoyed reading this article, make sure to share it with others! Check out my other Python posts by clicking here.

Visit TheAutomatic.net for additional insight on this article: https://theautomatic.net/2021/10/09/faster-alternatives-to-pandas/.

Disclosure: Interactive Brokers Third Party

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from TheAutomatic.net and is being posted with its permission. The views expressed in this material are solely those of the author and/or TheAutomatic.net and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

Join The Conversation

For specific platform feedback and suggestions, please submit it directly to our team using these instructions.

If you have an account-specific question or concern, please reach out to Client Services.

We encourage you to look through our FAQs before posting. Your question may already be covered!

Visit IBKR.com Open an IBKR Account

How much could you save on your margin loan by switching to Interactive Brokers?

Fill out the information below to see your estimated savings.

Current Interest Rate

Balance

USD

Margin Amount Borrowed

USD

Time Margin is Borrowed

IBKR will assess a surcharge of 1% on large loan balances unless otherwise prearranged with IBKR. The 1% surcharge would apply to all balances in the highest tier.

The interest calculator is based on information that we believe to be accurate and correct, but neither Interactive Brokers LLC nor its affiliates warrant its accuracy or adequacy and it should not be relied upon as such. Neither IBKR nor its affiliates are responsible for any errors or omissions or for results obtained from the use of this calculator.

Restrictions apply. Annual Percentage Rate (APR) on USD margin loan balances for IBKR Pro as of October 3, 2024. Interactive Brokers calculates the interest charged on margin loans using the applicable rates for each interest rate tier listed on its website. Learn more about margin loan rates.

The projections or other information generated by the Interest Calculator tool are hypothetical in nature, do not reflect actual results and are not guarantees of future results. Please note that results may vary with use of the tool over time.

Trading on margin is only for experienced investors with high risk tolerance. You may lose more than your initial investment. For additional information about rates on margin loans, please see Margin Loan Rates.

Master options fundamentals with our new Interactive Learning course

False Alternatives to Pandas

Background

R’s data table in Python

The modin package

The PandaPy library

numpy

Conclusion

Disclosure: Interactive Brokers Third Party

Join The Conversation

Information on Other Interactive Brokers Affiliates

Interactive Brokers Canada Inc.

Interactive Brokers Australia Pty. Ltd.

Interactive Brokers Hong Kong Limited

Interactive Brokers India Pvt. Ltd.

Interactive Brokers Securities Japan Inc.

Interactive Brokers Singapore Pte. Ltd.

IBKR Campus Log In

Master options fundamentals with our new Interactive Learning course

Background

R’s data table in Python

The modin package

The PandaPy library

numpy

Conclusion

Disclosure: Interactive Brokers Third Party

Join The Conversation

Bi-Weekly Newsletter

Daily Newsletter

Weekly Newsletter

Weekly Newsletter

Monthly Newsletter