Excerpt
Tick data is the most granular high-frequency data available, and so is the most useful in market microstructure analysis. Unfortunately, tick data is also the most susceptible to data corruption and so must be cleaned and conditioned prior to being used for any type of analysis.
This article, written by Ryan Maxwell, examines how to handle and identify corrupt tick data (for analysts unfamiliar with tick data, please try an intro to tick data first).
Causes of data-corruption?
Tick data is especially vulnerable to data-corruption due to the high-volume of data – a high-volume stock tick data set such for MSFT (Microsoft) can easily amount to 100,000 ticks per day, making error detection very challenging. Typically it is signal interruptions or signal delays that cause either corrupted or out-of-sequence data.
Defining ‘Bad’ Data
Before generating data filters, we first need to designate what constitutes a bad tick. It is a common error to make the test too restrictive and therefore eliminate valid data merely because it is not consistent with the data points close by it (in fact these ticks are often the most useful in trading simulations as they provide information on the market direction or they are trading opportunities themselves).
Thus, there is a need to balance the tradeoff between data completeness and data integrity based on how sensitive the analysis is to bad data.
What tools to use for data checking/cleaning
Unfortunately, there are very few off-the-shelf tools for cleaning time-series data and Excel is not suitable due to its memory requirements (on most systems Excel cannot efficiently work with spreadsheets over 1 million rows which may only be several weeks of tick data). Tools such as OpenRefine (formerly GoogleRefine) are typically more suited to structured data such as customer data.
Custom Python scripts are probably the most flexible and efficient method and are the most commonly used method in machine learning on time-series datasets.
Types of corrupt data and tests
There are numerous types of bad ticks, and each type will require a different test:
Zero or Negative Prices or Volumes
This is the simplest test – ticks with zero or negative prices or volumes are clearly errors and can be immediately discarded.
Simultaneous Observations
Multiple ticks can often be observed for the same timestamp. Since ultra high-frequency models for the modelling tick data typically require a single observation for each timestamp, some form of aggregation needs to be performed. In the case of bid/ask tick data we would use the highest bid and lowest offer (provided the bid is still less than or equal to the offer) and aggregating the volumes for both the bid and the ask price.
Trade data is more problematic as it cannot be easily aggregated. We would normally favour aggregating the volumes and then using a single volume-weighted price.
Bid/Ask Bounce
This is the phenomena of price appearing to ‘bounce’ around when in fact all that is happening is the bid/ask quote remaining the same and traders selling at the bid and buying at the offer giving the impression of price movement on the trade tick data.
The bid/ask bounce is the major reason why many analysts only use the bid/ask tick sequence and ignore the trade tick data. However, if using the trade tick data is essential, one method of eliminating ‘bounces’ is to only accept trade ticks which move the price by more than the bid/ask spread from the prior tick (this is a major reason why it is necessary to have both bid/ask tick as well as trade tick data).
Visit Quantpedia to read the full article:
https://quantpedia.com/working-with-high-frequency-tick-data-cleaning-the-data/
Disclosure: Interactive Brokers
Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
This material is from Quantpedia and is being posted with its permission. The views expressed in this material are solely those of the author and/or Quantpedia and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.