Applying Transformers to Financial Time Series

The article “Applying Transformers to Financial Time Series” was originally published on PredictNow.ai.

In the previous blog post, we gave a very simple example of how traders can use self-attention transformers as a feature selection method: in this case, to select which previous returns of a stock to use for predictions or optimizations. To be precise, the transformer assigns weights on the different transformed features for downstream applications. In this post, we will discuss how traders can incorporate different feature series from this stock while adding a sense of time. The technique we discuss is based partly on Prof. Will Cong’s AlphaPortfolio paper.

Recall that in the simple example in a Poor Person’s Transformer, the input X is just a n-vector with previous returns X=[R(t), R(t-1), …, R(t-n+1)]^T. Some of you fundamental analysts will complain “What about the fundamentals of a stock? Shouldn’t they be part of the input?” Sure they should! Let’s say, following AlphaPortfolio, we add B/M, EPS, …, all 51 fundamental variables of a company as input features. Furthermore, just as for the returns, we want to know the n previous snapshots of these variables. So we expand X from 1 to 52 columns (including the returns column). For concreteness, let’s say we use n=12 snapshots, captured at monthly intervals, and regard R(t) as the monthly return from t-1 to t. X is now a 12 × 52 matrix.

Are we ready to use them as input to our transformer? Goodness, no! As we said in our previous blog post, raw heterogeneous features, i.e. features that are from different spaces such as returns vs EPS, can’t be mixed up in a transformer without normalization. In AlphaPortfolio, the authors normalize them cross-sectionally, i.e. for each snapshot of time, compute the mean and std of a feature across the universe of stocks and thus turn every feature into a z-score. In our case, we only have 1 stock, so we have to normalize temporally, i.e. compute the mean and std of a feature in a lookback period in order to compute z-scores. For simplicity, we might as well use 12 months as the lookback period.

After normalizing all these features into z-scores, are we ready to use them as input to our transformer? Goodness, no! We usually project (linearly transform) the raw features into a higher dimensional embedding space before input to a transformer. In our case, we will project the 52 features into a 64-dimensional space (the embedding space dimension is usually a power of 2 and is larger than the original feature space):

where W_embed(t) is a 52 × 64 projection / embedding matrix, with values to be optimized based on the downstream objective function, and X_embed(t) is a 12 × 64 input matrix projected to the embedding space.

After this embedding, are we ready to use them as input to our transformer? Goodness, no! (Ernie has been reading too much Pete The Cat to his kids, hence the idiom.) Unlike a LSTM, our transformer does not know that feature X(t) comes after X(t-1) in time: there is no sense of time ordering. We need to apply “positional encoding”. This is done by adding a “positional encoding vector” PE to each input that encodes its position in the time series:

where i is the lookback from 0 to 11 months with i=0 pointing to the current time t, and j is the feature index in the embedding space, and

with

So k=0 to d/2-1, that is from 0 to 64/2-1. (Recall d=64 is the embedding dimension.)

What’s the intuition behind this formula? In general, if we have a time series Y(t) sampled at discrete intervals, we can always perform a Fourier decomposition:

PE(i, j) looks just like each of these cosine or sine basis function, with 1/10000^(2k/64) playing the role of the frequency (2 π j/N). So by adding PE, we are adding each of these cosine and sine basis functions to the original X_embed(t), and hope that the transformer will treat X_input=X_embed+PE as a time series. After all, as discussed in the previous blog post, the transformer is going to add the different basis functions up with some coefficients via matrix multiplications and make them into Q, K, V matrices:

So Q, K, V are just different time series that are transformed versions of X_input, each a projection to spaces with different dimensions. We find this not particularly mathematically rigorous. But hey, this is engineering, not science, and the final judge is whether it works. Also, there are alternatives to sinusoidal positional encoding. Just ask ChatGPT!

Once we have these Q, K, and V, the rest are standard transformer stuff, whether this is a time series or a sequence of words, and the output is a context matrix Z of dimension 12 × 64 in our case (the same dimension as X_input). Each row of Z still represents a different lagged set of mixed features.

For ease of downstream processing, AlphaPortfolio flatten the Z matrix into a feature vector with dimension 768 × 1, which they simply call r (not to be confused with the raw returns R(t)).

As usual, you can use these transformed features downstream for supervised or reinforcement learning as you like. In the next blog post, we will talk about what happens when we have more than 1 stock we want to use as input.

Disclosure: Interactive Brokers Third Party

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from PredictNow.ai and is being posted with its permission. The views expressed in this material are solely those of the author and/or PredictNow.ai and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

Join The Conversation

For specific platform feedback and suggestions, please submit it directly to our team using these instructions.

If you have an account-specific question or concern, please reach out to Client Services.

We encourage you to look through our FAQs before posting. Your question may already be covered!

Visit IBKR.com Open an IBKR Account

Master options fundamentals with our new Interactive Learning course

Applying Transformers to Financial Time Series

Disclosure: Interactive Brokers Third Party

Join The Conversation

Leave a Reply Cancel reply

Information on Other Interactive Brokers Affiliates

Interactive Brokers Canada Inc.

Interactive Brokers Australia Pty. Ltd.

Interactive Brokers Hong Kong Limited

Interactive Brokers India Pvt. Ltd.

Interactive Brokers Securities Japan Inc.

Interactive Brokers Singapore Pte. Ltd.

IBKR Campus Log In

Master options fundamentals with our new Interactive Learning course

Disclosure: Interactive Brokers Third Party

Join The Conversation

Leave a Reply Cancel reply

Bi-Weekly Newsletter

Daily Newsletter

Weekly Newsletter

Weekly Newsletter

Monthly Newsletter