Close Navigation
Learn more about IBKR accounts
Reinforcement Learning in Trading

Reinforcement Learning in Trading

Posted June 17, 2025 at 12:08 pm

Ishan Shah
QuantInsti

The article “Reinforcement Learning in Tradingwas originally posted on QuantInsti blog.

Initially, AI research focused on simulating human thinking, only faster. Today, we’ve reached a point where AI “thinking” amazes even human experts. As a perfect example, DeepMind’s AlphaZero revolutionised chess strategy by demonstrating that winning doesn’t require preserving pieces—it’s about achieving checkmate, even at the cost of short-term losses.

This concept of “delayed gratification” in AI strategy sparked interest in exploring reinforcement learning for trading applications. This article explores how reinforcement learning can solve trading problems that might be impossible through traditional machine learning approaches.

Prerequisites

Before exploring the concepts in this blog, it’s important to build a strong foundation in machine learning, particularly in its application to financial markets.

Begin with Machine Learning Basics or Machine Learning for Algorithmic Trading in Python to understand the fundamentals, such as training data, features, and model evaluation. Then, deepen your understanding with the Top 10 Machine Learning Algorithms for Beginners, which covers key ML models like decision trees, SVMs, and ensemble methods.

Learn the difference between supervised techniques via Machine Learning Classification and regression-based price prediction in Predicting Stock Prices Using Regression.

Also, review Unsupervised Learning to understand clustering and anomaly detection, crucial for identifying patterns without labelled data.

This guide is based on notes from Deep Reinforcement Learning in Trading by Dr Tom Starke and is structured as follows.

  • What is Reinforcement Learning?
  • How to Apply Reinforcement Learning in Trading
  • How is Reinforcement Learning Different from Traditional ML?
  • Components of Reinforcement Learning
  • Putting It All Together
  • Q-Table and Q-Learning
  • Experience Replay and Advanced Techniques in RL
  • Challenges in Reinforcement Learning for Trading
Reinforcement Learning in Trading

What is Reinforcement Learning?

Despite sounding complex, reinforcement learning employs a simple concept we all understand from childhood. Remember receiving rewards for good grades or scolding for misbehavior? Those experiences shaped your behavior through positive and negative reinforcement.

Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards. This paradigm of learning by trial-and-error, solely from rewards or punishments, is known as reinforcement learning (RL).


How to Apply Reinforcement Learning in Trading

In trading, RL can be applied to various objectives:

  • Maximising profit
  • Optimising portfolio allocation

The distinguishing advantage of RL is its ability to learn strategies that maximise long-term rewards, even when it means accepting short-term losses.

Consider Amazon’s stock price, which remained relatively stable from late 2018 to early 2020, suggesting a mean-reverting strategy might work well.

Reinforcement Learning in Trading

Source: Yahoo Finance

However, from early 2020, the price began trending upward. Deploying a mean-reverting strategy at this point would have resulted in losses, causing many traders to exit the market.

Reinforcement Learning in Trading

Source: Yahoo Finance

An RL model, however, could recognise larger patterns from previous years (2017-2018) and continue holding positions for substantial future profits—exemplifying delayed gratification in action.


How is Reinforcement Learning Different from Traditional ML?

Unlike traditional machine learning algorithms, RL doesn’t require labels at each time step. Instead:

  • The RL algorithm learns through trial and error
  • It receives rewards only when trades are closed
  • It optimises strategy to maximise long-term rewards

Traditional ML requires labels at specific intervals (e.g., hourly or daily) and focuses on regression to predict the next candle percentage returns or classification to predict whether to buy or sell a stock. This makes solving the delayed gratification problem particularly challenging through conventional ML approaches.


Components of Reinforcement Learning

This guide focuses on the conceptual understanding of Reinforcement Learning components rather than their implementation. If you’re interested in coding these concepts, you can explore the Deep Reinforcement Learning course on Quantra.

Actions

Actions define what the RL algorithm can do to solve a problem. For trading, actions might be Buy, Sell, and Hold. For portfolio management, actions would be capital allocations across asset classes.

Policy

Policies help the RL model decide which actions to take:

  • Exploration policy: When the agent knows nothing, it decides actions randomly and learns from experiences. This initial phase is driven by experimentation—trying different actions and observing the outcomes.
  • Exploitation policy: The agent uses past experiences to map states to actions that maximise long-term rewards.

In trading, it is crucial to maintain a balance between exploration and exploitation. A simple mathematical expression that decays exploration over time while retaining a small exploratory chance can be written as:

Reinforcement Learning in Trading

Here, εₜ is the exploration rate at trade number t, k controls the rate of decay, and εₘᵢₙ ensures we never stop exploring entirely.

State

The state provides meaningful information for decision-making. For example, when deciding whether to buy Apple stock, useful information might include:

  • Technical indicators
  • Historical price data
  • Sentiment data
  • Fundamental data

All this information constitutes the state. For effective analysis, the data should be weakly predictive and weakly stationary (having constant mean and variance), as ML algorithms generally perform better on stationary data.

Rewards

Rewards represent the end objective of your RL system. Common metrics include:

  • Profit per tick
  • Sharpe Ratio
  • Profit per trade

When it comes to trading, using just the PnL sign (positive/negative) as the reward works better as the model learns faster. This binary reward structure allows the model to focus on consistently making profitable trades rather than chasing larger but potentially riskier gains.

Environment

The environment is the world that allows the RL agent to observe states. When the agent applies an action, the environment processes that action, calculates rewards, and transitions to the next state.

RL Agent

The agent is the RL model that takes input features/state and decides which action to take. For instance, an RL agent might take RSI and 10-minute returns as input to determine whether to go long on Apple stock or close an existing position.


Putting It All Together

Reinforcement Learning in Trading

Let’s see how these components work together:

Step 1:

  • State & Action: Apple’s closing price was $92 on Jan 24, 2025. Based on the state (RSI and 10-day returns), the agent gives a buy signal.
  • Environment: The order is placed at the open on the next trading day (Jan 27) and filled at $92.
  • Reward: No reward is given as the trade is still open.

Step 2:

  • State & Action: The next state reflects the latest price data. On Jan 27, the price reached $94. The agent analyses this state and decides to sell.
  • Environment: A sell order is placed to close the long position.
  • Reward: A reward of 2.1% is given to the agent.
DateClosing priceActionReward (% returns)
Jan 24$92Buy
Jan 27$94Sell2.1

Q-Table and Q-Learning

At each time step, the RL agent needs to decide which action to take. The Q-table helps by showing which action will give the maximum reward. In this table:

  • Rows represent states (days)
  • Columns represent actions (hold/sell)
  • Values are Q-values indicating expected future rewards

Example Q-table:

DateSellHold
23-01-20250.9540.966
24-01-20250.9540.985
27-01-20250.9541.005
28-01-20250.9541.026
29-01-20250.9541.047
30-01-20250.9541.068
31-01-20250.9541.090

On Jan 23, the agent would choose “hold” since its Q-value (0.966) exceeds the Q-value for “sell” (0.954).

Creating a Q-Table

Let’s create a Q-table using Apple’s price data from Jan 22-31, 2025:

DateClosing Price% ReturnsCumulative Returns
22-01-202597.2
23-01-202592.8-4.53%0.95
24-01-202592.6-0.22%0.95
27-01-202594.82.38%0.98
28-01-202593.3-1.58%0.96
29-01-202595.01.82%0.98
30-01-202596.21.26%0.99
31-01-2025106.310.50%1.09

If we’ve bought one Apple share with no remaining capital, our only choices are “hold” or “sell.” We first create a reward table:

State/ActionSellHold
22-01-202500
23-01-20250.950
24-01-20250.950
27-01-20250.980
28-01-20250.960
29-01-20250.980
30-01-20250.990
31-01-20251.091.09

Using only this reward table, the RL model would sell the stock and get a reward of 0.95. However, the price is expected to increase to $106 on Jan 31, resulting in a 9% gain, so holding would be better.

To represent this future information, we create a Q-table using the Bellman equation:

Reinforcement Learning

Where:

  • s is the state
  • a is a set of actions at time t
  • a’ is a specific action
  • R is the reward table
  • Q is the state-action table that’s constantly updated
  • γ is the learning rate

Starting with Jan 30’s Hold action:

  • The reward for this action (from R-table) is 0
  • Assuming γ = 0.98, the maximum Q-value for actions on Jan 31 is 1.09
  • The Q-value for Hold on Jan 30 is 0 + 0.98(1.09) = 1.068

Completing this process for all rows gives us our Q-table:

DateSellHold
23-01-20250.950.966
24-01-20250.950.985
27-01-20250.981.005
28-01-20250.961.026
29-01-20250.981.047
30-01-20250.991.068
31-01-20251.091.090

The RL model will now select “hold” to maximise Q-value. This process of updating the Q-table is called Q-learning.

In real-world scenarios with vast state spaces, building complete Q-tables becomes impractical. To overcome this, we can use Deep Q Networks (DQNs)—neural networks that learn Q-tables from past experiences and provide Q-values for actions when given a state as input.


Experience Replay and Advanced Techniques in RL

Experience Replay

  • Stores (state, action, reward, next_state) tuples in a replay buffer
  • Trains the network on random batches from this buffer
  • Benefits: breaks correlations between samples, improves data efficiency, stabilises training

Double Q-Networks (DDQN)

  • Uses two networks: primary for action selection, target for value estimation
  • Reduces overestimation bias in Q-values
  • More stable learning and better policies

Other Key Advancements

  • Prioritised Experience Replay: Samples important transitions more frequently
  • Dueling Networks: Separates state value and action advantage estimation
  • Distributional RL: Models the entire return distribution instead of just the expected value
  • Rainbow DQN: Combines multiple improvements for state-of-the-art performance
  • Soft Actor-Critic: Adds entropy regularisation for robust exploration

These techniques address fundamental challenges in deep RL, improving efficiency, stability, and performance across complex environments.


Challenges in Reinforcement Learning for Trading

Type 2 Chaos

While training, the RL model works in isolation without interacting with the market. Once deployed, we don’t know how it will affect the market. Type 2 chaos occurs when an observer can influence the situation they’re observing. Although difficult to quantify during training, we can assume the RL model will continue learning after deployment and adjust accordingly.

Noise in Financial Data

RL models might interpret random noise in financial data as actionable signals, leading to inaccurate trading recommendations. While methods exist to remove noise, we must balance noise reduction against a potential loss of important data.


Conclusion

We’ve introduced the fundamental components of reinforcement learning systems for trading. The next step would be implementing your own RL system to backtest and paper trade using real-world market data.

For a deeper dive into RL and to create your own reinforcement learning trading strategies, consider specialised courses in Deep Reinforcement Learning on Quantra.

References & Further Readings

  1. Once you’re comfortable with the foundational ML concepts, you can explore advanced reinforcement learning and its role in trading through more structured learning experiences. Start with the Machine Learning & Deep Learning in Trading learning track, which offers hands-on tutorials on AI model design, data preprocessing, and financial market modelling.
  2. For those looking for an advanced, structured approach to quantitative trading and machine learning, the Executive Programme in Algorithmic Trading (EPAT) is an excellent choice. This program covers classical ML algorithms (such as SVM, k-means clustering, decision trees, and random forests), deep learning fundamentals (including neural networks and gradient descent), and Python-based strategy development. You will also explore statistical arbitrage using PCA, alternative data sources, and reinforcement learning applied to trading.
  3. Once you have mastered these concepts, you can apply your knowledge in real-world trading using Blueshift. Blueshift is an all-in-one automated trading platform that offers institutional-grade infrastructure for investment research, backtesting, and algorithmic trading. It is a fast, flexible, and reliable platform, agnostic to asset class and trading style, helping you turn your ideas into investment-worthy opportunities.

Join The Conversation

For specific platform feedback and suggestions, please submit it directly to our team using these instructions.

If you have an account-specific question or concern, please reach out to Client Services.

We encourage you to look through our FAQs before posting. Your question may already be covered!

Leave a Reply

Disclosure: Interactive Brokers Third Party

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from QuantInsti and is being posted with its permission. The views expressed in this material are solely those of the author and/or QuantInsti and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

IBKR Campus Newsletters

This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.