{"id":225869,"date":"2025-06-17T12:08:54","date_gmt":"2025-06-17T16:08:54","guid":{"rendered":"https:\/\/ibkrcampus.com\/campus\/?p=225869"},"modified":"2025-06-17T12:07:03","modified_gmt":"2025-06-17T16:07:03","slug":"reinforcement-learning-in-trading-2","status":"publish","type":"post","link":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/","title":{"rendered":"Reinforcement Learning in Trading"},"content":{"rendered":"\n<p><em>The <em>article <\/em>&#8220;Reinforcement Learning in Trading<\/em>&#8221; <em>was originally posted on <a href=\"https:\/\/blog.quantinsti.com\/reinforcement-learning-trading\/\">QuantInsti<\/a><\/em> <em>blog.<\/em><\/p>\n\n\n\n<p>Initially, AI research focused on simulating human thinking, only faster. Today, we&#8217;ve reached a point where AI &#8220;thinking&#8221; amazes even human experts. As a perfect example, DeepMind&#8217;s AlphaZero revolutionised chess strategy by demonstrating that winning doesn&#8217;t require preserving pieces\u2014it&#8217;s about achieving checkmate, even at the cost of short-term losses.<\/p>\n\n\n\n<p>This concept of &#8220;delayed gratification&#8221; in AI strategy sparked interest in exploring reinforcement learning for trading applications. This article explores how reinforcement learning can solve trading problems that might be impossible through traditional machine learning approaches.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"prerequisites\"><strong>Prerequisites<\/strong><\/h3>\n\n\n\n<p>Before exploring the concepts in this blog, it\u2019s important to build a strong foundation in machine learning, particularly in its application to financial markets.<\/p>\n\n\n\n<p>Begin with<a href=\"https:\/\/blog.quantinsti.com\/machine-learning-basics\/\">&nbsp;Machine Learning Basics<\/a>&nbsp;or<a href=\"https:\/\/blog.quantinsti.com\/trading-using-machine-learning-python\/\">&nbsp;Machine Learning for Algorithmic Trading in Python<\/a>&nbsp;to understand the fundamentals, such as training data, features, and model evaluation. Then, deepen your understanding with the<a href=\"https:\/\/blog.quantinsti.com\/top-10-machine-learning-algorithms-beginners\/\">&nbsp;Top 10 Machine Learning Algorithms for Beginners<\/a>, which covers key ML models like decision trees, SVMs, and ensemble methods.<\/p>\n\n\n\n<p>Learn the difference between supervised techniques via<a href=\"https:\/\/blog.quantinsti.com\/machine-learning-classification\/\">&nbsp;Machine Learning Classification<\/a>&nbsp;and regression-based price prediction in<a href=\"https:\/\/blog.quantinsti.com\/machine-learning-trading-predict-stock-prices-regression\/\">&nbsp;Predicting Stock Prices Using Regression<\/a>.<\/p>\n\n\n\n<p>Also, review<a href=\"https:\/\/blog.quantinsti.com\/unsupervised-learning\/\">&nbsp;Unsupervised Learning<\/a>&nbsp;to understand clustering and anomaly detection, crucial for identifying patterns without labelled data.<\/p>\n\n\n\n<p>This guide is based on notes from&nbsp;<a href=\"https:\/\/quantra.quantinsti.com\/course\/deep-reinforcement-learning-trading\">Deep Reinforcement Learning in Trading by Dr Tom Starke<\/a>&nbsp;and is structured as follows.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What is Reinforcement Learning?<\/li>\n\n\n\n<li>How to Apply Reinforcement Learning in Trading<\/li>\n\n\n\n<li>How is Reinforcement Learning Different from Traditional ML?<\/li>\n\n\n\n<li>Components of Reinforcement Learning<\/li>\n\n\n\n<li>Putting It All Together<\/li>\n\n\n\n<li>Q-Table and Q-Learning<\/li>\n\n\n\n<li>Experience Replay and Advanced Techniques in RL<\/li>\n\n\n\n<li>Challenges in Reinforcement Learning for Trading<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1000\" height=\"563\" data-src=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/quantinsti-reinforcement-learning.gif\" alt=\"Reinforcement Learning in Trading\" class=\"wp-image-225913 lazyload\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1000px; aspect-ratio: 1000\/563;\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-reinforcement-learning\"><strong>What is Reinforcement Learning?<\/strong><\/h2>\n\n\n\n<p>Despite sounding complex, reinforcement learning employs a simple concept we all understand from childhood. Remember receiving rewards for good grades or scolding for misbehavior? Those experiences shaped your behavior through positive and negative reinforcement.<\/p>\n\n\n\n<p>Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards. This paradigm of learning by trial-and-error, solely from rewards or punishments, is known as reinforcement learning (RL).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-to-apply-reinforcement-learning-in-trading\"><strong>How to Apply Reinforcement Learning in Trading<\/strong><\/h2>\n\n\n\n<p>In trading, RL can be applied to various objectives:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Maximising profit<\/li>\n\n\n\n<li>Optimising portfolio allocation<\/li>\n<\/ul>\n\n\n\n<p>The distinguishing advantage of RL is its ability to learn strategies that maximise long-term rewards, even when it means accepting short-term losses.<\/p>\n\n\n\n<p>Consider Amazon&#8217;s stock price, which remained relatively stable from late 2018 to early 2020, suggesting a mean-reverting strategy might work well.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"720\" height=\"360\" data-src=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/Amazon-share-price-quantinsti.png\" alt=\"Reinforcement Learning in Trading\n\" class=\"wp-image-225915 lazyload\" data-srcset=\"https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/Amazon-share-price-quantinsti.png 720w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/Amazon-share-price-quantinsti-700x350.png 700w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/Amazon-share-price-quantinsti-300x150.png 300w\" data-sizes=\"(max-width: 720px) 100vw, 720px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 720px; aspect-ratio: 720\/360;\" \/><\/figure>\n\n\n\n<p>Source: Yahoo Finance<\/p>\n\n\n\n<p>However, from early 2020, the price began trending upward. Deploying a mean-reverting strategy at this point would have resulted in losses, causing many traders to exit the market.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"720\" height=\"360\" data-src=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/Early-2020-share-price-Amazon-quantinsti.png\" alt=\"Reinforcement Learning in Trading\n\" class=\"wp-image-225916 lazyload\" data-srcset=\"https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/Early-2020-share-price-Amazon-quantinsti.png 720w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/Early-2020-share-price-Amazon-quantinsti-700x350.png 700w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/Early-2020-share-price-Amazon-quantinsti-300x150.png 300w\" data-sizes=\"(max-width: 720px) 100vw, 720px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 720px; aspect-ratio: 720\/360;\" \/><\/figure>\n\n\n\n<p>Source: Yahoo Finance<\/p>\n\n\n\n<p>An RL model, however, could recognise larger patterns from previous years (2017-2018) and continue holding positions for substantial future profits\u2014exemplifying delayed gratification in action.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-is-reinforcement-learning-different-from-traditional-ml\"><strong>How is Reinforcement Learning Different from Traditional ML?<\/strong><\/h2>\n\n\n\n<p>Unlike traditional machine learning algorithms, RL doesn&#8217;t require labels at each time step. Instead:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The RL algorithm learns through trial and error<\/li>\n\n\n\n<li>It receives rewards only when trades are closed<\/li>\n\n\n\n<li>It optimises strategy to maximise long-term rewards<\/li>\n<\/ul>\n\n\n\n<p>Traditional ML requires labels at specific intervals (e.g., hourly or daily) and focuses on regression to predict the next candle percentage returns or classification to predict whether to buy or sell a stock. This makes solving the delayed gratification problem particularly challenging through conventional ML approaches.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"components-of-reinforcement-learning\">Components of Reinforcement Learning<\/h2>\n\n\n\n<p>This guide focuses on the conceptual understanding of Reinforcement Learning components rather than their implementation. If you&#8217;re interested in coding these concepts, you can explore the&nbsp;<a href=\"https:\/\/quantra.quantinsti.com\/course\/deep-reinforcement-learning-trading\">Deep Reinforcement Learning course<\/a>&nbsp;on Quantra.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"actions\"><strong>Actions<\/strong><\/h3>\n\n\n\n<p>Actions define what the RL algorithm can do to solve a problem. For trading, actions might be Buy, Sell, and Hold. For portfolio management, actions would be capital allocations across asset classes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"policy\"><strong>Policy<\/strong><\/h3>\n\n\n\n<p>Policies help the RL model decide which actions to take:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Exploration policy<\/strong>: When the agent knows nothing, it decides actions randomly and learns from experiences. This initial phase is driven by experimentation\u2014trying different actions and observing the outcomes.<\/li>\n\n\n\n<li><strong>Exploitation policy<\/strong>: The agent uses past experiences to map states to actions that maximise long-term rewards.<\/li>\n<\/ul>\n\n\n\n<p>In trading, it is crucial to maintain a balance between exploration and exploitation. A simple mathematical expression that decays exploration over time while retaining a small exploratory chance can be written as: <\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" width=\"128\" height=\"41\" data-src=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/reinforcement-learning-formula-quantinsti-1.png\" alt=\"Reinforcement Learning in Trading\" class=\"wp-image-225918 lazyload\" style=\"--smush-placeholder-width: 128px; aspect-ratio: 128\/41;width:128px;height:auto\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" \/><\/figure>\n\n\n\n<p>Here, \u03b5\u209c is the exploration rate at trade number t, k controls the rate of decay, and \u03b5\u2098\u1d62\u2099 ensures we never stop exploring entirely.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"state\"><strong>State<\/strong><\/h3>\n\n\n\n<p>The state provides meaningful information for decision-making. For example, when deciding whether to buy Apple stock, useful information might include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Technical indicators<\/li>\n\n\n\n<li>Historical price data<\/li>\n\n\n\n<li>Sentiment data<\/li>\n\n\n\n<li>Fundamental data<\/li>\n<\/ul>\n\n\n\n<p>All this information constitutes the state. For effective analysis, the data should be weakly predictive and weakly stationary (having constant mean and variance), as ML algorithms generally perform better on stationary data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"rewards\"><strong>Rewards<\/strong><\/h3>\n\n\n\n<p>Rewards represent the end objective of your RL system. Common metrics include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Profit per tick<\/li>\n\n\n\n<li>Sharpe Ratio<\/li>\n\n\n\n<li>Profit per trade<\/li>\n<\/ul>\n\n\n\n<p>When it comes to trading, using just the PnL sign (positive\/negative) as the reward works better as the model learns faster. This binary reward structure allows the model to focus on consistently making profitable trades rather than chasing larger but potentially riskier gains.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"environment\"><strong>Environment<\/strong><\/h3>\n\n\n\n<p>The environment is the world that allows the RL agent to observe states. When the agent applies an action, the environment processes that action, calculates rewards, and transitions to the next state.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"rl-agent\"><strong>RL Agent<\/strong><\/h3>\n\n\n\n<p>The agent is the RL model that takes input features\/state and decides which action to take. For instance, an RL agent might take RSI and 10-minute returns as input to determine whether to go long on Apple stock or close an existing position.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"putting-it-all-together\"><strong>Putting It All Together<\/strong><\/h2>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1000\" height=\"563\" data-src=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/quantinsti-reinforcement-learning-2.gif\" alt=\"Reinforcement Learning in Trading\" class=\"wp-image-225920 lazyload\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1000px; aspect-ratio: 1000\/563;\" \/><\/figure>\n\n\n\n<p>Let&#8217;s see how these components work together:<\/p>\n\n\n\n<p><strong>Step 1:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>State &amp; Action<\/strong>: Apple&#8217;s closing price was $92 on Jan 24, 2025. Based on the state (RSI and 10-day returns), the agent gives a buy signal.<\/li>\n\n\n\n<li><strong>Environment<\/strong>: The order is placed at the open on the next trading day (Jan 27) and filled at $92.<\/li>\n\n\n\n<li><strong>Reward<\/strong>: No reward is given as the trade is still open.<\/li>\n<\/ul>\n\n\n\n<p><strong>Step 2:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>State &amp; Action<\/strong>: The next state reflects the latest price data. On Jan 27, the price reached $94. The agent analyses this state and decides to sell.<\/li>\n\n\n\n<li><strong>Environment<\/strong>: A sell order is placed to close the long position.<\/li>\n\n\n\n<li><strong>Reward<\/strong>: A reward of 2.1% is given to the agent.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Date<\/strong><\/td><td><strong>Closing price<\/strong><\/td><td><strong>Action<\/strong><\/td><td><strong>Reward (% returns)<\/strong><\/td><\/tr><tr><td>Jan 24<\/td><td>$92<\/td><td>Buy<\/td><td>&#8211;<\/td><\/tr><tr><td>Jan 27<\/td><td>$94<\/td><td>Sell<\/td><td>2.1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"q-table-and-q-learning\"><strong>Q-Table and Q-Learning<\/strong><\/h2>\n\n\n\n<p>At each time step, the RL agent needs to decide which action to take. The Q-table helps by showing which action will give the maximum reward. In this table:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Rows represent states (days)<\/li>\n\n\n\n<li>Columns represent actions (hold\/sell)<\/li>\n\n\n\n<li>Values are Q-values indicating expected future rewards<\/li>\n<\/ul>\n\n\n\n<p>Example Q-table:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Date<\/strong><\/td><td><strong>Sell<\/strong><\/td><td><strong>Hold<\/strong><\/td><\/tr><tr><td>23-01-2025<\/td><td>0.954<\/td><td>0.966<\/td><\/tr><tr><td>24-01-2025<\/td><td>0.954<\/td><td>0.985<\/td><\/tr><tr><td>27-01-2025<\/td><td>0.954<\/td><td>1.005<\/td><\/tr><tr><td>28-01-2025<\/td><td>0.954<\/td><td>1.026<\/td><\/tr><tr><td>29-01-2025<\/td><td>0.954<\/td><td>1.047<\/td><\/tr><tr><td>30-01-2025<\/td><td>0.954<\/td><td>1.068<\/td><\/tr><tr><td>31-01-2025<\/td><td>0.954<\/td><td>1.090<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>On Jan 23, the agent would choose &#8220;hold&#8221; since its Q-value (0.966) exceeds the Q-value for &#8220;sell&#8221; (0.954).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"creating-a-q-table\"><strong>Creating a Q-Table<\/strong><\/h3>\n\n\n\n<p>Let&#8217;s create a Q-table using Apple&#8217;s price data from Jan 22-31, 2025:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Date<\/strong><\/td><td><strong>Closing Price<\/strong><\/td><td><strong>% Returns<\/strong><\/td><td><strong>Cumulative Returns<\/strong><\/td><\/tr><tr><td>22-01-2025<\/td><td>97.2<\/td><td>&#8211;<\/td><td>&#8211;<\/td><\/tr><tr><td>23-01-2025<\/td><td>92.8<\/td><td>-4.53%<\/td><td>0.95<\/td><\/tr><tr><td>24-01-2025<\/td><td>92.6<\/td><td>-0.22%<\/td><td>0.95<\/td><\/tr><tr><td>27-01-2025<\/td><td>94.8<\/td><td>2.38%<\/td><td>0.98<\/td><\/tr><tr><td>28-01-2025<\/td><td>93.3<\/td><td>-1.58%<\/td><td>0.96<\/td><\/tr><tr><td>29-01-2025<\/td><td>95.0<\/td><td>1.82%<\/td><td>0.98<\/td><\/tr><tr><td>30-01-2025<\/td><td>96.2<\/td><td>1.26%<\/td><td>0.99<\/td><\/tr><tr><td>31-01-2025<\/td><td>106.3<\/td><td>10.50%<\/td><td>1.09<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>If we&#8217;ve bought one Apple share with no remaining capital, our only choices are &#8220;hold&#8221; or &#8220;sell.&#8221; We first create a reward table:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>State\/Action<\/strong><\/td><td><strong>Sell<\/strong><\/td><td><strong>Hold<\/strong><\/td><\/tr><tr><td>22-01-2025<\/td><td>0<\/td><td>0<\/td><\/tr><tr><td>23-01-2025<\/td><td>0.95<\/td><td>0<\/td><\/tr><tr><td>24-01-2025<\/td><td>0.95<\/td><td>0<\/td><\/tr><tr><td>27-01-2025<\/td><td>0.98<\/td><td>0<\/td><\/tr><tr><td>28-01-2025<\/td><td>0.96<\/td><td>0<\/td><\/tr><tr><td>29-01-2025<\/td><td>0.98<\/td><td>0<\/td><\/tr><tr><td>30-01-2025<\/td><td>0.99<\/td><td>0<\/td><\/tr><tr><td>31-01-2025<\/td><td>1.09<\/td><td>1.09<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Using only this reward table, the RL model would sell the stock and get a reward of 0.95. However, the price is expected to increase to $106 on Jan 31, resulting in a 9% gain, so holding would be better.<\/p>\n\n\n\n<p>To represent this future information, we create a Q-table using the Bellman equation:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"322\" height=\"36\" data-src=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/reinforcement-learning-formula-quantinsti-2.png\" alt=\"Reinforcement Learning\" class=\"wp-image-225922 lazyload\" data-srcset=\"https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/reinforcement-learning-formula-quantinsti-2.png 322w, https:\/\/ibkrcampus.com\/campus\/wp-content\/uploads\/sites\/2\/2025\/06\/reinforcement-learning-formula-quantinsti-2-300x34.png 300w\" data-sizes=\"(max-width: 322px) 100vw, 322px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 322px; aspect-ratio: 322\/36;\" \/><\/figure>\n\n\n\n<p>Where:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>s is the state<\/li>\n\n\n\n<li>a is a set of actions at time t<\/li>\n\n\n\n<li>a&#8217; is a specific action<\/li>\n\n\n\n<li>R is the reward table<\/li>\n\n\n\n<li>Q is the state-action table that&#8217;s constantly updated<\/li>\n\n\n\n<li>\u03b3 is the learning rate<\/li>\n<\/ul>\n\n\n\n<p>Starting with Jan 30&#8217;s Hold action:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The reward for this action (from R-table) is 0<\/li>\n\n\n\n<li>Assuming \u03b3 = 0.98, the maximum Q-value for actions on Jan 31 is 1.09<\/li>\n\n\n\n<li>The Q-value for Hold on Jan 30 is 0 + 0.98(1.09) = 1.068<\/li>\n<\/ul>\n\n\n\n<p>Completing this process for all rows gives us our Q-table:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Date<\/strong><\/td><td><strong>Sell<\/strong><\/td><td><strong>Hold<\/strong><\/td><\/tr><tr><td>23-01-2025<\/td><td>0.95<\/td><td>0.966<\/td><\/tr><tr><td>24-01-2025<\/td><td>0.95<\/td><td>0.985<\/td><\/tr><tr><td>27-01-2025<\/td><td>0.98<\/td><td>1.005<\/td><\/tr><tr><td>28-01-2025<\/td><td>0.96<\/td><td>1.026<\/td><\/tr><tr><td>29-01-2025<\/td><td>0.98<\/td><td>1.047<\/td><\/tr><tr><td>30-01-2025<\/td><td>0.99<\/td><td>1.068<\/td><\/tr><tr><td>31-01-2025<\/td><td>1.09<\/td><td>1.090<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The RL model will now select &#8220;hold&#8221; to maximise Q-value. This process of updating the Q-table is called Q-learning.<\/p>\n\n\n\n<p>In real-world scenarios with vast state spaces, building complete Q-tables becomes impractical. To overcome this, we can use Deep Q Networks (DQNs)\u2014neural networks that learn Q-tables from past experiences and provide Q-values for actions when given a state as input.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"experience-replay-and-advanced-techniques-in-rl\"><strong>Experience Replay and Advanced Techniques in RL<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"experience-replay\"><strong>Experience Replay<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Stores (state, action, reward, next_state) tuples in a replay buffer<\/li>\n\n\n\n<li>Trains the network on random batches from this buffer<\/li>\n\n\n\n<li>Benefits: breaks correlations between samples, improves data efficiency, stabilises training<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"double-q-networks-ddqn-\"><strong>Double Q-Networks (DDQN)<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uses two networks: primary for action selection, target for value estimation<\/li>\n\n\n\n<li>Reduces overestimation bias in Q-values<\/li>\n\n\n\n<li>More stable learning and better policies<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"other-key-advancements\"><strong>Other Key Advancements<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Prioritised Experience Replay<\/strong>: Samples important transitions more frequently<\/li>\n\n\n\n<li><strong>Dueling Networks<\/strong>: Separates state value and action advantage estimation<\/li>\n\n\n\n<li><strong>Distributional RL<\/strong>: Models the entire return distribution instead of just the expected value<\/li>\n\n\n\n<li><strong>Rainbow DQN<\/strong>: Combines multiple improvements for state-of-the-art performance<\/li>\n\n\n\n<li><strong>Soft Actor-Critic<\/strong>: Adds entropy regularisation for robust exploration<\/li>\n<\/ul>\n\n\n\n<p>These techniques address fundamental challenges in deep RL, improving efficiency, stability, and performance across complex environments.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"challenges-in-reinforcement-learning-for-trading\"><strong>Challenges in Reinforcement Learning for Trading<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"type-2-chaos\"><strong>Type 2 Chaos<\/strong><\/h3>\n\n\n\n<p>While training, the RL model works in isolation without interacting with the market. Once deployed, we don&#8217;t know how it will affect the market. Type 2 chaos occurs when an observer can influence the situation they&#8217;re observing. Although difficult to quantify during training, we can assume the RL model will continue learning after deployment and adjust accordingly.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"noise-in-financial-data\"><strong>Noise in Financial Data<\/strong><\/h3>\n\n\n\n<p>RL models might interpret random noise in financial data as actionable signals, leading to inaccurate trading recommendations. While methods exist to remove noise, we must balance noise reduction against a potential loss of important data.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"conclusion\">Conclusion<\/h3>\n\n\n\n<p>We&#8217;ve introduced the fundamental components of reinforcement learning systems for trading. The next step would be implementing your own RL system to backtest and paper trade using real-world market data.<\/p>\n\n\n\n<p>For a deeper dive into RL and to create your own reinforcement learning trading strategies, consider specialised courses in Deep Reinforcement Learning on Quantra.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"references-further-readings\">References &amp; Further Readings<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Once you\u2019re comfortable with the foundational ML concepts, you can explore advanced reinforcement learning and its role in trading through more structured learning experiences. Start with the<a href=\"https:\/\/quantra.quantinsti.com\/learning-track\/machine-learning-deep-learning-trading-1\">&nbsp;Machine Learning &amp; Deep Learning in Trading<\/a>&nbsp;learning track, which offers hands-on tutorials on AI model design, data preprocessing, and financial market modelling.<\/li>\n\n\n\n<li>For those looking for an advanced, structured approach to quantitative trading and machine learning, the<a href=\"https:\/\/www.quantinsti.com\/epat\/\">&nbsp;Executive Programme in Algorithmic Trading (EPAT)<\/a>&nbsp;is an excellent choice. This program covers classical ML algorithms (such as SVM, k-means clustering, decision trees, and random forests), deep learning fundamentals (including neural networks and gradient descent), and Python-based strategy development. You will also explore statistical arbitrage using PCA, alternative data sources, and reinforcement learning applied to trading.<\/li>\n\n\n\n<li>Once you have mastered these concepts, you can apply your knowledge in real-world trading using<a href=\"https:\/\/www.quantinsti.com\/blueshift\/\">&nbsp;Blueshift<\/a>. Blueshift is an all-in-one automated trading platform that offers institutional-grade infrastructure for investment research, backtesting, and algorithmic trading. It is a fast, flexible, and reliable platform, agnostic to asset class and trading style, helping you turn your ideas into investment-worthy opportunities.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards. <\/p>\n","protected":false},"author":517,"featured_media":168970,"comment_status":"open","ping_status":"closed","sticky":true,"template":"","format":"standard","meta":{"_acf_changed":true,"footnotes":""},"categories":[339,343,349,338,341],"tags":[632,7257,852,595,7258],"contributors-categories":[13654],"class_list":{"0":"post-225869","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science","8":"category-programing-languages","9":"category-python-development","10":"category-ibkr-quant-news","11":"category-quant-development","12":"tag-ai","13":"tag-algorithmic-trading","14":"tag-machine-learning","15":"tag-python","16":"tag-reinforcement-learning","17":"contributors-categories-quantinsti"},"pp_statuses_selecting_workflow":false,"pp_workflow_action":"current","pp_status_selection":"publish","acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Reinforcement Learning in Trading | IBKR Quant<\/title>\n<meta name=\"description\" content=\"Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/225869\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Reinforcement Learning in Trading\" \/>\n<meta property=\"og:description\" content=\"Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/\" \/>\n<meta property=\"og:site_name\" content=\"IBKR Campus US\" \/>\n<meta property=\"article:published_time\" content=\"2025-06-17T16:08:54+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2022\/11\/machine-learning-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"900\" \/>\n\t<meta property=\"og:image:height\" content=\"550\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Ishan Shah\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Ishan Shah\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\n\t    \"@context\": \"https:\\\/\\\/schema.org\",\n\t    \"@graph\": [\n\t        {\n\t            \"@type\": \"NewsArticle\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/#article\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/\"\n\t            },\n\t            \"author\": {\n\t                \"name\": \"Ishan Shah\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/0fd7dbae1e070042c10b53e8bdc551c5\"\n\t            },\n\t            \"headline\": \"Reinforcement Learning in Trading\",\n\t            \"datePublished\": \"2025-06-17T16:08:54+00:00\",\n\t            \"mainEntityOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/\"\n\t            },\n\t            \"wordCount\": 1941,\n\t            \"commentCount\": 0,\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2022\\\/11\\\/machine-learning-1.jpg\",\n\t            \"keywords\": [\n\t                \"AI\",\n\t                \"Algorithmic Trading\",\n\t                \"Machine Learning\",\n\t                \"Python\",\n\t                \"Reinforcement Learning\"\n\t            ],\n\t            \"articleSection\": [\n\t                \"Data Science\",\n\t                \"Programming Languages\",\n\t                \"Python Development\",\n\t                \"Quant\",\n\t                \"Quant Development\"\n\t            ],\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"CommentAction\",\n\t                    \"name\": \"Comment\",\n\t                    \"target\": [\n\t                        \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/#respond\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"WebPage\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/\",\n\t            \"name\": \"Reinforcement Learning in Trading | IBKR Campus US\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\"\n\t            },\n\t            \"primaryImageOfPage\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/#primaryimage\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/#primaryimage\"\n\t            },\n\t            \"thumbnailUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2022\\\/11\\\/machine-learning-1.jpg\",\n\t            \"datePublished\": \"2025-06-17T16:08:54+00:00\",\n\t            \"description\": \"Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards.\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"ReadAction\",\n\t                    \"target\": [\n\t                        \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"ImageObject\",\n\t            \"inLanguage\": \"en-US\",\n\t            \"@id\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/ibkr-quant-news\\\/reinforcement-learning-in-trading-2\\\/#primaryimage\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2022\\\/11\\\/machine-learning-1.jpg\",\n\t            \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2022\\\/11\\\/machine-learning-1.jpg\",\n\t            \"width\": 900,\n\t            \"height\": 550,\n\t            \"caption\": \"Machine learning\"\n\t        },\n\t        {\n\t            \"@type\": \"WebSite\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#website\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"name\": \"IBKR Campus US\",\n\t            \"description\": \"Financial Education from Interactive Brokers\",\n\t            \"publisher\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\"\n\t            },\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"SearchAction\",\n\t                    \"target\": {\n\t                        \"@type\": \"EntryPoint\",\n\t                        \"urlTemplate\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/?s={search_term_string}\"\n\t                    },\n\t                    \"query-input\": {\n\t                        \"@type\": \"PropertyValueSpecification\",\n\t                        \"valueRequired\": true,\n\t                        \"valueName\": \"search_term_string\"\n\t                    }\n\t                }\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"Organization\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#organization\",\n\t            \"name\": \"Interactive Brokers\",\n\t            \"alternateName\": \"IBKR\",\n\t            \"url\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/\",\n\t            \"logo\": {\n\t                \"@type\": \"ImageObject\",\n\t                \"inLanguage\": \"en-US\",\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\",\n\t                \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"contentUrl\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/wp-content\\\/uploads\\\/sites\\\/2\\\/2024\\\/05\\\/ibkr-campus-logo.jpg\",\n\t                \"width\": 669,\n\t                \"height\": 669,\n\t                \"caption\": \"Interactive Brokers\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/logo\\\/image\\\/\"\n\t            },\n\t            \"publishingPrinciples\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/about-ibkr-campus\\\/\",\n\t            \"ethicsPolicy\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/cyber-security-notice\\\/\"\n\t        },\n\t        {\n\t            \"@type\": \"Person\",\n\t            \"@id\": \"https:\\\/\\\/ibkrcampus.com\\\/campus\\\/#\\\/schema\\\/person\\\/0fd7dbae1e070042c10b53e8bdc551c5\",\n\t            \"name\": \"Ishan Shah\",\n\t            \"url\": \"https:\\\/\\\/www.interactivebrokers.com\\\/campus\\\/author\\\/ishanshah\\\/\"\n\t        }\n\t    ]\n\t}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Reinforcement Learning in Trading | IBKR Quant","description":"Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.interactivebrokers.com\/campus\/wp-json\/wp\/v2\/posts\/225869\/","og_locale":"en_US","og_type":"article","og_title":"Reinforcement Learning in Trading","og_description":"Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards.","og_url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/","og_site_name":"IBKR Campus US","article_published_time":"2025-06-17T16:08:54+00:00","og_image":[{"width":900,"height":550,"url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2022\/11\/machine-learning-1.jpg","type":"image\/jpeg"}],"author":"Ishan Shah","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Ishan Shah","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"NewsArticle","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/#article","isPartOf":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/"},"author":{"name":"Ishan Shah","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/0fd7dbae1e070042c10b53e8bdc551c5"},"headline":"Reinforcement Learning in Trading","datePublished":"2025-06-17T16:08:54+00:00","mainEntityOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/"},"wordCount":1941,"commentCount":0,"publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2022\/11\/machine-learning-1.jpg","keywords":["AI","Algorithmic Trading","Machine Learning","Python","Reinforcement Learning"],"articleSection":["Data Science","Programming Languages","Python Development","Quant","Quant Development"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/","url":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/","name":"Reinforcement Learning in Trading | IBKR Campus US","isPartOf":{"@id":"https:\/\/ibkrcampus.com\/campus\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/#primaryimage"},"image":{"@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/#primaryimage"},"thumbnailUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2022\/11\/machine-learning-1.jpg","datePublished":"2025-06-17T16:08:54+00:00","description":"Like humans, RL agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.interactivebrokers.com\/campus\/ibkr-quant-news\/reinforcement-learning-in-trading-2\/#primaryimage","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2022\/11\/machine-learning-1.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2022\/11\/machine-learning-1.jpg","width":900,"height":550,"caption":"Machine learning"},{"@type":"WebSite","@id":"https:\/\/ibkrcampus.com\/campus\/#website","url":"https:\/\/ibkrcampus.com\/campus\/","name":"IBKR Campus US","description":"Financial Education from Interactive Brokers","publisher":{"@id":"https:\/\/ibkrcampus.com\/campus\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ibkrcampus.com\/campus\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ibkrcampus.com\/campus\/#organization","name":"Interactive Brokers","alternateName":"IBKR","url":"https:\/\/ibkrcampus.com\/campus\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/","url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","contentUrl":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2024\/05\/ibkr-campus-logo.jpg","width":669,"height":669,"caption":"Interactive Brokers"},"image":{"@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/logo\/image\/"},"publishingPrinciples":"https:\/\/www.interactivebrokers.com\/campus\/about-ibkr-campus\/","ethicsPolicy":"https:\/\/www.interactivebrokers.com\/campus\/cyber-security-notice\/"},{"@type":"Person","@id":"https:\/\/ibkrcampus.com\/campus\/#\/schema\/person\/0fd7dbae1e070042c10b53e8bdc551c5","name":"Ishan Shah","url":"https:\/\/www.interactivebrokers.com\/campus\/author\/ishanshah\/"}]}},"jetpack_featured_media_url":"https:\/\/www.interactivebrokers.com\/campus\/wp-content\/uploads\/sites\/2\/2022\/11\/machine-learning-1.jpg","_links":{"self":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/225869","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/users\/517"}],"replies":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/comments?post=225869"}],"version-history":[{"count":0,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/posts\/225869\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media\/168970"}],"wp:attachment":[{"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/media?parent=225869"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/categories?post=225869"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/tags?post=225869"},{"taxonomy":"contributors-categories","embeddable":true,"href":"https:\/\/ibkrcampus.com\/campus\/wp-json\/wp\/v2\/contributors-categories?post=225869"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}