\documentclass[a4paper, 14pt]{scrartcl} \usepackage[utf8]{inputenc} \usepackage[english]{babel} \usepackage{parskip} \usepackage{microtype} \usepackage[margin=1in]{geometry} \usepackage{hyperref} \title{Stock Trading with Reinforcement Learning} \author{Marcel Zinkel} \begin{document} \maketitle \tableofcontents \section{Introduction} I want to build a reinforcement learning project about single asset stock trading. First I want to start a simple environment with just the actions buy and sell. For the reward function I also want to keep it simple at first by just using the profit as reward. In contrast to the algorithms we already heard in the lecture, I have to try out deep reinforcement learning algorithms because the price is a continuous variable. In theory, you could model the price with a specific resolution with many states. However, this can very quickly become impractical for classic reinforcement learning methods. Also, deep reinforcement learning can recognize pattern to act good in previously unseen states. I want to try out different reinforcement learning algorithms to see with works best for the trading environment. First I want to try out the Deep-Q-Network algorithm. It predicts the Q-function and uses Epsilon-Greedy Exploration. I plan to try out different formulas for the epsilon decay. Because DQN often overestimates the Q-values, I want to try out a variation of DQN, called Double DQN. It uses two networks for updating the policy. The online network selects the action with the highest Q-value and the target network evaluates the action. This causes more stable and better learning. I will try, if Double DQN will improve the results. At last, I want to try out the Proximal Policy Optimization algorithm. After implementing these different algorithms, I need to train these and compare the results. I find it also very interesting, if providing the RL agent with additional information then just the price, positively impacts the results. For example, I can add technical indicators, market volume or an online news score about the company. The last one is probably a bit difficult because you need a LLM which gives web scrapped articles a score how good the news is for a company. After adding this information, I need to reevaluate which algorithm is the best. \section{Libraries and Tools} The project will be implemented in Python using \texttt{gym-anytrading} to build the trading environment. For initial experiments, I will use the built-in datasets from \texttt{gym\_anytrading.datasets} such as \texttt{STOCKS\_GOOGL}, and later switch to real historical stock data via \texttt{yfinance}. The reinforcement learning algorithms will be implemented using the \texttt{stable-baselines3} library. I will start with the standard DQN algorithm and experiment with different epsilon decay strategies. Since \texttt{stable-baselines3} does not directly support Double DQN, I plan to modify the DQN implementation myself. Specifically, I will adjust the target calculation so that the action is selected using the online network but evaluated using the target network, as required in Double DQN. This will allow me to better understand the internal workings of the algorithm and directly control its behavior. In addition to DQN and Double DQN, I will also train PPO using the standard implementation in \texttt{stable-baselines3}. After training, I will evaluate all models using backtesting and performance metrics like total profit, Sharpe ratio, and maximum drawdown. Later, I plan to extend the observation space with technical indicators, volume data, or sentiment features. For technical indicators, I will use the \texttt{pandas-ta} library since it is easy to install, well integrated with \texttt{pandas}, and provides a wide range of indicators sufficient for prototyping and research. After adding these features, I will retrain the models and compare their performance again. \section{Development plan} Depending on the exact time my presentation will be scheduled, I have about 9-10 weeks of time. \subsection{Week 1--3} I want to integrate the DQN algorithm as an example and train it already with historical data. \subsection{Week 4--6} I plan to implement the other RL algorithms and the variations and evaluate which works best. Also change the reward function. \subsection{Week 7 to the presentation} Add the technical indicators and market volume to the environment. If I have too much time left, I can try news analysis. \section{Availability} I am on vacation from 04.08 to 13.08. On the 15. I am on an event, but I have time on the 14. From the 18. onwards I am available for the next couple of weeks. I look forward to the presentation, and thank you for giving me the additional time. \end{document}