Theory of DQN and the Double DQN Learning Algorithm:

As detailed by Deep Reinforcement Learning with Double Q-learning

A deep Q network (DQN) is a multi-layered neural network that for a given state s outputs a vector of action values Q(s, · ; θ), where θ are the parameters of the network. For an n-dimensional state space and an action space containing m actions, the neural network is a function from Rⁿ to R^m.¹ In DQN there is an online and target network. The target network, has parameters θ⁻, and they are the same as the online network except that its parameters are copied every τ steps from the online network, and kept fixed on all other steps.¹ The target value used by DQN is:

¹

However, research scientist/professor/genius Hado van Hasselt found that the max operator in the above formula led to overestimation.

Namely, the max operator in standard Q-learning and DQN, uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates.¹ Thus, to lower this overestimation in Double Q-learning the max operation is decomposed in the target value into an action selection and an action evaluation. Although not fully decoupled, the target network in the DQN architecture provides a natural candidate for the second value function, without having to introduce additional networks. We therefore propose to evaluate the greedy policy according to the online network, but using the target network to estimate its value.¹ Hence, the target value becomes the following:

¹

Implementation

The weights of the second network contain the weights of the target network Θₜ- for the evaluation of the current greedy policy. The update to the target network stays unchanged from DQN, and remains a periodic copy of the online network[2].Two values are involved in learning: TD Estimate - the predicted optimal 𝑸* for a given state s:

²

And TD Target - aggregation of current reward and the estimated 𝑸* in the next state s':

²

The algorithm also samples inputs from a replay buffer, to compute TDₜ and TDₑ and backpropagate this loss down Qₒₗᵢₙₑ to update its parameters θₒₗᵢₙₑ (α is the learning rate) as follows:

²

Target weights do not get updated during backpropogation, instead rather weights from the online network are assigned to the weights of the target network as follows:

²

Example 1: AI-powered Mario Agent using the Double Deep Q Network (DDQN) Algorithm

The results of training the model for 10000 episodes can be found in the code section of: Train Mario 10000 After 10000 episodes, an average score of 334.503 was achieved (which is well below the scores documented by others using right movements for Mario). I decided to warm start the model for another 100 episodes using the same memory max length and batch size of 32. Only parameters relating to: 1) the experiences added to the cache before training; 2) the number of experiences added to the cache before updating the weights of the online network and 3)`the frequency of synchronizing the model weights of the target network were changed. The values and code can be found at Warm-Start 100. Click the bottom image, if you dare, to see the DDQN in action in the Super Mario environment!!

(Basic 😄) Theory of Actor-Critic (A2C) Algorithm:

All Actor-Critic Algorithms consist of two methods: an Actor/Policy and a Critic/Value. As detailed by Actor-Critic Algorithms

Actor-Critic methods aim at combining the strong points of Actor/Policy-only and Critic/Value-only methods. The Critic uses an approximation architecture and simulation to learn a Value function, which is then used to update the Actor's Policy parameters in a direction of performance improvement. Such methods, as long as they are gradient-based, may have desirable convergence properties, in contrast to Critic-only methods for which convergence is guaranteed in very limited settings. They hold the promise of delivering faster convergence (due to variance reduction), when compared to Actor-only methods.³

Since the Actor-Critic methods rely on an Actor/Policy and a Critic/Value, you need to select both an Actor/Policy and a Critic/Value method. In this case,the Actor/Policy method used was a Policy-Gradient Method called the REINFORCE algorithm and the Critic/Value method used was based on On-Policy Value learning. Then both methods are combined to create the Actor-Critic Algorithm used in the Implementation portion.

Actor/Policy

The method REINFORCE is built upon trajectories instead of episodes because maximizing expected return over trajectories (instead of episodes) lets the method search for optimal policies for both episodic and continuing tasks.The method samples trajectories using the policy and then uses those trajectories only to estimate the gradient.⁴ The pseudocode for the REINFORCE algorithm is the following:

⁴

Critic/Value

An On-Policy Value Function gives the expected return if you start in a given state s_t and always act according to the policy π and is represented by the following formula:

⁵

Actor-Critic-Combination

By viewing the problem as a stochastic gradient problem on the Actor/Policy parameter space represented by line (4) in the above pseudocode,the job of the critic is to compute an approximation of the projection of the policy π onto the subspace Ψ paramerterized by θ [which leads to the two actor-critic algorithms described in the paper]³.The actor uses this approximation to update its policy in an approximate gradient direction.³ This is done by using the critic as a state-dependent baseline. To do so, an Advantage Function is calulated by subtracting the expected return(i.e. G) by the estimated value(i.e.V) and the following gradient is computed:

⁵

Implementation:

The implementation of the above Actor-Critic Combination was taken directly from ⁶. To model functions π and V one shared neural network was used with one hidden layer of 128 units, detailed by the following code snippet:

class ActorCritic(tf.keras.Model):
  def __init__(self, num_actions: int, num_hidden_units: int):
    super().__init__()
    self.common = layers.Dense(num_hidden_units, activation="relu")
    self.actor = layers.Dense(num_actions)
    self.critic = layers.Dense(1)

  def call(self, inputs: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
    x = self.common(inputs)
    return self.actor(x), self.critic(x)

The to train this Actor-Critic Model the gradient of the loss function has to be calculated and backpropogated.

Actor Loss

⁶

where G-V is the Advantage Function.

Critic Loss

⁶

where L_𝛿 is the Huber Loss.

Each loss is calculated independently and then combined to get the total loss. The following code snippet details this combination:

huber_loss = tf.keras.losses.Huber(reduction=tf.keras.losses.Reduction.SUM)

def compute_loss(action_probs: tf.Tensor, values: tf.Tensor,  returns: tf.Tensor):
  """Computes the combined actor-critic loss."""

  advantage = returns - values

  action_log_probs = tf.math.log(action_probs)
  actor_loss = -tf.math.reduce_sum(action_log_probs * advantage)

  critic_loss = huber_loss(values, returns)

  return actor_loss + critic_loss

Example 2: AI-powered Discrete Lunar Lander Agent using the Actor-Critic (A2C) Algorithm

I decided to test out the Implementaion of the A2C Algorithm on the Lunar Lander enviornment as found here Lunar Lander. This enviornment has two possible action spaces to choose from. One is continous and another one is discrete. I choose the discrete enironment this time, since it was very easy to set up.⁷ As detailed in the Lunar Lander documentation the action, state, reward space were the following:

There are four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine. There are 8 states: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.The landing pad is always at coordinates (0,0). The coordinates are the first two numbers in the state vector.Rewards for moving from the top of the screen to the landing pad and zero speed are about 100 to 140 points.If the lander moves away from the landing pad it loses reward. The episode finishes if the lander crashes or comes to rest, receiving an additional -100 or +100 points. Each leg with ground contact is +10 points.Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame.Solved is 200 points.

I used all the default parameters from the tutorial and ended on the 10000-th episode with an episode reward of 221 (i.e. 🎮 numbers! Take that 🧑‍🚀 camp!) and an average reward of 115.82. Und das war alles fur heute.

Theory of Dueling Network Architecture:

As detailed by Dueling Network Architectures for Deep Reinforcement Learning

In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function.The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm.⁸

To bring this insight to fruition, we design a single Qnetwork architecture, as illustrated in Figure 1, which we refer to as the dueling network. The lower layers of the dueling network are convolutional as in the original DQNs (Mnih et al., 2015). However, instead of following the convolutional layers with a single sequence of fully connected layers, we instead use two sequences (or streams) of fully connected layers. The streams are constructed such that they have they have the capability of providing separate estimates of the value and advantage functions. Finally, the two streams are combined to produce a single output Q function. As in (Mnih et al., 2015), the output of the network is a set of Q values, one for each action.⁸.

⁸.

The forumula that combines these two streams to output Q-values is the following:

⁸.

where V represents the value function and A represents the advantages function for each action.

Implementation

The implementation of the above formula came from the book Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition by Maxim Lapan. And is shown by the following PyTorch code snippet:

def __init__(self, shape, actions_n):
        self.conv = nn.Sequential(
            nn.Conv1d(shape[0], 128, kernel),
            nn.ReLU(),
            nn.Conv1d(128, 128, kernel),
            nn.ReLU(),
        )
        o =  self.conv(torch.zeros(1, *shape))
        out_size =i nt(np.prod(o.size()))
        self.fc_val = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )
        self.fc_adv = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, actions_n)
        )
    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        val = self.fc_val(conv_out)
        adv = self.fc_adv(conv_out)
        return val + adv - adv.mean(dim=1, keepdim=True)

Example 3: Discrete Action Space for Stock/ETF Trading Agent: Part I

In chapter 8 of ⁹, the author details a practical domain problem of reinforcement learning; namely stock trading! The first thing that had to be done was create the stock trading environment. The type of actions the agent could take in this environment were the following: 1) hold-off on trading; 2) buy the given stock; and 3) sell the given stock. When the agent bought or sold the given stock, it had to pay a commission of 0.1%. The agent's state-space consisted of the following items: 1) 5 past trading days of opening price data; 2) 5 past trading days of high, low, and closing prices in relation to the opening price; 3) volume for the current trading day; 4) whether the agent bought stock and 5) the relative close of the stock for the current trading day. The reward for the agent is a two reward scheme, as detailed in ⁹. Namely the reward is of "either/or form"; if the variable reward_on_close is True, the agent will receive a reward only on selling its stock position, else the agent will recive a reward only when buying and holding its stock position (i.e. not selling). The first form amounts to the trading strategy of active investing, while the second form amounts to the trading strategy of passive investing.

In ⁹ the author uses stock data from the Russian stock market from the period ranging from 2015-2016 for the technology company Yandex. While the dataset contained over 130,000 rows of data, in which every row represented a single minute of price data, I decided to take a more general approach and chose for the agent to trade using the SPY ETF. Each row in the dataset represented one trading day of the etf, and ranged from 2005 to 2022. The years 2005-2014 was used for training and the years 2015-2020 was used for validation.The Source Code for the SPY Trading agent can be found here: SpyTradingAgent

Passive Investing Results:

The Y-axis is given in percentages and the X-axis is given by the number of steps executed.

Mean Value Reward per 100 Trading Windows

Active Investing Results:

The Y-axis is given in percentages and the X-axis is given by the number of steps executed.

Mean Value Reward per 100 Trading Windows

Basic Theory of Proximal Policy Optimization(PPO)

Proximal Policy Optimization

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.¹⁰

CPI Loss

CLIP Loss

TOTAL Loss

Algorithm

Implementation:

To train the Trading Agent the package Stable-Baselines3 was used. As stated in the docs:

Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines. And steems from the paper Stable-Baselines3: Reliable Reinforcement Learning Implementations The algorithms in this package will make it easier for the research community and industry to replicate, refine, and identify new ideas, and will create good baselines to build projects on top of. We expect these tools will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. We also hope that the simplicity of these tools will allow beginners to experiment with a more advanced toolset, without being buried in implementation details.¹¹

Furthermore to visualize the trading agent's observation space when trading, I used Adam King's brilliant implementation of a stock trading environment as found detailed here Rendering elegant stock trading agents using Matplotlib and Gym.

Example 4: Continuous Action Space for Stock/ETF Trading Agent: Part II (:warning::warning::warning: :warning: WARNING!!!!!! Realize this environment is not realistic!!!! WARNING!!! WARNING!!!, Again WARNING!!!, OKAY, Time for the MIT software disclaimer, READ IT: :point_down:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

This second stock/etf environment is based on Adam King's article as found here:Create custom gym environments from scratch — A stock market example. Similar to the first stock trading environment based on Maxim Lapan's implementation as found in chapter eight of his book Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition and as implemented above in Example 3, the agent is trading in the environment of the SPY ETF except in this trading environment the agent is taking continuous actions, rather than discrete actions and is tasked with managing a trading account.

In this trading environment, the agent's reward is based on managing its trading account/balance. The agent can take two actions: 1) either buying or selling the SPY ETF¹² and 2) what percentage of the SPY ETF to buy or sell, which ranges from [0,1] (i.e. 0% to 100%).The agent's state/observation-space consists of the following items: 1) 5 past trading days of high, low, and closing price data in relation to the opening price; 2) volume for the current trading day; 3) the Agent's trading account/balance; 4) the number of shares held; 5) the number of shares sold; 6) the Agent's net worth which consists of the agent's account balance and the shares current price value; 7) the value of the shares sold; and 8) the cost basis of buying new shares as compared to buying shares in the previous period. All of these items were normalized to be in the interval of [-1,1]. The trading agent begins with 10,000 US dollars to trade with and can accumulate a max trading account/balence of 2,147,483,647 US dollars.

Trading Results

The PPO Agent was trained for 50 thousand steps of SPY data ranging from 2005 to mid-2017 and was tested on SPY data ranging from the end-of-2017 to 2022. Furthermore, 4 parallel environments were used for the agent to gather samples/experience to train on. All of the default hyper-parameters from stable-Baselines3 were used except for the number of epochs when optimizing the surrogate loss which I set to 20. I also set the entropy coefficient to 0.01 and used stable-Baselines3's limit implementation regarding the KL divergence between updates and set it to 0.2. Unfotunely, I did not keep track of the learning cures and/or other important metrics(i.e. total loss, value loss etc) that should be tracked during training. The video down below shows this under-trained trading agent in action 👇

Top Subplot Legend
	Agent's trading balance/account

Bottom Subplot Legend
	Trade executed by agent, colored either green or red, depending on agent buying or selling shares, and annotated with the total amount transacted
	Volume for the trading day colored either green or red, depending on whether the price moved up or down
	OHCL data in candlestick form colored either red or green, depending on whether the stock/etf closed lower or higher than its open

The Source Code for the Second Trading agent can be found here: Second Spy Trading Agent. The SPY data that the Trading agent used for training data can be found here: SPY_train.And the SPY data that the Trading agent used for testing data can be found here: SPY_test

Example 5: Multi-Discrete Action Space for Stock/ETF Trading Agent: Part III (:warning::warning::warning: :warning: WARNING!!!!!! Realize this environment is not realistic!!!! WARNING!!! WARNING!!!, Again WARNING!!!, OKAY, Time for the MIT software disclaimer, READ IT: :point_down:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

This third stock trading environment is based on Adam King's articles as found here:Creating Bitcoin trading bots don’t lose money and here:Optimizing deep learning trading bots using state-of-the-art techniques Furthermore, the random offset in the reset method and the if/and control flow is based on Maxim Lapan's implementation as found in chapter eight of his book Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition.

Similar to the first and second stock trading environment, the agent is trading in the SPY ETF environment; trading is in a Multi-Discrete action space of [3, 10] where 0==buy, 1==sell, and 2==hold and the values [0, .1, .2, .3, .4, .5, .6,.7,.8,.9,] represent the number of shares held/sold by the agent ie. 10%, 20%, etc.; and the observation space is a continious observation space from [-inf,inf])(note: however, in the second stock trading environment this space ranged from [-1,1]).Also unlike the second stock trading environment, an additional observation was added to the agent's observations space of an account history/ledger of the agent's past networth, balance, and shares from trading (note: this window is set by the variable LOOKBACK_WINDOW_SIZE and its default is 30 days). And to make it semi-realistic, a commision parameter is used in the cost and sales calculation (note: default is 0.1%).

Additionally, three different ways of calculating the agent's reward were added, namely:

sortinoRewardRatio $\frac{R_p-r_f}{\sigma_d}$ where $R_p$ is actual or expected portfolio return, $r_f$ is the risk free rate and ${sigma_d}$ is the std of the downside
omegaRewardRatio $\frac{\int_{\theta}^{inf}1-F(R_p)dx}{\int_{-inf}^{\theta}F(R_p)dx}$ where $F$ is the cumulative probability distribution of returns, and ${\theta}$ is the target return threshold defining what is considered a gain versus a loss
Excess Return This is a reward rule that was created by Franklin Allen and Risto Karjalainen in their paper Using genetic algorithms to find technical trading rules. This is the defualt reward function for the third trading environment.

Trading using StableBaseline3's MlpPolicy

The Source Code for the Third Trading agent can be found here:Third Spy Trading Agent

Robotic Control Algorithms:

¹³

D4PG

The D4PG reinforcement algorithm is an algorithm that enhances DDPG by making the update of the critic distributional. Namely as the authors of ¹⁴ detail:

¹⁴

DDPG Background

The DDPG algorithm manapulates the traditional state-action value fuction by parameterizing the policy from $\pi$:

¹⁴

to $\pi_{\theta}$ and then taking the gradient with respect to $\theta$:

¹⁴

and parameterizing the value function to w through the Bellman operator to form the temporal difference (TD) error:

¹⁴

The below video is the result of running the DDPG algorithm for 1 million timesteps on the MinitaurBulletEnv which models the Minitaur Robot. The implementation comes from chapter 17 of Maxim Lapan's book Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition

D4PG Theory

With this background in mind,the authors of ¹⁴ enhance the DDPG algorithm by first making the state-value function and the Bellman operator distributional by introducing a parameterized random variable $Z_{\pi}$:

¹⁴

With this in place, the temporal difference (TD) error can then be made distributional:

¹⁴

D4PG Algorithm

This is the pseudocode for the algorithm:

¹⁴

Example 6: Making a Real Life Robot and using the DDPG Algorithm to train it:

Everything that follows, i.e. components, programming, etc comes from chapter 18 of Maxim Lapan's book Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition in which a trained DDPG A2C policy is used to control a real life robot to stand up and move forward. Similar to what was done in Sim-to-Real but on a much smaller scale and budget.

MuJoCo File and PyBullet Simulation of Model:

Orginial Model

Modified Model (Version Zero)

⁹

The xml file for this model can be found here: version zero model and is tested in Maxim Lapan's show model code/simulation env.

Hardware Components:

PyBoard

MicroPython pyboard lite v1.0:

STM32F411RE microcontroller
96 MHz Cortex M4 CPU with hardware floating point
512KiB flash ROM and 128KiB RAM
Micro USB connector for power and serial communication
Micro SD card slot, supporting standard and high capacity SD cards
24 GPIO on left and right edges and 5 GPIO on bottom row, plus LED and switch GPIO available on bottom row
1x 12-bit analog to digital converter, available on 16 pins, 4 with analog ground shielding
4 LEDs (red, green, yellow and blue)
1 reset and 1 user switch
On-board 3.3V LDO voltage regulator, capable of supplying up to 250mA, input voltage range 3.6V to 16V
DFU bootloader in ROM for easy upgrading of firmware

Front View:

Servo Motors

Mini Analog Servo GH-S37A:

IMU Sensor

HiLetgo MPU9250:

Mpu-9250 module (three axis gyroscope + three axis acceleration + three axis magnetic field)
Gyroscope range: ± 250 500 1000 2000 ° / S
Acceleration range: ± 2 ± 4 ± 8 ± 16g
Magnetic field range: ± 4800ut
Power supply: 3-5v (internal low voltage differential voltage stabilizing)
Communication mode: Standard IIC / SPI communication protocol (chip built-in 16 bit AD converter, 16 bit data output)*

Top View:

3D Printed Frame:

This 3D Frame was designed by Maxim Lapan and the .stl file can be found HERE. It was printed by xometry

Top View:

Side View:

Programming:

Programming was done using MicroPython. Everything basically followed Maxim Lapan's implementation as found in chapter 18 of his book execpt for programming the sensor, since I used a different sensor than what Max used. However I was able to find its I2C driver as detailed by Mika Tuupola and was able to modify his code to work with Max's implementation. The directory structure on the sd card for the pyboard is the following:

sd/

Results:

Standing using DDPG learned policy

Moving Forward using DDPG learned policy

Reference:

Deep Reinforcement Learning with Double Q-learning ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Train a Mario-Playing Agent ↩ ↩² ↩³ ↩⁴ ↩⁵
Actor-Critic Algorithms ↩ ↩² ↩³
Policy-Gradient Methods: REINFORCE Algorithm ↩ ↩²
Actor-Critic Algorithms Slides ↩ ↩²
Playing CartPole with the Actor-Critic Method ↩ ↩² ↩³
I tried at first the continous environment, but the interface between numpy and tensorflow's graph was giving me some trouble when using tensorflow's wrapper tf.numpy_function ↩
Dueling Network Architectures for Deep Reinforcement Learning ↩ ↩² ↩³ ↩⁴
Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition ↩ ↩² ↩³ ↩⁴
PPO ↩
Stable-Baselines3 ↩
Note: eventhough this decision is discrete in nature, it is being modeled as a continous action by making values less than 0 as a buy action and values greater than or equal to 0 as a sell action ↩
Note: This video was captured after implementing A2C for continious actions with a normal distribution as detailed by Maxim Lapan in chapter 17 ↩
D4PG ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹

Name		Name	Last commit message	Last commit date
Latest commit History 422 Commits
AdamClarkStandke/ReinforcementLearning/tree/main/training_DDPG_Agent_Colabtraining_DDPG_Agent_Colab		AdamClarkStandke/ReinforcementLearning/tree/main/training_DDPG_Agent_Colabtraining_DDPG_Agent_Colab
TextWorld		TextWorld
libhw		libhw
testing_model		testing_model
training_DDPG_Agent_Colab		training_DDPG_Agent_Colab
.gitignore		.gitignore
09afd3dc-6220-4777-829b-7557ae6213a7.png		09afd3dc-6220-4777-829b-7557ae6213a7.png
16ktradingsceme.png		16ktradingsceme.png
3Dprinted_side_frame.jpg		3Dprinted_side_frame.jpg
3Dprinted_top_frame.jpg		3Dprinted_top_frame.jpg
69405178433__EA7CFBEC-BA4B-42E9-84FE-0403279D3322.jpg		69405178433__EA7CFBEC-BA4B-42E9-84FE-0403279D3322.jpg
ActorCriticAlgorithm.ipynb		ActorCriticAlgorithm.ipynb
B14854_10_01 (1).png		B14854_10_01 (1).png
ContinuousActionsRL.ipynb		ContinuousActionsRL.ipynb
CustomRobot.png		CustomRobot.png
D4PG_Algo.png		D4PG_Algo.png
DDPG algorithm.png		DDPG algorithm.png
DDQN_Algorithm.ipynb		DDQN_Algorithm.ipynb
EI.png		EI.png
HuberLoss.png		HuberLoss.png
IMG_0229.jpg		IMG_0229.jpg
IMG_0248.jpg		IMG_0248.jpg
IMG_0251.jpg		IMG_0251.jpg
IMG_0276.jpg		IMG_0276.jpg
IMU_Sensor_bottom.jpg		IMU_Sensor_bottom.jpg
IMU_Sensor_top.jpg		IMU_Sensor_top.jpg
LICENSE		LICENSE
MLP_SPY.zip		MLP_SPY.zip
Mlprf.png		Mlprf.png
OnPolicyValueFunc.png		OnPolicyValueFunc.png
PPOhyper.db		PPOhyper.db
README.md		README.md
REINFORCE.png		REINFORCE.png
SAC_TradingAgent.png		SAC_TradingAgent.png
SPY_MLP.png		SPY_MLP.png
Screenshot 2022-08-04 11.38.00 AM.png		Screenshot 2022-08-04 11.38.00 AM.png
Screenshot 2022-09-14 11.54.10 PM.png		Screenshot 2022-09-14 11.54.10 PM.png
Screenshot 2022-09-14 11.54.58 PM.png		Screenshot 2022-09-14 11.54.58 PM.png
Screenshot 2022-09-14 11.56.30 PM.png		Screenshot 2022-09-14 11.56.30 PM.png
Screenshot 2022-09-14 11.59.27 PM.png		Screenshot 2022-09-14 11.59.27 PM.png
Screenshot 2023-01-24 8.19.25 AM.png		Screenshot 2023-01-24 8.19.25 AM.png
SecondStockEnivornment.ipynb		SecondStockEnivornment.ipynb
Soft-Actor-Critic.png		Soft-Actor-Critic.png
SpyTradingAgent.ipynb		SpyTradingAgent.ipynb
TextWorld_RL.ipynb		TextWorld_RL.ipynb
ThirdStockEnivornment.ipynb		ThirdStockEnivornment.ipynb
Trade done by agent.png		Trade done by agent.png
Training_Mario_10000Episodes.ipynb		Training_Mario_10000Episodes.ipynb
Validation Mean Net Worth_BalenceRewardDefault.svg		Validation Mean Net Worth_BalenceRewardDefault.svg
Validation Mean Net Worth_SortinoRatioReward1-2_Default.svg		Validation Mean Net Worth_SortinoRatioReward1-2_Default.svg
Validation Mean Net Worth_Yandex.svg		Validation Mean Net Worth_Yandex.svg
Validation Mean Net Worth_tuned.svg		Validation Mean Net Worth_tuned.svg
actor_loss.png		actor_loss.png
agentTradingscreen.png		agentTradingscreen.png
ak8963.py		ak8963.py
balance line plot.png		balance line plot.png
bellmanOperator_TemporalDifference_DDPG.png		bellmanOperator_TemporalDifference_DDPG.png
best_+0.087_90000.dat		best_+0.087_90000.dat
best_model.zip		best_model.zip
candlestick .png		candlestick .png
ch08-small-quotes.tgz		ch08-small-quotes.tgz
distributionalStateActionFunction_distributionalBellmanOp.png		distributionalStateActionFunction_distributionalBellmanOp.png
download (1).png		download (1).png
download (2).png		download (2).png
download (3).png		download (3).png
download (4).png		download (4).png
download (5).png		download (5).png
download.png		download.png
duel_network.png.png		duel_network.png.png
dueling_architecture_formula.png.png		dueling_architecture_formula.png.png
enhancements_to_DDPG.png		enhancements_to_DDPG.png
episode_reward_test_buy.png		episode_reward_test_buy.png
episode_reward_val_buy.png		episode_reward_val_buy.png
episode_reward_val_sell.png		episode_reward_val_sell.png
eval_mean_ep_length_Yandex.svg		eval_mean_ep_length_Yandex.svg
eval_mean_reward_BalenceReward_Default.svg		eval_mean_reward_BalenceReward_Default.svg
eval_mean_reward_SortinoRatioRewardTrial1-2_Default.svg		eval_mean_reward_SortinoRatioRewardTrial1-2_Default.svg
eval_mean_reward_Yandex.svg		eval_mean_reward_Yandex.svg
eval_mean_reward_tuned.svg		eval_mean_reward_tuned.svg
expected_return.png		expected_return.png
forward_weights_90000.ipynb		forward_weights_90000.ipynb
forward_weights_90000.py		forward_weights_90000.py
gradientPwCasB.png		gradientPwCasB.png
landingMoon_no_wind.gif		landingMoon_no_wind.gif
lunar_landing_module.png		lunar_landing_module.png
mario.png		mario.png
minatur_frame.stl		minatur_frame.stl
model.pth		model.pth
modelTesting.ipynb		modelTesting.ipynb
modified_ddpg.py		modified_ddpg.py
modified_microtaur.py		modified_microtaur.py
modified_train_ddpg.py		modified_train_ddpg.py
move_weights.py		move_weights.py
mpu6500.py		mpu6500.py
mpu9250.py		mpu9250.py
mq1.png		mq1.png
mq10.png		mq10.png
mq2.png		mq2.png
mqdefault.png		mqdefault.png
obs.py		obs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Theory of DQN and the Double DQN Learning Algorithm:

Implementation

Example 1: AI-powered Mario Agent using the Double Deep Q Network (DDQN) Algorithm

(Basic 😄) Theory of Actor-Critic (A2C) Algorithm:

Implementation:

Example 2: AI-powered Discrete Lunar Lander Agent using the Actor-Critic (A2C) Algorithm

Theory of Dueling Network Architecture:

Implementation

Example 3: Discrete Action Space for Stock/ETF Trading Agent: Part I

Passive Investing Results:

Active Investing Results:

Basic Theory of Proximal Policy Optimization(PPO)

Proximal Policy Optimization

Implementation:

Example 4: Continuous Action Space for Stock/ETF Trading Agent: Part II (:warning::warning::warning: :warning: WARNING!!!!!! Realize this environment is not realistic!!!! WARNING!!! WARNING!!!, Again WARNING!!!, OKAY, Time for the MIT software disclaimer, READ IT: :point_down:

Trading Results

Example 5: Multi-Discrete Action Space for Stock/ETF Trading Agent: Part III (:warning::warning::warning: :warning: WARNING!!!!!! Realize this environment is not realistic!!!! WARNING!!! WARNING!!!, Again WARNING!!!, OKAY, Time for the MIT software disclaimer, READ IT: :point_down:

Trading using StableBaseline3's MlpPolicy

Robotic Control Algorithms:

D4PG

Example 6: Making a Real Life Robot and using the DDPG Algorithm to train it:

MuJoCo File and PyBullet Simulation of Model:

Hardware Components:

Programming:

Results:

Reference:

About

Releases

Packages

Languages

License

AdamClarkStandke/ReinforcementLearning

Folders and files

Latest commit

History

Repository files navigation

Theory of DQN and the Double DQN Learning Algorithm:

Implementation

Example 1: AI-powered Mario Agent using the Double Deep Q Network (DDQN) Algorithm

(Basic 😄) Theory of Actor-Critic (A2C) Algorithm:

Implementation:

Example 2: AI-powered Discrete Lunar Lander Agent using the Actor-Critic (A2C) Algorithm

Theory of Dueling Network Architecture:

Implementation

Example 3: Discrete Action Space for Stock/ETF Trading Agent: Part I

Passive Investing Results:

Active Investing Results:

Basic Theory of Proximal Policy Optimization(PPO)

Proximal Policy Optimization

Implementation:

Example 4: Continuous Action Space for Stock/ETF Trading Agent: Part II (:warning::warning::warning: :warning: WARNING!!!!!! Realize this environment is not realistic!!!! WARNING!!! WARNING!!!, Again WARNING!!!, OKAY, Time for the MIT software disclaimer, READ IT: :point_down:

Trading Results

Example 5: Multi-Discrete Action Space for Stock/ETF Trading Agent: Part III (:warning::warning::warning: :warning: WARNING!!!!!! Realize this environment is not realistic!!!! WARNING!!! WARNING!!!, Again WARNING!!!, OKAY, Time for the MIT software disclaimer, READ IT: :point_down:

Trading using StableBaseline3's MlpPolicy

Robotic Control Algorithms:

D4PG

Example 6: Making a Real Life Robot and using the DDPG Algorithm to train it:

MuJoCo File and PyBullet Simulation of Model:

Hardware Components:

Programming:

Results:

Reference:

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages