In this project, Proximal Policy Optimization (PPO) is used to train the agents. PPO was introduced by OpenAI
From the website:
Policy gradient methods are fundamental to recent breakthroughs in using deep neural networks for control, from video games, to 3D locomotion, to Go. But getting good results via policy gradient methods is challenging because they are sensitive to the choice of stepsize — too small, and progress is hopelessly slow; too large and the signal is overwhelmed by the noise, or one might see catastrophic drops in performance. They also often have very poor sample efficiency, taking millions (or billions) of timesteps to learn simple tasks.
PPO is a policy gradient method. Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm.
One novelty in OpenAIs approach with PPO was to introduce the clipped surrogate objective function in addition to the probability ratio between old and new policy, which acts as a surrogate objective that is optimized in Trust Region Policy Optimization (TRPO)
The main objective in PPO is thus
from agent.py
ratio = (new_log_probs - old_log_probs).exp()
# surrogate objective: probability ratio between old and new policy
surr1 = ratio * advantage
# clipped surrogate objective
surr2 = torch.clamp(ratio, 1.0 - clip_param, 1.0 + clip_param) * advantage
loss = - torch.min(surr1, surr2).mean()This clipping helps keeping the update to the policy in a small range.
Another innovation in OpenAIs approach is running the optimizer a number of times over sampled minibatches of a policy.
More info on the details of the algorithm can be found in the paper.
In this implementation I am also using Generalized Advantage Estimation to train the critic network.
The network consists of two networks: The actor network and the critic network.
The actor network takes a state tensor as an input and outputs an action.
The critic network is not directly needed for the PPO algorithm (original paper describes policy network and surrogate function which counts ration of new action probabilites to old ones - actor would suffice) but it's very helpful to compute advantages which requires value for state.
- 3 fully connected layers
- 33 input nodes: size of state vector
- 4 output nodes: size of action vector
- 256 hidden nodes in each layer
- ReLU activations, tanh on last layer
- 3 fully connected layers
- 33 input nodes [observation vector size], 1 output nodes, 512 hidden nodes in each layer
- ReLU activations, no activation on last layer
hyperparameters that can be tuned from main.py
run_experiment(hidden_size=256, lr=1e-3, max_episodes=500, mini_batch_size=128,
nrmlz_adv=True, num_steps=2048, ppo_epochs=4, threshold_reward=30,
gamma=0.99, tau=0.95, clip_gradients=True)- hidden_size: number of neurons for each layer of each network
- gamma: the discount rate
0.99 - lr: the learning rate of both networks
1e-3 - tau: "discount factor" for GAE
0.95 - max_episodes: how long to train
500 - num_steps: Rollout length -
2048 - ppo_epochs: how many epochs to run PPO -
4 - clip_gradients: gradients will be clipped if set to True
- nrmlz_adv: when True normalizes advantages.
- threshold_reward: when the moving average of the last 100 episodes exceeds this value, the training is finished.
It was extremely hard to get this agent to learn. I tried several different sets of hyperparamters which were of no help at all. Even adding GAE did not help learning. The key ingredient that made the agent learn is advantage normalization. Just in case, you'd like to find out, I added a special hyperparameter called nrmlz_adv. If you set
nrmlz_adv = Falseyou can watch your agents going nowhere at all.
With the right set of hyperparameters, the results are as follows:
Episode 0, Total score this episode: 0.4309999903663993, Last 0 average: 0.4309999903663993
...
Episode 55, Total score this episode: 30.905999309197068, Last 55 average: 13.403428271838598
Episode 121, Total score this episode: 37.10649917060509, Last 100 average: 30.123479326687754
Scores of last 100 episodes and moving average of all episodes.

To improve the agents performance
- train longer
- try different network architectures, maybe an LSTM
- investigate different initialization schemes for the neural networks
- try some ideas from PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation
OpenAI Spinning AI Docs on PPO
Deep Reinforcement Learning in Action, chapter 'Policy Gradient Methods'

