Skip to content

Latest commit

 

History

History
137 lines (92 loc) · 5.71 KB

File metadata and controls

137 lines (92 loc) · 5.71 KB

Project report

Learning algorithm

In this project, Proximal Policy Optimization (PPO) is used to train the agents. PPO was introduced by OpenAI

From the website:

Policy gradient methods are fundamental to recent breakthroughs in using deep neural networks for control, from video games, to 3D locomotion, to Go. But getting good results via policy gradient methods is challenging because they are sensitive to the choice of stepsize — too small, and progress is hopelessly slow; too large and the signal is overwhelmed by the noise, or one might see catastrophic drops in performance. They also often have very poor sample efficiency, taking millions (or billions) of timesteps to learn simple tasks.

PPO is a policy gradient method. Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm.

One novelty in OpenAIs approach with PPO was to introduce the clipped surrogate objective function in addition to the probability ratio between old and new policy, which acts as a surrogate objective that is optimized in Trust Region Policy Optimization (TRPO)

TRPO objective

The main objective in PPO is thus

PPO objective

from agent.py

    ratio = (new_log_probs - old_log_probs).exp()
    # surrogate objective: probability ratio between old and new policy
    surr1 = ratio * advantage
    # clipped surrogate objective
    surr2 = torch.clamp(ratio, 1.0 - clip_param, 1.0 + clip_param) * advantage
    loss = - torch.min(surr1, surr2).mean()

This clipping helps keeping the update to the policy in a small range.

Another innovation in OpenAIs approach is running the optimizer a number of times over sampled minibatches of a policy.

More info on the details of the algorithm can be found in the paper.

In this implementation I am also using Generalized Advantage Estimation to train the critic network.

Parameters and hyperparameters

Neural network architecture

The network consists of two networks: The actor network and the critic network.

The actor network takes a state tensor as an input and outputs an action.

The critic network is not directly needed for the PPO algorithm (original paper describes policy network and surrogate function which counts ration of new action probabilites to old ones - actor would suffice) but it's very helpful to compute advantages which requires value for state.

Actor network

  • 3 fully connected layers
  • 33 input nodes: size of state vector
  • 4 output nodes: size of action vector
  • 256 hidden nodes in each layer
  • ReLU activations, tanh on last layer

Critic network

  • 3 fully connected layers
  • 33 input nodes [observation vector size], 1 output nodes, 512 hidden nodes in each layer
  • ReLU activations, no activation on last layer

Algorithm hyperparameters

hyperparameters that can be tuned from main.py

    run_experiment(hidden_size=256, lr=1e-3, max_episodes=500, mini_batch_size=128,
                   nrmlz_adv=True, num_steps=2048, ppo_epochs=4, threshold_reward=30,
                   gamma=0.99, tau=0.95, clip_gradients=True)
  • hidden_size: number of neurons for each layer of each network
  • gamma: the discount rate 0.99
  • lr: the learning rate of both networks 1e-3
  • tau: "discount factor" for GAE 0.95
  • max_episodes: how long to train 500
  • num_steps: Rollout length - 2048
  • ppo_epochs: how many epochs to run PPO - 4
  • clip_gradients: gradients will be clipped if set to True
  • nrmlz_adv: when True normalizes advantages.
  • threshold_reward: when the moving average of the last 100 episodes exceeds this value, the training is finished.

Results

It was extremely hard to get this agent to learn. I tried several different sets of hyperparamters which were of no help at all. Even adding GAE did not help learning. The key ingredient that made the agent learn is advantage normalization. Just in case, you'd like to find out, I added a special hyperparameter called nrmlz_adv. If you set

 nrmlz_adv = False

you can watch your agents going nowhere at all.

With the right set of hyperparameters, the results are as follows:

Episode 0, Total score this episode: 0.4309999903663993, Last 0 average: 0.4309999903663993
...
Episode 55, Total score this episode: 30.905999309197068, Last 55 average: 13.403428271838598
Episode 121, Total score this episode: 37.10649917060509, Last 100 average: 30.123479326687754

Environment solved in 121 episodes! Average Score: 30.12

Scores of last 100 episodes and moving average of all episodes. results

Next steps

To improve the agents performance

Resources that helped me implement this project

higgsfields RL-Adventure-2

OpenAI Blog post

PPO with Sonic the Hedgehog

OpenAI Spinning AI Docs on PPO

Deep Reinforcement Learning in Action, chapter 'Policy Gradient Methods'