Skip to content

Recurrent PPO#53

Merged
araffin merged 60 commits into
masterfrom
feat/ppo-lstm
May 30, 2022
Merged

Recurrent PPO#53
araffin merged 60 commits into
masterfrom
feat/ppo-lstm

Conversation

@araffin

@araffin araffin commented Nov 29, 2021

Copy link
Copy Markdown
Member

Description

Experimental version of PPO with LSTM policy.

Current status: usable but not polished, see #53 (comment)

Missing:

Known issue: if the model was train on GPU and tested on CPU, a warning will be issued because it cannot unpickle the lstm initial states. This is ok as they will be reset anyway in setup_model() and it doesn't affect prediction.

Context

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • I've read the CONTRIBUTION guide (required)
  • The functionality/performance matches that of the source (required for new training algorithms or training-related features).
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have included an example of using the feature (required for new features).
  • I have included baseline results (required for new training algorithms or training-related features).
  • I have updated the documentation accordingly.
  • I have updated the changelog accordingly (required).
  • I have reformatted the code using make format (required)
  • I have checked the codestyle using make check-codestyle and make lint (required)
  • I have ensured make pytest and make type both pass. (required)

Note: we are using a maximum length of 127 characters per line

@HamiltonWang

Copy link
Copy Markdown

for the time being, is there any way we can feedback the older data back to the trainer as input to mimic a crude version of LSTM? any sample code to do that?

@araffin

araffin commented May 3, 2022

Copy link
Copy Markdown
Member Author

for the time being, is there any way we can feedback the older data back to the trainer as input to mimic a crude version of LSTM? any sample code to do that?

Hello,
you can already use this PR if you need to use recurrent PPO (see install from source in our doc), otherwise you can use frame stacking or history wrapper (see code in the RL Zoo).

@henrydeclety

Copy link
Copy Markdown

when LSTM for A2C?

@araffin

araffin commented May 11, 2022

Copy link
Copy Markdown
Member Author

when LSTM for A2C?

a2c is a special case of ppo ;) (cc @vwxyzjn )

@vwxyzjn

vwxyzjn commented May 11, 2022

Copy link
Copy Markdown
Contributor

@henrydeclety see https://github.com/vwxyzjn/a2c_is_a_special_case_of_ppo. We have a paper coming out soon...

@vwxyzjn

vwxyzjn commented May 20, 2022

Copy link
Copy Markdown
Contributor

The preprint of the paper is out at https://arxiv.org/abs/2205.09123 @henrydeclety :)

@HamiltonWang

Copy link
Copy Markdown

Hello, you can already use this PR if you need to use recurrent PPO (see install from source in our doc), otherwise you can use frame stacking or history wrapper (see code in the RL Zoo).

I’ll give it a try

@EloyAnguiano

Copy link
Copy Markdown

How could I configure the maximum sequence length for the LSTM?

@philippkiesling

philippkiesling commented Mar 15, 2023

Copy link
Copy Markdown

@EloyAnguiano As far as I could tell from the code, the implementation in SB3 does not have a sequence length, but saves the hidden state between steps of your environment and then uses it as input. So the maximum sequence length for the lstm would be the number of steps (n_steps) before you update your policy.

This way you only need to compute each input once, instead of refeeding it every new step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

recurrent policy implementation in ppo [feature-request]