In losses.py, I noticed that the code includes the following step before returning value targets and advantages:
advantages = (rewards + discount * (1 - termination) * vs_t_plus_1 - values) * truncation_mask
From what I understand, compute_vs_minus_v_xs should return the standard GAE result. Why do we perform an additional TD computation at the end?
Second, I was hoping to ask why the value loss includes an extra 0.5 term:
v_loss = jnp.mean(v_error * v_error) * 0.5 * 0.5
Both these decisions seem non-standard. Did you find they improved performance empirically?
In
losses.py, I noticed that the code includes the following step before returning value targets and advantages:From what I understand,
compute_vs_minus_v_xsshould return the standard GAE result. Why do we perform an additional TD computation at the end?Second, I was hoping to ask why the value loss includes an extra
0.5term:Both these decisions seem non-standard. Did you find they improved performance empirically?