GAE and Critic Loss (PPO2) #1190

rohey · 2021-10-11T17:34:33Z

Hello. Can you please explain why are you using the mb_returns = mb_advs(GAE) + mb_values as the returns to compute the critic loss ? Should not the value function approximately represent the discounted sum of rewards ? E.g., R = gamma * R + rewards[i] value_loss = value_loss + 0.5 * (R - values[i]).pow(2).

If I understand correctly, the value function depends on the parameter γ and not on the parameter λ based on the paper https://arxiv.org/pdf/1506.02438.pdf. However, If I use the GAE(γ, λ) advantages to compute the returns and use those returns to train the critic, wouldn't the the value function become V(γ, λ) instead of V(γ) ? And if it is true, would the TD residual be correctly computed with V(γ, λ) ?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GAE and Critic Loss (PPO2) #1190

GAE and Critic Loss (PPO2) #1190

rohey commented Oct 11, 2021 •

edited

Loading

GAE and Critic Loss (PPO2) #1190

GAE and Critic Loss (PPO2) #1190

Comments

rohey commented Oct 11, 2021 • edited Loading

rohey commented Oct 11, 2021 •

edited

Loading