You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello. Can you please explain why are you using the mb_returns = mb_advs(GAE) + mb_values as the returns to compute the critic loss ? Should not the value function approximately represent the discounted sum of rewards ? E.g., R = gamma * R + rewards[i] value_loss = value_loss + 0.5 * (R - values[i]).pow(2).
If I understand correctly, the value function depends on the parameter γ and not on the parameter λ based on the paper https://arxiv.org/pdf/1506.02438.pdf. However, If I use the GAE(γ, λ) advantages to compute the returns and use those returns to train the critic, wouldn't the the value function become V(γ, λ) instead of V(γ) ? And if it is true, would the TD residual be correctly computed with V(γ, λ) ?
The text was updated successfully, but these errors were encountered:
Hello. Can you please explain why are you using the
mb_returns = mb_advs(GAE) + mb_values
as the returns to compute the critic loss ? Should not the value function approximately represent the discounted sum of rewards ? E.g.,R = gamma * R + rewards[i] value_loss = value_loss + 0.5 * (R - values[i]).pow(2)
.If I understand correctly, the value function depends on the parameter γ and not on the parameter λ based on the paper https://arxiv.org/pdf/1506.02438.pdf. However, If I use the GAE(γ, λ) advantages to compute the returns and use those returns to train the critic, wouldn't the the value function become V(γ, λ) instead of V(γ) ? And if it is true, would the TD residual be correctly computed with V(γ, λ) ?
The text was updated successfully, but these errors were encountered: