Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Positive policy decay #2093

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Naphthalin
Copy link
Contributor

This is a reimplementation of #1173 and #1288 without the child visit boost part, aimed at supplementing CPuct scaling. Setting either PolicyDecayFactor or PolicyDecayExponent to 0 effectively turns the policy decay off, so it should be tuning friendly.

@Naphthalin Naphthalin added enhancement New feature or request rfc Request for comments testing required Feature/bug fix needs more testing. Implies not for merge. labels Dec 17, 2024
@Naphthalin
Copy link
Contributor Author

Some words of explanation:

  • PUCT in the way Alphazero uses it treats policy as a multiplicative factor to the U term, persistent even at millions of nodes. The original reference meanwhile treats policy as an additional "delaying" term, decaying to 0 with more nodes.
  • this approach (first version was a semi April fools joke in 2020) picks up the general idea of decaying the policy effect, but instead by affecting the multiplicative P% term, keeping the nature of the PUCT formula intact.
  • it's crucial that policy decay (and generally any modification to PUCT) is done in a way that the U term still gets reduced after a visit, and slightly increased after a sibling is visited. Hence why "positive policy decay" is needed, otherwise search inconsistencies like Add visit-based policy temperature decay #1150 happen.
  • what this does is effectively removing the "policy sums up to 100%" condition, resp. effectively increases cpuct especially in positions with multiple good moves in addition to cpuct scaling.
  • mathematically, the formula converts policy into a logit, and adds a logarithmically growing term. With N->inf, all P% will grow towards 100%, though the initial P% value decides how much this growth is delayed.

I expect this to work best together with a reduced CPuctFactor, with PolicyTemperature, CPuct and FpuValue optimized for STC. A tuning could therefore happen in two steps; first the STC tune (~1k npm) with the 3 parameters, and then a LTC tune (>100k npm) with CPuctFactor, PolicyDecayExponent and PolicyDecayFactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rfc Request for comments testing required Feature/bug fix needs more testing. Implies not for merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant