Positive policy decay #2093

Naphthalin · 2024-12-17T10:48:10Z

This is a reimplementation of #1173 and #1288 without the child visit boost part, aimed at supplementing CPuct scaling. Setting either PolicyDecayFactor or PolicyDecayExponent to 0 effectively turns the policy decay off, so it should be tuning friendly.

…cy decay feature

Naphthalin · 2024-12-17T23:32:40Z

Some words of explanation:

PUCT in the way Alphazero uses it treats policy as a multiplicative factor to the U term, persistent even at millions of nodes. The original reference meanwhile treats policy as an additional "delaying" term, decaying to 0 with more nodes.
this approach (first version was a semi April fools joke in 2020) picks up the general idea of decaying the policy effect, but instead by affecting the multiplicative P% term, keeping the nature of the PUCT formula intact.
it's crucial that policy decay (and generally any modification to PUCT) is done in a way that the U term still gets reduced after a visit, and slightly increased after a sibling is visited. Hence why "positive policy decay" is needed, otherwise search inconsistencies like Add visit-based policy temperature decay #1150 happen.
what this does is effectively removing the "policy sums up to 100%" condition, resp. effectively increases cpuct especially in positions with multiple good moves in addition to cpuct scaling.
mathematically, the formula converts policy into a logit, and adds a logarithmically growing term. With N->inf, all P% will grow towards 100%, though the initial P% value decides how much this growth is delayed.

I expect this to work best together with a reduced CPuctFactor, with PolicyTemperature, CPuct and FpuValue optimized for STC. A tuning could therefore happen in two steps; first the STC tune (~1k npm) with the 3 parameters, and then a LTC tune (>100k npm) with CPuctFactor, PolicyDecayExponent and PolicyDecayFactor.

Naphthalin added 2 commits December 17, 2024 00:16

Added PolicyDecayExponent and PolicyDecayFactor for the positive poli…

fc55ea1

…cy decay feature

corrected typo

ea2fff6

Naphthalin added enhancement New feature or request rfc Request for comments testing required Feature/bug fix needs more testing. Implies not for merge. labels Dec 17, 2024

Naphthalin mentioned this pull request Dec 17, 2024

Policy Scaling with tunable exponent #1288

Closed

Merge branch 'master' into positive-policy-decay

caf62c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Positive policy decay #2093

Positive policy decay #2093

Naphthalin commented Dec 17, 2024

Naphthalin commented Dec 17, 2024

Positive policy decay #2093

Are you sure you want to change the base?

Positive policy decay #2093

Conversation

Naphthalin commented Dec 17, 2024

Naphthalin commented Dec 17, 2024