You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Compare against an agent that selects actions randomly
Compare against our previous best baseline agent
Compare against stockfish (of varying search depth)
Comparing output move distribution
Compare list of outputted moves to list of best moves outputted by Stockfish. We can use a similarity metric. One option is the Footrule distance. The Footrule distance treats difference at top of list as being just as relevant at bottom of list. However, we care more about the "best" moves, and lower moves in the list are not as important. We should consider some weighted implementation.
The text was updated successfully, but these errors were encountered:
In our last software meeting, we discussed using a series of puzzles (https://database.lichess.org/#puzzles) to determine the elo of the agent. Each of these puzzles has a distinct best move for the "player" to make, so we can use it to objectively quantify performance. One caveat is that there may be puzzles with multiple 'mate-in-one' moves, in which case any move leading to checkmate should be counted as a solution. We can select a subset of puzzles to use and incorporate these into the training loop every n iterations to quantify model improvement/deterioration.
Remember that tactics only come about because it's a good position. If you don't know how to play positionally and set up for tactics, they will never show up in your games
I agree that it's not all about tactics, but even if it was, there's no reason these two ratings should be in sync with each other. They are completely different systems. One is a result of head-to-head competition, the other is a solo endeavor where the "rating" you get assigned is really quite arbitrary.
So maybe we should use puzzles as a first pass, and if the new AI can solve the puzzles, we evaluate it with self play against the previous best model?
Ideas:
Comparing by win/loss against other agents
Comparing output move distribution
The text was updated successfully, but these errors were encountered: