Truth From Zero?

Gödel’s Lost Letter and P=NP 2018-03-12

How we might compare AlphaZero against standards of perfection

YouTube 2015 lecture source

David Silver is the lead author on the paper, “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm,” which was posted twelve days go on the ArXiv. It announces an algorithm called AlphaZero that, given the rules of any two-player game of strategy and copious hardware, trains a deep neural network to play the game at skill levels approaching perfection.

Today I review what is known about AlphaZero and discuss how to compare it with known instances of perfect play.

AlphaZero is a generalization of AlphaGo Zero, which was announced last October 18 on the Google DeepMind website under the heading “Learning From Scratch.” A paper in Nature, with Silver as lead author, followed the next day. Unlike the original AlphaGo, whose victory over the human champion Lee Sedol we covered, AlphaGo Zero had no input other than the rules of Go and some symmetry properties of the board. From round-the-clock self-play it soon acquired as tutor the world’s best player—itself.

The achievements in Go and Shogi—the Japanese game whose higher depth in relation to Western chess we discussed three years ago—strike us as greater than AlphaZero’s score of 28 wins, 72 draws, and no losses against the champion Stockfish chess program. One measure of greatness comes from the difference in Elo ratings between the machine and the best human players. AlphaGo Zero’s measured rating of 5185 is over 1,500 points higher than the best human players on the scale used in Go. In Shogi, the paper shows AlphaZero zooming toward 4500 whereas the top human rating shown here as of 11/26/17 is 2703, again a difference over 1,500. In chess, however, as shown in the graphs atop page 4 of the paper, AlphaZero stays under 3500, which is less than 700 ahead of human players.

Bumping Against Perfection

Although AlphaZero’s 64-36 margin over Stockfish looks like a shellacking, it amounts to only 100 points difference on the Elo scale. The scale was built around the idea that a 200-point difference corresponds to about 75% expectation for the stronger player—and this applies to all games. Higher gains become multiplicatively harder to achieve and maintain. This makes the huge margins in Go and Shogi all the more remarkable.

There has been widespread criticism of the way Stockfish was configured for the match. Stockfish was given 1 minute per move regardless of whether it was an obvious recapture or a critical moment. It played without its customary opening book or endgame tables of perfect play with 6 or fewer pieces. The 64 core threads it was given were ample hardware but they communicated position evaluations via a hash table of only one gigabyte, a lack said to harm the accuracy of deeper searches. However hobbled, what stands out is that Stockfish still drew almost three-fourths of the games, including exactly half the games it played Black.

I have fielded numerous queries these past two weeks about how this affects my estimate that perfect play in chess is rated no higher than 3500 or 3600, which many others consider low. Although the “rating of God” moniker is played up for popular attention, it really is a vital component in my model: it is the ${y}$ -intercept of regressions of player skill versus model parameters and inputs. I’ve justified it intuitively by postulating that slightly randomized versions of today’s champion programs could score at least 10–15% against any strategy. I regard the ratings used for the TCEC championships as truer to the human scale than the CCRL ratings. TCEC currently rates the latest Stockfish version at 3226, then 3224 for Komodo and 3192 for the Houdini version that won the just-completed 10th TCEC championship. CCRL shows all of Houdini, Komodo, and an assembly-coded version of Stockfish above 3400. Using the TCEC ratings and the official Elo “p-table” implies that drawing 2 or 3 of every 20 games holds the stronger player to the 3500–3600 range.

Of course, the difference from Go or Shogi owes to the prevalence of draws in chess. One ramification of a post I made a year ago is that the difference is not merely felt at the high end of skill. The linear regressions of Elo versus player error shown there are so sharp ( ${R^2 > 0.995}$ ) that the ${y}$ -intercept is already well determined by the games of weaker players alone.

Overall, I don’t know how the AlphaZero paper affects my estimates. The Dec. 5 paper is sketchy and only 10 of the 100 games against Stockfish have been released, all hand-picked wins. I share some general scientific caveats voiced by AI researcher and chess master Jose Camacho-Collados. I agree that two moves by AlphaZero (21. Bg5!! and 30.Bxg6!! followed by 32.f5!! as featured here) were ethereal. There are, however, several other possible ways to tell how close AlphaZero comes to perfection.

Some Possible Experiments

One experiment is simply to give AlphaZero an old-fashioned examination on test positions for which the perfect answers are known. These could even be generated in a controlled fashion from chess endgames with 7 or fewer pieces on the board, for which perfect play was tabulated by Victor Zakharov and Vladimir Makhnichev using the Lomonosov supercomputer of Moscow State University. Truth in those tables is often incredibly deep—in some positions the win takes over 500 moves, many of which no current chess program (not equipped with the tables) let alone human player would find. Or one can set checkmate-in- ${N}$ problems that have stumped programs to varying degrees. The question is:

With what frequency can the trained neural network plus Monte Carlo Tree Search (MCTS) from the given position find the full truth in the position?

The trained neural network supplies original probabilities for each move in any given position ${P}$ . AlphaZero plays games against itself using those probabilities and samples the results. It then adjusts parameters to enhance the probabilities of moves having the highest expectation in the sample, in a continuous and recursive manner for positions encountered in the search from ${P}$ . The guiding principle can be simply stated as:

“Nothing succeeds like success.”

We must pause to reflect on how clarifying it is that this single heuristic suffices to master complex games—games that also represent a concrete face of asymptotic complexity insofar as their size n-by-n generalizations are polynomial-space hard. A famous couplet by Piet Hein goes:

Problems worthy of attack prove their worth by hitting back.

It may be that we can heuristically solve some NP-type problems better by infusing an adversary—to make a PSPACE-type problem that hits back—and running AlphaZero.

As seekers of truth, however, we want to know how AlphaZero will serve as a guide to perfection. We can regard a statement of the form, “White can win in 15 moves” (that is, 29 moves counting both players) as a theorem for which we seek a proof. We can regard the standard alpha-beta search backbone as one proof principle and MCTS as another. Which ‘logic’ is more powerful and reliable in practice?

A second way to test perfection is to take strategy games that are just small enough to solve entirely, yet large enough that stand-alone programs cannot play perfectly on-the-fly. One candidate I offer is a game playable with chess pawns or checkers on a board with 5 rows and ${n}$ columns, where perhaps ${n}$ can be set to achieve the small-enough/large-enough balance. I conceived this 35 years ago at Oxford when ${n=4}$ seemed right for computers of the day. The starting position is:

Each player’s pawns move one square forward or may “hop” over an opposing piece straight forward or diagonally forward. If some hop move is legal then the player must make a hop move. The hopped-over piece remains on the board. If a pawn reaches the last row it becomes a king and thereupon moves or hops backwards. No piece is ever captured.

The goal is to make your opponent run out of legal moves. If a king reaches the player’s first row it can no longer move. This implies a fixed horizon on a player’s count of moves. The trickiest rules involve double-hops: If a single hop and double hop are available then the double hop is not mandatory, but if a pawn on the first row begins a double hop it must complete it. Upon becoming a king after a hop, however, making a subsequent return hop is optional, except that a king that makes the first leg of a returning double hop must make the second leg. A final optional rule is to allow a king to move one cell back diagonally as well as straight.

From the starting position, White can force Black to hop four times in four moves by moving to a3, a4, d3, and d4. Then White still has the initiative and can either make a king or force another hop; the latter, however, forces White to hop diagonally in return. This seems like a powerful beginning but the subsequent Black initiative also looked strong. My playing partners at Oxford and I found that positional considerations—making the opponent’s pieces block themselves—mattered as much as the move-count allowance. This made it challenging and fun enough for casual human play, but we knew that computers should make quick work of it.

The point of using this or some other solved game would be to compare the strategy produced by the AlphaZero procedure against perfection—and against programs that use traditional search methods.

Open Problems

What do you think are the significances of the AlphaZero breakthrough?