Is There Momentum in Chess?

Gödel’s Lost Letter and P=NP 2018-03-28

And if so, what units does it have?

Cropped from source

Fabiano Caruana has just won the 2018 chess Candidates Tournament. This earns him the right to challenge Magnus Carlsen for the world championship next November in London. A devastating loss on Saturday to the previous challenger, Sergey Karjakin, had seemed to presage a repeat of his last-round failure in the 2016 Candidates. But Caruana reversed the mojo completely with powerful victories in Monday’s and Tuesday’s last two rounds to win the tournament by a full point.

Today we congratulate Caruana on the triumph and hail the first time an American will challenge for the world championship since Bobby Fischer in 1972. And we ask, is there really such a thing as being “in” or “out of” form in chess and similar pursuits?

Summer 2014 saw Caruana pull a stunning streak that put him in a pantheon with Fischer and the past champions Mikhail Tal and José Capablanca. Following a stunning 1.5 point margin of victory in the elite Dortmund Sparkassen event in only 7 rounds, he rattled off 7 straight wins in what remains the highest-rated tournament in history, the 2014 Sinquefield Cup in St. Louis. The streak included a win over Carlsen, and he had Carlsen dead-to-rights again before allowing a draw which stopped it. His performance rating for the fortnight was over 3100, high above Carlsen’s regular rating of 2877 and his own 2801, while I measured the intrinsic quality of Caruana’s moves at “only” 2995 during the win streak and 2925 overall.

Caruana had had a head of steam coming in to the 2016 Candidates from a strong tied-2nd behind Carlsen in the January 2016 Tata Steel annual classic in Wijk aan Zee, Netherlands. In the 2016 Candidates, he had been tied for the lead before losing to Karjakin in the last round. By contrast his run-up to the 2018 Candidates was listless. A roundup of predictions at Dennis Monokroussos’s blog The Chess Mind included the opinion:

“…Caruana seems to be off form, given his rather dismal performance at Wijk aan Zee.”

But what is “form” anyway? Does it exist? That is our subject.

Hot and Cold

Players and teams have been said to “have momentum” for decades. Belief that human competitors innately run “hot” and “cold” is endemic. Only in recent decades have there been attempts to measure such attributes or tell whether they exist at all.

We have covered the “hot hand” question in basketball and there was further controversy a year ago. The arguments turn on whether the distributions of performance outcomes are statistically indistinguishable from distributions with identifiably `random’ causes. A fair coin cannot “be hot” and dice players say “the dice were hot” only in retrospect of streaks that enabled them to win.

However, being off form is certainly real when one is ill or otherwise distracted. Is there an obverse? Can one be “super well” (leaving aside performance-enhancing drugs)? Is being “in the groove” prompted by physiological conditions, ones that self-reinforce?

This leads into a further matter that reinforces the “it’s all random” interpretation but also feeds into my alternative view. Suppose being under-the-weather or concretely burdened happens once every five tournaments in a way that drops your performance by 200 rating points. It follows that in the other four tournaments you’ve averaged playing 50 points higher than your published rating. By the Elo expectation table, the extra 50 points gives 57% expectation against equal-rated opponents. This translates to an extra draw or win over a 9-round tournament and is enough of a difference to be felt as “being on.”

Thus a finding that players were “in form” more than random simulations might expect could be explained by this, if the simulations naturally took each player’s published rating as the baseline for their projections. This extends the reach of outcomes to judge consistent with the random-effect hypothesis. Studies of game results from massive data could test this hypothesis. I have not expressly done so with my full statistical model, and this comment in the last basketball item above asked whether it has been done in chess. I can, however, give a partial indication that tends toward “no hot hand” in chess.

A Simple Test

The Candidates Tournament was a traditional double-round-robin with eight players and fourteen rounds. Chess tournaments using the Swiss System can have hundreds of players. Their open character and often-high prize funds make them most exigent to screen for possible cheating. The top 10 or 20 or more boards are usually played with equipment that automatically records the moves, which can be broadcast live or with some minutes’ delay. All remaining games are preserved only on paper scoresheets which both players are required to submit. Some tournament staffs painstakingly type these games as well into PGN files but others do not. The latter I distinguish by putting “Avail” into their filenames.

In the first round the players on the top boards are those with the highest ratings and their opponents from the third or second quartile according to the pairing system used. There is no selection bias in those top games. But in all succeeding rounds, the top-board players are those who have kept or earned their place by winning. The “Avail” files hence should be biased toward players who are “in form.”

My screening tests use simple counting metrics: the number of agreements with the chess-playing program(s) used for the tests and the total error judged by the computer in cases of disagreement. The latter is averaged over all moves, agreeing or not. The games are analyzed in the programs’ quick “Single-PV” mode, which is also their playing mode, rather than the “Multi-PV” analysis mode used by my full model which takes hours per processor core per game. If the screening test raises any concern, then the full model can be run for the particular games involved.

The screening test likewise does not support the computation of an “Intrinsic Performance Rating” (IPR) for the games, as I did for Marcel Duchamp and George Koltanowski recently using my full model. But the data is large enough—and the metrics concrete enough—that if there were an “in form” effect then I would expect to see distinctly higher values in the “Avail” files.

I do not. The values I get from the two strands of Open tournaments agree well within the error bars for several hundred tournaments of each kind per year. The “Avail” figure is a little bit lower each year since 2016, per each of the Stockfish and Komodo chess programs. The two averages of the average ratings in the two kinds of tournaments, the latter averages weighted by moves in games, are close. This is only from a quick test, giving more an “absence of evidence” verdict than “evidence of absence.” But it speaks most particularly against the hypothesis that form carries from one game into the next day, so that it would be denominated in units of tournaments. Whether it might carry from move to move, so that the unit becomes “playing a good game,” may require my full model owing to covariances between moves that only it can compensate for. We can, however, consider a different notion of “form” that brings human qualities inside the ‘random’ picture.

What Is Being In Form?

The famous baseball manager Earl Weaver once defused a question of whether his team had momentum by retorting:

“Momentum? Momentum is the next day’s starting pitcher.”

What he meant includes all of these: His hitters who were hot could be stopped by an excellent opposing pitcher. His hitters who had struck out were accustomed to sloughing off one bad day and being fresh for the next, when the pitches might be easier to hit. His own team’s fortunes would most likely depend on his own starting pitcher who hadn’t pitched in four or five days.

In chess, however, losses are said to “stay with” players. They are rarer for the elite and harder to slough off. That’s what made Caruana’s two-win finish remarkable in chess. It was under the same last-round pressure that had marked his defeat in 2016 and that had tangibly caused the shocking double–loss finish to the 2013 Candidates. His round-12 loss last Saturday was horrific, a “positional crush.” Not only did he fall into a tie with Karjakin having led alone since round 7, he stood to lose any tiebreaker with Karjakin or the player next on their heels, Shakhriyar Mamedyarov. So he not only needed to right the ship, he needed to rev it.

If I were to simulate tournaments randomly based on these eight players’ ratings, I would expect to generate a fair share in which the winner finished loss-win-win. Maybe there were random factors that pinged in Caruana’s brain at the right times—not to mention going “pong” in the brains of his two unfortunate victims. My model can tell “ping” from “pong” modulo high error bars. Here are its results for the last three games—with the caveat that they come from a “version 2.5” that has been fitted but not fully vetted and which is a way-station while issues in my intended “version 3.0” are still being grappled with.

Karjakin 2800 +- 675, 1-0 in 48 moves over Caruana 2760 +- 450.
Caruana 2585 +- 825, 1-0 in 39 moves over Levon Aronian 1740 +- 1170.
Alexander Grischuk 2425 +- 535, 0-1 in 69 moves to Caruana 2575 +- 485.

Since the players were all rated near 2800 this suggests more “pong” than “ping,” notwithstanding the enormous two-sigma error bars for measuring just one game. However, the last game’s figures arguably mislead because Caruana after turn 40 had Grischuk in a mortal lock and took his time—as observed here and here—rather than rush in with quick kills seen by the programs, while Grischuk had to try to thrash about. If we cut off at turn 40 then the figures become:

Grischuk 2755 +- 745, 0-1 to Caruana 2940 +- 540.

This has its share of “ping.” For the whole tournament, my model tabs Caruana right at 2800, with less-yawning error bars +- 160, and his opponents at 2640 +- 205.

I regard the error bars in my model as reflecting an innate lower bound on human variability. “Pings” and “pongs” happen in a way analogous to quantum uncertainty. But my final point is, they are human pings and pongs, not those of coins or dice. The training of my model at Caruana’s level is based on the actual recorded performance of his human peers (no computer games) over the history of chess. Those pings are in the brain, and even in basketball the pings outside the brain are in well-conditioned muscles and smooth joints.

Thus the human factors need to be inside, not outside, the randomized models—inside where our humanity retains the credit for them. In the sporting terms that chess professionals recognize, Caruana showed true grit coming back. He showed the most consistent command from his round-1 win over fellow American Wesley So clear to the end. He was not sick and he put distractions aside, including his loss to Karjakin. Maybe we say this only in retrospect, and maybe the negative results on momentum say that the next time the situation comes up after round 12 we should not bet heavily either on roaring comeback or collapse. But the victory and the strong play on the whole, coming from his brain, constitute his having been in form.

Open Problems

How should questions of “momentum” and “form” be formulated—and should the two be treated differently?

Answer to the leprechaun puzzle in the previous post: N = 1 in the count of times Neil’s words use that letter, and similarly P = 0. Those are the two sufficient arithmetical conditions for P=NP, which is Neil’s theorem of choice. (Cued by hints in the text including Neil’s nerdiness.)