Retroactively Validated Hype
Statistical Modeling, Causal Inference, and Social Science 2026-01-18
Gary Smith is an economist with a sideline in AI debunking. We’ve pointed to some of these on the blog:
Jun 2021: “ai-promised-to-revolutionize-radiology-but-so-far-its-failing”
Jan 2022: “Chatbots: Still Dumb After All These Years”
Jan 2022: A chatbot challenge for Blaise Agüera y Arcas and Gary Smith
Aug 2022: From chatbots and understanding to appliance repair and statistical practice
Jun 2023: AI as Wiley E. Coyote, and a funny thing about Worstfish
May 2024: More red meat for you AI skeptics out there
Jan 2025: Large Language Models (LLMs) Flunk the Word Game Connections
Most recently, this hilarious story:
LLMs are undeniably astonishingly good at using the text they trained on (aided by human fine tuners) to generate convincing prose. But they are really bad at distinguishing between truth and falsehoods and responding to prompts that are unlike or even slightly different from what they trained on. . . .
A few days ago, Doug Hofstadter, forwarded an interesting LLM experiment with OpenAI o1 by Abhijit Mahabal . . .
Abhijit: I want to discuss a chess alternative that I call NB-inverted chess. This is characterized by two changes.
LLM: Thought about NB-inverted chess rules for 4 seconds. Sure! I’d be happy to explore NB-inverted chess with you. Could you describe the two main rule changes that define it? Once I know those details, we can dive into how they affect gameplay, strategy, and overall feel of the variant.
Abhijit: First is the position of the pieces. Instead of Rook-Knight-Bishop, this has Rook-Bishop-Knight. And Second, a bishop here moves like a knight in traditional chess, and a knight here moves like a traditional bishop.
Notice the LLM’s false enthusiasm and the words “thought about” as if it was actually a human who was excited and thinking. It is neither. Anyone who is familiar with chess and really thinking would recognize that switching the positions of the knight and bishop and then having the knight move like a bishop and the bishop move like a knight has absolutely no effect on the play of the game. Players just need to remember that knights now move like bishops and vice versa.
Not understanding the situation, Open AI launched an extremely verbose and confused description of the situation and concluded:
LLM: NB-inverted chess adds a fresh layer of complexity and mind-bending tactics to a familiar board….The best way to get a feel for it is to dive in and play a few games. You’ll quickly discover new pitfalls—and new opportunities—that simply don’t exist in regular chess. Have fun exploring!
Hofstadter suggested an additional change:
Abhijit: Let’s add another twist, and call this the NB-inverted/resculpted chess. In this variant, we will resculpt the knight to look like a bishop, and resculpt the bishop to look like the knight. What are your strategy recommendations for playing this game with the three modifications?
These three modifications return the board to its original position, with bishops and knights where they normally are, looking like they normally look, and moving like they normally move. The game is unchanged. Open AI recognized none of this, though it reported that it had “thought about” it for a remarkable 28 seconds, and generated a tedious and strikingly misleading 1,406-word response with this conclusion:
LLM: The real fun lies in embracing the unfamiliar dynamics, looking for brand-new tactics, and building up fresh “muscle memory” for these pieces that look (and move) entirely unlike their standard counterparts. Enjoy the chaos and creativity—it’s a great way to break out of standard chess patterns and see how adaptable your board vision truly is!
Intrigued, I [Smith] tried an even simpler manipulation:
Gary: I’d like to discuss a new version of tic-tac-toe called 90-degree tic-tac-toe.
LLM: That sounds intriguing! What’s the twist in “90-degree Tic-Tac-Toe”? Does it involve rotating the board, or is there some other mechanic that changes the gameplay? Let me know the rules or your idea, and we can brainstorm or refine it together!
Gary: It is still a 3×3 grid but the grid is rotated 90-degrees to the right.
LLM: Interesting concept! By rotating the grid 90 degrees to the right, you’re visually changing the perspective of the classic 3×3 Tic-Tac-Toe board. The gameplay itself might remain the same, but the rotated perspective could create a novel experience for players. [This nonsense was followed by 227 words of blah-blah.]
The LLM clearly lacked the ability to visualize the tic-tac-toe grid and recognize that a 90-degree rotation leaves the grid exactly as it was originally. I then asked
Gary: Do you think it would be better to rotate the 3×3 grid to the right or to the left?
LLM: Rotating the grid either to the right or left is largely a matter of preference, but each direction could subtly influence how players perceive and adapt to the game. Here are some considerations for both options: [Another 391 words of blah-blah, including this conclusion:
Which Is Better?
• If you want the game to feel approachable and intuitive, rotating to the right might be better. • If you’re aiming for a twist that makes players rethink the game and adds a subtle layer of complexity, rotating to the left could be the way to go.
Ultimately, it depends on your target audience and the experience you want to create. You could even experiment with both directions and see which feels more engaging during gameplay testing!
I then asked
Gary: Which will be more difficult for expert players?
This elicited 254 more words of blah-blah, with this conclusion:
LLM: To challenge expert players and make the game feel fresh and complex, a left rotation would likely be the better choice. It disrupts their ingrained patterns, encourages more active thinking, and raises the game’s difficulty level.
This all reminds me of a point we’ve discussed before (see here and here), which is that chatbots produce text that reads like well-read humans writing on autopilot. If you asked me that chess-piece question, and I wasn’t fully paying attention—or if it were about some game I don’t know so much about (cricket, maybe?)—, I could imagine given a response similar to the chatbot’s: yeah, hmmm, this is interesting, there must be some subtleties, etc. This sort of thing can happen when seeing a student presentation, or reading some research article that someone sends me: I read it in a sort of shallow, “linguistic,” way, not attending to the details. Listening to the music without paying attention to the words, as it were.
It takes an extra effort to focus and try to figure out exactly what’s going on. In the same way that I can be playing soccer and running, but then if I want to go at top speed I need to apply my consciousness to the task.
Indeed, it might be that if you go back to the chatbot in the above examples and respond, “Hey, chatbot! Focus on the question. Those chess pieces are just being re-labeled!”, then maybe it would give a more reasonable answer. I don’t really know. I do know that if I’m in a meeting and I’m suddenly motivated to focus, I can do so and up my level. Indeed, a colleague recently told me he prefers meeting with me in the morning rather than in the afternoon because in the morning I’m more focused.
One way to think about all of this is to think about intelligence as something you do rather than something you are.
Let’s forget about me for a moment and consider the example of a student answering an essay exam question.
A common mode of operation is that a student will read the question and then make a judgment of whether he or she can give a good answer or must resort to bullshit. Then the student will write the solution or spin out some b.s. of the appropriate length. The b.s. will be written in a way approximating the output of a chatbot: a stream of associations structured so as to look like past writings on the topic.
Here’s the point. A student who is bullshitting is still a sentient, intelligent person. But the bullshitting is not itself the product of an intelligent process. Or, maybe we could say, it’s a process that shows intelligence but not reflection.
I guess that employees at Google etc. are working hard on programming computers to do something like whatever it is that we do when we’re applying our intelligence and thinking, which is something different from, but maybe related to, what we do when we’re bullshitting.
But there’s also this, from Aug 2023:
Bob Carpenter thinks GPT-4 is awesome.
Here’s the story. We’ve seen lots and lots of AI hype. On the other hand, actual AI is improving so fast. So, in year T, the gang at Google or wherever will claim that AI can do X, but really it can’t. But then in year T+1, it can do X, or something like it, so the hype was retroactively validated.
The other key piece here is that lots of people are working improving these programs. If someone at Google says their chatbot can understand math, or whatever, and then Gary Smith or someone comes up with an example showing that, at least for that example, the chatbot is not understanding anything, then Google has lots of smart people working to fix it. And, when I say “fix,” I don’t just mean to put in a patch so the chatbot no longer gives the wrong answer to the Monty Hall problem or whatever, but rather to program something more general that will take it something closer to understanding. So, it’s all a rapidly moving target.
This is different from the usual stories of science hype (Subliminal smiley faces shift attitudes on immigration by 18 percentage points! Early childhood intervention increases eventual adult earnings by 42%! Attractive parents are 26% more likely to have daughters! etc. etc.) or business and government hype (Theranos! Hyperloop! A cancer cure by 2000! Nudge! etc. etc.) in that these are things that never worked (or did not work as advertised; to be fair, some cancers can be cured and some nudges do work) and were propped up by some combination of unrealistic assumptions, bad statistical analysis, uncritical promotion, and, at times, just plain lies. With AI hype, it often seems that some version of reality will catch up to it, in a way that I can’t see happening with the Bible code or the hyperloop or mind-body healing or whatever.