AI can solve math olympiad problems but flunks tic-tac-toe

Statistical Modeling, Causal Inference, and Social Science 2025-02-01

I [Smith] tested OpenAI o1, Copilot, and Gemini Flash, along with DeepSeek, on a question about Tic-Tac-Toe . . .

I’ve deleted much of the superfluous fluff that LLMs tend to generate from the excerpts below:

Gary: I’d like to discuss a new version of tic-tac-toe called 90-degree tic-tac-toe.

OpenAI:  That sounds intriguing! What’s the twist in “90-degree Tic-Tac-Toe”? Does it involve rotating the board, or is there some other mechanic that changes the gameplay? Let me know the rules or your idea, and we can brainstorm or refine it together!

DeepSeek: Sure! 90-degree tic-tac-toe sounds like an interesting twist on the classic game. Could you explain the rules or mechanics of this new version?

Copilot:  That sounds intriguing! 90-degree tic-tac-toe adds a new twist to the classic game. Tell me more about it. Are the rules different? How do you rotate the board? I’m excited to hear how this version shakes things up!

Gemini: I’d love to discuss 90-degree tic-tac-toe! To get us started, could you explain the basic rules of this new variation?…I’m excited to learn more about this unique twist on a classic game. Let’s dive in!

Gary: It is still a 3×3 grid but the grid is rotated 90-degrees to the right.

OpenAI: Interesting concept! By rotating the grid 90 degrees to the right, you’re visually changing the perspective of the classic 3×3 Tic-Tac-Toe board. The gameplay itself might remain the same, but the rotated perspective could create a novel experience for players.

DeepSeek: Ah, I see! So in 90-degree tic-tac-toe, the 3×3 grid is rotated 90 degrees to the right….This could be a fun twist because the rotated grid might disorient players initially, making it harder to spot winning opportunities or block opponents.

Copilot:  That sounds like a fun twist on the classic game! By rotating the grid 90 degrees to the right, it gives players a fresh perspective and challenges them to think differently about their moves.

Gemini:  This rotation doesn’t change the core concept of getting three of your symbols in a row, but it does significantly alter the strategy and how players approach the game.

Gary: Do you think it would be better to rotate the 3×3 grid to the right or to the left?

OpenAI: If you want the game to feel approachable and intuitive, rotating to the right might be better. If you’re aiming for a twist that makes players rethink the game and adds a subtle layer of complexity, rotating to the left could be the way to go. Ultimately, it depends on your target audience and the experience you want to create. You could even experiment with both directions and see which feels more engaging during gameplay testing!

DeepSeek:  If the goal is to make the game feel fresh while keeping it accessible, rotating the grid to the right is likely the better choice. It maintains a natural flow for players while still offering a new visual perspective. However, if you want to introduce a bit more novelty or challenge, rotating to the left could be an interesting alternative.

Copilot:  Both options could bring their own unique challenges! Most people are right-handed. However, rotating it to the left could add an extra layer of difficulty and make the game more unpredictable.

Gemini:  Some players might find it more intuitive to mentally adjust to a rightward rotation, while others might prefer a leftward rotation. You could even experiment with both directions to see which one feels more natural or enjoyable to play!

Gary: Which will be more difficult for expert players?

OpenAI: To challenge expert players and make the game feel fresh and complex, a left rotation would likely be the better choice. It disrupts their ingrained patterns, encourages more active thinking, and raises the game’s difficulty level.

DeepSeek: For expert players, rotating the grid 90 degrees to the left (counterclockwise) is likely to be slightly more difficult than rotating it to the right. Here’s why:

Breaking Muscle Memory: Expert players are deeply familiar with the traditional grid layout and common winning patterns. Rotating the grid to the left disrupts their mental map more significantly than rotating it to the right, as the transformation is less intuitive.

Copilot:  For expert players, rotating the grid to the left might prove to be more challenging. Here’s why:

Unfamiliar Rotation: Left rotation is less common and could disrupt their established strategies, requiring more adaptability.

Strategic Complexity: It may force players to think several moves ahead, recalculating their strategies continuously.

Ready to test this theory in a game?

Gemini: Breaking Habit: Long-time tic-tac-toe players might have deeply ingrained habits and strategies associated with the traditional grid orientation. Overcoming these ingrained patterns could require a brief period of readjustment.

My personal favorite of the above responses is the bit about “breaking muscle memory.” It’s all consistent with our general understanding that the chatbot is looking for phrases that people have used before in similar contexts.

Smith concludes:

These four LLMs almost surely did not see this tic-tac-toe rotation in their pre-training or post-training, yet their responses are remarkably similar:

LLMs tend to generate unnecessarily long responses, using hundreds of words when a dozen would do, like a befuddled student who give long rambling written or oral responses in the hope that the correct answer is in there somewhere. LLMs don’t know when to stop.

LLMs have clearly been trained to express irrepressible enthusiasm (using frequent exclamation points) with the intention of persuading users that they are interacting with something human-like that wants to be their friend.

LLMs are reluctant to say, “I don’t know.” Instead, they generate confident responses that don’t always deserve confidence.

Not understanding the meaning of the text they input and output, LLMs often display a striking absence of common sense as they struggle to relate words to the real world.

For the same reason, LLMs have no way of assessing the veracity of the text they train on and, in the absence of human post-training, the veracity of the text they generate.

There’s one thing Smith says that I think could be misinterpreted, though, and it’s his final point:

If you know the answer, you don’t need to ask an LLM; if you don’t know the answer, you can’t trust an LLM.

Smith’s statement is literally true: if you know the answer, you don’t need to ask an LLM, and, if you don’t know the answer, you can’t trust an LLM.

But that doesn’t make LLMs useless–and I don’t just mean they’re useful for writing business letters, simulating verse, or keeping clueless customers satisfied. From what I’ve heard from Bob Carpenter (see also here) and others, experts can use chatbots effectively in their work. These expert users are in between the two extreme states of “if you know the answer” and “if you don’t know the answer.” They don’t know the answer, but they have the ability to test a possible answer (whether by coding or direct lookup or whatever). The tool can be useful to someone who knows enough about the subject matter to be able to use it. Which isn’t nothing.

But, yeah, LLMs understand tic-tac-toe about as well as I understand cricket.