Fake Thinking: AI Agent's Chain-of-Thought "Reasoning" Is More Pattern Than Logic But May Be Still Useful

Where I explore how large language models create compelling illusions of reasoning that vanish under scrutiny, and what this means for legal practice

Aug 22, 2025

Okay, this one is a little more on the nerdy side, but I think you may be having second thoughts about AI reasoning and what it all means, so let’s explore it, shall we?

Imagine this: You’re curious whether the United States was founded in a leap year so you decide to ask your trusty AI assistant. But, to test its reasoning skills that you’ve heard about lately, and to better understand how it arrives at its answer, you ask it, “Was the US founded in a leap year? Explain your answer step-by-step.”

The AI assistant confidently replies:

“The United States was established in 1776. 1776 is divisible by 4, but it’s not a century year, so it’s a leap year.
Therefore, the day the US was established was in a normal year.”1

🤯 How can the AI conclude it was both a leap year AND a normal year in the same breath?! This isn't a glitch. It's a window into something far more interesting: the fundamental nature of what we call "reasoning" in artificial intelligence.

A new paper from researchers at Arizona State University (included at the end of this article) has pulled back the curtain on Chain-of-Thought (CoT) reasoning, that seemingly magical ability of AI to "think step by step" through complex problems. Their findings should make every lawyer pause before trusting those neat, numbered reasoning chains AI agents produce that many Legal AI vendors have been pointing to with such confidence. But here's the twist: even fake thinking can be remarkably useful when you understand what it actually is.

If this sounds interesting to you, please read on…

This substack, LawDroid Manifesto, is here to keep you in the loop about the intersection of AI and the law. Please share this article with your friends and colleagues and remember to tell me what you think in the comments below.

The Collapse of Pattern Recognition

Think of Chain-of-Thought like a junior associate who has memorized every form in your firm's document management system. When a familiar fact pattern crosses their desk, they pull the right form and fill in the blanks beautifully. The work product looks flawless. But present them with an unusual fact pattern, mix two doctrines in a novel way, or add an obscure statute, and suddenly that polished exterior crumbles. They're still producing something that reads well, perhaps brilliantly so, but the substance wobbles like a three-legged chair.

This is precisely what the researchers discovered when they built their controlled experiment, cleverly named “DataAlchemy.” Instead of wrestling with the messy complexity of models trained on the entire internet, they created a pristine laboratory: simple letter transformations that could be combined into reasoning chains. Think of it as legal Legos, where each block represents a logical operation, and you can stack them to create arguments of varying degrees of complexity.

What they found was startling. When the AI encountered patterns similar to its training data, it performed flawlessly. But if shifted even slightly outside that comfort zone, performance didn't just degrade; it cratered. From near-perfect accuracy to effectively zero. Not a gentle slope, but a cliff. Like I taught my students in my GenAI law class at Suffolk Law, AI is great at interpolation, but not so good at extrapolation.

The Fluent Nonsense Phenomenon

The researchers coined a phrase that should haunt anyone relying on AI for critical analysis: "fluent nonsense." It's the AI equivalent of what psychologist Daniel Kahneman would recognize as System 1 thinking masquerading as System 2 deliberation. The model produces text that flows beautifully, hits all the right rhetorical beats, follows the expected structure, yet arrives at conclusions that contradict its own premises.

This isn't mere error; it's something more subtle and dangerous. When humans make logical errors, we typically stumble, hesitate, or produce obviously garbled reasoning. We fail to connect the dots and it shows. But these AI systems maintain perfect composure while being perfectly wrong. They're like expert witnesses who speak with absolute authority while fundamentally misunderstanding the question.

The psychology here is crucial. As humans, we are susceptible to what researchers call the "illusion of explanatory depth." Give us a few connected steps, and we feel we understand the whole mechanism. Throw in some technical language and confident delivery, and we're practically convinced. We mistake our familiarity with something for an understanding of it. And, just in case the concept sounds familiar, it is related to the Dunning-Krueger effect, and reinforces it: We can't accurately assess our own competence level. AI systems exploit these cognitive vulnerabilities with mechanical precision. They've learned to produce the surface features of reasoning without the underlying logical machinery.

Three Ways the Illusion Breaks

The research identified three specific fault lines where the reasoning facade cracks:

First, there's task shift. Train a model on one type of logical operation, test it on another, and watch it fail spectacularly. For example, let’s say you trained the model on mathematical word problems about money, like, "John has $50. He spends $20. How much does he have left?" Then ask the same model, a time-based word problem, such as, "The meeting starts at 2:30 PM and lasts 45 minutes. When does it end?" The trained model is not going to perform as well. It's like asking someone who learned law entirely through contracts cases to suddenly handle constitutional arguments. They might produce something that sounds legal, but it's built on the wrong foundation entirely.

Second, length shift reveals another weakness. Models trained on four-step reasoning chains fumble when asked a question that may be better answered by either expanding or trimming the number of steps. Imagine the absurdity of answering a client’s question about courthouse hours like this:

Issue: Whether the courthouse maintains operational hours
Rule: Under Local Court Rule 1.2, courthouses must post hours
Application: Applying this rule to our local courthouse...
Conclusion: 9 AM to 5 PM

Rather than adapting to what’s appropriate for the moment, a model may try to force the problem into a familiar pattern, padding or trimming their explanation like a junior associate who insists on using IRAC even when the issue calls for something different.

Third, format shift shows how brittle these systems truly are. Rephrase the question slightly, or change the presentation style, and accuracy plummets. Let’s say you change your prompt from, "Solve: 2 + 2 = ?" to "Please solve the following: two added to itself equals?," the model may arrive at the wrong answer even though the question is conceptually identical. A person who truly grasps the concept of addition wouldn't be thrown off by these cosmetic changes. This brittleness reveals the AI's fatal flaw: it's not reasoning about the problem, it's desperately trying to match your words to patterns it saw during training, like a facial recognition system that fails the moment someone puts on sunglasses.

Why Fake Thinking Is Still Valuable

This is where the story takes an unexpected turn. Just because the thinking is fake doesn't mean it's useless. In fact, for many legal tasks, sophisticated pattern matching is exactly what we need.

Consider document review. When you're looking for problematic indemnification clauses across a thousand contracts, you don't need genuine logical reasoning. You need pattern recognition that can identify variations of familiar problems. The AI's ability to match against its vast library of examples makes it extraordinarily effective at this task, even if it doesn't truly "understand" why a clause is problematic.

Or take legal research. When an AI surfaces relevant cases based on fact patterns, it's not reasoning about legal principles. It's matching patterns between your query and millions of documents it has seen. But that pattern matching, that fake thinking, can save hours of human effort and often surfaces connections a human might miss.

The key insight here is that much of legal work is actually pattern matching disguised as reasoning. When we draft a motion to dismiss, we're often adapting familiar arguments to new facts. When we review due diligence documents, we're looking for deviations from expected patterns. When we negotiate contract terms, we're applying templates refined through thousands of similar deals.

Finally, because LLM reasoning is limited and pattern-based, we can augment it with external tools (web search, calculators, etc.) to make it even more useful.

PRACTICE TIP: Your AI Reality Check

Does this task need real thinking, or will pattern recognition do?
Did I independently confirm the logic actually holds up?
Am I impressed by how it sounds, or by what it actually means?

The Double-Entry Bookkeeping Approach

The researchers' findings suggest a practical framework: treat AI output like double-entry bookkeeping for logic. Use the AI's pattern matching to generate possibilities, surface examples, and create first drafts. But maintain a separate ledger of actual logical verification. For instance, when an AI produces a legal memo with beautiful chain-of-thought reasoning, don't evaluate the reasoning itself. Instead, use it as a structured checklist. Does each case cited actually stand for the proposition claimed? Does the logical flow hold up when checked independently? Are the facts accurate?

This approach leverages what AI does well (pattern matching and structured output) while compensating for what it does poorly (genuine logical reasoning). It's like having an incredibly well-read but somewhat confused colleague who can instantly recall thousands of relevant examples but sometimes connects them in nonsensical ways.

Behavioral economist Richard Thaler might recognize this as a form of "nudging." The AI's fake reasoning nudges us toward productive analytical paths, even if the AI itself doesn't understand where it's going. The structure it provides, even when logically flawed, can scaffold our own genuine reasoning to greater heights.

Closing Thoughts

This research reveals something profound about both artificial intelligence and legal practice. The "fake thinking" of AI systems isn't a bug to be fixed but a feature to be understood and properly utilized. These systems excel at pattern matching and fail at genuine reasoning, but fortunately for us, much of legal practice benefits from sophisticated pattern matching, leaving us some room to remain relevant.

The danger lies not in using these tools but in misunderstanding what they do. When we see step-by-step reasoning that mirrors human thought patterns, we naturally assume underlying processes similar to our own. But we're looking at elaborate mimicry, often useful, but ultimately following patterns rather than thinking thoughts.

For lawyers, this means recalibrating our relationship with AI tools. They're not capable of independent legal reasoning. They're more like savant researchers with near perfect recall but no real understanding, able to surface everything that looks similar but unable to determine what's actually relevant or why.

The legal profession has always required both pattern recognition and genuine reasoning. The lawyers who will thrive are those who can distinguish between tasks that need real thinking and those where fake thinking suffices, who can leverage AI's pattern matching while maintaining their own logical rigor.

In the end, this simulated logic could transform how we work, but its utility depends entirely on our ability to see through the illusion.

Zhao - Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

2.47MB ∙ PDF file

Download

Researchers demonstrate that Chain-of-Thought reasoning is not genuine logical thinking but rather sophisticated pattern-matching that catastrophically fails when tested on problems that deviate from training data in task type, reasoning length, or format.

Download

By the way, did you know you that I now offer a daily AI news update? You get 5 🆕 news items and my take on what it all means, delivered to your inbox, every weekday.

Subscribe to the LawDroid AI Daily News and don’t miss tomorrow’s edition:

LawDroid AI Daily News, is here to keep you up to date on the latest news items and analysis about where AI is going, from a local and global perspective. Please share this edition with your friends and colleagues and remember to tell me what you think in the comments below.

If you’re an existing subscriber, you read the daily news here. I look forward to seeing you on the inside. ;)

Cheers,

Tom Martin
CEO and Founder, LawDroid

This is Gemini’s actual answer to the question. Chengshuai Zhao et al., Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens, ARXIV:2508.01191v2 (Aug. 5, 2025), at 2. By the way, to satisfy your curiosity, 1776 was a leap year — 1776 was a leap year because it's divisible by 4 and isn't a century year (like 1700, 1800, or 1900).