Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Research Shows How Large Language Models Fake Conceptual Mastery

A famous story about artificial intelligence involves a model being told to devise efficient methods for landing jet fighters onto aircraft carriers. As recounted by researcher Janell Shane, the model discovered “that if it applied a *huge* force, it would overflow the program’s memory and would register instead as a very *small* force.” The pilot was dead, “but, hey, perfect score.”
See Also: OnDemand Webinar | Trends, Threats and Expert Takeaways: 2025 Global IR Report Insights
Humans developers thought the AI “understood” the primary goal was to land airplanes safely, not crash them. This gap between models and reality is so widespread that academics from MIT, Harvard and the University of Chicago are introducing the phrase “potemkin understanding” to characterize how large language models that excel at conceptual benchmarks can’t grasp the ideas they seem to master.
Borrowing imagery from the sham settlements filled with fake happy villagers reputedly built by Grigory Potemkin to please 18th century Russian Empress Catherine II, researchers say that LLMs can produce convincing explanations of abstract concepts despite lacking the ability to apply those concepts meaningfully.
Potemkin understanding is “a failure mode of LLMs whereby apparent comprehension revealed by successful benchmark performance is undermined by nonhuman patterns of misunderstanding,” the authors write in a preprint paper.
Researchers Marina Mancoridis, Bec Weeks, Keyon Vafa and Sendhil Mullainathan say potemkin understanding is different than hallucination, a more commonly cited failure mode in AI. “Potemkins are to conceptual knowledge what hallucinations are to factual knowledge – hallucinations fabricate false facts; potemkins fabricate false conceptual coherence,” they wrote.
That brittleness, Vafa said, has real-world consequences, particularly in automation. “If LLMs are used to automate processes but don’t actually possess the necessary understanding to carry out these processes, it can result in bad or harmful performance,” he said in an email.
The paper cites examples of how OpenAI’s GPT-4o was asked to define a basic word rhyming scheme known as ABAB. The model correctly responded: “An ABAB scheme alternates rhymes: first and third lines rhyme, second and fourth rhyme.” But when prompted to supply a rhymed word to complete the pattern in a partially-written poem, the model inserted a word that didn’t rhyme at all. This pattern repeated across many tasks: models could explain concepts but failed when required to operationalize them. “It’s tempting to anthropomorphize when LLMs perform well on human benchmarks, but this can give a false sense of model competence,” said Vafa .
The authors argue this gap has broad implications, particularly for AI evaluation. Benchmarks are widely used to evaluate and improve models, but if models can pass them without genuine understanding, then benchmark performance ceases to be a reliable indicator.
To test the prevalence of potemkins, the team developed new benchmarks focused on literary devices, game theory and psychological biases. They evaluated models including GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3 70B, DeepSeek-V3, DeepSeek-R1 and Qwen2-VL 72B.
Models successfully identified a concept in 94.2% of cases but they failed at classification 55% of the time and scored similarly poorly on generation and editing tasks related to the same concepts.
The root cause is models making surface-level correlations allowing them to answer many related questions without acquiring the abstract schema needed to reason about novel instances, the researchers say. In one test involving Shakespearean sonnets, AI could describe structural rules but frequently failed to detect or construct valid sonnets, suggesting that current LLMs aren’t learning the abstract schema behind the concepts they discuss.
The authors hypothesize that during training, LLMs optimize for token prediction in ways that over-fit to benchmark formats, rather than abstracting true conceptual knowledge. The result is that models that can ace a test without generalizing their knowledge to new tasks.
The paper stops short of prescribing a solution but suggests the need for new evaluation frameworks and deeper structural changes in model design. Improved training regimes or evaluation frameworks may be necessary to detect or prevent potemkin-style failures in models that otherwise appear competent.
“My main advice would be to not trust LLMs based solely on static benchmarks designed for humans,” Vafa said. “To feel most confident in an LLM’s ability, it’s important to evaluate it on the specific task that the company is interested in using it for.”
