Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
New Architectures, Smarter Benchmarks Must Evolve Further

The era of “bigger is better” in artificial intelligence could be approaching its limits, at least when it comes to teaching machines how to reason. Large language models can draft contracts, summarize policies and pass coding exams, but their performance flatlines when asked to connect the dots, weigh options or tailor outputs to context.
See Also: AI Agents Demand Scalable Identity Security Frameworks
Recent analysis from research group Epoch AI found that performance on reasoning benchmarks is showing early signs of plateauing, even as compute and model sizes grow.
Models struggle to capture user context and intent, said Ram Bala, associate professor of business analytics at Santa Clara University. “Good reasoning requires awareness of who is asking and why, but models often default to generic outputs, like summarizing clauses without adapting to a user’s goals,” he said. Bala studies business applications of AI and is an author of The AI-Centered Enterprise: Reshaping Organizations with Context Aware AI.
The flattening curve is a structural issue. Perception tasks such as labeling an image or predicting the next token scale smoothly with more data and parameters, but reasoning does not.
“Scaling parameters improves fluency but not the structured inference and trade-off evaluation required for reasoning,” Bala said. Perception tasks are statistically dense and local, but reasoning is global and goal-dependent, where a single context error can derail the whole output, he said.
Sumeet Gupta, senior managing director and AI practice leader at FTI Consulting, said the scaling profile for reasoning looks different because it combines multiple techniques such as distillation, reinforcement learning from human feedback and supervised fine-tuning, each with its own bottleneck.
“Scaling laws will ultimately catch up to any compute architecture and technique, and this will not be different for reasoning performance,” he said. “The mix of techniques involved makes the curve less predictable than for LLM training, but the slowdown is real.”
The Problem of Architectural Limits
Many of today’s LLMs are built on transformer architectures, which excel at statistical pattern recognition, but are limited in planning and memory. “Transformers are limited by fixed-context attention and lack iterative planning loops, which hinders complex reasoning,” Bala said.
Those limits have sparked interest in memory-augmented and retrieval-based systems, which can surface external information during inference or maintain context across multi-step interactions. But their value depends on the application.
“There are inherent limits to any architecture,” Gupta said. “Memory doesn’t improve base performance itself but helps maintain AI agent context over a longer session of time. Similarly, [retrieval-augmented generation] access to external data improves business performance of specific applications but doesn’t necessarily improve core model reasoning performance.”
Techniques like chain-of-thought prompting, where the model is guided to “think aloud” in intermediate steps, have shown promise. But “without better perception and context alignment, longer reasoning traces often become verbose but irrelevant,” Bala said. “Most of the easy gains have been realized.”
What’s Next?
As brute-force scale reaches its limits, researchers are turning to new architectures, hybrid models and more structured approaches. Bala said symbolic hybrids, where a neural model handles perception and a symbolic engine handles logic, is a promising direction. Agentic systems, which modularize perception, reasoning and learning, could also offer more control and interpretability.
“Curriculum learning that teaches models to move from perception to decision-making, coupled with feedback, also holds promise,” Bala said.
Gupta sees room for progress in both incremental refinement and longer-term architectural shifts. “A fundamental architectural innovation will yield the most core performance gain,” he predicted. Current systems may become more effective by combining with symbolic AI or being trained for specialized domains such as legal, finance or mathematics.
“For specialized use cases particularly requiring a high degree of response fidelity, compliance adherence, domain-specific complex problem solving and/or low latency performance, smaller and specialized models will be used,” he said.
General-purpose models still have a place, particularly for perception across unstructured data, but high-precision reasoning tasks may increasingly fall to systems purpose-built for domain understanding.
The stagnation of reasoning progress has also prompted debate about how progress should be measured. Traditional metrics such as parameter count or token throughput reveal little about a model’s ability to reason within specific roles.
Bala called for new kinds of benchmarks – ones that test outputs in context. “Benchmarks should test reasoning in context, such as generating different outputs for procurement versus legal roles in the same scenario,” he said. He added that tasks requiring subgoal decomposition, multi-source input and real-time feedback are more aligned with reasoning than simple prompt completion.
Gupta said several domain-relevant benchmarks are already in use, such as HellaSwag, BBH or ARC-AGI for general-purpose reasoning and HumanEval for coding. Benchmarks that simulate the completion of real-life reasoning tasks are the best for tracking model improvement, he said.
Breaking Through
If reasoning model performance continues to plateau even with 100 times more compute, what kind of bets should the industry be placing?
Research should focus on richer context infrastructure, such as dynamic knowledge graphs and orchestration frameworks that separate perception and reasoning, Bala said. Hybrid neuro-symbolic architectures and agentic control loops with feedback can break current limitations, he said.
Gupta said that transformative new architectures are in development, but there is still plenty of ground to cover with current models, especially by enhancing how they’re applied. New approaches include architectures that can train on multi-modal datasets, mimicking the rich sensory input the human brain experiences, and others that allow for self-adaptive tuning, he said. But while new architectures go through the research lifecycle, “variations, optimizations and hybrid combinations of current model architectures have a long runway ahead for practical business applications,” he said.
Scale alone isn’t enough, but it’s also not obsolete, with experts suggesting smarter architectures, better benchmarks and more role-aware reasoning tasks to push past the plateau.
