Approach Aims to Scale AI Models by Making them Smarter Instead of Bigger

For years, it seemed obvious that the best way to scale up artificial intelligence models was to throw more upfront computing resources at them. This theory was that performance improvements are proportional to increases in model size, dataset volume and computation. But the anticipated leap in performance isn’t materializing.
See Also: The Comprehensive Guide for a Viable BYOD Policy
Instead, scaling AI models in size has hit a plateau. OpenAI’s Orion model only demonstrated modest improvements over its predecessor despite requiring about 10 times more compute resources than GPT 4, while Google’s development of its next-generation Gemini model also experienced slower-than-expected progress.
One possible solution to scaling AI models is time test compute. The approach dynamically allocates extra computational resources during inference – or the thinking phase – to refine answers. Unlike older models that simply chug out the next word, newer “reasoning” approaches allow for the AI equivalent of reflection and refinement. A 2024 research paper written by Google DeepMind workers showed that an adaptive “compute-optimal” strategy can improve performance four-fold compared to traditional methods, sometimes allowing a smaller model to outperform one 14 times larger.
Instead of a fixed computational budget for every task, test-time compute lets AI models allocate resources based on the problem’s complexity. The dynamic allocation aims to make AI systems more efficient and better at handling complex, real-world challenges.
Test-time scaling in general refers to using additional computation to get more accurate answers. Scaling parameters also uses more computation when the model is larger, but with test-time compute, models spend more tokens thinking through the answer before outputting it.
The Google research paper studied techniques that allowed models to use more tokens at test-time, and found that test-time scaling can outperform scaling model parameters. The additional “thinking time” allows models to explore multiple reasoning paths, making them particularly suited for tasks such as advanced code generation or multi-modal data analysis, said lead paper researcher Charlie Snell.
Research indicates that grouping – or binning – questions by difficulty and tailoring the extra compute accordingly can boost performance. “We used a pretty straightforward approach of binning questions by difficulty and selecting the test-time strategy which was most effective for a given FLOPs budget within that difficulty bin,” said Charlie Snell, leader author of the Google paper.
This compute-optimal strategy has shown that for easy-to-medium difficulty problems, iterative refinement can be particularly effective, while for more complex challenges, independent sampling or search methods may offer better results.
Memory is an important consideration. Increasing test-time compute can introduce new memory constraints during inference. Snell said that inference is “definitely” more memory bound than training, which is a “little bit of a problem” because increasing hardware memory bandwidth tends to be harder than increasing FLOPs. But there are strategies like speculative decoding or alternative architectures like SSMs, which might help mitigate some of this, he said.
But real-world deployment of test time compute raises questions about generalizability. “I think there is an open question of how well these techniques can generalize beyond domains that are easily verifiable, like math and code. So for certain applications, it might not be as beneficial. But for more in-distribution tasks like math and code, I think practitioners can try out some of the recent reasoning models released by OpenAI, Deepseek and Google and see if they help out for their use cases,” said Snell.
Open AI’s Strawberry family of models engage in real-time reasoning during inference and Microsoft CEO Satya Nadella has identified test time compute as a new scaling law in AI development. Google is researching methods to optimize test-time computation by enabling models to generate and evaluate multiple solutions. Nvidia is developing hardware and software solutions to support dynamic inference processes, while Meta is investing in AI infrastructure that allows models to adjust their computational pathways during inference.
The timeline for widespread adoption of test time compute varies. Jeremy Bron, AI director at Silamir Group, a data-AI-cyber consulting company headquartered in France, told Information Security Media Group that that basic strategies could be implemented within months, particularly by teams with existing cloud-based GPU or TPU infrastructure. More advanced techniques like latent-space reasoning might take a year or more of dedicated research and development.