Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Llama 4 Scout and Maverick Face Scrutiny Over Benchmarking Practices

Meta released two new open-weight large language models that aim to scale artificial intelligence performance while lowering compute costs. Their debut drew criticism over how the company presented benchmark results, raising questions about transparency in AI model evaluation.
See Also: Fortinet Expands FortiAI Across Security Fabric Platform
Dubbed Llama 4 Scout and Llama 4 Maverick, both models are built on a mixture-of-experts architecture.
The launch comes at a time when researchers and developers are scrutinizing benchmark scores as they try to evaluate increasingly complex models. While Meta pitched Llama 4 as a major step forward, the use of non-public “experimental” versions for testing triggered debate about how representative those scores are of models available to users.
Scout and Maverick feature 17 billion active parameters but differ in how they route tasks to specialized parts of the model. Scout uses 16 experts – which are specialized neural network components that focus on specific tasks or data types – and is small enough to run on a single Nvidia H100 GPU, aimed at developers with limited resources. Maverick scales to 128 experts and is designed for larger, more complex workloads. Both are derived from Llama 4 Behemoth, an unreleased model with 288 billion active parameters and nearly two trillion total parameters currently in training.
The mixture of experts design allows the model to selectively activate a subset of experts per task, offering efficiency gains over dense models where every parameter is used for every input. This structure potentially improves performance and lowers the cost of inference, which could make deployment more practical across a range of enterprise use cases.
After the release, Llama 4 Maverick climbed to the second spot on LM Arena, a leaderboard that uses human preferences to compare AI models. But some researchers flagged that the version submitted to the leaderboard was not the same as the open-weight model, saying that Meta used an “experimental” chat version of Maverick for benchmarking – a version not available to the public.
The move drew criticism from the AI community, with researchers arguing that it undermines the purpose of benchmarking. Benchmarks are supposed to reflect the performance of models as they are released, not internally tuned variants that may not behave the same in real-world settings. Using non-public versions can give a distorted picture of quality, especially when rankings are used to drive developer interest and shape perceptions about model superiority.
Meta defended the process. The company’s vice president of generative AI Ahmad Al-Dahle posted on X, formerly Twitter, that it was “simply not true” that Meta had trained models on the test sets to artificially inflate results. He said that performance could vary depending on which platform the models are run on and that Meta was still tuning public deployments to match quality levels seen internally.