Artificial Intelligence & Machine Learning
,
Governance & Risk Management
,
Next-Generation Technologies & Secure Development
Fable 5 Release Fuels Debate Over Whether Frontier Models Are Worth the Higher Cost

The security industry has for months treated frontier artificial intelligence models as a category apart: systems so capable that they needed to be rationed. Anthropic’s release of Fable 5, a Mythos-class model publicly available with cybersecurity capabilities routed to a less-powerful model, has pushed into focus the question of how much of that capability actually requires frontier access.
See Also: Edge Transformation: Top 5 SASE Predictions and Trends
Frontier models part of Anthropic’s Project Glasswing and OpenAI’s Daybreak are behind controlled-access programs, available only to vetted partners. The justification for the limited access is that the models can reason across enormous codebases, chain individual vulnerabilities into working exploits and surface flaws that survived decades of prior scrutiny.
Smaller language models, built to run on less data and fewer tokens, were considered unsuited for that kind of reasoning, but industry experts are now exploring whether they can close that gap, and under what conditions.
Finding a flaw in a specific section of code is straightforward once a model is pointed at the right place. Reasoning across an entire codebase to locate a vulnerability nobody has named before is a different problem entirely, and that is where smaller models struggle.
A cheaper model might succeed on a complex identification step 30% of the time, while the percentage for a frontier model will be 80%, said Philippe Dourassou, AI pen test lead at Aikido Security. When chained with exploitation and moving deeper into a compromised system, the probabilities multiply across each step. The smaller model completes the full sequence roughly 3% of the time, and the frontier model, close to half, he said. “The harder and longer the task becomes, the better the smart model will perform,” he told ISMG. Frontier models can hold more code and prior reasoning in their working memory simultaneously, and their training includes solving multi-step problems.
Dipto Chakravarty, chief technology officer at Black Duck, said while small language models are limited in solving harder tasks, the key to success rates is understanding how often those tasks arise in practice. Most of what organizations need isn’t a novel exploit chain, but reliable detection and triage at volume, he said. And for that, the scaffolding around the model – the system governing what code gets examined, how many attempts the model gets and how findings get organized – matters as much as the model itself.
“When Anthropic runs Terminal-Bench at 1 million tokens per task, five retries and a ceiling of three times the compute, a candid principal engineer will surmise half of the observed capability delta belongs to the harness, not the weights,” he told ISMG. In other words, when a benchmark gives a model generous resources and multiple attempts, much of the performance gain comes from those conditions and not from the model being smarter.
The argument is backed by Microsoft’s findings during an experiment dubbed MDash, in which it paired frontier models for complex reasoning with smaller distilled models, which are compressed versions trained to execute specific tasks efficiently, across a pipeline of over 100 coordinated agents. On CyberGym, a benchmark of 1,507 real-world vulnerability reproduction tasks, MDash outscored both Mythos and GPT-5.5. No single model drove the result, but the pipeline did.
For the high-volume, repeatable work that fills most security teams’ days, such as matching findings to known vulnerability categories and correlating alerts across systems, domain-trained small models already outperform frontier ones on accuracy. IBM Research’s CyberPal 2.0, a family of security-specialized models, beats GPT-4o and o1 on core threat investigation tasks.
George Gerchow, chief security officer at Bedrock Data, said the results small models replicate are not the ones that frontier models demonstrated, he said. “The bugs lived for decades through every static analyzer, fuzzer and pattern-matcher we have pointed at the same code,” he said of the vulnerabilities Mythos surfaced. “They did not survive because nobody looked. They survived because finding them requires reasoning across thousands of lines of context to spot an interaction nobody knew existed. That is not a search problem. It is a reasoning problem, and the reasoning ceiling sits with the model.”
On most benchmarks, smaller models are handed the relevant code directly, he said. A genuine autonomous scan starts from an entire codebase and has to find that code first.
AI-generated vulnerability reports are already arriving faster than human reviewers can assess them. HackerOne paused its internet bug bounty program this year after AI submissions overwhelmed triage capacity. The cURL project shuttered its bug bounty program for the same reason. Gerchow says the reasoning capacity of frontier models is the hardest to replace, because it doesn’t just assess whether a flaw exists, but if it is reachable, exploitable and worth acting on.
Howie Koh, vice president of innovation at Forescout, said that both positions describe different layers of the same problem. Smaller models handle the continuous, cost-efficient sweeps, while frontier AI covers the periodic deep analysis when its reasoning depth is needed.
“That will result in multiple models inside a single harness and it will open a market gap for vendors who can optimize for outcomes and return on investment, rather than offering harnesses that conveniently work best with their own frontier models,” he told ISMG. The opportunity essentially belongs to vendors who have no model of their own to protect and market because they can pick the right tool for each task rather than defaulting to the most expensive one.
“All it takes is one zero-day finding that a frontier model surfaces, and a smaller model misses, to justify the investment of running that frontier model,” Gerchow said. The release of Fable 5 gives security teams another option depending on what it’s trying to solve – and how much it can afford to miss.
