Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
,
The Future of AI & Cybersecurity
Vulnerability Researchers: Start Tracking LLM Capabilities, Says Veteran Bug Hunter

A vulnerability researcher said large language models have taken a big step forward in their ability to help chase down code flaws.
See Also: On Demand | Global Incident Response Report 2025
Veteran London-based bug hunter Sean Heelan said he’s been reviewing frontier artificial intelligence models to see if they’ve got the chops to spot vulnerabilities, and found success in using OpenAI’s o3 model, released in April. It discovered CVE-2025-37899, a remotely exploitable zero-day vulnerability in the Linux kernel’s server message block protocol, a network communication protocol for sharing files, printers and other resources on a network.
“With o3, LLMs have made a leap forward in their ability to reason about code, and if you work in vulnerability research you should start paying close attention,” Heelan said in a blog post. “If you’re an expert-level vulnerability researcher or exploit developer, the machines aren’t about to replace you. In fact, it is quite the opposite: they are now at a stage where they can make you significantly more efficient and effective.”
As that statement suggests, multiple caveats apply. Heelan’s success at using o3 to find CVE-2025-37899 – “a use-after-free in the handler for the SMB ‘logoff’ command” – appears to trace back, in no small part, to his expertise as a bug hunter.
Notably, he’d already discovered the similar CVE-2025-37778 vulnerability in KSMBD, the SMB3 Kernel Server built into Linux, involving a dangling authentication pointer not being set to null
.
Heelan set o3 loose on some of the KSMBD code to see how it “would perform were it the backend for a hypothetical vulnerability detection system,” as well as what code and instructions it would need to be shown to make that happen.
The LLM couldn’t be given access to the entire code base, due to “context window limitations and regressions in performance that occur as the amount of context increases” He first needed to create a system to clearly describe what needed to be analyzed and how.
Properly cued, o3 found the vulnerability, using detailed reasoning to achieve this result. “Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances,” Heelan said. The LLM “was able to comprehend this and spot a location where a particular object – that is not referenced counted – is freed while still being accessible by another thread. As far as I’m aware, this is the first public discussion of a vulnerability of that nature being found by a LLM.”
LLM Benchmarks Highlight Improvements
The vulnerability o3 discovered is a good benchmark for testing what the LLM can do. “While it is not trivial, it is also not insanely complicated,” Heelan said. The researcher said he could “walk a colleague through the entire code-path in 10 minutes,” with no extra Linux kernel, SMB protocol or other knowledge required.
O3 sometimes also found the correct solution to fixing the problem, although in other cases it presented an erroneous solution. One quirk is that Heelan initially and independently came up with the same fix first proposed by o3, only later realizing it wouldn’t work due to the SMB protocol allowing “two different connections to ‘bind’ to the same session,” and having no way to block an attacker from exploiting it, even with a solution that set it to null
.
As that suggests, the vulnerability remediation process can’t yet be fully automated.
One challenge involves success rates. Heelan said he ran his test 100 times, each time using the same 12,000 lines of code – “combining the code for all of the handlers with the connection setup and teardown code, as well as the command handler dispatch routines” – which ended up being equal to about 100,000 input tokens, which was o3’s maximum. A token refers to pieces of words used in natural language processing, which averages out to about four characters per token.
The total cost of his approximately 100 test runs: $116.
Heelan said o3 found the vulnerability in the benchmark eight of those 100 runs, concluded there was no flaw in 66 runs and for the remaining 23 runs generated false positives. This was an improvement on tests he ran using Anthropic’s Claude Sonnet 3.7, released in February, which found the flaw in three of its 100 runs, while Claude 3.5 didn’t find it at all.
As those results demonstrate, one main takeaway is that “o3 is not infallible,” and “there’s still a substantial chance it will generate nonsensical results and frustrate you,” he said.
What’s new is that “the chance of getting correct results is sufficiently high enough that it is worth your time and your effort to try to use it on real problems,” he said. “If you have a problem that can be represented in fewer than 10k lines of code there is a reasonable chance o3 can either solve it, or help you solve it.”
Making Professionals More Efficient
Heelan’s finding that AI tools might actually enhance technology professionals’ ability to do their job isn’t an outlier.
Speaking last month at the RSAC Conference in San Francisco, Chris Wysopal, co-founder and chief security evangelist at Veracode, said developers on average ship 50% more code when they use AI-enhanced software development tools, with Google and Microsoft reporting a third of their new code is now AI-generated.
One wrinkle is that AI tools are trained on what real-world developers do, and produces code containing an equal quantity of vulnerabilities as classically built code. More code, more vulnerabilities.
Wysopal said the – obviously ironic – solution to this “is to use more AI.” Specifically, LLMs trained on secure code examples, so they recognize bad code and know how to fix it (see: Unpacking the Effect of AI on Secure Code Development).
Heelan’s research shows how the latest frontier AI models might be brought to bear to fix certain types of code, in part by experts creating well-built models designed to enable an LLM to review a particular type of functionality, tool or protocol.
“My prediction is that well-designed systems end up being just as – if not more important – than the increased intelligence of the models,” said AI and cybersecurity researcher Daniel Miessler in his latest Unsupervised Learning newsletter on Wednesday, responding to Heelan’s research.
“Think of it this way: the more context/memory and guided structure a given AI has to solve a problem, the less smart it needs to be,” Miessler said. “So when o3 or whatever finds its first zero-day, that’s cool, but it’s nothing compared to what it could do with 100 times the context and a super clear description of the life and work and process of a security researcher who does that for a living.”