AI-Driven Security Operations
,
Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Open Source ‘Vulnhalla’ Promises ‘Up to 96% Reduction in False Positives’

Every profession has its impossible dream. For software development, it’s using artificial intelligence tools to automatically find, fix and help remediate code flaws.
See Also: Agentic AI and the Future of Automated Threats
Experts have found AI bug hunting so far to be doable only in very narrow and specific circumstances. But a new open source tool for vulnerability hunting called Vulnhalla may bring the impossible dream closer to realization, according to first results from experienced security researchers.
Developed by researchers at CyberArk Labs, Vulnhalla marries automated analysis of code for security flaws with a large language model, using a process called “guided questioning” to help an experienced code reviewer more quickly identify and review potential flaws. Its name is a derivation of the Norse mythological afterlife hall of Valhalla, a place where only the true and brave are able to enter – another impossible dream from a different age.
Available on GitHub, Vulnhalla is designed to fetch code repositories from GitHub as well as any CodeQL databases for the repository. CodeQL is GitHub’s code security analysis engine, designed to allow users to analyze their code and see code-scanning alerts. Vulnhalla runs CodeQL queries on those databases “to detect security or code-quality issues” and post-processes the results using the GPT-4 LLM “to classify and filter issues,” CyberArk says.
“Every researcher, immediately when the AI came inside our world, we were all wondering: can we use it to find vulnerabilities?” said Simcha Kosman, a senior security researcher at CyberArk.
The goal is to significantly reduce the signal-to-noise ratio facing bug hunters, Kosman said. “Running CodeQL’s built-in queries on Redis gave me over 6,800 potential issues. Doable, maybe,” he told Information Security Media Group.
“But when I tried FFmpeg, I got over 51,000. That’s way too much for me. And how many of those are real vulnerabilities? Probably around 0.01%. The sheer number of false positives makes static code analysis impractical – who wants to manually sift through tens of thousands of results just to find a few actual security flaws?”
Early results with Vulnhalla are promising. Kosman said that with a budget of $80 and only two days of effort, his team found seven new vulnerabilities in widely used tools, ranging from flaws in the Linux kernel and game engine RetroArch to the aforementioned video and audio processing tool FFmpeg and application cache tool Redis. The team engaged in coordinated disclosure processes with all of the relevant software projects.
Early approaches to AI bug hunting, such as copying and pasting code into LLMs and asking them to spot the flaws, didn’t yield good results. The entirety of a code base typically won’t fit into an LLM, due to limited context windows. Processing time was slow. False positives abounded.
Kosman identified two main hurdles. The first is context, referring to giving the LLM what it needs to determine if something is a vulnerability. The second he calls the “focus challenge,” which refers to needing “to know where exactly in the code to focus in order to determine if that’s the vulnerability,” by tracing both the data flow and control flow, just like an experienced bug hunter would do.
To address these challenges, Vulnhalla is designed around the concept of guided questioning, which is how it’s meant to be used by a senior researcher.
“The key is to consider how you would explain the issue to a junior security researcher, what you would advise them to look for, and how you would phrase it so they understand the core idea,” the CyberArk researcher said. “Those same explanations should become the questions you ask the LLM.”
Using the tool with this guided questioning approach, researchers reported seeing “up to 96% reduction in false positives for specific issue types, which significantly alleviates the manual review burden of static analysis at scale.”
The degree to which these false positives can be avoided altogether is based on whether a user sets the tool to operate in a strict or non-strict manner. Strict mode attempts to deliver only true positives, with the tradeoff being that some true positives will be marked as false positives. Still, this can be useful for organizations that are so overwhelmed by false positives that they don’t do any static code analysis at all, Kosman said.
The alternative is the non-strict mode, which will result in more true positives actually being false, but still drive down the false-positive rate, he said. For separating noise from signal in the sometimes onerous process of reviewing code security alerts and identifying which ones are true positives, “I’m not replacing a human in the loop here completely, but it does help,” he said.
Kosman said Vulnhalla will develop further, and that he’d like to see it get improved by the software development community and deployed onto any type of code repository or static analysis tool. Likewise, while the tool only now works with C and C++ code, he wants to expand compatibility to more software languages, to help many more bug hunters more rapidly find and eradicate code flaws.
