Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
GenAI Chooses Insecure Code Nearly Half the Time, Veracode Finds

Artificial intelligence may be writing more of today’s code, but it’s also writing in vulnerabilities. Large language models introduce vulnerabilities in nearly half of test cases when asked to complete secure code tasks, say researchers.
See Also: AI Agents Demand Scalable Identity Security Frameworks
There’s been little improvement in how well AI models handle core security decisions, says a report from application security company Veracode.
AI models are getting better at writing syntactically correct code, but not at writing secure code, the report finds. “LLMs are fantastic tools for developing software, but blind faith is not the way to go,” Veracode CTO Jens Wessling told Information Security Media Group.
Veracode analyzed 80 curated coding tasks drawn from well-established common weakness enumeration classifications, including SQL injection, cryptographic weaknesses, cross-site scripting and log injection, each representing risks ranked on the OWASP Top 10. Veracode tested over 100 LLMs across these tasks, using static analysis to assess outcomes. The results confirm what many security teams suspected: GenAI is transforming the pace of development, but it is not yet reliable when it comes to risk.
Wessling said model size and complexity didn’t correlate with more secure output. “The mean difference in security performance between small, medium and large LLMs was less than 2%,” he said. That means the problem is not a scaling issue, but a systemic one, he added.
Java was the worst-performing language, with LLMs generating insecure code in more than 70% of the cases. Wessling said Java’s long history is a likely contributor. “It has one of the largest volume of publicly available training data, and a lot of that predates awareness of the CWEs we tested,” he said. Python, C# and JavaScript posted security failure rates between 38% and 45%.
The worst weaknesses were in areas that required broader context. LLMs avoided cryptographic issues and SQL injection vulnerabilities around 80% of the time, but they failed dramatically on log injection and cross-site scripting with about 10% success rate. “Knowing whether a log message is tainted requires understanding where the data comes from and what it contains,” Wessling said. That context is hard for LLMs to reason over.
Researchers tested the output of “vibe coding,” a practice in which developers rely on AI tools to produce code without explicitly defining design choices. Developers often don’t specify security constraints, which effectively delegates those decisions to the model. The study shows that models pick an insecure path 45% of the time.
Some vendors have suggested that prompting LLMs more carefully could improve results, but Wessling was skeptical. “We did not run those tests as part of this study,” Wessling said, “But explicitly telling an LLM to produce more secure code will, clearly, produce more secure results, the same way telling a developer would.”
Attackers can also use AI tools to identify and exploit vulnerabilities, the report says, especially since the barrier to entry for less-skilled attackers is dropping and the speed and sophistication of attacks is increasing.
But Wessling doesn’t see LLMs as the enemy. He said they’re a critical part of the future of secure development, if used responsibly. That includes AI-driven remediation tools that can catch what LLMs miss. The report focuses on vulnerabilities introduced by code generation, but also outlines some strategies for mitigating these risks, including pairing LLM usage with static analysis, software composition analysis and package firewalls integrated into agentic workflows. Veracode recommends enforcing secure coding standards through policy automation and embedding guardrails directly in development pipelines.
In Wessling’s view, this isn’t just about improving tools but about building the right security posture around them. “Having sound security policies around scanning and remediation are critical for any software enterprise, and LLM generated code is no exception,” he said.
On whether unaugmented LLMs could ever be trusted to write secure code without oversight, Wessling said that “if the trend continues, unaugmented LLMs are unlikely to produce software that can be trusted without validation.”
