Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Researchers Examined 7 LLMs to Determine How they Dealt with Flawed Code

Large language models trained on flawed data tend to replicate those mistakes, researchers have found, raising concerns for enterprises relying on artificial intelligence-assisted software development.
See Also: Capturing the cybersecurity dividend
A group of researchers examined seven LLMs to determine how they performed when dealing with flawed code snippets.
Using code from the Defects4J dataset, nine researchers from institutions including Beijing University of Chemical Technology and the Chinese Academy of Sciences assessed seven models like OpenAI’s GPT-4o, Meta’s CodeLlama, Google’s Gemma and Salesforce’s CodeGEN. The findings: GPT-4o copied known bugs 82.61% of the time, while GPT-3.5 followed with a 51.12% replication rate.
“In bug-prone tasks, the likelihood of LLMs generating correct code is nearly the same as generating buggy code, and it is substantially lower than in normal code completion tasks,” researchers said. For GPT-4, the accuracy rate dropped from 29.85% on clean code to 12.27% on buggy snippets.
“On average, each model generates approximately 151 correct completions and 149 buggy completions, highlighting the increased difficulty of handling bug-prone contexts,” the report said.
The behavior shows that LLMs are prone to memorization rather than intelligent error correction. The researchers described it as “echoing errors,” where models regurgitating past mistakes from their training data instead of applying reasoning or pattern recognition to correct them.
“To our surprise, on average, 44.44% of the bugs LLMs make are completely identical to the historical bugs,” the researchers said. For GPT-4o, this number is as high as 82.61%.
The study also uncovered that LLMs struggle more with complex programming scenarios than with simpler syntax. Models exhibited higher error rates in tasks involving method invocation and return statements compared to variable declarations or conditional statements.
Google’s Gemma-7B demonstrated a lower bug replication rate of 15%, suggesting that smaller, more specialized models may introduce fewer errors in certain contexts. But even models designed for reasoning, like DeepSeek’s R1, failed to significantly outperform their counterparts when it came to bug-prone code.
The researchers advised enhancing the LLMs’ understanding of programming semantics, integrating error detection mechanisms and applying rigorous post-processing checks could reduce the tendency to mirror past mistakes.