Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
LLMs Falter on Real-world Bugs, Even With Debugger Access: Microsoft

Artificial intelligence can code but it can’t debug says Microsoft after observing how large language models performed when given a series of real world software programming tests.
See Also: Fortinet Expands FortiAI Across Security Fabric Platform
Researchers concluded that despite rapid advancements in code generation, most LLMs struggle to resolve software bugs, even when given access to traditional developer tools such as debuggers.
AI-powered coding assistants have become increasingly integrated into software development workflows, with tools like GitHub Copilot, Amazon CodeWhisperer and ChatGPT streamlining tasks such as code completion, documentation and boilerplate creation.
The team assessed nine popular models using a benchmark called SWE-bench Lite, which consists of 300 real Python issues drawn from GitHub repositories. Each issue includes a test case that fails until the model correctly patches the code. A second evaluation used a smaller set of 30 debugging tasks to examine how LLMs behave in more controlled scenarios.
Even the best-performing models were unable to resolve a majority of the issues. Anthropic’s Claude 3 Sonnet achieved the highest accuracy at 48.4% on SWE-bench Lite among the models tested. OpenAI’s o1 and o3-mini scored 30.2% and 22.1%, respectively. Microsoft’s own Phi-2 model achieved 15.8% accuracy.
The study also tested whether providing access to Python’s built-in debugger pdb would help. On a smaller curated set of 30 problems, Claude 3 Sonnet improved its accuracy from 27% to 32% when the debugger was enabled. But most models saw little or no meaningful benefit.
Microsoft said it has built a new training and evaluation environment called Debug-Gym, designed to simulate interactive debugging by allowing models to interact with a real Python execution environment through a text interface. The system is built on OpenAI’s Gym toolkit and runs inside a Docker container. It exposes elements like source code, stack traces and failing test cases. Models can run the test suite, use debugging commands and apply code changes, receiving structured feedback after each action.
Debug-Gym helps AI systems learn sequential problem-solving strategies, Microsoft said. By mimicking how developers explore code using tools like pdb, the environment can help evaluate whether models can learn to fix bugs by inspecting runtime behavior, setting breakpoints and using feedback from failed tests to guide code edits.
But even with the ability to step through execution and inspect values, models performed inconsistently. Researchers said that AI systems are often not trained with data that reflects how humans actually perform debugging. As a result, their use of tools like pdb does not always align with how a human developer would approach the same problem.
The models often issued debugging commands without a clear strategy or failed to modify their approach based on new information, limiting the effectiveness of their interaction with the environment.
LLMs have shown utility in tasks like code completion and generation, but debugging presents a different set of challenges: it requires a feedback-driven process that hinges on interpreting test failures, modifying code accordingly and reevaluating results.