mundophone

DIGITAL LIFE

closeup of eye with computer chip design

AI models demonstrate great programming capabilities but are far from the power of the human brain

AI models from OpenAI and Anthropic demonstrate great programming skills, but according to a Microsoft study, they still struggle to solve software bugs that an experienced human developer can easily overcome.

Artificial intelligence companies such as OpenAI, MetaAI, Anthropic, Google, xAI and others have been showcasing the capabilities of their LLMs in the field of programming. Especially deep thinking models like the recent Gemini version, as well as Claude. Google has already admitted that 25% of new code written by the company was produced by AI.

According to a Microsoft study, despite the advances made in this area, advanced AI models still have limitations when it comes to resolving software bugs, something that an experienced human developer can overcome without difficulty. The study points out that Anthropic's Claude 3.7 Sonnet model or OpenAI's o3-mini failed to solve several problems, according to the SWE-bench Lite benchmark platform, created by the owner of ChatGPT.

These results demonstrate that there is still a long way to go before AI is on the same level as experienced programmers in the field of programming.

Microsoft created debug-gym to help develop LLM agents in an interactive code environment, bridging the gap between the current capabilities of LLMs and the requirements of large-scale code creation and bug fixing (debugging). This lightweight textual environment features several useful tools, such as the Python Debugger designed to make it easier for AI agents to fix bugs.

Still in relation to the Microsoft study, nine AI models were tested, in which an agent, in a single prompt, had to access several bug-fixing tools, including the Python Debugger. The test consisted of solving 300 software debugging tasks. The results show that even the latest models failed to complete more than half of the tasks. Claude 3.7 Sonnet achieved a success rate of 48.4%, followed by OpenAI's o1 at 30.2% and 03-mini at 22.1%, Techcrunch says.

The study's authors point out that the problem lies in the lack of training data that represents developers' sequential decision-making processes when resolving bugs. They believe that specialized data is needed to fill the training of models, such as logs of agents interacting with a debugger to gather the necessary information, before suggesting a correction.

mundophone

mundophone

Tuesday, April 15, 2025

No comments:

Post a Comment

Report Abuse