"“Tasks that seemed straightforward often took days rather than hours, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions,” the researchers explain in their report. “Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible.”
As an example, they cited how Devin, when asked to deploy multiple applications to the infrastructure deployment platform Railway, failed to understand this wasn’t supported and spent more than a day trying approaches that didn’t work and hallucinating non-existent features.
Of 20 tasks presented to Devin, the AI software engineer completed just three of them satisfactorily – the two cited above and a third challenge to research how to build a Discord bot in Python. Three other tasks produced inconclusive results, and 14 projects were outright failures.
The researchers said that Devin provided a polished user experience that was impressive when it worked.
“But that’s the problem – it rarely worked,” they wrote.
“More concerning was our inability to predict which tasks would succeed. Even tasks similar to our early wins would fail in complex, time-consuming ways. The autonomous nature that seemed promising became a liability – Devin would spend days pursuing impossible solutions rather than recognizing fundamental blockers.”"
https://www.theregister.com/2025/01/23/ai_developer_devin_poor_reviews/
#AI #GenerativeAI #AIAgents #Devin #Programming #SoftwareDevelopment
Devin must’ve been trained on enterprise code.
@remixtures@tldr.nettime.org So they are really going to take devs’ jobs!
If you mean to shit on how it works now, and tell idiot managers not to throw money at it, shout this from the rooftops.
If you mean to suggest it’s going to stay this dumb forever then you haven’t been paying attention.
“It rarely worked” acknowledges that it does sometimes work. That’s where quite a lot of money and effort will be expended. We’ve invented a program that writes other programs. You can just describe code into existence, based on high-level goals, in plain English. Even if human code forever remains the better option - it is now an option.
This whole evolutionary branch might still get snipped. LLMs hallucinate by design. We’ve proven you can get spooky results out of a neural network, a Library Genesis torrent, and a roomful of GPUs. We’re seeing the local maxima for this shape of network. The global maxima damn well ought to code better than us, for the same reasons and to the same degree a calculator can multiply better than us.
@remixtures@tldr.nettime.org not surprising at all!