Coding has emerged as genAI’s killer use case. But what if its benefits are a mirage?

Hello and welcome to Eye on AI…In this edition: Meta is going big on data centers…the EU publishes its code of practice for general purpose AI and OpenAI says it will abide by it…the U.K. AI Security Institute calls into question AI “scheming” research.

Contents

Experiment calls gains from AI coding assistants into question Is it just vibes all the way down?Maybe the problem is coders just aren’t using enough AI?

The big news at the end of last week was that OpenAI’s plans to acquire Windsurf, a startup that was making AI software for coding, for $3 billion fell apart. (My Fortune colleague Allie Garfinkle broke that bit of news.) Instead, Google announced that it was hiring Windsurf’s CEO Varun Mohan and cofounder Douglas Chen and a clutch of other Windsurf staffers, while also licensing Windsurf’s tech—a deal structured similarly to several other big tech-AI startup not-quite-acquihire acquihires, including Meta’s recent deal with Scale AI, Google’s deal with Character.ai last year, as well as Microsoft’s deal with Inflection and Amazon’s with Adept. Bloomberg reported that Google is paying about $2.4 billion for Windsurf’s talent and tech, while another AI startup, Cognition, swooped in to buy what was left of Windsurf for an undisclosed sum. Windsurf may have gotten less than OpenAI was offering, but OpenAI’s purchase reportedly fell apart after OpenAI and Microsoft couldn’t agree on whether Microsoft would have access to Windsurf’s tech.

The increasingly fraught relationship between OpenAI and Microsoft is worth a whole separate story. So too is the structure of these non-acquisition acquihires—which really do seem to blunt any legal challenges, either from regulators or the venture backers of the startups. But today, I want to talk about coding assistants. While a lot of people debate the return on investment from generative AI, the one thing seemingly everyone can agree on is that coding is the one clear killer use case for genAI. Right? I mean, that’s why Windsurf was such a hot property and why Anyshphere, the startup behind the popular AI coding assistant Cursor, was recently valued at close to $10 billion. And GitHub Copilot is of course the star of Microsoft’s suite of AI tools, with a majority of customers saying they get value out of the product. Well, a trio of papers published this past week complicate this picture.

Experiment calls gains from AI coding assistants into question

METR, a nonprofit that benchmarks AI models, conducted a randomized control trial involving 16 developers earlier this year to see if using code editor Cursor Pro integrated with Anthropic’s Claude Sonnet 3.5 and 3.7 models, actually improved their productivity. METR surveyed the developers before the trial to see if they thought it would make them more efficient and by how much. On average, they estimated that using AI would allow them to complete the assigned coding tasks 24% faster. Then the researchers randomized 246 software coding tasks, either allowing them to be completed with AI or not. Afterwards, the developers were surveyed again on what impact they thought the use of Cursor had actually had on the average time to complete the tasks. They estimated that it made them on average 20% faster. (So maybe not quite as efficient as they had forecast, but still pretty good.) But, and now here’s the rub, METR found that when assisted by AI it actually took the coders 19% longer to finish tasks.

What’s going on here? Well, one issue was that the developers, who were all highly experienced, found that Cursor could not reliably generate code as good as theirs. In fact, they accepted less than 44% of the code-generated responses. And when they did accept them, three-quarters of the developers felt the need to still read over every line of AI-generated code to check it for accuracy, and more than half of the coders made major changes to the Cursor-written code to clean it up. This all took time—on average 9% of the developers time was spent reviewing and cleaning up AI-generated outputs. Many of the tasks in the METR experiment involved large code bases, sometimes consisting of over 100,000 lines of code, and the developers found that sometimes Cursor made strange changes in other parts of this code base that they had to catch and fix.

Is it just vibes all the way down?

But why did the developers think the AI was making them faster when in fact it was slowing them down? And why, when the researchers followed up with the developers after the experiment ended, did they discover that 69% of the coders were continuing to use Cursor?

Some of it seems to be that despite the time it took to edit the Cursor-generated code, the AI assistance did actually ease the cognitive burden for many of the coders. It was mentally easier to fix the AI-generated code than to have to puzzle out the right solution from scratch. So is the perceived ROI from “vibe coding” itself just vibes? Perhaps. That would actually square with what the Wall Street Journal noted about a different area of genAI use—lawyers using genAI copilots. The newspaper reported that a number of law firms found that given how long it took to fact-check AI-generated legal research, they were not sure lawyers were actually saving any time using the tools. But when they surveyed lawyers, especially junior lawyers, they all reported high satisfaction using the AI copilots and that they felt it made their jobs more enjoyable.

But a couple of other studies from last week suggest that maybe it all depends on exactly how you use AI coding assistance. A team from Harvard Business School and Microsoft looked at two years of observations of software developers using GitHub Copilot (which is Microsoft product) and found that those using the tool spent more time on coding and less time on project management tasks, in part because GitHub Copilot allowed them to work independently instead of having to use large teams. It also allowed the coders to spend more time exploring possible solutions to coding problems and less time actually implementing the solutions. This too might explain why coders enjoy using these AI tools—because it allows them to spend more time on parts of the job they find intellectually interesting— even if it isn’t necessarily about overall time-savings.

Maybe the problem is coders just aren’t using enough AI?

Finally, let’s look at the third study, which is from researchers at Chinese AI startup Modelbest, Chinese universities BUPT and Tsinghua University, and the University of Sydney. They found that while individual AI software development tools often struggled to reliably complete complicated tasks, the results improved markedly when multiple large language models were prompted to each take on a specific role in the software development process and to pose clarifying questions to one another aimed at minimizing hallucinations. They called this architecture “ChatDev.”

So maybe there’s a case to be made that the problem with AI coding assistants is how we are using them, not anything wrong with the tech itself? Of course, building teams of AI agents to work in the way ChatDev suggests also uses up a lot more computing power, which gets expensive. So maybe we’re still facing that question: is the ROI here a mirage?

With that, here’s more AI news.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Before we get to the news, the U.S. paperback edition of my book, Mastering AI: A Survival Guide to Our Superpowered Future, is out from Simon & Schuster. Consider picking up a copy for your bookshelf.

Also, if you want to know more about how to use AI to transform your business? Interested in what AI will mean for the fate of companies, and countries? Then join me at the Ritz-Carlton, Millenia in Singapore on July 22 and 23 for Fortune Brainstorm AI Singapore. This year’s theme is The Age of Intelligence. We will be joined by leading executives from DBS Bank, Walmart, OpenAI, Arm, Qualcomm, Standard Chartered, Temasek, and our founding partner Accenture, plus many others, along with key government ministers from Singapore and the region, top academics, investors and analysts. We will dive deep into the latest on AI agents, examine the data center build out in Asia, examine how to create AI systems that produce business value, and talk about how to ensure AI is deployed responsibly and safely. You can apply to attend here and, as loyal Eye on AI readers, I’m able to offer complimentary tickets to the event. Just use the discount code BAI100JeremyK when you checkout.

Note: The essay above was written and edited by Fortune staff. The news items below were selected by the newsletter author, created using AI, and then edited and fact-checked.

This story was originally featured on Fortune.com

Coding has emerged as genAI’s killer use case. But what if its benefits are a mirage?

Experiment calls gains from AI coding assistants into question

Is it just vibes all the way down?

Maybe the problem is coders just aren’t using enough AI?

Latest News

Steelers announce Ben Roethlisberger, Joey Porter, Maurkice Pouncey to join Hall of Honor

Phillies’ Nick Castellanos out of Saturday’s lineup vs. Yankees with left knee injury

2025 Tour de France standings going into final stage, with Tadej Pogačar set to win 2nd consecutive trophy

2025 MLB betting: Nick Kurtz now a massive favorite to win AL Rookie of the Year

Experiment calls gains from AI coding assistants into question

Is it just vibes all the way down?

Maybe the problem is coders just aren’t using enough AI?

You Might Also Like

Latest News