The METR study is wild!
It’s methodoloty is unlike any other. While previous studies went ‘wide’, the METR study went ‘deep’, focusing on 16 developers (instead of hundreds or thousands), but deeply analyzing the effect on AI on those 16 developers in a way ‘wide’ studies could not.
METR first identified 16 developers who were maintaining high quality open source repositories. These repositories had an average of 22k stars on Github, and ~1 million lines of code. 22k stars is probably 6-7 standard deviations above average on Github. To put a football perspective, these developers were maintaining the football equivalent of Real Madrid, while your average Enterprise app code base is Toa Payoh FC or Scunthorpe United.
Then these developers were asked to provide a list of task to execute on these code bases. The tasks were purposely set to be below 2 hours long. Developers were asked to estimate how long each task with take, and provide a self-reported proficiency level on the task. Tasks were then randomly assigned to either be ‘AI Assisted’ or ‘Not AI assisted’. In the ‘AI Assisted’ tasks developers could choose to use AI or not, while in the ‘Not AI assisted’ tasks, developers could not use any AI assistance.
Then with the magic of statistical analysis, the study could determine the effect of AI on the productivity of the developer based on the two buckets and taking into account their estimates for the task.
The results were WILD!
The study not only determine that there we no increase in productivity — it actually found that AI tools made productivity worse. Task were taking 19% longer with AI than without.
This is crazy, because in the AI assisted bucket, developers could choose to use (or not use) AI. Instead all of them chose to use AI — and all of them reported that AI made them more productive — but all but one of them actually got an AI boost, with the rest of them getting a performance penalty from using these AI tools.
So how do we make sense of the study? Because developers everywhere are putting money on these tools … and it apparently has a negative impact. It’s like paying more money for a slower car.
Here’s 5 explanations around how we can make sense of the results.
The Developers were not trained to use the tools properly
All but one developer were actually using these tools before the study. And while the developers did receive some training (circa 1-2 hours), the rationale is that these tools takes days,weeks and even months to perfect. There is a learning curve associated with AI assisted tooling, and while you’re on the curve, you will experience a negative effect.
The contention is that these were above-average programmers, and 1-2 hours of training feels like plenty. Plus the developers used the tools for a good number of hours during the tasks, and we should have seen a marked increase in productivity during the execution as the developers learnt more about how to use the tools.
The last piece here is that the developer who actually improved (by 20%), was the ONLY developer who was actually using cursor consistently before the study. So there are some good data points for this explanation.
Above average programmers working with highly familiar codebases is not a good use case
All developers were maintaining highly popular code bases, which probably puts them in the top 1-2% of programmers worldwide. These were not run of the mill developers. Plus, they were highly familiar with these code bases as these were ‘THEIR’ code bases.
The rationale here is that these programmers working on these code bases required very little help from anyone (AI or human) to execute the tasks at hand. Hence including an coding assistant actually slowed things down.
The analogy is that a car is much faster than walking. But if you’re going 50m to nearby coffeeshop, walking might be faster, then getting into your car and then trying to find a new carpark. In some situations the AI assistants aren’t very helpful.
The contention of course is that if the developers felt slower they would have stopped using AI in the optional scenarios. Why did they continue to feel that the AI was ‘speeding’ them up?
Also, that in other scenarios, like average developers working on crappy code bases, would benefit significantly from these assistants.
Coding Assistants are like Adderall
This explanation is the wildest!
Adderall is a drug used to treat Attention Deficit Hyperactive Disorder (ADHD), and it works really well. But College students without ADHD sometimes buy Adderall to help them study, claiming it helps increase alertness and cognitive ability.
However, multiple studies conclude that Adderall doesn’t increase your ability in these areas, and some studies actually point to a reduction in cognitive ability. Turns out you can feel smarter and sharper, while actually being … kinda stupid!
This helps explains why developers continued to use the assistants even though they slowed them down. It also explains why most developers continue to buy these assistants — the same way College students but Adderall. There’s a lot of data points throughout the study that back this hypothesis, and it seems to be a tight fit.
Humans don’t really have a productivity sensor in our brains, we don’t even have a proper working clock. If you sit down for 3 hours to watch a intense movie like Lord of the Rings or Avatar, or (god forbid!) The Seven Samurai that feels very different than watching 6 back-to-back episodes of Friends or Brooklyn 99. The same time might have passed, but we’d have very different experiences in those times.
Asking someone how much time they spent on something is not a good measurement of anything. Similarly if you’re just barking orders at an LLM without actually expending your mental efforts you might underestimate the amount of time you’re spending on it — vs. when developers are actually in a state of flow and intensely focused on something. In both cases, the mental effort does something to warp our perception of time, and if time is how we measure productivity, there’s something to think about.
The contention though is that one developer really did get a boost (and not a small one), and while that might sound like a statistical anomaly it’s just too big a number to be random.
Above average programmers have above average expectations
This is the one explanation that resonated with me the most.
Above average programmers have above average expectations, without those high quality standards they wouldn’t have been able to maintain these codebases for so many years and still roll out new features. Most Enterprise code bases calcify into a stasis after 3-5 years, and then a ‘transformation’ is required.
In the real world, the above average programmers would receive code from actual people who have feelings, and sometimes might lower their standards on some pull request just to not be the bad guy. In this scenario the LLM has no feelings, hence the high bar could be kept. These code bases were not ‘corporate’ code bases owned by employers, these were personal code bases — their magnum opuses — high quality — even ridiculously unrealistic quality standards are in place.
Effectively this means lots of mediocre code getting refactored and rewritten. Which in turn increased the time taken to produce the code. The developers writing this on their own wouldn’t need to do this — as the style of the code base is already imprinted in their memory.
In the real-world, standards might not be so high (at least not for all code bases) and most of the code suggestion could be accepted. Productivity increases at perhaps the expense of quality.
Conclusion
I think the final conclusion is that each of these reasons taken together account for the 19%. Which and how much weightage is the actual question, and there won’t be a definitive answer. But there are some concrete takeaways.
First, it’s really hard to measure productivity. Previous attempts to measure PRs or code commits all show high improvements in productivity — but those numbers can easily (or accidentally) be fudged.
Second, don’t trust self-reports. Even though the results were definitive — all developers reported that the AI assistants were actively increasing productivity — even when they weren’t.
Third, there’s probably something above and beyond productivity. If the AI assistant can do the ‘stupid’ work of writing test cases, or a grunt work of writing repetitive code — maybe that’s enough to be useful, for developers to channel their real cognitive powers into something useful. Maybe that’s why it feels faster, because you’re less bogged down in tedium.
Fourth, there’s no stopping the AI train — but there are hard lessons somewhere for all of us to learn. Just because you feel more productive, doesn’t mean you are. And because it’s very hard to measure productivity and even hard to measure code quality all of this can be fudged to meet some requirement.