The Doing Got Cheap
Anthropic’s new report on AI building AI is mostly a story about which human jobs survive contact with it.
I have not written the first draft of a deal memo in nearly two years. A data room and a call transcript go in, a structured memo with a score comes back (trained on thousands of past deals), and then I sit with it. The writing took minutes. The sitting takes the rest of the day, because the sitting is where I decide whether the score means anything. Hold onto that ratio. Anthropic’s new piece on recursive self-improvement is, underneath the charts, a long argument about that ratio and what happens when the cheap half keeps getting cheaper.
First, the phrase. Recursive self-improvement is the point at which an AI system can design and build a better version of itself with no human in the loop. The model writes the next model’s code, runs the experiments, reads the results, and decides what to try next, on its own. Anthropic is careful to say we are not there, and that it is not guaranteed to arrive. What they are claiming is narrower and harder to wave away. They are saying the loop is closing one stage at a time, and they have internal numbers showing how far it has already closed.
Which numbers to believe
The headline most people will repeat is that Anthropic engineers now ship roughly eight times the code per person they did a couple of years ago. Ignore it. Anthropic itself flags that lines of code measures volume, not value, and that the real productivity gain is almost certainly smaller. A model that writes verbose code looks more productive on this chart while making the codebase worse. So does a model cleaning up four years of deferred junk in an afternoon. Treat the 8x as a mood, not a measurement. High confidence it overstates the truth.
The numbers worth your attention are the ones that are hard to game. The first is task length. A research group called METR measures the longest task a model can finish on its own with reasonable reliability, and that length has been roughly doubling every four months, faster than it used to. In March 2024 the best Claude could handle a software task that takes a person about four minutes. A year later, ninety-minute tasks. A year after that, twelve-hour tasks, with the unreleased internal model running past sixteen hours and bumping against the ceiling of what METR can even measure. Lines of code can lie. The clock cannot. If that curve holds, work that takes a skilled person a full week comes into range during 2027. That is the trend to watch, and I would put it at high confidence in the near term and moderate confidence past a year, because every exponential eventually meets a wall and we cannot see this one yet.
The second is the success rate on genuinely open-ended problems, meaning problems with no clear spec where even the engineer does not know what the answer looks like. Anthropic reports that rate hit 76 percent in May 2026, up fifty points in six months. The example they give is a model handed a live incident, tens of thousands of training jobs crashing, given little more than cluster access, that isolated one obscure debugging flag and shipped a fix in two hours where a person would have spent days. That is not autocomplete. That is debugging under uncertainty, which is most of what senior engineering actually is.
The rest of the evidence rhymes. A standard software-engineering benchmark went from near zero to solved in two years. A benchmark for reproducing published research went from one in five to solved in fifteen months. More than four out of five lines merged into Anthropic’s own codebase are now written by Claude, up from low single digits before their coding agent launched in early 2025. Read those as confirmation, not as separate miracles. They all point the same way.
The job that is left
Here is the part that should hold the attention of anyone whose living comes from judgment rather than output, which includes my entire industry.
Anthropic is candid that the human role has narrowed to a single function at each stage. Humans no longer write the code, they review it. Humans no longer run the experiment, they choose which experiment is worth running. The doing has collapsed toward zero cost in human time. What remains is taste. Knowing which problem matters, which result to trust, when an approach is a dead end. They call it research taste and they name it as the last real moat between a capable assistant and a system that could replace its makers.
And then, in the same piece, they put that moat on the clock. They note that taste might be just another capability that models fail at for a while and then suddenly get good at, the way they eventually learned to explain why a joke lands or pass a theory-of-mind test. They have an early measurement to back the worry. On a set of real research sessions, they asked models to pick the next step, and the best model in November 2025 beat the human choice 51 percent of the time, rising to 64 percent by April 2026. You should discount this one, and the article tells you how. They deliberately chose moments where the human’s move had room to improve, so it is not a fair fight. On a control set where the human’s move was already strong, the models won only about a fifth of the time. So taste is not falling yet. But the people with the most information in the world just told you it is the next thing they expect to fall.
Sit with what that means if you price judgment for a living. The pitch of every active venture investor, every stock picker, every senior analyst, is some version of “I see what others miss.” That is a taste claim. If taste compresses the way coding did, the premium you charge for it compresses with it. I am not predicting that this year. Moderate confidence it starts to bite within three to five years for narrow, well-defined judgment, and low confidence on the open-ended, cross-domain judgment that separates a good seed investor from a lucky one. But the direction is not in dispute, and the people running the experiment are not the ones with an incentive to hype this particular finding.
Why I am not panicking, and why that is not reassuring
I have spent five years building a firm around the bet that the doing is cheap and the picking is overrated. The strategy is breadth. Cover selectively, but broadly, rather than agonize over three names, because at the earliest stage the variance in outcomes swamps anyone’s ability to forecast them. This article reads like a long external validation of that view. If implementation is nearly free and even expert judgment is unreliable enough to be beaten on a coin flip in cherry-picked cases, then conviction is worth less than coverage and speed.
That is the comfortable reading, so I should attack it. The uncomfortable version is that cheap doing and compressing taste do not protect a volume strategy, they commoditize it. If access and speed are my edge, and access and speed are exactly what an army of agents grants every other allocator too, then breadth stops being a moat the moment everyone can run it. The thing that survives is not a strategy at all. It is the stuff that did not get cheaper. Relationships a founder chooses to take money from. A brand that earns the first call. The willingness to act when the model says wait. None of that lives on the benchmark.
The law nobody quotes
The single most useful idea in the piece is buried near the end, and it is borrowed from computing. Amdahl’s law says that speeding up one part of a process only helps until the parts you did not speed up become the limit. Anthropic has already hit it inside their own walls. They taught the machines to write code faster than humans can review it, so review became the new bottleneck. The constraint did not vanish. It moved.
That migration is the whole investment thesis hiding in this report, and they nearly walk past it. When engineering goes to roughly free, value pools wherever the bottleneck lands next, and the bottleneck lands on everything that refused to get faster. Their own cyber program is the clean example. The internal Mythos model found more than ten thousand serious software vulnerabilities in its first weeks, and the binding constraint in defense flipped overnight from finding holes to patching them fast enough. Finding got automated. Patching, which touches deployment and humans and downtime, did not. If you want to know where the next decade of company-building money goes, stop looking at what AI makes cheap and start looking at what stays expensive next to it. Distribution. Trust. Regulatory throughput. Anything physical. Anything that requires a human to say yes.
The part where they ask for help they know they will not get
The closing section proposes that frontier labs keep the option to slow down or pause, and that Anthropic would do so if rivals verifiably did the same. Read the conditions and you have read the obituary of the idea. A training run is easier to hide than a missile silo, the inputs are general-purpose, and whoever keeps going while others stop inherits the lead. They cite the decades it took to stand up nuclear arms verification and then admit, in the next breath, that we do not have decades. The honest translation of the proposal is that a verifiable pause is close to impossible, a unilateral one only changes who finishes first, and so the default is that nobody pauses. I do not read that as cynicism on their part. I read it as the most useful sentence in the essay, said quietly so it does not spoil the mood.
So we are left with the ratio I started with. The afternoon I spend deciding whether the memo is right is the job now, and the report I just read is a careful, well-sourced argument that even that afternoon is on a timer. I do not think it runs out this year. I think anyone betting their career on judgment should assume the premium they charge for it is going to thin, and should be building the relationships and the brand and the nerve that no benchmark can saturate while the charge for cleverness still holds.
The doing got cheap. Be very clear-eyed about what you are selling once it does.

