When Code Gets Cheap

21 min read

In February 2025, Andrej Karpathy posted that “the hottest new programming language is English.” A month later he gave the practice a name: vibe coding. Describe what you want in natural language, accept what comes back, iterate by feel. Don’t read the code, just run it and see what happens. The term spread instantly. Within weeks, non-technical founders were shipping working products over weekends. Jason Lemkin, who runs the SaaStr conference and had never written production software, built seven apps on Replit. Startup Twitter was euphoric. Writing code, as a human activity, appeared to be over.

Then the apps needed to scale. Lemkin, after his hundred-plus hours of building, documented ten mistakes he wished he’d avoided. Santiago Valdarrama, a software engineer, described meeting a non-technical person who had built a prototype and couldn’t distinguish between what they’d accomplished and what a professional developer would build. “Vibe coding has convinced many non-programmers they can build anything,” Valdarrama wrote. They could build demos. They could build prototypes. What they couldn’t build, reliably, was software that would survive contact with real users, real edge cases, and real security threats. The gap between a working demo and a production system is exactly the gap where software engineering lives.

The distance between what vibe coding promises and what it delivers is the story of every tool that has ever made code cheaper. Code is getting cheap. For certain categories of work it’s approaching free. But cheapness hasn’t eliminated the need for skilled practitioners. It has clarified, with new precision, what skilled practitioners were actually doing all along. And if you’re a lawyer, an analyst, a researcher, or anyone else whose work is mostly text and judgment, you’re watching a preview of your own future.

The canary

Software engineering is the first profession to go through this transformation at full speed. The reason is structural: the core artifacts of the job are text. Code, tickets, design documents, tests, logs, code reviews. All versioned, structured, machine-checkable. No other knowledge profession produces work this compatible with what language models do well. Developers are the largest single population of people actively collaborating with AI on professional tasks, and they’ve been doing it long enough to produce real data about what works and what breaks.

That makes software engineering the canary in every knowledge worker’s mine. The dynamics playing out in coding today, the productivity gains, the verification burdens, the skill shifts, the organizational bottlenecks, are not unique to software. They are the dynamics of any profession where AI can generate the core work product cheaply. Law firms drafting contracts, analysts building financial models, researchers writing literature reviews, clinicians producing clinical notes: each of these professions shares the same structural features that made software engineering vulnerable first. Their artifacts are text. Their workflows are structured. Their outputs can be partially checked against external criteria. What’s different is timing, not trajectory. Software is two to three years ahead. The rest of the cognitive economy is watching its future unfold in real time.

If you’re a partner at a law firm, a portfolio manager, a hospital administrator, or a research director, the developers in your organization have a few years’ head start on what’s coming for you.

The recurring lesson

In October 1968, fifty computer scientists gathered at a NATO conference center in Garmisch, Germany. Their subject was what they called a crisis. Software projects were chronically late, catastrophically over budget, and frequently unreliable. Hardware had been improving at geometric rates, but the software to harness it was falling behind. The conference coined the term “software engineering” and reached a conclusion that seems obvious now but wasn’t then: the hard part was systems complexity and coordination, not the act of writing instructions for the machine.

Two responses from that conference read like dispatches from 2026. Edsger Dijkstra argued for structured programming on the grounds that constraints and clarity were preconditions for correct software. His “Go To Statement Considered Harmful,” published in March 1968, was a polemic against the unstructured code that made programs impossible to reason about and a case for discipline over cleverness. Malcolm Douglas McIlroy argued for “mass-produced software components,” envisioning families of reusable routines that would form an industrial base for software construction. One pushed for better verification. The other pushed for better composition. Both were responding to the same underlying shift: writing code had become easy enough that the problems were no longer about the writing.

Frederick Brooks made the distinction permanent in his 1986 essay “No Silver Bullet.” He separated software difficulty into two kinds: essential complexity, which comes from specifying what the software should do in the context of messy organizational reality, and accidental complexity, which comes from fighting your tools. Syntax, boilerplate, manual wiring, build systems, deployment ceremonies. Every major advance in software engineering, from high-level languages to object-oriented programming to cloud infrastructure, reduced accidental complexity. None of them touched the essential kind. Deciding what to build, integrating it with existing systems, evolving it as requirements change, proving it’s secure: that work remained stubborn regardless of how much easier the typing became.

The 1980s and 1990s tested this distinction with CASE tools, and the episode is worth lingering on because the parallels to today’s AI coding wave are uncanny. Computer-aided software engineering promised orders-of-magnitude productivity gains. Generate code from visual diagrams. Let the machine handle tedium. Vendors sold the vision aggressively, and early adopters saw genuine benefits on well-scoped tasks. But the broader results were uneven. Firms that redesigned their development processes around the tools, rethinking roles, training people, changing how teams coordinated, saw real improvement. Firms that installed the tools and kept working the same way saw little. Academic researchers framed CASE adoption explicitly as organizational transformation rather than a purely technical upgrade. Practitioners noted a telling shift over the decade: marketing language moved from productivity promises to quality claims, a quieter and more defensible pitch. The initial excitement faded because the tools had automated a bottleneck that turned out to be smaller than everyone assumed. Generating code from diagrams was useful. Deciding what the diagrams should represent, coordinating across teams, and maintaining the generated code when requirements shifted: those problems persisted unchanged. It’s the electric motor wired into the shaft-and-belt layout, transplanted to software.

The Agile movement that followed offered a different kind of lesson entirely. When seventeen developers met at Snowbird, Utah, in 2001 and drafted the Agile Manifesto, they were not introducing a technology. They were advocating a reorganization: shorter cycles, tighter feedback loops, working software over comprehensive documentation, responding to change over following a rigid plan. What made Agile significant was that it redefined what “better” meant in software. Previous waves had focused on making developers produce code faster. Agile focused on making teams deliver the right software more reliably. The DevOps movement that grew from it reinforced the point: research tracking thousands of organizations over multiple years consistently found that performance gains came from culture and practice changes, not from tools. Continuous delivery, trunk-based development, small batch sizes, psychological safety, and fast feedback loops mattered more than any specific technology in the pipeline. Teams with mediocre tools and excellent processes outperformed teams with excellent tools and rigid processes. The performance was in the organization, not the instrument.

Each episode tells the same story. Someone makes code generation cheaper. The profession briefly believes the hard part is solved. Then the bottleneck surfaces somewhere else: in specification, in verification, in coordination, in the organizational capacity to absorb change. AI strips away accidental complexity faster than any previous tool. It can generate syntactically correct code, scaffold entire applications, write tests, and debug errors faster than any human. The essential complexity hasn’t moved. If anything, cheap generation has made it more visible, like draining a lake and discovering the rocks underneath.

The verification bottleneck

The evidence of rapid absorption is everywhere. Controlled experiments have shown large productivity gains: a Microsoft study of GitHub Copilot found roughly fifty-six percent faster completion on a defined programming task. More recent field experiments using randomized rollouts across thousands of developers at Microsoft, Accenture, and another large firm reported a meaningful average increase in completed tasks, about twenty-six percent in preferred pooled estimates, with larger gains and higher uptake among less experienced developers. The equalizer effect from the previous chapter is already visible in code: the tool compresses the performance distribution by giving novices access to patterns that experienced developers carry in their heads. For a junior developer, Copilot is a silent mentor whose suggestions encode the collective habits of millions of programmers.

At the same time, adoption is running ahead of trust. Stack Overflow’s 2025 developer survey reported high AI usage or planned usage across the profession, while favorable sentiment actually declined compared with prior years. Developers are using the tools and growing more skeptical simultaneously.

And the economic weight of the profession tilts toward exactly the work AI handles least reliably. Multiple empirical and survey-based sources estimate that maintenance dominates software lifecycle cost, often cited in the sixty to eighty percent range. Bug localization, dependency upgrades, database migrations, test repair, incident response: these tasks require the deepest context about why a system was built the way it was, which corners are load-bearing and which are vestigial. Making agents effective at this work is the economic prize. Making them reliable at it is the challenge that separates benchmarks from production.

The research frontier has accordingly shifted from measuring speed to measuring autonomy. SWE-bench, introduced in 2023 as a benchmark built from actual GitHub issues and pull request solutions, measured whether AI systems could resolve real software problems end to end. The tasks weren’t toy problems. They were drawn from popular open-source Python repositories: fixing a date parsing error in Django, resolving a regression in a plotting library, implementing a requested feature in a data analysis tool. Each came with a test suite the solution had to pass. Early results were sobering. Even strong models solved only a small fraction of tasks. Agent scaffolding lifted those results: SWE-agent reached about twelve and a half percent by equipping the model with tools to search files, run tests, and iterate through fixes. Subsequent systems pushed higher. But the benchmark ecosystem simultaneously produced a critical correction: some improvement was artificial. SWE-bench+ and other analyses found high scores sometimes relied on weak test suites, leaked solutions, or contaminated training data. After filtering, true success rates dropped meaningfully. OpenAI acknowledged the benchmark had become unreliable and recommended harder evaluations. When generation becomes easy, the hard part moves to proving the output is right.

That lesson is now the lived experience of working developers. Consider a typical day for a senior developer in 2026. She starts an agent on a feature request, and within minutes it produces a working implementation across four files: new API endpoints, database migrations, frontend components, and test coverage. The code compiles. The tests pass. The feature works when she demos it locally. She has saved perhaps a full day of typing and boilerplate. But now she needs to review four files of code she didn’t write, checking for edge cases the agent missed, security patterns it might have violated, performance characteristics it can’t reason about, and architectural choices that conflict with decisions the team made six months ago for reasons the agent doesn’t know. The review takes longer than it would have taken for code written by a colleague, because a colleague would have absorbed those constraints through osmosis. The agent started fresh.

A 2025 survey by SonarSource found that nearly all developers using AI coding tools spend significant effort reviewing, testing, and correcting the output. A sizable share reported that reviewing AI-generated code actually requires more effort than reviewing code written by humans, because AI produces larger volumes, fewer explanatory comments, and patterns drawn from a wider stylistic range. The cognitive load didn’t disappear. It migrated from “how do I write this?” to “is what the AI wrote correct, secure, and maintainable?” The previous chapter called this the measurement paradox. Here it operates at the scale of an entire profession.

Jeffrey Wang, an engineer at the AI search company Exa, documented how unevenly the gains distribute. Productivity multipliers of five to ten times for frontend and internal tooling. One and a half to two and a half times for full-stack product work. Barely anything for low-level systems programming. Negative returns for reliability and infrastructure work where mistakes compound. During one incident at Exa, an engineer spent the first five minutes vibe-coding a custom incident dashboard in Streamlit. “This is the kind of thing you’d never think to do before AI,” Wang wrote, “but is now the right thing to do.” Cheap code opens a category of disposable, ephemeral tooling that didn’t exist before. But the same cheapness applied to production security code or financial infrastructure creates risks that scale with the volume.

Alex Lieberman captured the divergence in a single observation: “A CEO vibe coding is trying to one-shot an internal version of Asana from scratch. An engineer vibe coding is an elaborate orchestration of sub-agents that are carefully navigated through studying, planning, executing, testing, hardening, and improving.” Same term, entirely different activity. The CEO is delegating judgment. The engineer is exercising it through a new interface.

Security sharpens the stakes. A controlled study titled “Asleep at the Keyboard?” evaluated the security of AI-generated code across scenarios designed to surface vulnerabilities and found a high rate of insecure outputs. Separate analyses of Copilot-attributed code in real repositories found substantial rates of security weaknesses in practice. The European Commission’s Cyber Resilience Act, with its staged timelines imposing vulnerability-reporting obligations and supply-chain accountability, creates regulatory pressure to close the gap. If you can’t trace what your agent produced and prove it’s secure, compliance frameworks will force the issue before production incidents do.

Recent experimental research on how AI delegation affects developer skill formation reports that heavy delegation impairs conceptual understanding, debugging fluency, and the ability to read code critically, unless the interactions are structured to preserve engagement. The “falling asleep at the wheel” dynamic from the previous chapter operates here at industrial scale. A developer who accepts every suggestion for months finds their capacity for independent reasoning has degraded. The tool works fine. They stopped practicing the skill it was supposed to augment.

Where the bottleneck moves

Engineering bottlenecks are moving toward product definition and testability. Teams that treat specifications, test plans, and architecture constraints as first-class artifacts can delegate wider scopes to agents with lower risk, because the agent has a measurable target and the organization has an automated way to reject bad work. As one developer put it: “It used to be how fast you could implement. Now it’s how precisely you can specify. Machines don’t carry tribal knowledge, so any ambiguity becomes a guess, and software guesses compound.” Enterprise conversations in early 2026 converge on the same point: code may now be free, but learning what to build next is still bottlenecked by customer feedback loops. The constraint has moved from typing speed to the rate at which organizations can generate the right specifications.

Code review is becoming the core high-skill surface. If agents produce more code, review quality and capacity determine throughput. Addy Osmani, a senior Chrome engineer, observed that developers with strong foundations in testing, documentation, and code review are having the most success with AI tools, because “these ‘boring’ foundations turn agents from chaos generators into productivity multipliers.” The pattern holds across enterprise teams: the best engineers love AI for coding because they can leverage existing expertise for vastly greater output. Most of the time spent building something goes into drudgery work. AI lets you skip to the parts that actually matter. Traditional engineering skills become accelerators rather than constraints. Review design, including guidelines, automated pre-checks, and ownership rules, has become strategically central in a way it never was when every line was human-written.

Maintenance is becoming scalable for the first time. Multiple sources estimate that maintenance consumes sixty to eighty percent of software lifecycle cost. Bug localization, dependency upgrades, database migrations, test repair, incident response: these are tasks historically throttled by risk and the need for deep context about why systems were built the way they were. Agents equipped with good context can absorb that institutional knowledge and operate across codebases that no single developer could hold in their head. Anton Osika documented how one mid-size Swedish technology company doubled its shipping velocity by using AI to build interactive end-to-end prototypes instead of writing product requirement documents, collapsing weeks of alignment meetings into a single day. The coordination cost of understanding what to build had been higher than the cost of building it.

Work is also expanding, not contracting. Cybersecurity is having its own “Jevons paradox moment”: better AI tooling for security increases demand for security talent, because autonomous vulnerability discovery automates the finding step but not the triage, remediation, and architectural responses that follow. The same logic applies across software engineering. AI generates more code. More code means more tests, more security surface, more dependencies, more documentation, more review. Enterprise teams adopting agents unanimously report working more, not less. AI isn’t creating leisure. It’s creating ambition. And the static view of engineering employment misses where that ambition leads. The tech industry accounts for only eight to fifteen percent of GDP. The anxiety about AI replacing engineers is a product of Silicon Valley’s self-referential perspective, where everyone’s mental model of “an engineer” is someone at Google or a Series B startup. But eighty-five percent of the economy has never had real access to engineering talent. When AI coding tools make software development accessible to non-tech industries, that changes. The CS graduate of 2030 may go to John Deere, Eli Lilly, or Caterpillar instead of Google. They won’t be building apps with buttons. They’ll be automating pharmaceutical research or building AI systems for precision agriculture. Total demand for engineering talent doesn’t shrink. It diffuses across the entire economy, into industries that have been starved of it for decades.

The venture capital firm Andreessen Horowitz observed: “We thought agents would map to existing workflows. The reverse is happening.” Projects once estimated in months now scope to days or weeks. Teams tackle ideas they would never have prioritized when every feature cost weeks of developer time. The pattern matches Chapter 1’s prediction: when a general-purpose technology makes production cheaper, it expands the frontier of what gets built.

None of these shifts will take longer to address than apprenticeship. If early-career coding tasks are increasingly automated, organizations that fail to create structured learning paths risk hollowing out their senior talent pipeline. An analysis from the Stanford Digital Economy Lab, using high-frequency payroll data, found that early-career workers in AI-exposed occupations experienced meaningful relative employment declines, while more experienced workers remained comparatively stable. Junior developers have traditionally learned by doing the routine work that agents now handle: writing CRUD endpoints, fixing straightforward bugs, building features from well-defined specifications. These weren’t just grunt work. They were the apprenticeship layer through which developers built mental models of how systems fit together. Each previous abstraction shift, from assembly to high-level languages, from servers to cloud platforms, changed what beginners practiced first, and the profession adapted by developing new on-ramps. The agent era compresses the adaptation timeline more aggressively than any prior shift. Without intentional program design, “learning by doing” becomes “shipping without understanding,” and the profession’s future capacity erodes beneath a surface of high near-term output.

Making any of this work reliably requires infrastructure that has no precedent in the pre-agent era.

Context engineering comes first: the discipline of structuring what information an agent receives so it has what it needs and nothing more. The longer a session runs, the more the AI quietly forgets. Context windows fill up, and critical decisions get compressed out. The prescription is pragmatic: maintain documentation that captures business context, architectural decisions, constraints, and known failure modes. When the agent drifts, don’t re-explain. Point it at the files. Then there is what practitioners call “legacy archeology,” the painstaking work of reverse-engineering implicit institutional knowledge into explicit, machine-readable specifications. Most production software is brownfield, held together by developers who know which parts you never touch on a Friday. Getting an agent to work effectively requires making the unwritten rules written.

Then there’s connectivity. The Model Context Protocol, introduced by Anthropic in November 2024 as an open standard, addresses the problem of wiring every model to every tool individually. GitHub launched a public MCP server, signaling that agent toolchains are moving from bespoke connectors toward standard interfaces. The analogy to the standardized electrical outlet is exact: individually mundane, collectively transformative.

Evaluation and guardrails are the piece most organizations still lack. Behavioral scenario engineering stores test specifications separately from the codebase so the AI can’t teach itself to the test. Digital twin environments, simulated versions of production services, let agents run integration tests safely. Economic compute management makes the cost of continuous agent execution visible to budget holders. Running a dark factory is expensive. Agents running continuously, parallel builds, full test suites firing on every change: this adds up fast. Give agents more autonomy only when you can prove their work is correct and diagnose it when it isn’t.

Beyond software

Everything happening in software engineering has direct parallels in professions that don’t think of themselves as being in the same situation. Software is ahead because its artifacts are machine-readable by default. But any field whose core work product is structured text is on the same trajectory. The timing differs. The dynamics don’t.

Consider law. A large firm that once assigned a team of associates to review forty thousand documents in a discovery dispute can now run the initial pass through an AI system in a fraction of the time. Draft contracts arrive formatted, sourced from real precedent, with clauses that look right. “Looks right” is the problem. The draft may have missed a jurisdiction-specific requirement, or included a liability cap that contradicts the client’s negotiating position, or cited a case that was subsequently overruled. Catching those errors requires exactly the legal training the tool was supposed to supplement. Junior associates have traditionally built that training by doing the document review and contract drafting that AI is now handling. Take away the apprenticeship layer and the pipeline of future partners thins.

In finance, the danger is less visible. An AI-generated valuation model looks professional. The numbers are internally consistent. The formatting is immaculate. It’s right often enough to erode the habit of checking the assumptions underneath. When it’s wrong, the error is in the framing: the model used the right method on the wrong comparable, or weighted a risk factor that was appropriate last quarter but not this one. The “falling asleep at the wheel” dynamic from the previous chapter may be most dangerous in a field where a single unchecked assumption can move millions of dollars.

Healthcare is further along than most people realize. The documentation burden in medicine is enormous, and automating clinical notes, literature search, and administrative overhead is a genuine relief. But the previous chapter described a study where radiologists given AI diagnostic predictions overrode correct suggestions with inferior ones and deferred to uncertain suggestions when their own judgment was better. The tool didn’t fail. The collaboration did. Nobody had trained the radiologists to integrate AI output with their own clinical reasoning. Medicine’s hardest work, reading a patient’s anxiety as a diagnostic clue, deciding when to deviate from a guideline because the patient in front of you doesn’t fit the population it was derived from, resists automation precisely because it depends on judgment that develops only through practice.

Scientific research is earlier in the curve. AI can draft literature reviews, generate hypotheses, and produce statistical analyses. The benchmark inflation problem from SWE-bench has a direct analogue: if AI-generated research passes superficial peer review but contains subtle statistical errors or fabricated citations, the verification infrastructure of science itself needs to adapt. Graduate students, like junior developers and junior associates, have traditionally built scientific judgment by doing the literature review and data analysis that AI now handles faster.

Verification risk is one reason these professions expand rather than contract. There’s a second, complementary mechanism: automation at one step doesn’t eliminate the steps downstream. It exposes them as the new constraint. A healthcare company that automates patient referrals doesn’t reduce the need for doctors. It surfaces the eighteen-month appointment backlog that administrative friction had been quietly hiding. A law firm that generates contracts with AI doesn’t need fewer attorneys. It needs more, because easy document generation feeds demand into the human-only steps that follow: filing, negotiation, court appearances, the judgment calls that require a license. The bottleneck shifts, and the shifted bottleneck is almost always labor-intensive work that requires human expertise. This is a different mechanism from verification risk. Verification is about catching errors in AI output. The bottleneck shift is about demand expansion, where removing friction at one stage reveals unmet demand at the next. Together they explain why cheap production tends to grow a profession’s total workload rather than shrink it.

The professions aren’t shrinking. They’re reorganizing around a different bottleneck, and the reorganization follows the same sequence everywhere. Production gets cheap. Verification becomes the constraint. The people who benefit most from cheap production are the least equipped to catch its failures. And the essential work of each profession, the judgment and context-sensitivity and tolerance for ambiguity, turns out to have been harder and more valuable than anyone assumed when production labor obscured it. Software got there first. The restructuring is far enough along to show everyone else the shape of what’s coming.

Brooks’s distinction between essential and accidental complexity was written about software in 1986. It describes every cognitive profession in 2026.

The recomposition

The trajectory points toward professions reorganized into systems where humans own the goal, the constraints, and the verification, and machines own much of the production. Software engineering has been through enough of these cycles to know the rhythm. Compilers did not eliminate programmers; they widened the space of feasible programs. Open-source components did not eliminate development; they shifted value to integration and architecture. DevOps did not eliminate operations; it redistributed responsibility around flow and reliability. Each time, the profession announced its own obsolescence and instead found a larger version of itself on the other side. The other professions in the pattern’s path, law, finance, medicine, research, don’t have this track record to draw on. They’re encountering cheap production for the first time. Software’s history is the closest thing to a field guide they have.

General-purpose technology research predicts J-curve dynamics: productivity gains lag until organizations invest in the intangible complements, process redesign, measurement systems, skills, and new organizational forms. Model quality matters, but the durable advantage comes from the socio-technical system that converts model capability into reliable throughput. The factory owners who captured the most value from electric motors were not the ones with the best motors. They were the ones who redesigned their floors. The same is true for software organizations, and it will be true for law firms, financial institutions, research labs, and hospitals. The organizations that capture the most value from AI won’t have the best models. They’ll have the best surrounding systems: tight specifications, evaluation that catches failures before users do, review that scales with volume, and training that develops the judgment machines can’t supply.

The best practitioners in every field will work this way. They won’t abandon their expertise. They’ll apply it through a different interface: steering, evaluating, deciding when to trust and when to override. They’ll delegate what is genuinely accidental complexity, the formatting, the assembly, the lookup, the boilerplate, and retain what is genuinely essential: deciding what should exist, whether it’s right, and whether it’s good.

Brooks was right in 1986. There is no silver bullet for essential complexity. Each tooling wave reduces accidental work and exposes what remains. AI is the biggest reduction yet. But the work that persists tells you what the profession actually is. When writing code was expensive, typing obscured everything else. The same was true across the cognitive professions: lawyers buried in document assembly, analysts wrestling spreadsheets cell by cell, researchers copying citations by hand. Production labor hid the essential work. As it falls away, what’s left turns out to be harder and more valuable than anyone who equated the profession with its production step had assumed.

Software engineers are living this transition first and fastest. What they’re learning about verification bottlenecks, productive delegation, dangerous offloading, organizational infrastructure, and apprenticeship design is a field guide for everyone who follows. The next chapter examines the discipline that sits behind all of it: the practice of structuring information, tools, and evaluation so that agents can work reliably at scale. Software engineers call it context engineering. It won’t keep that name for long.