Agentic Systems Infrastructure
The Demo-to-Production Gap
AI agent demos are impressive. Production AI agent systems are rare. Shreya Shekhar identifies the gap: “Today’s agentic systems are impressive prototypes, not production-grade systems.” As agents tackle incrementally harder problems, the cracks show: “incorrect results from broken tools, low visibility leads to inability to diagnose failures, low repeatability of success, excessive latency from ‘thinking’ loops which run up inference bills.”
The gap is structural. Traditional software provides guarantees around consistency, scalability, and fault tolerance. AI agents introduce three new challenges that violate those guarantees:
| Challenge | Description | Why it’s hard |
|---|---|---|
| Stochasticity | Same input produces different outputs | Cannot write deterministic tests |
| Action-intent gap | Agent takes action that doesn’t match intended goal | Hard to specify intent precisely |
| Subjectivity of results | ”Correct” output is contextual, not binary | No ground truth for evaluation |
Shekhar argues that “the missing piece isn’t just better models or reasoning, but better systems primitives.”
Infrastructure Requirements
Levie maps the infrastructure that agents will need as they scale to production:
- Identity systems: Agents need verifiable identities across platforms. Current auth systems assume human users.
- Persistent storage: File systems and databases for agent work products, session state, and accumulated knowledge
- Orchestration: When agents outnumber humans by orders of magnitude, coordination becomes the bottleneck — “when you have agents going out and doing work for you, the work just moved up a layer” to orchestration
- Financial infrastructure: Agents that transact need safe, auditable money management
- Sandboxed compute: Agents executing code need isolated environments
Workflow Redesign
Levie argues that deploying agents into existing workflows captures only a fraction of their value: “To get the full benefit of AI agents you often need to change your underlying workflows.” Agents work differently from humans — they’re fast but fragile, broad but shallow, tireless but context-limited. Workflows designed around human strengths and limitations don’t map cleanly onto agents.
The paradigm Levie sees emerging is “like if you had an army of really smart interns — they can do a lot, but you need to give them very clear instructions, check their work, and structure the work into manageable pieces.”
The Eval Gap
Yohei captures the current state of agent infrastructure in a “random rant”: the tooling for building, monitoring, and evaluating agents is immature relative to the ambitions of agent builders. The eval gap — the inability to reliably measure whether agents are performing well — is perhaps the most acute infrastructure deficit.
Without evals, organizations cannot:
- Measure the impact of prompt changes
- Compare agent architectures
- Detect regressions in agent behavior
- Justify the cost of agent deployment
The eval gap leaves Context Engineering partly guesswork: the build-measure-learn cycle of agent development has a broken “measure” step.
Historical Parallel
The infrastructure gap mirrors early cloud computing. In 2006, AWS offered compute and storage but lacked monitoring, orchestration, security, and governance tools. The ecosystem took a decade to mature. Agent infrastructure is in a similar early stage — the core capabilities exist, but the surrounding ecosystem of observability, testing, governance, and ops tooling is nascent.
Related
- Context Engineering — The discipline that depends on these infrastructure primitives
- Software Design for the Agent Era — How software must change to support agent infrastructure
- Enterprise AI Adoption Lag — Infrastructure gaps contribute to enterprise failure rates
- The Two Machine Ages — Each machine age required new infrastructure (factories, then data centers, now agent platforms)