← Wiki

Agentic Systems Infrastructure

The Demo-to-Production Gap

AI agent demos are impressive. Production AI agent systems are rare. Shreya Shekhar identifies the gap: “Today’s agentic systems are impressive prototypes, not production-grade systems.” As agents tackle incrementally harder problems, the cracks show: “incorrect results from broken tools, low visibility leads to inability to diagnose failures, low repeatability of success, excessive latency from ‘thinking’ loops which run up inference bills.”

The gap is structural. Traditional software provides guarantees around consistency, scalability, and fault tolerance. AI agents introduce three new challenges that violate those guarantees:

ChallengeDescriptionWhy it’s hard
StochasticitySame input produces different outputsCannot write deterministic tests
Action-intent gapAgent takes action that doesn’t match intended goalHard to specify intent precisely
Subjectivity of results”Correct” output is contextual, not binaryNo ground truth for evaluation

Shekhar argues that “the missing piece isn’t just better models or reasoning, but better systems primitives.”

Infrastructure Requirements

Levie maps the infrastructure that agents will need as they scale to production:

  • Identity systems: Agents need verifiable identities across platforms. Current auth systems assume human users.
  • Persistent storage: File systems and databases for agent work products, session state, and accumulated knowledge
  • Orchestration: When agents outnumber humans by orders of magnitude, coordination becomes the bottleneck — “when you have agents going out and doing work for you, the work just moved up a layer” to orchestration
  • Financial infrastructure: Agents that transact need safe, auditable money management
  • Sandboxed compute: Agents executing code need isolated environments

Workflow Redesign

Levie argues that deploying agents into existing workflows captures only a fraction of their value: “To get the full benefit of AI agents you often need to change your underlying workflows.” Agents work differently from humans — they’re fast but fragile, broad but shallow, tireless but context-limited. Workflows designed around human strengths and limitations don’t map cleanly onto agents.

The paradigm Levie sees emerging is “like if you had an army of really smart interns — they can do a lot, but you need to give them very clear instructions, check their work, and structure the work into manageable pieces.”

The Eval Gap

Yohei captures the current state of agent infrastructure in a “random rant”: the tooling for building, monitoring, and evaluating agents is immature relative to the ambitions of agent builders. The eval gap — the inability to reliably measure whether agents are performing well — is perhaps the most acute infrastructure deficit.

Without evals, organizations cannot:

  • Measure the impact of prompt changes
  • Compare agent architectures
  • Detect regressions in agent behavior
  • Justify the cost of agent deployment

The eval gap leaves Context Engineering partly guesswork: the build-measure-learn cycle of agent development has a broken “measure” step.

Historical Parallel

The infrastructure gap mirrors early cloud computing. In 2006, AWS offered compute and storage but lacked monitoring, orchestration, security, and governance tools. The ecosystem took a decade to mature. Agent infrastructure is in a similar early stage — the core capabilities exist, but the surrounding ecosystem of observability, testing, governance, and ops tooling is nascent.