← Wiki

Agentic Systems Infrastructure

The demo-to-production gap

AI agent demos are impressive. Production AI agent systems are rare. Shreya Shekhar identifies the gap: “Today’s agentic systems are impressive prototypes, not production-grade systems.” As agents tackle incrementally harder problems, the cracks show: “incorrect results from broken tools, low visibility leads to inability to diagnose failures, low repeatability of success, excessive latency from ‘thinking’ loops which run up inference bills.”

The gap is structural. Traditional software provides guarantees around consistency, scalability, and fault tolerance. AI agents introduce three new challenges that violate those guarantees:

ChallengeDescriptionWhy it’s hard
StochasticitySame input produces different outputsCannot write deterministic tests
Action-intent gapAgent takes action that doesn’t match intended goalHard to specify intent precisely
Subjectivity of results”Correct” output is contextual, not binaryNo ground truth for evaluation

Shekhar argues that “the missing piece isn’t just better models or reasoning, but better systems primitives.”

Infrastructure requirements

Levie maps the infrastructure that agents will need as they scale to production:

  • Identity systems: Agents need verifiable identities across platforms. Current auth systems assume human users.
  • Persistent storage: File systems and databases for agent work products, session state, and accumulated knowledge
  • Orchestration: When agents outnumber humans by orders of magnitude, coordination becomes the bottleneck — “when you have agents going out and doing work for you, the work just moved up a layer” to orchestration
  • Financial infrastructure: Agents that transact need safe, auditable money management
  • Sandboxed compute: Agents executing code need isolated environments

Workflow redesign

Levie argues that deploying agents into existing workflows captures only a fraction of their value: “To get the full benefit of AI agents you often need to change your underlying workflows.” Agents work differently from humans — they’re fast but fragile, broad but shallow, tireless but context-limited. Workflows designed around human strengths and limitations don’t map cleanly onto agents.

The paradigm Levie sees emerging is “like if you had an army of really smart interns — they can do a lot, but you need to give them very clear instructions, check their work, and structure the work into manageable pieces.”

The eval gap

Yohei captures the current state of agent infrastructure in a “random rant”: the tooling for building, monitoring, and evaluating agents is immature relative to the ambitions of agent builders. The eval gap — the inability to reliably measure whether agents are performing well — is perhaps the most acute infrastructure deficit.

Without evals, organizations cannot:

  • Measure the impact of prompt changes
  • Compare agent architectures
  • Detect regressions in agent behavior
  • Justify the cost of agent deployment

The eval gap leaves Context Engineering partly guesswork: the build-measure-learn cycle of agent development has a broken “measure” step.

Agentic security

AI agents create a new category of cybersecurity risk — and a corresponding infrastructure need. Levie identifies two compounding attack surfaces:

  1. Code-generation risk. AI generates “way more code than anybody’s ability to review that code.” Every new feature shipped by an agent is a potential vulnerability — “the AI could have written in, oh, we want to actually open up that port in the system because we need to do something, and maybe that was the wrong decision for the agent to go and do.”

  2. Offensive AI. Adversaries using AI (especially open models) can “find more vulnerabilities because they can scan across the internet far faster than before.”

Against these two new risk vectors, organizations have one countermeasure: agents that review code and detect vulnerabilities. Levie frames this as a self-referential dynamic: “Agents are the solution to the problem that agents have caused.”

This creates demand for a new infrastructure category: agent observability and evaluation for security. Companies like BrainTrust are building tools that monitor whether agents are producing correct outputs — a need that initially seemed limited to Silicon Valley agent builders but in practice extends to “everybody on the entire planet, if you’re putting agents into an enterprise workflow.” A pharma company needs to know if its agent “just stopped producing loan origination documents the right way.” An agent eval platform that works across multiple frontier labs becomes essential infrastructure — analogous to how monitoring tools (Datadog, New Relic) became essential infrastructure for cloud applications.

Historical parallel

The infrastructure gap mirrors early cloud computing. In 2006, AWS offered compute and storage but lacked monitoring, orchestration, security, and governance tools. The ecosystem took a decade to mature. Agent infrastructure is in a similar early stage — the core capabilities exist, but the surrounding ecosystem of observability, testing, governance, and ops tooling is nascent.