Architecture is where AI agents succeed or fail long before users see them. The choices you make about orchestration patterns, state management, error handling, and system boundaries determine whether an agent is a reliable tool or an expensive demo.
These posts explore the structural decisions behind production AI systems: how to design tool-calling loops, when to use multi-agent orchestration vs. single-agent pipelines, how to handle failure gracefully, and what separates architectures that scale from those that collapse under real-world load. Every pattern is grounded in systems we’ve actually built and operated.
Topics range from single-agent tool-calling loops to multi-agent coordination, state persistence strategies, context window management, and the monitoring infrastructure needed to operate agent systems with confidence. Each architectural pattern is evaluated against real production constraints — latency budgets, cost targets, and failure recovery requirements.
MCP gives your agent hands. A2A gives your agents colleagues. That distinction is now load-bearing: on April 23 2026 the Linux Foundation cut Agent2Agent Protocol v1.0 under the newly-formed Agentic AI Foundation (AAIF), the same governance body that took over MCP earlier this spring. The AAIF launch, announced jointly by OpenAI, Anthropic, Google, Microsoft, AWS, and Block with roughly 150 member orgs, settles the political question that has haunted multi-agent infrastructure since 2025: there is now one protocol stack, with one steward, that everyone has staked their roadmap on.
Past a 10-step trajectory with read-heavy tools, your agent is bottlenecked on tool latency, not LLM throughput. Speculative tool execution fires the predicted next call while the model is still emitting tokens, then promotes or discards on commit. PASTE (arxiv 2603.18897, Microsoft Research, March 2026) reports a 48.5% task-completion-time reduction. The companion UMD/LLNL paper (arxiv 2512.15834) layers client-side and engine-side speculation for an additional 6 to 21%. The technique reduces to two design decisions: a predictor and an eligibility policy. Get either wrong and you ship a billing incident or a data-safety incident.
Conventional prompt-injection defenses—input classifiers, spotlighting, fine-tuned refusal heads—plateau somewhere around 95% detection. In application security, 95% is a failing grade: the remaining 5% is a repeatable exploit. CaMeL (Capabilities for Machine Learning), introduced by Debenedetti et al. at DeepMind in arXiv:2503.18813, does not try to push that number higher. It changes the shape of the problem. Split the model in two—a Privileged LLM that never reads untrusted data, and a Quarantined LLM that reads data but cannot call tools—and enforce an information-flow policy on every value that crosses the boundary. What you lose is seven points of utility on AgentDojo (84% undefended to 77% defended). What you gain is a security property you can prove, not just measure.
The “long context” number in your model card is marketing. Past roughly 80K tokens, your agent’s tool-calling accuracy falls off a cliff—and padding the window with more tokens is the most expensive way to get worse results. The engineering answer is not a bigger window. It is a structured compaction operation, borrowed from research published in late 2025 and early 2026 under names like FoldGRPO, AgentFold, and ACON, and now shipping as a first-class API primitive in Anthropic’s context-management-2026-01-12 beta. The common label for the technique is context folding: replace a settled segment of the trajectory with a learned summary, evict the raw tokens, and keep executing against the compressed artifact.
The industry’s mental model of prompt injection is session-scoped: attacker crafts a malicious input, it executes in the current context, the session ends, and the attack ends with it. Defenses are designed around this model—input filtering, system prompt hardening, output validation. Every major framework has a story for it.
Memory-augmented agents break that model entirely.
When your agent writes to long-term memory—and most production agents running in 2026 do—a successful injection doesn’t need to execute immediately. It can plant a record, go dormant, and activate three sessions later when a semantically related query retrieves it. The attacker doesn’t need to be in the session at exploit time. The agent’s own reasoning, presented with a poisoned memory entry it trusts, does the rest. Forensics are brutal: the bad decision looks indistinguishable from the agent’s own learned behavior.
The majority of teams running AI agents in production have no automated quality gates. They deploy, manually check a few outputs, and hope nothing regressed. LangChain’s 2026 State of Agent Engineering report found that 57% of organizations now have agents in production — but quality remains the top barrier, cited by 32% of respondents. Google released a codelab this year explicitly titled “from vibe checks to data-driven agent evaluation.” The industry is collectively admitting that the testing story for agents is broken.
The most important skill in production AI in 2026 is not prompt engineering, model selection, or fine-tuning. It is context engineering — the discipline of designing everything the model sees at inference time. A weaker model with well-engineered context consistently outperforms a stronger model with bad context. Anthropic’s own evaluation showed that Claude Code with proper context engineering via MCP achieved an 80% quality improvement over the same model without it. LangChain’s 2026 State of Agent Engineering report confirms the pattern: context engineering is the top difficulty for 57% of organizations running agents in production.
In our previous posts, we broke down the individual components of production AI agents: how the tool-calling loop works, how system prompts govern behavior, how MCP connects agents to business systems, and how to configure extension points in practice. Each of those posts examined a single agent doing a single job.
This post is about what happens when one agent isn’t enough.
2025 was the year of single AI agents. 2026 is the year they start working together. The AI agent market is growing at 46% year over year, and Gartner projects that 40% of enterprise applications will feature task-specific AI agents by the end of this year — up from less than 5% in 2025. But Gartner also predicts that over 40% of agentic AI projects will be canceled by 2027, and the primary killers are cost overruns, coordination complexity, and inadequate governance.
In our previous posts, we broke down how system prompts govern agent behavior and how the tool-calling loop actually works. Both of those pieces assumed something that, in practice, is the hardest part of building a production AI agent: the agent can actually talk to your business systems.
That’s the integration problem. And until recently, it was brutal.
If you wanted an AI agent that could look up customer orders, check inventory, update a CRM record, and send a follow-up email, you needed four separate integrations — each with its own authentication flow, data format, error handling, and maintenance burden. Five AI platforms connecting to twenty business tools meant a hundred integration projects. Every new tool or model multiplied the work.
Most explanations of AI agents stop at “the LLM decides which tool to use.” That’s the easy part. The hard part is everything around it: how you define tools so the model actually picks the right one, how you handle failures mid-chain, and how you keep a stateless model acting like it has memory.
This post breaks down the core loop that powers every agent we build at Replyant.
The Agent Loop
Every AI agent, regardless of framework, runs the same fundamental cycle: