The Architecture of an AI Orchestration Platform

A high-level view of a multi-agent orchestration platform - the conversation loop, the storage layers, the event bus, the approval gate, and the team of specialist agents all working in parallel

This post is a high-level architecture walkthrough of a production AI orchestration platform. The goal is to show, plainly, the eight systems that turn a language model into something a team can actually deploy.

If you are evaluating orchestration platforms, designing one or simply curious how agentic systems are built under the hood, this is the map.

1. The conversation store and the compaction engine

Every agent conversation in a production platform lives in a single durable document inside the operational database. That document holds the full history: every user message, every tool result, every model response, every internal plan state. It is the source of truth for the agent's context.

The problem is obvious: an unbounded conversation hits two ceilings at once. The model's context window, typically one to two hundred thousand tokens. And the operational database's per-document size limit, usually in the low megabytes. Letting either one overflow is a production incident waiting to happen.

Two strategies for the same problem

A production platform ships with two compaction strategies that the operator can switch between.

The first produces a single prose summary of everything before the compaction boundary. It is fast, space-efficient and a good fit for short-running or exploratory conversations where the exact wording of older messages does not matter.

The second is more sophisticated. It extracts three tiers of context.

The last block of messages, the most recent slice, stays verbatim. The agent needs to see exactly what was just said.
The next block is collapsed into prose summaries that preserve decision anchors and entity references - who decided what, who is talking to whom, which numbers matter.
Everything older is folded into a single root-level summary.

The three-tier approach is the default, and for good reason. It preserves a recency-versus-resolution trade-off that is what makes autonomous agents actually useful: full detail where it matters, summaries where it does not.

When compaction fires

A pressure check runs at the end of every loop iteration. It triggers when the context exceeds roughly three-quarters of the model's window, gated by a minimum floor in the low tens of thousands of tokens to avoid premature compaction. There is a short cooldown between compactions to stop the system thrashing. Token counts come from a real tokeniser with a safe character-based fallback when one is not available.

What never gets summarised away

Critical state survives every compaction.

The agent's current plan, expressed as a list of active todos, is re-injected verbatim at the compaction boundary.
Pinned or canonical facts, things the operator has flagged as always relevant, are re-injected on every pass.
A manifest of what was compacted, with stable chunk identifiers, is emitted so the agent or a human can recover any earlier portion on demand. The model can ask for a piece of its own past back, by identifier, and the platform serves it.

Cumulative summarisation

A subtle but important detail: the compactor does not start from scratch every time. It feeds existing summaries forward into the new pass. The savings compound. The first compaction reduces the working context by roughly half. The second compaction, working from an already-summarised history, takes off another large fraction of the remainder. The result is a context that stays within the window indefinitely, without losing fidelity on what matters.

2. The interceptor pipeline for large data

Tool results in an agentic system are often enormous. A full web scrape. A database export. A media-generation response with embedded metadata. A raw CSV. You cannot throw any of that into the model's context verbatim, and you should not throw most of it away.

The interceptor architecture is the middleware that wraps every phase of the agent loop. It is a small, ordered set of hooks that can inspect, modify, allow or reject anything that passes through.

The hooks run at these points in the loop.

Before the request goes to the model. This is where relevant lessons, memory anchors and compacted context get injected.
After the model responds. This is where blank or cached responses are caught and the request is retried or surfaced as a hard failure.
Before a batch of tool calls executes. This is where each tool is checked against the approval rules and against the learning layer's risk score.
After each individual tool call completes. This is where large results get pre-compacted into a lightweight placeholder, with the full body offloaded to durable storage.
On any error. Failures propagate loudly. There is no silent swallowing.

Pre-compaction of large tool results

The most consequential hook for large data is the one that runs after a tool call. When the result is larger than a few kilobytes, the full body is offloaded to durable storage and the conversation document only stores a pointer plus a short preview. A recovery tool can pull the full body on demand, but only when the agent explicitly needs it, not on every iteration.

This is how an agent can process a multi-megabyte export without blowing past the context window. It sees the schema, the first few rows, and a pointer. If it needs the full dataset to answer the question, it fetches it explicitly. If it does not, the full body never enters the model's working memory.

What the full chain enforces

In a typical deployment, the chain runs in this order.

A budget guard checks spend limits for the current iteration and the session as a whole, with configurable thresholds for warn, pause and stop.
A tool-approval hook checks the human-in-the-loop rules for each tool being called.
A learning hook scores the proposed call against the platform's accumulated lessons, returning a number that the platform maps to allow, warn or block.
An output guardrail validates the model's eventual output against configured constraints before it reaches the user or a downstream system.
An activity logger records the call to an audit trail.

Every hook is optional and configurable. None of them silently drops data. They either pass the data through or reject it with a clear, typed signal that the rest of the platform can reason about.

3. Layered storage

A production platform uses storage at three distinct layers, each chosen for what it is good at.

Layer 1 - the operational database

The conversation document lives here, in a document store, as a single record with embedded message history. New messages are appended atomically to the message array which is crash-safe: if the server dies mid-write, the document is either fully updated or fully untouched. The agent's next iteration resumes from the last persisted state without corruption.

The same atomicity applies to state transitions like approval status. The approval state field moves between pending, approved and denied in a single write.

Layer 2 - the blob store

Large tool results, compacted chunk bodies and media files all live in a durable, tenant-scoped blob store, typically backed by object storage. The store is addressable through stable URIs and the conversation document carries those URIs as the source of truth for large payloads. The full body is only fetched on demand.

Layer 3 - the vector index

Compacted chunks are embedded and indexed in a vector store, with tenant and workspace metadata attached to every record. Semantic search across the compacted history uses these embeddings. The vector client automatically attaches tenant filters to every query, so a workspace never sees another workspace's data.

The write contract

The critical design rule that holds the three layers together: a chunk body's URI is only committed to the operational document after both the blob write and the vector index write have succeeded. This is enforced in a single coordinated transaction with a short timeout. If the timeout is hit, or either write fails, the entire operation is rolled back and reported as an error. The agent or the human operator can retry the compaction. No partial state is ever visible to the model.

4. Atomic versus non-atomic operations

This distinction matters for understanding how a production platform stays reliable under load.

Atomic operations

The basic message append is atomic. A new message from the model, a tool result or a state transition becomes a single write to the operational document, and that write either fully lands or does not land at all. The agent loop gets its basic reliability from this: each iteration is crash-recoverable and the platform can restart, fail over or scale without corrupting the conversation.

Non-atomic operations

Context compaction is inherently non-atomic. It involves reading the full document, generating summaries, writing chunk bodies to the blob store, writing vector embeddings to the index and updating the operational document. Those steps span three different systems.

The platform handles this with a fail-loud transactional pattern. The writes are coordinated and the operational document is only updated after both downstream writes confirm. If any step fails, the next pressure check simply re-triggers the compaction. The agent falls back to its pre-compaction state and keeps working. The context window is tight but the work continues.

Why this design holds

Atomic message append gives the loop its reliability primitive. Non-atomic compaction is acceptable because it is an optimisation, not a correctness primitive. The platform separates the operations that must be correct from the operations that are best-effort and the system is built so that the best-effort operations can fail repeatedly without compromising the correctness operations.

5. Tool policies, approval and guardrails

Every tool call in a production platform goes through a policy engine before it reaches the external service. This is the governance layer and it is the difference between a demo and a deployment.

Discovery and routing

Tools are discovered through a catalog service that exposes the available tools and their schemas. The model emits a structured call request. The platform's tool router runs the batch concurrently, with a configurable upper bound on parallelism. Tools follow a consistent naming convention and a small set of built-in meta-tools - things like discovering, executing, retrying and rendering - bypass the external tool layer entirely because they are part of the platform itself.

The approval model

The approval system is a human-in-the-loop gate that pauses tool execution before it reaches the external service, then resumes asynchronously after the human decides.

The decision precedence is well defined.

A short, fixed list of always-approved built-in tools skips every check.
Each tool configuration declares an approval mode, one of allowed, manual or guard-gated.
User-level rules, stored in the operational database, allow the operator to set per-tool defaults like always-approve, ask, approve-exact or deny, scoped to the user and workspace.
A per-tool metadata flag overrides the default when set explicitly.
In the absence of any rule, the default is to ask. The system is fail-closed.

Pause and resume as a signal, not an error

When a tool needs approval, the interceptor does not throw an exception. It returns a typed signal - call it an approval-required signal - and transitions the conversation state to pending. The user receives a notification, decides and the platform resumes the loop asynchronously with the approved tool calls cleared to proceed.

This distinction matters. Errors abort permanently. Signals mean pause and resume. The agent retains its full state, the pending tool calls are preserved and the human can approve or deny without the agent losing its place. The agent never sees the pause. To it, the tools simply ran.

Output and budget guardrails

Two more guards sit alongside the approval layer. An output guardrail validates the model's eventual output against configured constraints before it reaches the user or a downstream system. A budget guard checks spend limits per iteration and per session, with configurable thresholds. Both use the same hook contract: inspect, allow, modify or reject. Neither silently swallows.

6. Serial and parallel delegation

Delegation is the most architecturally distinctive feature of an orchestration platform. It is what turns a single-agent conversation into a team.

The delegation primitive

A built-in delegation tool is exposed to every agent. Its parameters name the target specialist, carry the task brief, declare the expected output shape and choose a strategy. The expected output shape is not optional. It is a formal schema that the platform forwards to the model provider as a native structured-output constraint. The specialist must return conforming output. The parent receives a validated payload it can trust.

Three pre-flight gates

Before delegation proceeds, three checks run.

Is the target specialist enabled to run as a sub-agent?
Is the agent trying to delegate to itself? That would create a loop, and it is rejected up front.
Does the target provider support the requested output schema natively? If not, the platform does not attempt the call. Silent failure downstream is worse than a clear error up front.

Serial delegation

Serial is not really delegation in the team sense. The serial strategy switches the current conversation to take on the role of the target specialist at the next iteration boundary. No new context is created. No parallel work happens. It is useful for role-switching within the same conversation - the agent becomes a different specialist for a stretch, then can become something else.

Parallel delegation

Parallel is the real story. The strategy spawns a new child context - an isolated workspace with its own message history, its own tool set and its own memory scope, optionally able to read the parent's knowledge. The parent keeps a registry of the children it has spawned and tracks their progress through a heartbeat and a stale-context recovery mechanism.

The children are dispatched as independent work items on the event bus. Each one runs concurrently. The parent can dispatch many at once. The wall-clock benefit is significant: work that would be sequential in a single-agent context becomes concurrent. A research task, a draft task, an image generation and a data extraction can all run at the same time, each in its own isolated context.

Typed output contracts

The schema requirement is the load-bearing part. Because the specialist must return output matching the parent's declared schema, the parent's recovery decisions become precise. The platform can distinguish between schema-owned failures - the output failed validation - and execution failures - the underlying tool or model crashed. That distinction drives better retry, better error reporting and better composition of partial results.

7. Event streaming and Context lifecycle

A production orchestration platform is event-driven throughout. There is no synchronous request-response chain for agent execution.

The event bus

All inter-service communication flows through a publish-subscribe bus. Subjects are namespaced by tenant, by user, by conversation, so no service can subscribe to another tenant's subjects. The tenant boundary is enforced at the bus level, not just at the application level. Tokens on the wire are encrypted, never plaintext.

The request flow

When a user sends a message, the chat service receives it, validates the user's identity and publishes a chat request event on the bus. The orchestration worker picks up the event, resolves the agent configuration and starts the loop. Every model delta, every tool start, every tool result, every thinking step is emitted as a streaming event back to the user's interface.

Streaming deltas

The user sees the agent working in real time, not waiting for a complete response. Text tokens stream in as the model produces them. Tool starts announce themselves. Tool results arrive in full when they complete. Reasoning, where the model exposes it, streams alongside the response. The interface renders these as they arrive, so the experience is a live view into the agent's work, not a black box.

The Context lifecycle

Every conversation moves through a small set of states over its life.

Active - the agent is currently iterating on this conversation.
Paused - waiting for a human approval decision.
Completed - the agent has returned a final result, whether through a natural stop, a stop sequence or a refusal.
Archived - the compaction engine has offloaded older chunks and the document's message list has been trimmed.

The state is persisted after every iteration. If the server restarts mid-conversation, the agent resumes from the last persisted state. Unprocessed events remain on the bus until consumed. The architecture has no single point at which a request can be silently lost.

8. Learning and reflection

A production platform does not just execute work. It learns from outcomes, extracts patterns and applies them to future tasks. This is what makes the platform measurably better at its job the more you use it.

Three subsystems

Lessons are transferable patterns extracted from what worked and what did not. Each lesson has a unique name, a problem statement, a recommended approach, a recommended avoidance, a confidence score, a safety flag and a time-to-live.

Memory is durable, owner-scoped recall across sessions. The owner of a piece of memory is the team, if there is a team, otherwise the individual agent. Before the agent starts its work loop, the platform builds a pre-flight block: what the agent can recall from prior runs, plus the key elements of the current conversation.

Decision and outcome calibration records the platform's confidence predictions on important decisions and links outcomes back to them. The platform then computes per-agent prediction accuracy, feeding that signal back into future decisions.

The learning loop

Every agent run follows this cycle.

Before starting, the agent queries relevant lessons, ranked by similarity, with a minimum score threshold.
Before every iteration, the most relevant lessons are injected into the working context.
Before every tool batch, the learning layer scores each proposed call against the lessons, using a weighted signal: argument-pattern match, tool-name match, error-history match, keyword match, and semantic similarity. A high score warns or rejects. A low score allows.
After the run, or at significant milestones, the platform extracts new lessons from what happened. Extraction is fail-fast, never blocking the agent loop and rate-limited per workspace.

Confidence decay and retirement

Lessons are not permanent. The platform uses two decay mechanisms. An injection-score decay that gently down-weights older lessons with a multi-month half-life. An effective confidence that combines the original confidence, time elapsed and the running success and failure counts. A background sweep soft-archives lessons that have expired their time-to-live or dropped below a confidence floor. Lessons flagged as safety-critical are exempt from auto-retirement.

Why this matters

Most AI tools start every task from scratch. Every lesson is new. Every tool call is blind. A platform with a real learning layer means the second time you run a similar task, the platform already knows the rate limits, the common error patterns, the argument shapes that worked before. Lessons compound across runs, across agents, across the team.

What ties it together

These eight systems are not independent modules. They share a common architecture.

One execution engine runs the agent loop regardless of where the request came from - a chat, a scheduled trigger, a webhook.
One context model - a single document that holds the current state - is the source of truth.
One interceptor chain enforces approvals, learning, budgets and output guardrails in a predictable order.
One event bus carries all requests, responses and lifecycle events.
One security model attaches tenant identity to every operation, so the database, the vector store and the bus all enforce isolation automatically.

The result is a platform where the agent can run unattended, remember what it learned, respect human boundaries, handle arbitrarily large tool results and coordinate a team of specialists - all within a consistent, auditable, crash-safe architecture.

Ready to see it run? Try UnaGo free and stand up your first agent team in under ten minutes or book a demo to map the first workflow worth automating for your team.

The architecture of an AI orchestration platform