The Most Expensive Word in AI Agents Is “Parallel”
AI agents love searching everything in parallel. Your token budget probably doesn’t. A practical look at where the cost actually comes from in multi-agent systems and how to reduce it.
I was using a knowledge assistant connected to the systems my team actually works in: Slack, Confluence, Jira, GitHub, internal catalogs, monitoring. It answered questions well.
Then I looked at the token bill.
The assistant was launching parallel subagents by default: one per connected source for most questions. Slack search, Jira lookup, documentation retrieval, catalog queries, monitoring checks. Everything at once.
Latency looked reasonable because the work happened concurrently. Cost did not.
The expensive part is easy to miss: subagents are not cheap lookups. In many agent frameworks each one carries most of the original context with it: system prompts, MCP tool definitions, conversation history, instructions. Every branch becomes its own prompt execution.
So for a question whose answer clearly lives in one place, you might still pay for six retrieval paths. Five return nothing useful. One finds the answer.
This is where distributed systems intuition breaks down. In usual infrastructure parallelism is often close to free if capacity exists. In LLM systems parallelism improves latency but scales token cost almost linearly.
The fork always invoices you. To be fair, if your KPI is “AI adoption,” this is amazing design. Tokens burn at industrial scale and the dashboards look fantastic. Although if your care about cost, this is painful.
The assistant looked intelligent because it searched everywhere. In practice, it was paying for uncertainty. The highest-impact fix was adding a routing step before retrieval.
Before searching, classify the question:
- where is the answer most likely to live?
- which source should be queried first?
- when is confidence high enough to stop?
Most internal questions have an obvious home. Incident discussion belongs in Slack. Ownership data belongs in a catalog. Delivery state belongs in Jira.
That single change: classify first, search selectively, stop early - reduced costs more than prompt tuning or model upgrades.
The second driver was quieter: MCP overhead.
In many implementations MCP server definitions and tool schemas are included in the prompt whether they are used or not. Every connected integration increases the baseline token cost of every conversation.
Unused tools become permanent prompt weight. One surprisingly effective optimization is simply disabling MCP servers you rarely use.
The third driver was model selection.
A lot of retrieval work does not require premium reasoning models. Searching Slack, finding Jira tickets, extracting summaries from Confluence, this is mostly retrieval and lightweight summarization.
Using Opus everywhere because it is the default gets expensive quickly:
|
Model |
Input |
Output |
|---|---|---|
|
Opus |
~$15 / 1M tokens |
~$75 / 1M tokens |
|
Sonnet |
~$3 / 1M tokens |
~$15 / 1M tokens |
|
Haiku |
~$0.25 / 1M tokens |
~$1.25 / 1M tokens |
The pricing gap becomes enormous at scale: Sonnet is roughly 12x more expensive than Haiku, while Opus is closer to 60x.
The practical approach is simple:
- cheap models for retrieval,
- expensive models for reasoning,
- escalation only when necessary.
Other optimizations helpful too (mentioning briefly): avoiding full-document retrieval, trimming long conversations, caching repeated context, avoiding repeated fetches, compressing retrieval output.
RTK-style compression was particularly useful for command-heavy workflows where bash output and logs can silently explode token usage.
There are also projects like mcp-rtk trying to reduce MCP overhead further. I have not used them personally because the dominant cost drivers in my setup were elsewhere and the ecosystem is still early enough that operational maturity and security review matter.
But uncontrolled parallel retrieval was still the biggest cost driver by a large margin.
The interesting part is that the system was not behaving irrationally. Searching broadly feels like thoroughness. But in LLM systems breadth has a direct token cost. The real optimization problem is not “search harder.” It is “know where to look before you search.”
And importantly, every deployment ends up with different bottlenecks. Parallel retrieval dominated mine. Yours might be conversation length, retrieval strategy, oversized prompts or model selection. So do your own research first, then define cost reduction actions relevant to your use case.