Inference for Agentic Workloads Is Different. Here's What That Means for Your Stack.

Who's going to win in inference? To answer that, it helps to look at what happened in databases.

Relational databases dominated for decades, that is up until the workloads changed. Document stores, graph databases, time-series engines, vector databases: they each emerged not because they beat relational databases outright, but because they were genuinely better suited to a specific class of problem. The question stopped being "which database wins?" and became "what is this workload actually shaped like?"

AI inference is heading the same way. Async inference (background agent inference, long-horizon agents, whichever way you call them) is the first major split.

Synchronous inference is the model most engineers picture: a request comes in, the model responds, latency is everything. It's the right stack for chat, copilots, anything with a human reading the output.

Async inference is a different animal. Jobs are queued, batched, scheduled. The metrics that matter shift from p99 latency to throughput, cost per task, and performance across long agentic traces. The workloads look different too:

Background and coding agents
Long-running multi-step tasks
Multimodal processing
Data curation and labelling
Reporting and compliance pipelines
ETL

The infrastructure that's right for real-time chat is not the right infrastructure for a long-running agentic task. The "which platform wins" conversation will give way to a better question: what is this workload actually shaped like, and what does the stack need to do to serve it well?

What Makes Agentic Workloads Different

An agent doesn't make one request. It runs a trace: a directed sequence of model calls, tool executions, and reasoning steps where each step's output feeds the next.

Consider a coding agent fixing a bug. It reads the codebase, plans an approach, writes a patch, runs tests, interprets the output, revises the patch, runs tests again. Each step is a separate model call; together they form a single logical task. A straightforward bug might take 30 steps. A difficult one, 200.

A few properties of that shape are meaningful when it comes to inference.

The context grows with every step.

Each model call typically includes everything that came before: the original task, intermediate reasoning, tool outputs, accumulated context and more. Prompt lengths that start at a few thousand tokens can grow to tens or hundreds of thousands over the course of a long trace. The infrastructure needs to handle this at the tail, not just the median.

There are long idle periods between model calls

When an agent calls a tool, such as running a test suite, querying a database, fetching a file - the model is doing nothing. That gap might be two seconds or two minutes. If the serving system evicts the agent's cached context during that gap, the next model call pays full recompute cost. Across thousands of concurrent traces, this is expensive.

The call depth is unpredictable

An agent's trace length is determined at runtime by what the model decides to do, what the tools return, and how many retries are needed. Infrastructure that assumes a fixed request lifetime doesn’t do this well.

Failures need to be recoverable

A 200-step trace that fails at step 180 shouldn't require starting over. Memory from previous steps needs to be reused rather than re-calculated.

Where Standard Inference Infrastructure Falls Short

Requests are treated as independent

Standard schedulers see every incoming request as a fresh, unrelated job. For an agent, the 40th model call in a trace shares a long prefix with the previous 39 and belongs to the same logical task. A scheduler blind to that can't prioritize a nearly-complete trace, can't pin relevant cached state, can't coordinate a fan-out when a multi-agent system spawns subagents.

Configurations are static

Batch sizes, cache parameters, routing weights are set at startup and held fixed. But agentic workloads aren't stationary. Context lengths drift as traces progress. Burst patterns shift when a job fans out. A configuration tuned for one traffic shape is wrong for another.

The cache is local to each worker

In a multi-worker deployment, each instance manages its own KV cache independently. Routing a request to a worker that doesn't hold the relevant prefix means recomputing it from scratch. With round-robin routing, this happens on most requests. The cache exists in theory; in practice its hit rate on agentic fleets hovers around 30–40%.

What Agentic Infrastructure Actually Needs

The trace as the unit of work

Rather than scheduling individual requests, inference for async AI needs to understand which requests belong to the same trace and make decisions at that level: pinning state, coordinating fan-outs, and prioritizing jobs close to completion.

Context that survives idle periods

When an agent is waiting on a tool call, its KV cache shouldn't be evicted. A system that understands trace structure can offload that context to cheaper memory tiers during the gap and recall it before the trace resumes, instead of dropping it and forcing a full recompute.

Cache-aware routing

The highest-leverage improvement in multi-worker agentic deployments is routing requests to workers that already hold the relevant cached state. This requires a fleet-wide view of what each worker holds, not just load-balancing against throughput.

Continuous adaptation

Inference needs to adapt, since the workload mix changes, trace lengths vary and burst patterns shift. The control loop needs to observe the fleet in real time and re-allocate scheduling, routing, and memory placement as conditions change, without restarts or manual intervention.

The Broader Shift

Agentic inference infrastructure is still young. The gap between what production agentic workloads need and what general-purpose inference provides is large. Teams running agents at scale are either building custom solutions on top of existing frameworks or leaving performance and cost on the table.

At agent scale, the serving layer will need to continuously orchestrate compute, memory, and bandwidth across a workload that changes shape constantly. The systems that work are the ones built around that reality, just like document stores and time-series engines weren't better databases, just the right tool for a workload that relational databases weren't shaped for.

Adaptive inference

It’s pretty clear that what is needed is a dynamic inference platform built specifically for this class of problem, supporting large-scale async inference for background agents, long-running tasks, and high-volume batch workloads.

Under the hood, Impala operates as a compute fabric: an invisible layer that stitches models and machines together across heterogeneous GPU infrastructure. Any model, any hardware, any workload, unified and abstracted. Each workload has different token patterns, prompt shapes, and memory requirements; Impala's SLA-driven orchestration observes those patterns in real time and automatically adapts inference execution to match.

The result is inference that is fast, adaptive, and up to 10x cheaper than that running on general-purpose stacks.

A few things that distinguish Impala:

Bring your own cloud or not: Impala deploys fully on your cloud or VPC. No data leaves your environment, no rate limits imposed by a shared API tier. You can also use Impala as a serverless option.

Serverless-like simplicity: Dedicated endpoints deployed on your infrastructure, without the operational burden of managing it yourself.

Built for scale: Impala is designed for the teams that don't do small. It offers limitless capacity, support for the world's largest models, and the highest throughput and lowest cost per token available.

For teams running background agents, data pipelines, or multimodal processing at any meaningful scale, the gap between a general-purpose inference stack and one built for async workloads is larger than it looks. Impala closes it at the infrastructure level

‍