A Vocabulary For Async AI Agent Inference

What is Async agent inference

Async agent inference is a fundamentally different mode of running AI than the chat interfaces most people picture.

Instead of a human waiting for a chat response, agents run autonomously in the background. They use tools, branch on results, spawn subagents, and loop through multi-step tasks until a goal is reached.

This results in a totally different type of inference, that both changes and breaks existing inference setups. Instead of optimizing for interactivity, the measure of success for async agent inference is how much useful work gets done per dollar spent.

Why is a vocabulary (or glossary) needed?

As Async agent inference goes mainstream, there needs to be a common way of discussing the challenges and the solutions. Terms like "agent," "step," "trace," and "action" get borrowed from adjacent fields and are used loosely. Sometimes they mean the software artifact, sometimes the workload category, sometimes a runtime metric. That ambiguity is harmless until it slips into system design, SLO definitions, or cross-team communication. That’s why we set about to create a shared vocabulary, so we can clearly communicate about scheduling, routing, memory, and cost at scale.

Drumroll… the Async agent inference glossary and its terms

The category and the systems in it

Interactive AI applications

The human-facing category of AI: chat, copilots, anywhere a person reads tokens as they stream. Measured by interactivity SLO metrics.

Async AI

The category of AI workloads that run without a human waiting on each token: background coding agents, research agents, multi-step automations. The optimization target is tasks per dollar. Used in this post to name the industry, never the software.

Agent / agentic system / harness

The actual running software: the loop that calls the model, executes tools, branches, retries, fans out, and decides when it is done. We use these terms for the artifact, never the category.

Subagent / fan-out

A child agent spawned by a parent, typically grounded in a shared parent context, often runs in parallel with siblings.

The unit of work

Trace (a.k.a. rollout)

One full agent run: an ordered sequence of model calls, reasoning blocks, and tool executions where each step's output appends to the next step's input. The right unit for scheduling, routing, and memory decisions in async AI.

Step (a.k.a. turn)

A single round inside a trace: one model generation followed (typically) by one tool call.

Tool-call gap (tool-call latency)

The idle interval between issuing a tool call and receiving its result. Often dominates wall-clock time on long-horizon traces.

Prefix

The shared, growing input every step re-reads — system prompt, tool schemas, accumulated trace history. The single most read-heavy object in the stack.

The metrics

Tasks per dollar (cost per completed task, $/task)

The economic SLO of async AI. Decomposes into economic value per token times throughput, normalized by compute spend.

Economic value per token

The dollar value of work each generated token actually advances. Pushed up by the right model and a runtime that does not waste tokens on recompute.

Throughput (tokens per second, fleet-wide)

How fast the cluster turns work out at scale. The fleet-side counterpart to tasks per dollar — multiplied by economic value per token, it is the runtime's contribution to the economic SLO.

Job Completion Time (JCT) / Task Completion Time

Wall-clock time from agent task submission to completion. The user-felt SLO of async AI.

Interactivity SLO metrics

The bundled latency targets that interactive AI applications are measured against — TTFT, TPOT, ITL, and end-to-end per-request latency. Optimized for the moment a person reads each token. Still real for any user-facing surface; no longer load-bearing once a human is no longer reading.

TTFT / TPOT / ITL

The named members of the interactivity SLO metrics.

TTFT: time to first token. TPOT: time per output token. ITL: inter-token latency (often used interchangeably with TPOT).

SLO

Service-level objective: the latency or throughput target the runtime is paid to hit.

Runtime primitives

KV cache

The intermediate per-request state the model carries across tokens. Large, read-dominated, and the central object of any prefix-reuse strategy.

Prefix caching / prefix-aware routing

Reusing KV state across steps and traces that share an input prefix; routing requests to workers that already hold the relevant KV.

Tiered memory (HBM → pinned DRAM → NVMe → RDMA pool)

The memory stack the runtime works with. Different KV blocks live on different tiers; the engine moves them between tiers on the timescale of a single tool-call gap.

KV offload

Demoting KV blocks from a faster tier to a slower one when the faster tier declines to hold them.

KV migration

Moving a live trace's KV state from one worker to another in flight, when the routing score changes. The primitive that turns observation into adaptation.

Continuous batching / chunked prefill / disaggregated prefill-decode

Standard inference primitives that increase per-worker efficiency. Necessary, not sufficient, for async AI.

The shape of the problem

Dynamic inference

Inference where scheduling, routing, and memory placement are re-decided continuously against live signals from the fleet, in flight, with no operator in the loop. The alternative is decisions made once at deploy and held until the next one.

Continuous resource allocation

The frame this post takes on serving: at cluster scale, the cluster is one machine, and inference is the continuous allocation of compute, bandwidth, and memory tiers across a workload that won't sit still.

That’s all folks

These are the vocabulary terms I find most useful for the Async agent inference discussion. Did I miss any? I’d love to know.

In any case, language shapes what a team can build. The moment an engineer can distinguish a trace from a step, a tool-call gap from TPOT, or KV offload from KV migration, they unlock a whole class of optimizations that were previously invisible. Not because the techniques were unavailable, but because there was no clean way to name the problem.

This glossary is a starting point, not a final word. The field is moving fast, and the terminology will sharpen as the systems do. What matters is that when your team talks about async AI, everyone is solving the same problem.

You can also drop us a line at Impala and we’ll show you what we’re up to.

‍