How to Cut Inference Costs for Background AI Agents and Batch Jobs

If your team is running AI in production, there's a good chance the majority of your inference spend isn't coming from users chatting with a copilot. It's coming from background agents: coding pipelines, data labelling workflows, multimodal media processing jobs. These are running in loops, calling tools, branching, retrying, and fanning out subagents.

This shift matters because the way you optimize inference for async, background workloads is fundamentally different from how you optimize for interactive chat.

This guide walks through what you can practically do to improve throughput and lower cost per task for async AI workloads. It then addresses the question of whether you will be better off doing this with an async (or background check) inference platform such as Impala.

Know Your Real Metric

Before optimizing anything, it's worth being clear on what you're actually trying to improve.

For interactive applications such as chat, copilots, anything with a human reading tokens as they stream, the KPIs are interactivity metrics: time to first token (TTFT), time per output token (TPOT), and inter-token latency. Shaving 200ms off TTFT genuinely matters when a person is watching.

For async workloads, those targets are largely irrelevant. Nobody is watching a data labelling job stream tokens in real time, nor a coding agent or any other background agent. The metric that matters is tasks completed per dollar. It combines two things you want simultaneously: lower job completion time and higher fleet throughput. This means faster and cheaper, not faster or cheaper.

This is what we’ll use below to optimize against.

Understand What Async Workloads Actually Look Like

Background agent jobs have a very different shape from chat requests, and that shape determines where your bottlenecks are.

They're long-horizon and idle-heavy. A coding agent or data labelling pipeline doesn't make one model call - it makes dozens or hundreds, with tool calls in between. Production systems report a mean of around 20 tool steps per trace, with tails exceeding 200. Tool-call latencies can stretch into hundreds of seconds. That means the model is frequently waiting (or actually idle) while a tool executes, a file is read, or an API responds.

They're read-heavy. Coding agent traces routinely show prompt-to-completion ratios around 70:1. The system prompt, tool schemas, and accumulated context dominate input mass. The model is re-reading the same large prefix on almost every step. This has major implications for how you think about caching (more on this below).

They produce tokens in parallel. Multimodal media processing, large-scale data labelling, and multi-agent coding pipelines all fan out across many concurrent jobs. The fleet is handling many traces simultaneously, not serving one request at a time.

If your inference stack was designed around independent chat requests, it is almost certainly mismatched to this workload.

Practical Optimization Levers

1. Prefix Caching: Your Highest-Leverage Move

Given that async workloads re-read the same large context on every step, prefix caching (also called prompt caching) is the single most impactful optimization available.

The idea is simple: rather than recomputing the KV (key-value) attention cache for the shared prefix of every request, you compute it once and reuse it. For a 50K-token system prompt or document context that's included in every step of a 100-step trace, this is the difference between recomputing 5 million tokens of context and computing them once.

Most modern inference frameworks support prefix caching. The critical thing is to structure your prompts to put stable, shared content at the front (system prompts, tool schemas, document context) so the cache hit rate is as high as possible. Even rough implementation can cut inference costs by 30–50% on read-heavy workloads.

2. Continuous Batching

If your inference server is processing requests sequentially, you're leaving GPU utilization on the table. Continuous batching (also called dynamic batching or iteration-level scheduling) keeps the GPU fed by continuously replacing completed sequences with new ones, rather than waiting for an entire batch to finish before starting the next.

For async workloads with many concurrent jobs, this is essential infrastructure. The difference between naive sequential processing and a well-tuned continuous batching setup can be 5–10x in effective throughput at the same hardware cost. Most production-grade serving frameworks such as vLLM support this out of the box. If you're not using one, you should.

3. Quantization

Quantization reduces the numerical precision of model weights, from 32-bit floats down to 16-bit, 8-bit, or 4-bit. This shrinks memory footprint, allows more concurrent requests on the same hardware, and reduces compute cost per token.

The accuracy trade-off is real but often overstated. Quantization doesn't degrade quality uniformly across tasks. For most production use cases, particularly structured outputs like labelling, classification, and extraction, the quality impact is small relative to the gains in throughput and cost. Most workloads don't require maximum model accuracy, and quantization rarely compromises the quality bar they do require.

FP8 and methods like TurboQuant offer a practical middle ground: meaningful memory and compute savings with quality that holds up well across a broad range of tasks. Benchmark against your specific task distribution before assuming a trade-off exists that matters to you.

4. Right-Size Your Models With Tiered Routing

The most capable model in your fleet should not be handling every task. For async workloads that fan out into many subtasks such as extracting metadata from images, classifying text or generating structured outputs from templates, a smaller model can be entirely sufficient.

Build a tiered routing layer: classify incoming tasks by complexity and route straightforward subtasks to a smaller, cheaper model. Even a rough classifier can push 60–70% of your volume to a cheaper tier, with the larger model reserved for tasks that genuinely need it.

For multimodal media processing pipelines in particular, different stages of the pipeline often have very different model requirements. Video segmentation, caption generation, and quality scoring don't all need the same model.

5. Prompt Efficiency

Every token in your prompt costs money. For background jobs running at scale, prompt bloat compounds fast.

Audit your system prompts and tool schemas for unnecessary verbosity. Instruct models to respond in structured formats (JSON with defined schemas, for example) to constrain output length. If your traces include accumulated conversation history, consider summarization strategies to compress older context rather than passing the full raw history indefinitely.

For multimodal workloads, image and video token counts can be very large. Strategies like dynamic resolution scaling can significantly reduce per-task cost.

This Gets Hard at Scale

The optimizations above are real and worth implementing. But there's an honest complication: at cluster scale, with many concurrent agent traces running in parallel, the interactions between these levers become complex in ways that are genuinely difficult to manage.

Prefix cache hit rates depend on routing decisions. If a trace's requests get distributed across workers randomly, the cache is cold on every step.
Batching efficiency depends on workload mix. A scheduler that works well for short summarization tasks may create head-of-line blocking when a long reasoning generation enters the queue.
Memory management for long-running traces with tool-call gaps requires actively managing the KV cache across idle periods, not just evicting on LRU.

At sufficient scale, inference stops being a request-response service and becomes a continuous resource allocation problem across a distributed compute fabric.

How Impala Approaches This

Impala is built specifically for async AI workloads. These are the background agents, batch pipelines, and multi-agent systems that now account for the majority of enterprise inference spend.

The core insight behind Impala's architecture is that the standard inference stack was designed around independent chat requests, and that design is fundamentally mismatched to agent traces. Impala's runtime treats the trace as the unit of work. Not the individual request.

In practice, this means several things. The scheduler reconstructs trace identity from signals it already has such as prefix-hash overlap, request-arrival cadence and fan-out structure. It makes scheduling decisions at the trace level. It can pin a prefix whose resume is due in seconds, allocate fan-outs from a shared parent context, and rank a long reasoning generation differently from a short summarization. This alone can dramatically reduce job completion time by clearing the head-of-line blocking that causes traces to stall.

Routing is cache-aware. Given a 70:1 read-to-write ratio, routing a request to a cold worker means a full recompute of tens of thousands of tokens of prefix. Impala routes against a fleet-wide KV index so cache hit rates are consistently high rather than the 30–40% typical of round-robin routing.

Memory management treats KV cache as a hierarchy, not a budget. System prompts get effectively permanent retention. A parent context with active subagents reading it stays pinned until they drain. During tool-call gaps, blocks are demoted down the memory tier stack (from HBM to DRAM to NVMe) and recalled in time for the trace to resume, without requiring a full recompute. Shared prefixes are deduplicated fleet-wide so a common parent context is held once rather than replicated across every worker.

Critically, all of these decisions are made continuously against live fleet state, not against a static configuration. As workload mix shifts, context-length distributions drift, and burst structure changes, the runtime adapts in flight.

The result is an inference stack where faster and cheaper compound rather than trade off: lower job completion time and higher fleet throughput, at the same time, which is the contract that async AI actually requires.

If your team is running background agents, data labelling pipelines, or multimodal media processing at any meaningful scale, the gap between a general-purpose inference stack and one built for async workloads is likely larger than you'd expect. Impala's approach is to close that gap at the runtime level, so your engineering team doesn't have to.