What has changed

The first AI applications to gain widespread adoption were chat-centric. The primary metric was latency: users shouldn’t wait too long for responses. The resulting KPI was focused on TTFT/TPOT.

The inference required to support that was static, stationary and configuration-based, optimized to reduce latency at acceptable costs. 

This made sense when AI was synonymous with chat. But the next wave of enterprise AI is different: it’s about long running agentic workflows. This is both changing (and breaking) inference as we know it. 

AI agent workloads are fundamentally different

Async agentic workloads break every assumption underlying today’s inference approaches; they also operate at a much larger scale. 

Consider what a modern AI agent actually does: it receives a task, decomposes it into subtasks, reasons through dependencies, calls tools, waits for results, re-evaluates, and iterates. Only then it produces an output. Long-running tasks from document analysis to code generation with test-and-fix cycles generate thousands of tokens per task. This creates a combinatorial explosion of tokens per task. 

A reasoning model working through a complex problem isn't just "a longer chat." It's a qualitatively different computational workload with altogether different costs. 

Inference engineering is becoming complex

The consequence of running async, agentic workloads through infrastructure designed for interactive chat isn't just about performance. It's an engineering and business problem.

First, inference engineering becomes expensive and brittle. Because static configurations can't adapt to shifting workload profiles, teams end up in a constant re-tuning loop. 

Second, and more concretely, the economics break down.

  • Rate limits become a ceiling, not a guardrail. Systems hit throughput ceilings when async jobs flood in. Requests queue, latency climbs, and SLOs get missed: not because the hardware is insufficient, but because the scheduling logic was never designed for this traffic shape.
  • Token costs make ROI almost impossible to demonstrate. Agentic workloads that generate 50,000–200,000 tokens per task at static pricing quickly erode the business case for AI at scale. When inference infrastructure can't distinguish between a low-priority background job and a latency-sensitive user-facing request, everything gets priced and scheduled the same way. 

Treating Inference as a high performance computing problem

At Impala, we treat inference as a high performance computing problem. This approach assumes that since workloads are both non-stationary and heterogeneous, optimization needs to happen across the entire compute stack, by observing workloads in real time and adapting. 

Most inference platforms are tuned ahead of time for a "happy medium". We approach this differently, by adapting across the entire stack, optimizing from throughput and cost per token. 

How It Works

The Impala inference engine operates as a real-time optimization system across the full compute stack.

At the observation layer, the system continuously monitors live workload signals: context lengths, token generation rates, batching characteristics, and traffic patterns. Rather than relying on pre-configured assumptions, it builds a live picture of what the system is actually doing at any given moment.

At the execution layer, the engine dynamically adjusts kernel selection, memory movement between SRAM and HBM, and token scheduling, in real time, without requiring operator intervention. When a workload shifts from short-context interactive requests to long-context reasoning jobs, the system detects the regime change and adapts execution accordingly.

At the infrastructure layer, requests are routed and scheduled across GPU fleets based on their actual computational profile, not a generic priority queue. High-latency-sensitivity requests get treated differently from background batch jobs, automatically, without manual configuration.

A continuous telemetry feedback loop closes the system: as traffic flows through, scheduling and execution decisions improve. Over time, the platform develops specialized execution paths for distinct workload shapes.

This approach isn't theoretical. In benchmarking on DeepSeek R1, Impala achieves approximately 29% higher throughput than leading MLPerf submissions on equivalent hardware (read here).

But throughput at benchmark time is only part of the story. The more important metric is throughput consistency under production conditions. That's where static inference systems degrade, and where dynamic, full-stack optimization keeps systems operating near peak.

If you're running AI at scale, or building toward it, we'd like to show you what that looks like in practice.