Overview

This report presents Impala AI's performance results on the DeepSeek Reasoning for MLPerf Inference benchmark, the industry-standard evaluation for large language model serving. The DeepSeek Reasoning for MLPerf Inference v5.1 (offline on 8xB200) benchmark was designed to measure real-world throughput under production conditions using a state-of-the-art reasoning model on high-end GPU hardware.

Impala's MLPerf results demonstrate that Impala’s approach, treating inference cost and scale as a hyperscale problem, explicitly assuming non-stationary and async workloads, delivers significantly  better results. Since the benchmark runs on a predefined setup (model, hardware and networking), the fact that Impala demonstrated a considerable performance increase proves that its Dynamic Inference Engine is able to deliver more throughput at a lower cost. 

Test Configuration

| Workload | Mlperf_deepseek_r1 Reasoning Model Dataset | | -------- | ------------------------------------------- | | Model | Nvidia/DeepSeek-R1-NVFP4 | | Hardware | 8x B200 SXM (AWS P6-B200) |

Impala's platform orchestrated the workload across all 8 GPUs using its Dynamic Inference Engine, automatically adapting to the workloads in the benchmark.

Key Results

Impala throughput vs. industry state of the art results, in tokens/sec

At 29% higher than the Dell baseline and 18.2% higher than Nvidia’s reported results , Impala showed that its dynamic inference engine is able to deliver more throughput, resulting in faster processing and a  lower cost per token, compared to existing inference platforms, without model modifications, rate limits, or reliability trade-offs.

Why These Results Matter

Most inference platforms are optimized for chat-speed latency, minimizing the time it takes to respond to a single user. An example application is an AI customer service agent. 

Impala is optimized for asynchronous, throughput-first, workloads where enterprises need to process millions of records, not answer one question at a time. As the AI application landscape evolves, these new applications are gaining prominence and are where most inference platforms or optimizations break.

Sample use cases where asynchronous workloads require inference that optimizes throughput over latency are:

  • Nightly ETL with AI-enriched transformations
  • Data curation and labeling pipelines (computer vision, NLP)
  • Compliance report generation (financial services, AML/CTF analysis)
  • Document processing and summarization at volume
  • Web scraping and content enrichment
  • MCP agent orchestration
  • Code review / analysis pipelines
  • Multi-step agentic workflows (planning, executing, evaluating, retrying)

Impala assumes that workloads aren’t rigid, and that their variance presents a non-stationary online optimization problem. The workloads in the benchmark exemplify this observation and Impala’s results demonstrate that by dynamically adapting inference to actual workloads, superior throughput and lower costs can be achieved.

The MLPerf benchmark reflects this distinction. The reasoning model dataset (mlperf_deepseek_r1) is a batch workload, containing large volumes of inputs processed in sequence. This is the core use case Impala is designed for: data classification, document processing, synthetic data generation, and content pipelines at scale.

About the benchmark

MLPerf Inference v5.1 introduced DeepSeek-R1 671B as the benchmark suite's first dedicated reasoning model workload, developed by MLCommons' Reasoning LLM Task Force to keep pace with the growing real-world deployment of advanced reasoning systems. The benchmark tests the model across five challenging open datasets, covering advanced mathematics (AIME, MATH500), graduate-level science (GPQA-Diamond), expert knowledge (MMLU-Pro), and live code generation (LiveCodeBench), with a maximum output length of 20,000 tokens, the highest ever used in MLPerf, to allow the model to fully exercise its Chain of Thought reasoning. 

Performance is measured in two scenarios: offline token generation throughput and server throughput constrained by latency thresholds (99th percentile TTFT under 2 seconds, TPOT under 80ms), with accuracy evaluated via exact match for math and multiple-choice tasks and code execution for programming tasks. The reference implementation supports three inference backends - vLLM, SGLang, and PyTorch - and the addition of this 671B-parameter, mixture-of-experts model means MLPerf now covers the full spectrum of language model benchmarking, from 7B to 671B parameters, across both reasoning and non-reasoning architectures.