Insights & Ideas

Latest blog posts

Discover stories, tips, and resources to inspire your next big idea.

May 12, 2026

Async AI, Done Right

A manifesto for dynamic inference, and how we made it run faster on the same GPUs, and the same model.

Ziv Bakhajian

May 7, 2026

A Vocabulary For Async AI Agent Inference

We set about to create a shared vocabulary for async inference, so we can clearly communicate about scheduling, routing, memory, and cost at scale.

Ziv Bakhajian

May 7, 2026

To Support Async AI Agents, Inference Needs To Be Adaptive

The next wave of enterprise AI is about long running agentic workflows.

Noam Salinger

May 5, 2026

Announcing our partnership with Highrise AI

Together, Highrise and Impala provide inference that is both high-throughput and high availability.

Noam Salinger

May 2, 2026

MLPerf Benchmark Report

Impala AI's performance results on the DeepSeek Reasoning for MLPerf Inference benchmark

Boaz Touitou

April 21, 2026

Why Wide-EP Serving Becomes Network-Bound

A first-principles communication model for multi-node MoE serving, and why per-rank payloads decide when the wire becomes the bottleneck.

Ariel Bereslavsky

October 29, 2025

High level alternatives to Cuda

A history of how GPUs learned to speak Python, and why DeepSeek just made TileLang everyone’s problem.

Boaz Touitou

October 29, 2025

Unsafe at Any Speed - The Vuln That (Briefly) Broke AI

How a single bug in vLLM left millions of prompts compromised

Boaz Touitou

August 26, 2025

20ms Responses, GPT-5 Thinking, No Fancy Hardware

How lessons learned from the past solve AI biggest bottlenecks

Noam Salinger

Latest blog posts

Async AI, Done Right

A Vocabulary For Async AI Agent Inference

To Support Async AI Agents, Inference Needs To Be Adaptive

Announcing our partnership with Highrise AI

MLPerf Benchmark Report

Why Wide-EP Serving Becomes Network-Bound

High level alternatives to Cuda

Unsafe at Any Speed - The Vuln That (Briefly) Broke AI

20ms Responses, GPT-5 Thinking, No Fancy Hardware

Ready to run AI at scale?