Think before you CUDA

A history of how GPUs learned to speak Python, and why DeepSeek just made TileLang everyone’s problem.

On September 29, 2025, a quiet earthquake shook the world of AI hardware. DeepSeek shipped its V3.2-Exp model with an unmissable tell in its repository: an inference path that literally runs import tilelang. That’s not marketing copy; it’s a line of code that signals a jailbreak from the single-vendor reality we’ve lived in for years.

With TileLang’s native support for Huawei’s Ascend backends, the same elegant kernel style can now land on NPUs, not just NVIDIA GPUs. It's a simple but revolutionary idea: “Own the kernels, swap the backend.”

Prologue: The CUDA Decade (And Why We Tolerated the Pain)

CUDA arrived in 2007, transforming GPUs from specialized shader boxes into general-purpose parallel computing engines. Then, in 2012, AlexNet detonated the deep learning curve, and an entire industry learned to suffer happily in C++/CUDA for the intoxicating privilege of speed. If you needed a custom fused operation, you earned it: wrestling with shared memory, calculating occupancy, untangling warp divergence, and staring at Nsight profiler tabs that looked like the Vegas skyline.

But from 2021 onward, Python won the mindshare war. Researchers and engineers wanted to write the kernelwithout the cognitive whiplash of switching languages. This demand birthed a new species of Python-native GPU languages. And NVIDIA, credit where it’s due, read the room.

The Plot Twist: Tiles, Compilers, and Python as the Control Plane

Suddenly, the ground shifted. A series of parallel innovations created a new reality:

  • Tiles as the new alphabet. NVIDIA began rolling out a tile programming model (cuTile / Tile IR), encouraging developers to think in arrays and tiles instead of threads and blocks—the very mental model that DSLs like Triton and TileLang had already normalized.
  • PyTorch normalized Triton. With torch.compile, the PyTorch Inductor backend now routinely generates and autotunes Triton kernels alongside established libraries like cuDNN and CUTLASS. The era of the "custom op = lost weekend" is rapidly ending.
  • Portability is no longer cosplay. TileLang, built on TVM, now spans CUDA, HIP, and CPU, plus Metal, WebGPU, and Huawei Ascend via dedicated adapters. The same tile-based mental model now has multiple exit ramps off the NVIDIA highway.
  • CUDA goes Python-first. The official cuda-python stack exposes the runtime, JIT, and linker as simple pip-installable packages. The "two-language problem" is shrinking, with Python firmly established as the control plane.

The Cast of Characters (And Why Each Exists)

Meet the new tools changing how we write GPU code.

Triton - Python Kernels That Bench Like Libraries

The original audacious promise from OpenAI was simple: write an FP16 matrix multiplication in a few dozen lines of Python and get cuBLAS-class speed. It routinely delivered 2x speedups over vanilla PyTorch ops. In 2025, NVIDIA and OpenAI jointly announced day-one Blackwell support for Triton. The translation is clear: “Write kernels that feel like NumPy, and run them like you bribed a CUTLASS wizard.”

  • Why it stuck: Triton’s block-based model automates the painful choreography of memory coalescing and shared memory, freeing you to focus on the high-level logic.

TileLang — Portable Tiles, Fewer Excuses

TileLang is a tile-centric Python DSL built on Apache TVM. You write concise, high-level kernels, and it targets everything from CUDA and HIP to Apple Metal and WebGPU. As of 2025, its tilelang-ascend adapter makes it the first truly credible "one DSL, many accelerators" solution.

  • Why it matters now: DeepSeek’s V3.2-Exp model uses tilelang in its public inference code. This provides a credible, battle-tested path to running top-tier models on Huawei NPUs; a useful option when geopolitics makes H100s feel mythical.

ThunderKittens - Adorable Name, Dangerous Throughput

Stanford's Hazy Research shipped ThunderKittens (TK), a tiny C++/CUDA embedded DSL where 16×16 tiles map cleanly onto the GPU’s warp/warpgroup/block hierarchy. You still write CUDA, but it feels like assembling LEGOs instead of untangling pointer spaghetti. The ICLR’25 paper showed the receipts: matching cuBLAS and FlashAttention-3 on key benchmarks while delivering massive speedups (8x-14x) on newer architectures like SSMs and linear attention.

  • Its niche: When the profiler is open and your coffee is cold, TK is the small, sharp tool for maximum performance with minimal boilerplate.

Mojo - Python++ with an MLIR Engine

Mojo looks like Python, compiles like C++, and is powered by the MLIR compiler framework. In 2025, it gained official GPU kernel support and deeper Python interop. An HPC paper found Mojo competitive with hand-written CUDA/HIP on memory-bound kernels on H100/MI300A GPUs. Not just a DSL, it’s a full-blown systems language wearing a Python hoodie.

  • The bet: Mojo aims to solve the two-language problem by being the one language.

Intermission: The Ghost of Assembly Past

In the late 20th century, the fastest code was hand-written assembly. Then hardware evolved with branch predictors, pipelines, and out-of-order execution. Compilers learned the dance better than humans could. By the Pentium era, a well-structured C program almost always beat "expert" assembly. Assembly didn’t die, but it retreated to the weird bits: bootloaders, ISRs, and cryptic inner loops.

We’re here again. CUDA C++ is our modern assembly: terrifying, beautiful, and still the performance ceiling. But Triton, TileLang, TK, and Mojo mean the compiler often sees farther than you can. Most days, you’ll write tiles in a high-level dialect and let the toolchain do the dirty work. And when you prove you need that last 7% of performance, you’ll pop open CUDA like a lockpick, and only for the part that needs it.

“So Why Are People Still Writing Raw CUDA?”

Because the absolute performance ceiling still lives there.

  1. Day-One Features: New tensor-core modes and memory tricks hit CUDA first. The other stacks follow. (The coordinated Blackwell-Triton release was a notable exception.)
  2. Last-Mile Control: When your performance bug hides in an esoteric warp shuffle or a memory layout no compiler can model yet, it’s just you, a profiler, and a lovingly placed __syncthreads().
  3. Industrial-Grade Libraries: cuBLAS, cuDNN, TensorRT, and the Nsight profiler are still the strongest pillars of the ecosystem—and now they’re cleanly scriptable from Python.

Rule of thumb: Start high-level. Drop to CUDA only when you prove you need the headroom.

Choose Your Weapon: The 2025 Hacker’s Guide

  • Start with Triton if you live in PyTorch and need fused kernels by dinnertime. It’s the path of least resistance, with the Blackwell wind at its back.
  • Bring in TileLang when your roadmap says NVIDIA plus AMD, Ascend, Metal, or WebGPU. DeepSeek just gave you the social proof.
  • Unsheathe ThunderKittens when you need tile-level control with minimal boilerplate and have paper receipts that show it beats the comfy baselines.
  • Prototype in Mojo if you’re betting on one language to rule them all. Expect some sharp edges, but the MLIR engine underneath is the real deal.

The Future

Tiles are the new lingua franca of GPU programming. CUDA won’t die; it will sink (deliberately) beneath higher-level models that most of us will write in. We’ll argue about autotuners instead of warps. We’ll pip install a tile DSL and target whatever chips geopolitics allows us to buy.

Think before you CUDA.