Why Wide-EP Serving Becomes Network-Bound

Part 1 of 3: a compact communication model for why Wide-EP serving becomes network-bound.

Why the wire shows up earlier than most people expect

Wide-EP failures rarely look like failures at first. The model still serves. Nothing obviously crashes. What changes is the scaling curve: adding more GPUs stops helping, GPU utilization falls, and the network starts looking busier than the tensor cores.

Part 1 is about making that behavior unsurprising. The payload model is simple, but it already explains most of the bad traces I care about:

why DeepSeek-class MoE serving is a multi-node problem,
why the key variable is not the headline batch but the per-rank slice,
why the fastest way to lose scaling is to widen EP without growing the work seen by each rank.

In this part

The source-backed model constants for the DeepSeek-V3 / R1 family.
A notation set that is small enough to fit in your head.
Dispatch/combine byte formulas, including the locality fraction that actually reaches the slow fabric.
The first-order timing model linking communication, compute, and routing skew.

As in the flagship post, I will use three kinds of numbers:

model constants from public configs and model cards,
hardware specs from vendor docs,
toy calculations used as sanity checks.

1) Why DeepSeek-class serving forces multi-node Wide-EP

DeepSeek-V3 and DeepSeek-R1 are a useful anchor because the public artifacts are unusually explicit. Between the model cards, the released config, and the published weight notes, the numbers you need are:

671B total parameters and 37B activated parameters per token.
61 hidden layers, 256 routed experts, and $\text{top-}k = 8$.
Hidden size $d = 7168$.

The memory-side conclusion is immediate. NVIDIA advertises 80 GB HBM3 per H100 SXM, so an 8xH100 box gives you 640 GB total HBM. DeepSeek's own weight documentation describes 671B main-model parameters, while the Hugging Face release notes say the published package is 685B parameters including the MTP module. The exact in-memory footprint depends on packing, scaling tensors, and whether you load the MTP path. The qualitative point does not: this is already a multi-node serving regime if you want the model resident on GPU and still need room for KV cache and runtime buffers.

The second scale argument is about repetition. A routed decode step does not pay one expert-parallel collective. It pays dispatch and combine in every MoE layer. For the V3 / R1 family that is 61 dispatch phases and 61 combine phases along the active path. Small inefficiencies stop being small when you pay them 122 times per step.

That is what makes Wide-EP feel different from ordinary distributed inference. You are not only sharding weights; you are repeatedly shipping token activations to where the experts live, under a schedule that is sensitive to both topology and skew.

2) The smallest model that still predicts the trace

I like to keep only the variables that survive contact with measurement:

$B$: total tokens participating in one MoE layer step across the EP group.
$P$: number of expert-parallel ranks for that layer.
$T = B / P$: balanced tokens per rank before routing skew.
$k$: routed experts per token.
$d$: hidden size of the routed activation.
$s_{\text{dispatch}}, s_{\text{combine}}$: bytes per activation element on dispatch and combine.
$b_{\text{side,dispatch}}, b_{\text{side,combine}}$: amortized sideband bytes per routed copy, covering routing indices, combine weights, source metadata, FP8 scales, padding, and kernel bookkeeping.
$L$: fixed effective overhead of one collective on the critical path.
$BW_{\text{eff}}$: achieved bandwidth on that same path.
$\gamma$: routed load on the hottest rank divided by mean routed load.
$\rho$: fraction of routed payload that leaves the fast locality domain and reaches the slower scale-out network.

The roofline view is still the right first picture:

The two variables people usually underestimate are:

$T = B / P$, because the kernel does not care that the global batch was large if the per-rank slice became tiny.
$\rho$, because the NIC does not care about bytes that stayed on NVLink or xGMI.

Those two numbers - the per-rank slice and the scale-out fraction - do a surprising amount of explanatory work.

3) Dispatch/combine bytes

It helps to separate the dominant activation term from the full wire payload.

For dispatch,

$$V_{\text{dispatch}} \approx T\,k\,(d\,s_{\text{dispatch}} + b_{\text{side,dispatch}})$$

and for combine,

$$V_{\text{combine}} \approx T\,k\,(d\,s_{\text{combine}} + b_{\text{side,combine}})$$

The older shortcut

$$V_{\text{act,dir}} \approx T\,k\,d\,s_{\text{dir}}$$

is still useful as a leading term, but it is not a literal packet budget. DeepEP-style paths may also carry top-k indices, top-k weights, source-location metadata, FP8 scales, and padding.

If only a fraction $\rho$ leaves the fast locality domain, and if the same locality split applies in both directions, then

$$V_{\text{layer}} \approx V_{\text{dispatch}} + V_{\text{combine}}, \qquad V_{\text{scaleout,layer}} \approx \rho\,V_{\text{layer}}$$

This is the deliberately clean version. Real systems are messier in two predictable ways:

all-to-allv is tail-dominated, so completion time is set by the busiest destination rank, not the mean rank.
The sideband terms are implementation-dependent, so $b_{\text{side,dispatch}}$ and $b_{\text{side,combine}}$ are not cosmetic. They change with routing format, padding rules, quantization scales, and kernel layout choices.

Sanity-check example

Take a DeepSeek-V3-class shape:

$B = 128{,}000$
$P = 64$
$k = 8$
$d = 7168$
$s_{\text{dispatch}} = 1$ byte for an FP8 dispatch activation term
$s_{\text{combine}} = 2$ bytes for a BF16 combine activation term
ignore $b_{\text{side,dispatch}}$ and $b_{\text{side,combine}}$ on the first pass, then add them back conceptually as a correction.

Then the activation-dominant bytes per rank are

$$V_{\text{dispatch,act}} \approx \frac{128000}{64} \cdot 8 \cdot 7168 \cdot 1 \approx 114.7\ \text{MB}$$

and

$$V_{\text{combine,act}} \approx \frac{128000}{64} \cdot 8 \cdot 7168 \cdot 2 \approx 229.4\ \text{MB}$$

so one MoE layer already costs about

$$V_{\text{layer,act}} \approx 344.1\ \text{MB per rank}$$

before routing sideband.

If only $\rho = 0.30$ of that leaves the locality domain, the scale-out-visible traffic is still

$$V_{\text{scaleout,layer}} \approx 0.30 \cdot 344.1\ \text{MB} \approx 103.2\ \text{MB}$$

per layer, per rank. Across 61 MoE layers that is about

$$61 \times 103.2\ \text{MB} \approx 6.3\ \text{GB}$$

per forward pass on the slow fabric. At 10 steps per second, that is roughly

$$63\ \text{GB/s}$$

which already overshoots a 400 Gb/s-class link before you count sideband, protocol tax, or imbalance.

That is the important intuition. The network becomes first-order well before $\rho$ gets anywhere near 1.

4) Communication time versus expert time

Once the payload is clear, the timing model is almost boring:

$$t_{\text{comm}} \approx L_{\text{dispatch}} + L_{\text{combine}} + \frac{V_{\text{dispatch,crit}}}{BW_{\text{eff,dispatch}}} + \frac{V_{\text{combine,crit}}}{BW_{\text{eff,combine}}}$$

where each crit term is the payload on the slowest relevant path in that direction. I only collapse this back to one $L$ and one $BW_{\text{eff}}$ when the same fabric and kernel family dominate both legs.

Expert compute on the critical rank can be modeled as

$$t_{\text{compute}} \approx \gamma \cdot \frac{B\,k}{P} \cdot c_{\text{tok}} = \gamma\,T\,k\,c_{\text{tok}}$$

where $c_{\text{tok}}$ is expert compute time per routed token on a balanced rank.

This is where the Wide-EP tension becomes obvious:

increasing $P$ reduces compute per rank,
but it also makes the collectives smaller and more latency-sensitive,
while routing skew inflates the hottest rank through $\gamma$.

If the hottest rank carries 30% more routed load than the mean, this part of the step scales by about 1.3x relative to the balanced case. I prefer stating it that way rather than saying '30% of compute is wasted,' because the real issue is that the slowest rank stretches the whole step.

The most common Wide-EP mistake is to widen EP and watch expert work per rank shrink faster than communication gets easier.

That is the entire reason DBO exists, which is where Part 2 starts.

Next in the series

In Part 2, I take this model and turn it into tuning decisions: when DBO actually buys you something, how to choose DeepEP low-latency versus high-throughput kernels, and why the real cliff is a locality-domain boundary.

References (for this part)

DeepSeek-V3 model card: DeepSeek-V3
DeepSeek-V3 config: DeepSeek-V3 config.json
DeepSeek-V3 weight documentation: DeepSeek-V3 README_WEIGHTS.md
DeepSeek-R1 model card: DeepSeek-R1
NVIDIA H100 page: NVIDIA H100 GPU
Collective background: MPI Alltoallv