Run It Hot

A lot of money is going into AI data centers, split between installed infrastructure capex and recurring operating costs, with energy becoming the most visible operating constraint.

Installed infrastructure is the bill for building the factory: GPUs, servers, networking, power delivery, cooling, and facility capacity. Energy opex is the recurring electricity cost of running that factory. Power is the physical constraint that determines how hard the installed infrastructure can run. Goldman Sachs models around $7.6 trillion of cumulative AI infrastructure capex between 2026 and 2031, and the IEA expects data center electricity demand to roughly double to 950 TWh by 2030.

But focusing on installed infrastructure and energy supply alone misses the question that matters most: how much useful output can the installed base produce. Running hot can improve ROI when the same infrastructure and contracted power generate more sellable tokens. My argument is that throughput is a core lever that operators should think about when considering both capex amortization and energy productivity.

In this blog we will ask what actually maximizes ROI on AI data center spend? The instinct is to look for savings. Those matter, but they are not the whole equation. An AI data center can cost more to run per hour and still become more profitable if that hour produces enough more tokens. We will therefore analyze $\text{PPMT}$: the fully loaded production cost per million tokens before provider margin, implied by total cost of ownership ($\text{TCO}$) and achieved throughput.

Let’s see how throughput matters, with math.

The Equation That Runs the Factory

Demand for AI is now measured in tokens. Alphabet noted on its Q1 2026 earnings call that its first-party models were processing more than 16 billion tokens per minute through direct API usage, up from 10 billion the prior quarter. For an inference provider, tokens delivered within latency, quality, and reliability targets are the factory output. Everything else is overhead in service of producing them.

Each generated token carries the fully loaded cost of the installed GPU hour, $C$. If that GPU hour produces $X$ tokens per second, $\text{PPMT}$ is:

$$\text{PPMT}(X) = \frac{C \cdot k}{X},\quad\quad k = \frac{10^{6}}{3600}$$

From Total Cost of Ownership to GPU Hours

Total cost of ownership is the full cost of making the data center available and productive. For an inference operator, the useful way to read it is not only as a large upfront budget, but as an hourly cost of capacity. A provisioned GPU hour carries a share of the installed infrastructure capex, a share of the facility capex, the energy opex needed to run it, and the recurring operating costs required to keep it available. Once those costs are expressed per GPU hour, throughput determines how many million-token units that hour produces.

The fully loaded hourly cost per provisioned GPU hour is:

$$C = C_{\text{gpu}} + C_{\text{net}} + C_{\text{fac}} + C_{\text{energy}} + C_{\text{op}}$$

Where:

$$C_{\text{gpu}} = \frac{P_{\text{sys}} \cdot \text{CRF}(i,L)}{8760}$$

$$C_{\text{net}} = \frac{N \cdot \text{CRF}(i,L)}{8760}$$

$$C_{\text{fac}} = \frac{f_{\text{fac}} \cdot \text{CRF}\left( i,L_{\text{f}} \right)}{8760}$$

$$C_{\text{energy}} = \text{PUE} \cdot \frac{{\text{Pwr}}_{\text{IT}}}{1000} \cdot p_{\text{e}}$$

$$C_{\text{op}} = \text{staff} + \text{maintenance} + \text{overhead}$$

The first three terms convert installed infrastructure capex into an hourly charge. $C_{\text{gpu}}$ is the hourly share of GPU and server capex. $C_{\text{net}}$ is the hourly share of networking capex. $C_{\text{fac}}$ is the hourly share of facility capex allocated to that GPU. The capital recovery factor, $\text{CRF}(i,L)$, spreads each upfront investment over its useful life while reflecting the cost of capital. The division by $8760$ converts an annualized cost into an hourly cost.

Together, these capital charges form the capex component of the hourly cost stack:

$$C_{\text{capex}} = C_{\text{gpu}} + C_{\text{net}} + C_{\text{fac}}$$

$C_{\text{energy}}$ is different because it is an operating expense, not a capital charge. It starts with ${\text{Pwr}}_{\text{IT}}$, the power draw in watts, converts it to kilowatts, applies $\text{PUE}$ to include facility overhead, and multiplies by $p_{\text{e}}$, the electricity price per kilowatt hour. $C_{\text{op}}$ covers the recurring work around the installed base: staffing, maintenance, monitoring, repairs, and overhead.

That gives the production cost per million tokens:

$$\text{PPMT}(X) = \frac{k}{X} \cdot \left( C_{\text{capex}} + C_{\text{energy}} + C_{\text{op}} \right)$$

SemiAnalysis’ GPU server example puts capital at $1.203 per GPU hour and hosting at $0.321 per GPU hour, for a total of $1.524 per GPU hour. The same table assumes 10.2 kW for an eight-GPU server, 1.25 PUE, and $0.087 per kWh. That puts electricity in the high single digits of the full GPU hour, putting energy in the ~10% opex share.

Here is where the asymmetry is strongest. If energy opex is 10% of the stack, cutting electricity price in half lowers $\text{PPMT}$ by about 5%. Doubling throughput lowers $\text{PPMT}$ by about 50% before second order power and lifetime effects.

Power Scarcity Matters, but Tokens Scale Faster Than Watts

“Fine,” you can say, “but the binding constraint isn’t capital anymore. It’s electricity. You can’t solve for a megawatt you can’t buy.”

It’s a fair objection, and it deserves a real answer rather than a hand wave. So let’s run the same move on power.

In the first phase of the AI data center build, the market focus was scarce accelerators, which set the pace of capacity. The next identified constraint is available power: energy supply, grid interconnection, and power delivery infrastructure.

That shift changes how we account for power scarcity. Earlier, we assumed the operating decision was made under a fixed hourly cost stack. But when contracted power itself is scarce, the relevant operating question becomes energy productivity: how many paid tokens the operator can produce per watt. This is the energy component of $\text{PPMT}$, or the capex free view of the unit economics.

The energy component of $\text{PPMT}$ is:

$${\text{PPMT}}_{\text{energy}}(X) = \text{PUE} \cdot p_{\text{e}} \cdot \frac{k}{1000} \cdot \frac{{\text{Pwr}}_{\text{IT}}(X)}{X} \propto \frac{1}{(\text{tokens per watt})(X)}$$

The energy cost per million tokens moves inversely with tokens per watt, meaning that understanding how throughput and power correlate for inference is crucial. Since this behavior depends on the underlying inference system, we need a way to describe the relationship between throughput and power draw. The useful object is the throughput power Pareto curve.

For a given model, hardware setup, serving policy, and service level objective (SLO), the throughput power Pareto curve describes the best attainable throughput at each level of IT power draw. Each point on the curve is an operating mode: a batching policy, scheduling policy, memory policy, routing policy, and utilization level that produces some throughput $X$ while drawing some power ${\text{Pwr}}_{\text{IT}}(X)$.

The curve matters because the operator does not care about watts in isolation. The operator cares about how many paid tokens those watts produce. A worse operating point burns power without enough output. A better operating point produces more tokens for the same contracted watt, or reaches the same token volume with less power.

In this framing, the energy component of $\text{PPMT}$ is not just a function of electricity price. It is a function of where the inference system sits on the throughput power Pareto curve.

If power draw rises by 5% while throughput rises by 30%, the operator is economically better off under a fixed power constraint because tokens per watt improved. If power draw falls by 10% but throughput falls by 40%, the operator saved electricity and destroyed output.

The next question is therefore not whether power rises when GPUs are pushed harder. It usually does. The right question is whether power rises proportionally to throughput. If throughput grows faster than power draw, the energy component of $\text{PPMT}$ falls and power constrained ROI improves.

At the device level, power has a floor and a ceiling. $P_{\text{idle}}$ is the idle platform draw that remains when the accelerator is lightly used. $P_{\text{TDP}}$ is the Thermal Design Power envelope, the level the platform and cooling design are built to sustain. Dynamic power sits between them and rises with switching activity, memory traffic, and achieved utilization:

$${\text{Pwr}}_{\text{IT}}(X) = P_{\text{idle}} + \left( P_{\text{TDP}} - P_{\text{idle}} \right) \cdot a(X)$$

$$0 \leq a(X) \leq 1$$

The activity term $a(X)$ summarizes workload intensity between the idle floor and the TDP envelope. When inference is far from the compute or memory roofline, $a(X)$ stays well below $1$, so power draw remains closer to $P_{\text{idle}}$ than to $P_{\text{TDP}}$.

The mechanism is roofline utilization. Prefill work tends to be compute bound, so its ceiling is tied to model FLOPs utilization (MFU). Decode work tends to be memory bandwidth bound, so its ceiling is tied to model bandwidth utilization (MBU). MFU and MBU are achieved utilization ratios. Low MFU or MBU means compute or memory resources are underfilled. Power rises significantly above $P_{\text{idle}}$ toward $P_{\text{TDP}}$ only as the workload approaches those roofs. Inference often operates below them, which leaves room for throughput to grow faster than watts.

$$X \leq \frac{\text{MFU} \cdot {\text{FLOP}}_{\text{peak}}}{F_{\text{token}}}$$

$$X \leq \frac{\text{MBU} \cdot BW}{M_{\text{token}}}$$

Serving changes that improve batching, KV cache movement, and decode path efficiency move the operating point along the throughput power Pareto curve. They raise $X$ by filling idle bubbles or moving work closer to the relevant roofline. The power response is smaller when the idle floor was already paid in watts and the platform remains below the TDP envelope. In other words: you are already paying for the idle draw. You may as well get tokens for it.

With the mechanism of how throughput and power interact, the next question is how they co-scale. The relevant measurement is power elasticity. Elasticity measures how responsive one variable is to a percentage change in another. Here, it measures how much power changes when throughput changes along the throughput power Pareto curve.

Move from one operating point on the curve to a nearby point. Throughput rises from $X$ to $X + \Delta X$. Power rises from ${\text{Pwr}}_{\text{IT}}(X)$ to ${\text{Pwr}}_{\text{IT}}(X) + \Delta{\text{Pwr}}_{\text{IT}}$. The percentage increase in throughput is $\Delta X/X$. The percentage increase in power is $\Delta{\text{Pwr}}_{\text{IT}}/{\text{Pwr}}_{\text{IT}}$. Power elasticity is the ratio between those two percentage changes:

$$\eta \approx \frac{\Delta{\text{Pwr}}_{\text{IT}}/{\text{Pwr}}_{\text{IT}}}{\Delta X/X}$$

For small moves on the curve, this becomes the local log slope:

$$\eta = \frac{d\ln\left( {\text{Pwr}}_{\text{IT}} \right)}{d\ln(X)}$$

This number has a direct operating interpretation. If $\eta = 0.3$, a 1% gain in throughput requires about 0.3% more power. If $\eta = 1$, power rises in proportion to throughput. If $\eta > 1$, the next throughput gain consumes power faster than it creates tokens.

The metric matters because the energy component of $\text{PPMT}$ depends on watts per token. Tokens per watt are $X/{\text{Pwr}}_{\text{IT}}(X)$. The mathematical derivation of percentage change in tokens per watt is the percentage change in throughput minus the percentage change in power:

$$\frac{d\ln(\text{tokens per watt})}{d\ln(X)} = \frac{d\ln\left( X/{\text{Pwr}}_{\text{IT}}(X) \right)}{d\ln(X)} = 1 - \eta$$

When $\eta < 1$, throughput grows faster than power and tokens per watt rise. If throughput rises 1% and power rises 0.3%, tokens per watt improves by about 0.7%. The investment implication is stable: in an energy constrained market, throughput remains a $\text{PPMT}$ lever whenever the throughput power Pareto curve has power elasticity below one.

Accounting for energy scarcity changes the unit of ROI. Under capital scarcity, throughput amortizes capex over tokens. Under energy scarcity, throughput raises revenue per contracted watt and lowers the energy component of $\text{PPMT}$ whenever tokens scale faster than watts. In both scenarios, throughput is what influences the capital constraint, the power constraint, or both.

Operating Implication: Run It Hot

The economics reduce to a single sentence: what matters is paid tokens per dollar of TCO and per contracted watt.

Higher throughput amortizes capex over more tokens and raises revenue per contracted megawatt, because tokens scale faster than watts. This is especially important in a power constraint market though counter intuitive; though power prices rise, the value per megawatt raises even more and makes ROI higher.

So run it hot.

Sources

Goldman Sachs, Tracking Trillions: The Assumptions Shaping the Scale of the AI Build Out. Baseline aggregate AI capex: about $7.6 trillion between 2026 and 2031; annual AI capex rising from $765 billion in 2026 to $1.6 trillion in 2031.
McKinsey & Company, The $7 trillion data center build out: How industrials can capture their share.
International Energy Agency, Key Questions on Energy and AI.
Electric Power Research Institute, Powering Intelligence 2026: Updated Scenarios of U.S. Data Center Electricity Use and Power Strategies; summary brief.
Alphabet Investor Relations, First quarter 2026 earnings call transcript.
SemiAnalysis, GPU Cloud Economics Explained. The simple GPU server table shows $7,025.9 per month of capital cost, $1,871.8 per month of hosting cost, $1.203 per GPU hour of capital cost, $0.321 per GPU hour of hosting cost, and $1.524 per GPU hour of total cost. The same table assumes 10,200 W of server power, 1.25 PUE, and $0.087 per kWh electricity.
Yuan et al., LLM Inference Unveiled: Survey and Roofline Model Insights, arXiv:2402.16363.
Wikipedia, Elasticity (economics).