Back to news
AnalysisJune 24, 2026· 3 min read

Power costs 40% of AI factory OpEx. Here's how to cut it.

Power efficiency is now the primary constraint on AI factory profitability. NVIDIA and researchers outline full-stack optimizations that can cut energy per token without sacrificing speed or scale.

Our Take

NVIDIA's post conflates system architecture wins (liquid cooling, power allocation) with model-level choices (MoE, precision tuning), making it hard to isolate what operators actually control or what delivers measurable ROI on existing hardware.

Why it matters

Operators running at megawatt to gigawatt scale are hitting fixed power budgets from regional providers. Even 5% efficiency gains translate to millions in OpEx savings or new revenue capacity without infrastructure investment.

Do this week

Data center operators: benchmark your current tokens-per-watt on inference using NVIDIA TensorRT-LLM before and after switching to lower-precision formats (NVFP4 vs FP8) to quantify savings on your specific workloads.

Power becomes the binding constraint on token economics

Power costs account for roughly 40% of operating expenses in large-scale AI inference operations (per NVIDIA). Unlike compute or memory, power is capped by regional grid capacity. This forces a new optimization metric: tokens per watt, which directly maps to cost per token and, by extension, margin per token sold.

NVIDIA's engineering blog outlines three categories of levers:

  • Hardware: Direct-to-chip liquid cooling at 45°C inlet temperature to raise power usage effectiveness (PUE). The GB200 NVL72 system includes in-rack power smoothing to flatten current spikes and enable denser GPU deployment within the same power envelope.
  • Software orchestration: NVIDIA DSX, a facility-scale platform that performs real-time power reallocation, dynamic workload scheduling, and recovery of "stranded" power at the rack level. DSX MaxLPS operates within the data center; DSX Flex connects to grid signals.
  • Model and precision selection: Mixture-of-experts (MoE) models activate only a subset of parameters per token, lowering per-token compute cost relative to dense models. Lower-precision formats like NVFP4 deliver more throughput per watt than FP8 at equivalent accuracy (per NVIDIA benchmarks).

Across six GPU architecture generations, NVIDIA claims inference throughput per megawatt has improved 1,000,000x. The company also cites research from the ML.ENERGY Initiative at the University of Michigan showing that coordinated GPU speed tuning during training (running slower GPUs at lower clock speed while fast GPUs sprint) reduces total training energy by up to 25% without extending wall-clock time.

The efficiency problem is real; attribution is murkier

The efficiency problem is genuine. At megawatt to gigawatt scale, even single-digit percentage gains in tokens-per-watt unlock millions in margin or new capacity without buying new hardware. Inference directly drives revenue, so maximizing inference throughput per watt is a natural priority.

Where the story gets slippery is in mixing claims. The 1,000,000x improvement across six generations reflects Moore's Law and GPU architecture evolution, not a single product or technique. The 45°C liquid cooling, power smoothing, and dynamic reallocation are infrastructure wins. The MoE architecture and precision tuning are model selection wins. An operator cannot pick one lever and expect all three gains.

The ML.ENERGY training work is independent peer-facing research, but the MoE and precision claims rely on NVIDIA's own benchmarks without third-party reproduction. The DeepSeek-R1 example shows that MoE can outperform dense models on intelligence-per-token, but this is architectural, not a NVIDIA-specific advantage.

Separate the infrastructure wins from the model wins

If your data center is power-constrained and you have capital, liquid cooling and dynamic power allocation (DSX) are real ROI levers that apply to any workload. If you are software-focused, precision tuning (NVFP4 vs FP8) on your inference engine (TensorRT-LLM) is a starting point, but requires benchmarking on your exact workload mix. MoE selection is a model choice, not a facility choice, and trades inference latency for efficiency.

NVIDIA DSX is described as an "open platform" but is explicitly tied to NVIDIA compute and OEM partners. Operators locked into alternative accelerators or multi-vendor deployments will not see the full stack of gains claimed.

The most actionable claim is the training speed tuning: if you are running Megatron-LM training at scale, profiling your critical path and intentionally lowering clock speed on non-critical GPUs can recover 10–25% energy with no wall-clock penalty (per ML.ENERGY / NVIDIA research). This is a software knob, not a hardware buy.

#Enterprise AI#Developer Tools#Research#Open Source
Share:
Keep reading

Related stories