Our Take
The memory wall is structural, not a supply problem—throwing more silicon at it won't move the needle enough to matter.
Why it matters
If memory constraints are fundamentally hard to solve, the economics of training and deploying large models shift. Teams building on LLMs need to plan for persistent, not temporary, constraints.
Do this week
Infrastructure teams: audit your model's memory bandwidth requirements against your hardware roadmap for the next 18 months and identify which workloads will starve first.
The Memory Problem Resists Hardware Scaling
The Wall Street Journal reports that memory constraints are emerging as a near-intractable bottleneck for large language models. The issue is not capacity alone but bandwidth: the speed at which data moves between compute cores and memory. As models grow, compute scales faster than the physical pathways that feed data to processors can support, creating a widening imbalance.
Unlike raw compute, which has benefited from decades of Moore's Law gains, memory bandwidth improvements have stalled. Doubling GPU memory does not double the rate at which that memory can deliver data. This asymmetry means that adding more processors to a training run does not yield proportional speedups once models exceed a certain scale.
The constraint affects both training and inference. During training, gradients must be shuttled between memory and processors billions of times. During inference, tokens must be loaded and processed with similar bandwidth pressure. Neither workload scales cleanly.
This Reframes What "Scaling" Actually Costs
For the past five years, the dominant story in AI has been that bigger models train faster and perform better. That narrative rests on the assumption that hardware will keep up. The memory bandwidth wall suggests hardware has stopped cooperating.
If memory bandwidth is the binding constraint, then the cost-per-FLOP stops improving. Teams will face a choice: train smaller, more efficient models; accept longer training times; or redesign workloads to reduce memory bandwidth pressure (through techniques like quantization or sparse computation). Each option trades something away.
This matters for purchase decisions, hiring, and roadmap planning. If memory bandwidth is the real bottleneck, investments in additional GPUs without matching memory infrastructure will see diminishing returns. Cloud providers and chip manufacturers now have an incentive to market memory solutions, but the physics constraining those solutions remain stubborn.
Audit Your Memory Bandwidth Assumptions Now
If you are planning model training or inference pipelines, stop assuming linear scaling. Measure your actual memory bandwidth utilization on your target hardware. Compare it to the theoretical peak. If utilization is below 50%, you are memory-bandwidth-bound, not compute-bound.
For teams training models, this means revisiting batch size, precision (FP8 vs. FP16 vs. FP32), and distributed training strategies. Smaller batches and lower precision both reduce bandwidth demand but trade model convergence or accuracy. The tradeoff is worth modeling explicitly now rather than discovering it mid-run.
For inference, quantization and batching strategies that reduce the number of memory loads per token become essential. If your current serving setup assumes you can linearly add GPUs to handle more tokens per second, you need to re-examine that math against your actual hardware's memory bandwidth.