Read PyTorch traces to spot where your model wastes GPU time

How to read a PyTorch profiler trace

Hugging Face published the second in a series on profiling PyTorch models using torch.profiler. The piece walks through three progressively complex workloads: a single nn.Linear layer, the same layer compiled with torch.compile, and a three-layer GeGLU MLP (gate projection, up projection, GELU, element-wise multiply, down projection).

The core finding: a single nn.Linear already dispatches to cuBLAS's fused addmm kernel, which computes the matrix multiply and bias addition in one GPU kernel. There is no separate add operation in the trace because the bias is an epilogue, a small computation baked into the kernel's writeback phase. Compile does not remove any GPU kernels here; it only removes CPU dispatch overhead by pre-computing strides at compile time and hard-coding them into the aten::addmm call.

For the MLP, the profiler shows exactly five GPU kernels per forward pass: three GEMMs (one per linear layer) and two pointwise kernels (GELU and element-wise multiply). Each GEMM does an occupancy query before launch; pointwise ops launch directly. Metadata-only ops like transpose, reshape, and view show 0.000us of CUDA time because they do not launch kernels.

Compile works on composition, not on single ops

A reflex many practitioners have is to wrap a slow model in torch.compile. The article makes a critical point: compile has little to do when the model is already a stack of fused operations. The cuBLAS and CUTLASS libraries ship precompiled kernel binaries for each layout combination, labeled with suffixes like _tn_ (transposed input, non-transposed weight). The dispatcher picks the right kernel based on strides. Compile cannot do better; it can only eliminate the CPU overhead of computing those strides.

Where compile wins is when you have multiple separate ops that can be fused into a single kernel. For a single linear layer, that window is already closed. The real payoff comes from stacking operations (a full MLP, a residual path, a full transformer block) and letting Inductor (compile's backend) fuse across boundaries.

The second-order implication: if your profile shows high CPU dispatch time but few GPU kernels, compile may help. If you see many small GPU kernels, compile is more likely to fuse them. If you see one large kernel already, compile will not improve it.

Learn to read kernel names

The article emphasizes a practical habit: when comparing profiler traces, look at the kernel names in the GPU lane. If two runs show the same name (e.g., cutlass_80_wmma_tensorop_bf16_s161616gemm_bf16_32x32_32x1_tn_align8), the GPU is doing identical math. If the layout suffix changes (_tn_ vs _nn_) or the data type shifts (bf16 vs fp16), the dispatcher took a different branch and the GPU is doing different work.

This is far more useful than raw cycle counts. A name match confirms your optimization changed only CPU overhead. A name change means the GPU kernel itself changed, and you need to understand why. The series provides scripts (02_linear.py, 03_simple_mlp.py, 03_kernels_mlp.py) that run on NVIDIA A100 hardware and emit traces you can inspect in Perfetto or via Hugging Face's trace-util utility.

Read PyTorch traces to spot where your model wastes GPU time

Our Take

Why it matters

Do this week

How to read a PyTorch profiler trace

Compile works on composition, not on single ops

Learn to read kernel names

Related stories

Eve Launches EveOS Platform to Sync AI Agents With Case Management Systems

Lexsoft Embeds Curated Knowledge Into Claude, Copilot, Harvey

Daiichi Sankyo targets top-five oncology by 2035 with $19.1B ADC pipeline