Back to news
AnalysisJune 11, 2026· 3 min read

Read PyTorch traces to spot where your model wastes GPU time

Hugging Face shows how to profile nn.Linear and MLPs using PyTorch's built-in tracer. Learn why compile helps stacked ops but not single layers, and how to read kernel names.

Our Take

This is a working tutorial on profiler hygiene, not a performance claim; the takeaway is that compile needs multiple ops to fuse, and most overhead you see is CPU dispatch, not GPU math.

Why it matters

Most practitioners reach for torch.compile when models feel slow, but a single GEMM-with-bias (like nn.Linear) already runs the optimal cuBLAS kernel. Knowing what the profiler actually shows prevents wasted effort chasing fusions that do not exist.

Do this week

ML Engineer: run torch.profiler on your next model checkpoint before and after compile, compare the kernel names in the GPU lane (look for _tn_ vs _nn_ suffixes), and only then decide if compile is worth the compilation overhead.

How to read a PyTorch profiler trace

Hugging Face published the second in a series on profiling PyTorch models using torch.profiler. The piece walks through three progressively complex workloads: a single nn.Linear layer, the same layer compiled with torch.compile, and a three-layer GeGLU MLP (gate projection, up projection, GELU, element-wise multiply, down projection).

The core finding: a single nn.Linear already dispatches to cuBLAS's fused addmm kernel, which computes the matrix multiply and bias addition in one GPU kernel. There is no separate add operation in the trace because the bias is an epilogue, a small computation baked into the kernel's writeback phase. Compile does not remove any GPU kernels here; it only removes CPU dispatch overhead by pre-computing strides at compile time and hard-coding them into the aten::addmm call.

For the MLP, the profiler shows exactly five GPU kernels per forward pass: three GEMMs (one per linear layer) and two pointwise kernels (GELU and element-wise multiply). Each GEMM does an occupancy query before launch; pointwise ops launch directly. Metadata-only ops like transpose, reshape, and view show 0.000us of CUDA time because they do not launch kernels.

Compile works on composition, not on single ops

A reflex many practitioners have is to wrap a slow model in torch.compile. The article makes a critical point: compile has little to do when the model is already a stack of fused operations. The cuBLAS and CUTLASS libraries ship precompiled kernel binaries for each layout combination, labeled with suffixes like _tn_ (transposed input, non-transposed weight). The dispatcher picks the right kernel based on strides. Compile cannot do better; it can only eliminate the CPU overhead of computing those strides.

Where compile wins is when you have multiple separate ops that can be fused into a single kernel. For a single linear layer, that window is already closed. The real payoff comes from stacking operations (a full MLP, a residual path, a full transformer block) and letting Inductor (compile's backend) fuse across boundaries.

The second-order implication: if your profile shows high CPU dispatch time but few GPU kernels, compile may help. If you see many small GPU kernels, compile is more likely to fuse them. If you see one large kernel already, compile will not improve it.

Learn to read kernel names

The article emphasizes a practical habit: when comparing profiler traces, look at the kernel names in the GPU lane. If two runs show the same name (e.g., cutlass_80_wmma_tensorop_bf16_s161616gemm_bf16_32x32_32x1_tn_align8), the GPU is doing identical math. If the layout suffix changes (_tn_ vs _nn_) or the data type shifts (bf16 vs fp16), the dispatcher took a different branch and the GPU is doing different work.

This is far more useful than raw cycle counts. A name match confirms your optimization changed only CPU overhead. A name change means the GPU kernel itself changed, and you need to understand why. The series provides scripts (02_linear.py, 03_simple_mlp.py, 03_kernels_mlp.py) that run on NVIDIA A100 hardware and emit traces you can inspect in Perfetto or via Hugging Face's trace-util utility.

#Developer Tools#Open Source#Performance Optimization
Share:
Keep reading

Related stories