NVIDIA speeds BEV pooling 16–19x on RTX GPUs for autonomous vehicles

BEVPoolV3 cuts BEV perception latency by 16–19x on production GPUs

NVIDIA published an optimized kernel for BEV pooling, a core operation in autonomous vehicle and robotics perception pipelines. BEV pooling takes multicamera image features with depth information and scatters them into a shared top-down grid representation that downstream modules use for detection, occupancy prediction, and planning.

The new implementation, BEVPoolV3, reduces latency from 274.0 µs to 17.3 µs (FP16) on RTX PRO 6000 Blackwell Max-Q and to 90.0 µs (FP16) on RTX A6000. On the Blackwell platform, V3 also ships an FP8 variant at 16.4 µs. Speedups over the prior BEVPoolV2 reference reach 19.31x on A6000 and 15.84x on Blackwell (FP16), with FP8 on Blackwell at 16.71x (per NVIDIA's measurements).

The kernel removes four sources of redundant work: duplicate depth loads within each interval, inefficient memory packing that wastes aligned loads, runtime integer division in the inner loop, and scattered writes that required atomics. The production implementation uses five explicit arrays (ranks_depth, ranks_feat, ranks_bev, interval_starts, interval_lengths) instead of reconstructing indices at runtime.

Memory regime determines optimization strategy

The core insight is that the same BEV pooling operator requires fundamentally different tuning depending on GPU L2 cache size. The canonical real-world config from nuScenes has a 49 MB working set. On RTX A6000 (6 MB L2), this working set spills to DRAM, so optimization prioritizes byte reduction and cache-friendly output stores. On RTX PRO 6000 Blackwell Max-Q (128 MB L2), the same working set fits in L2 after initial fill, so the path shifts toward instruction efficiency, occupancy, and FP8 specialization.

This is not theoretical. The performance results show why: smaller-L2 GPUs need different kernel code than large-L2 GPUs running the same logical operation. A practitioner tuning on the wrong GPU class could apply the wrong optimization and fail to see the expected speedup.

Classify memory regime before touching kernel code

NVIDIA walked through a repeatable workflow: classify whether the working set fits in L2 cache; identify redundant scatter traffic by profiling with Nsight Compute; map the kernel launch shape to occupancy targets on the target GPU; validate the bottleneck with Nsight Compute profiling.

The implementation is exposed as a TensorRT IPluginV3 operator, so teams already using TensorRT for inference can swap in BEVPoolV3 without rewriting downstream perception modules. The kernel dispatches the appropriate implementation based on GPU class and dtype.

For teams not using TensorRT, NVIDIA published the CUDA kernel structure (five-array scatter map, explicit index arrays, interval-owned writes) as a reference for applying the same techniques to other scatter-reduce workloads in robotics and spatial AI pipelines.

NVIDIA speeds BEV pooling 16–19x on RTX GPUs for autonomous vehicles

Our Take

Why it matters

Do this week

BEVPoolV3 cuts BEV perception latency by 16–19x on production GPUs

Memory regime determines optimization strategy

Classify memory regime before touching kernel code

Related stories

Legal Tech Vendors Must Win AI Search, Not Google Search

Sapphire Legal Isolates AI Per Client to Block Data Leaks for Fractional GCs

Baker McKenzie CINO: Avoid single-vendor AI lock-in for law firms