NVIDIA releases Model Optimizer for FP8 quantization

NVIDIA packages quantization tools in Model Optimizer

NVIDIA released Model Optimizer (ModelOpt), a Python library for post-training quantization that supports FP4, FP8, INT8, and INT4 formats. The library accepts Hugging Face, PyTorch, or ONNX models and includes algorithms like SmoothQuant, AWQ, SVDQuant, and Double Quantization.

NVIDIA demonstrated the workflow by quantizing a CLIP-ViT-L-14 model to FP8 format using 8,192 MS-COCO image-text pairs for calibration. The quantized model was evaluated on three benchmarks: CIFAR-100 zero-shot classification, ImageNet-1k zero-shot classification, and MS-COCO Captions zero-shot retrieval (company-reported results).

The library addresses attention mechanism quantization by registering custom quantized replacements for attention modules. For CLIP, this means intercepting scaled dot-product attention calls and inserting four quantizers around the fused kernel: three for Q/K/V tensors and one for the kernel output.

Attention quantization remains the hard problem

Most quantization libraries struggle with attention mechanisms because they dispatch to functional APIs that module walkers cannot intercept. NVIDIA's approach of registering quantized replacements for entire attention modules sidesteps this limitation.

The company's results show FP8 quantization preserving model quality when quantizers are disabled in the patch embedding layer. This selective quantization approach lets practitioners maintain accuracy while achieving memory and compute savings on consumer hardware like GeForce RTX GPUs.

The workflow follows standard fake quantization: insert quantizer modules, calibrate with representative data, simulate precision loss in floating point, then export to deployment frameworks like TensorRT for actual speedup.

Six-stage workflow with granular control

ModelOpt implements a structured approach: prepare quantization config, calibrate with small data batch, apply fake quantization, evaluate accuracy, iterate on sensitive layers, then export for deployment. The library provides regex-based layer filtering to disable quantization selectively.

The calibration process requires forward passes through representative data to collect tensor statistics and derive scaling factors. NVIDIA used 8,192 MS-COCO pairs in 512-sample batches for their CLIP example.

For deployment, the fake-quantized weights compress to true low-precision format and export as checkpoints. The actual speedups and memory savings occur in downstream inference engines, not during the ModelOpt workflow itself.

NVIDIA releases Model Optimizer for FP8 quantization

Our Take

Why it matters

Do this week

NVIDIA packages quantization tools in Model Optimizer

Attention quantization remains the hard problem

Six-stage workflow with granular control

Related stories

Gresham and FundGuard merge data platforms for asset managers

ANNA Money adds 3.66% savings account for UK small businesses

Payward buys Reap for $600M to merge stablecoin cards with B2B rails