SVDQuant+NVFP4: 4× Smaller, 3× Faster FLUX with 16-bit Quality on Blackwell GPUs

SVDQuant supports NVFP4 on NVIDIA Blackwell GPUs with 3× speedup over BF16 and better image quality than INT4. Try our interactive demo below or at https://svdquant.mit.edu/! Our code is all available at https://github.com/mit-han-lab/nunchaku.

With Moore's law slowing down, hardware vendors are shifting toward low-precision inference. NVIDIA's latest Blackwell architecture introduces a new 4-bit floating point format (NVFP4), improving upon the previous MXFP4 format. NVFP4 features more precise scaling factors and a smaller microscaling group size (16 v.s. 32), enabling it to maintain 16-bit model accuracy even at 4-bit precision while delivering 4× higher peak performance.

In our previous blog, we shared a tutorial on setting up a 5090 workspace with the Blackwell architecture. In this blog, we’re excited to announce that SVDQuant now supports NVFP4 on the 5090 GPU, delivering better image quality and performance! Our code and demo are all publicly available!

SVDQuant: Absorbing Outliers via Low-Rank Branch

SVDQuant is a new 4-bit quantization paradigm. Compared to other quantization methods which redistribute the outliers between weights and activations, Unlike traditional methods that redistribute outliers between weights and activations, it employs a lightweight high-precision low-rank branch to absorb them.

LaTeX Rendering Example As illustrated in the above figure, we first aggregate the outliers by migrating them from activation \( \hat{\boldsymbol{X}} \) to weight \( \hat{\boldsymbol{W}} \) via smoothing. Then we apply Singular Value Decomposition (SVD) to the updated weight, \( \hat{\boldsymbol{W}} \), decomposing it into a low-rank branch \( \boldsymbol{L}_1 \boldsymbol{L}_2 \) and a residual \( \hat{\boldsymbol{W}} - \boldsymbol{L}_1 \boldsymbol{L}_2 \). The low-rank branch remains in 16-bit precision, while only the residual—now with reduced outliers and lower magnitude—is quantized to 4 bits.

Image Quality

Model Precision Image Reward (↑) LPIPS (↓) PSNR (↑) FLUX.1-dev BF16 0.953 — — INT4 0.908 0.322 18.5 INT4+SVDQuant 0.935 0.223 21.0 NVFP4 0.926 0.242 20.4 NVFP4+SVDQuant 0.942 0.205 21.5 FLUX.1-schnell BF16 0.938 — — INT4 0.962 0.345 16.3 INT4+SVDQuant 0.951 0.257 18.3 NVFP4 0.956 0.277 17.6 NVFP4+SVDQuant 0.964 0.229 19.0 SANA-1.6B BF16 0.952 — — INT4 0.894 0.339 15.3 INT4+SVDQuant 0.935 0.220 17.8 NVFP4 0.929 0.236 17.4 NVFP4+SVDQuant 0.941 0.176 19.0 PixArt-Sigma Original 0.944 — — INT4 -1.226 0.762 9.08 INT4+SVDQuant 0.878 0.323 17.6 NVFP4 0.660 0.517 14.8 NVFP4+SVDQuant 0.940 0.271 18.5

The table above compares image quality across various datatypes on four popular text-to-image diffusion models using the MJHQ prompt set. Image Reward assesses overall image quality, while LPIPS and PSNR measure perceptual and numerical similarity between images generated by quantized and original models.

Across all models, NVFP4 outperforms INT4, particularly in similarity metrics, thanks to the native hardware support of smaller microscaling group size on Blackwell. Additionally, SVDQuant consistently improves upon naive quantization, as its low-rank branch effectively absorbs outliers. Notably, combining SVDQuant with NVFP4 delivers the best results, achieving a PSNR of 21.5 on FLUX.1-dev, closely matching the image quality of the original 16-bit model.

... continue reading