Efficient Diffusion Models via Quantization and Compression ‒ CVLAB ‐ EPFL

Diffusion models, such as Stable Diffusion, have achieved state-of-the-art results in generative image synthesis. However, they require hundreds of millions (or even billions) of parameters, making them computationally expensive and memory-heavy. This hinders deployment on resource-constrained devices like mobile phones or edge accelerators.

Objective

Explore and systematically evaluate quantization and compression methods tailored for diffusion models, aiming to reduce computational and memory requirements while preserving generative quality.

Background and Literature Review

Survey recent advances:
- Standard diffusion models (e.g., Stable Diffusion, latent diffusion architectures)
- Quantization techniques: INT8/INT4, mixed-precision, layer-wise bit allocation
- Model compression: pruning (structured/unstructured), low-rank factorization, knowledge distillation
- Hybrid strategies and compression-aware training
Identify gaps: limited research directly evaluating these techniques on diffusion models.

Core Research Questions

RQ1: How do quantization (e.g. INT8, mixed-precision) and compression methods affect diffusion model fidelity and performance?
RQ2: Which architectural components are most or least sensitive to precision and compression?
RQ3: Can compression-aware fine-tuning mitigate quality degradation better than naïve PTQ or QAT?
RQ4: What configurations yield optimal trade-offs between model size, speed, and image quality?

Prerequisites

Strong proficiency with Python and PyTorch.
Understanding of diffusion model architectures and generative model evaluation.
Experience or willingness to learn quantization and compression techniques.

Contact

[email protected]

References

1. Diffusion Model Quantization: A Review
2. SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models