Accelerate Image Synthesis: Latent Adversarial Diffusion Distillation

March 18th, 2024

"Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation" presents a novel distillation approach known as Latent Adversarial Diffusion Distillation (LADD). This approach is designed to address the limitations of existing diffusion models, particularly the challenge of slow inference speed, which hampers real-time applications. LADD enables high-resolution, multi-aspect ratio image synthesis by efficiently distilling large latent diffusion models (LDMs), significantly simplifying the training process and enhancing performance compared to previous methods.

We will summarize the key takeaways from this paper.

Introduction

Diffusion models have emerged as a powerful tool for image and video synthesis and editing, offering high-quality results. However, their iterative nature, requiring numerous network evaluations to transform noise into coherent images, has limited their practicality for real-time applications. Various strategies have been proposed to accelerate diffusion models. LADD introduces a new strategy, leveraging generative features from pretrained LDMs, allowing for efficient high-resolution image synthesis in a fraction of the steps required by traditional methods.

Background

The paper starts by providing an overview of diffusion models and their distillation. Traditional diffusion models operate by gradually denoising an image through many iterative steps, making the process slow and computationally expensive. Distillation methods, including Adversarial Diffusion Distillation (ADD), have sought to streamline this process by reducing the number of steps needed. However, ADD faces limitations such as a fixed training resolution, and the necessity of decoding to RGB space for distilling latent diffusion models, which can limit high-resolution training.

Methodology

LADD addresses these issues by distilling directly in latent space, thereby avoiding the need to decode to pixel space, and allowing for training at higher resolutions. Unlike ADD, which relies on a pretrained discriminator operating in pixel space, LADD utilizes a novel approach where the discriminator and teacher model are unified, operating directly on latents. This method not only simplifies the training process, but also provides several advantages, including efficiency, the ability to provide noise-level specific feedback, and the capacity for Multi-Aspect Ratio (MAR) training.

Experiments and Results

The paper extensively evaluates LADD through various experiments, demonstrating its superior performance in synthesizing high-resolution images with only a few steps. Notably, when applied to Stable Diffusion 3 (SD3), LADD results in a model dubbed SD3-Turbo, which achieves comparable image quality to the state-of-the-art text-to-image generators in merely four steps. The experiments also explore the impact of different teacher noise distributions, the use of synthetic data, latent distillation approaches, and the scaling behavior of LADD.

Comparison to State-of-the-Art

LADD's effectiveness is further underscored by a comparison with current leading methods in text-to-image and image-to-image synthesis. SD3-Turbo not only matches the performance of its teacher model (SD3) in image quality, but also demonstrates significant improvements over other baselines in terms of inference speed and image-text alignment.

Limitations and Future Directions

Despite its advancements, LADD is not without limitations. The authors note a trade-off between model capacity, prompt alignment, and inference speed, which could impact the model's ability to handle certain text-to-image synthesis challenges. Future research directions include exploring this trade-off more deeply and developing strategies to enhance control over the image and text guidance strengths.

Conclusion

"Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation" introduces a new approach to image/video synthesis that significantly accelerates the generation of high-quality images from text prompts. By distilling large diffusion models in latent space, LADD paves the way for real-time applications and sets a new standard for efficiency and performance in image synthesis.