This project explores the application of diffusion models to generate and denoise images. The project is divided into two main parts:
In this part, you will explore the capabilities of pre-trained diffusion models, implement sampling loops, and use the models for tasks such as inpainting and generating optical illusions.
The forward diffusion process adds Gaussian noise to an image over a series of timesteps \( t \), gradually corrupting the image. Mathematically, this is defined as:
\( x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon \),
where:
Below, we show a clean test image and its noisy counterparts at timesteps \( t = 250, 500, 750 \):
As the timestep \( t \) increases, the image becomes progressively noisier, with \( t = 750 \) being almost indistinguishable from pure noise.
In this part, we attempt to denoise the noisy images generated in the forward process (\( t = 250, 500, 750 \)) using classical Gaussian blur filtering. Gaussian blur reduces high-frequency noise by averaging nearby pixels, which is mathematically represented as:
\( G(x, y) = \sum_{i=-k}^{k} \sum_{j=-k}^{k} I(x+i, y+j) \cdot w(i, j) \),
where \( w(i, j) \) is the Gaussian kernel, and \( k \) is determined by the kernel size. Below, we show a comparison between the noisy images and their Gaussian-denoised counterparts for three timesteps:
Observations: While Gaussian blur effectively smooths high-frequency noise, it also removes fine details from the image. At lower noise levels (\( t = 250 \)), some recognizable features remain visible. However, at higher noise levels (\( t = 750 \)), the denoised images still resemble smoothed noise, highlighting the limitations of classical methods in denoising structured diffusion noise.
In this section, we use a pretrained diffusion model (DeepFloyd's stage_1.unet) to estimate and remove noise from noisy images. The process involves three steps:
The forward diffusion process adds Gaussian noise to an image, as defined by:
\( x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon \),
where:
The pretrained UNet estimates the noise \( \hat{\epsilon} \) in \( x_t \). Using this noise estimate, the original image \( x_0 \) is recovered by reversing the forward process:
\( x_0 \approx \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \, \hat{\epsilon}}{\sqrt{\bar{\alpha}_t}} \).
Below, we visualize the original image, the noisy image at timesteps \( t = 250, 500, 750 \), and the recovered estimate \( x_0 \):
Observations:
In this section, we implement and evaluate an iterative denoising process using the pretrained DeepFloyd diffusion model. Iterative denoising progressively reduces noise by moving from a noisy image \( x_t \) at a timestep \( t \) to a less noisy image \( x_{t'} \), until a clean image \( x_0 \) is obtained. The timesteps \( t \) are selected from a list of strided timesteps, starting at \( t = 990 \) and ending at \( t = 0 \) in steps of 30.
The denoising process for a single step is governed by the formula:
\( x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}} \, \beta_t}{1 - \bar{\alpha}_t} \, x_0 + \frac{\sqrt{\alpha_t} \, (1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t} \, x_t + v_\sigma \),
where:
add_variance
function.Below, we visualize the iterative denoising process, starting from \( t = 300 \) and gradually reducing noise until \( t = 0 \).
Key Observations:
In this section, we use the pretrained diffusion model to generate images from scratch. Starting with pure Gaussian noise, the model iteratively denoises the noise step-by-step until a coherent image is obtained. This process leverages the pretrained UNet's ability to map noise to the manifold of natural images, conditioned on the prompt "a high quality photo."
The process begins with pure noise \( x_T \) at the noisiest timestep \( t = 990 \). Iterative denoising then progressively reduces noise using the following formula:
\( x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}} \, \beta_t}{1 - \bar{\alpha}_t} \, x_0 + \frac{\sqrt{\alpha_t} \, (1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t} \, x_t + v_\sigma \),
where:
Below, we display 5 sampled images generated by the model using this method:
In this section, we use Classifier-Free Guidance (CFG) to improve the quality of generated images. CFG combines both conditional and unconditional noise estimates to guide the generation process. The noise estimate is defined as:
\( \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \),
where:
During the iterative denoising process:
Below are 5 images generated using CFG with a scale of \( \gamma = 7 \):
Key Observations:
In this section, we explore the process of Image-to-Image Translation, where we take a real image, add varying levels of noise, and iteratively denoise it using the diffusion model. This technique, inspired by the SDEdit algorithm, allows the model to make edits to an image by forcing it back onto the manifold of natural images.
The forward process adds noise to the image:
\( x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon \),
where:
The noisy image \( x_t \) is then denoised iteratively using Classifier-Free Guidance (CFG):
\( \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \),
where:
Below, we show the progression of edits for:
Observations:
In this section, we explore how the SDEdit algorithm can project non-realistic images, such as paintings, sketches, and web images, onto the natural image manifold. The method involves adding noise at specific levels (\( t = [1, 3, 5, 7, 10, 20] \)) and iteratively denoising the image using Classifier-Free Guidance (CFG).
The forward process adds noise using:
\( x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon \),
and the CFG-adjusted noise estimate for denoising is given by:
\( \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \),
where:
The selected web image of "Minions" was processed using the SDEdit algorithm. Noise was added at timesteps \( t = [1, 3, 5, 7, 10, 20] \), and the model iteratively denoised the noisy image back to the natural image manifold. Below are the results:
Two hand-drawn images, "A Fish" and "Phineas," were created and processed using the same method. Noise levels \( t = [1, 3, 5, 7, 10, 20] \) were applied, and iterative denoising with CFG was used to refine the images. The results showcase how the diffusion model creatively transforms these inputs into realistic outputs while retaining artistic characteristics.
Observations:
In this section, we explore the capability of diffusion models for inpainting. Inpainting involves editing specific parts of an image while preserving other regions. Using a binary mask \( m \), where \( m = 1 \) represents the areas to edit and \( m = 0 \) represents areas to preserve, the model refines the noisy regions back to the natural image manifold.
The inpainting process follows the formula:
\( x_t \leftarrow m x_t + (1 - m) \text{forward}(x_{\text{orig}}, t) \)
This ensures that pixels outside the mask retain their original values with appropriate noise, while pixels inside the mask are denoised iteratively. The model progressively refines the image using the iterative denoising process.
Below, we demonstrate inpainting on the test image with a rectangular mask that targets the top half of the image. The mask ensures that only the top portion is edited while the rest remains unchanged:
For the piano image, we applied a rectangular mask that spans the middle section of the image. This mask allows the model to focus its edits on that region, leaving the rest of the image untouched:
For the Rolex image, we used a circular mask centered in the middle of the image. The circular mask allows the model to refine the specific region within the circle, while preserving the outer area as is:
In this section, we use text-conditioned embeddings to guide the image projection process onto the natural image manifold. By changing the prompt from "a high quality photo" to a specific text description, we add control using language. For instance, we used the prompt "a rocket ship" to guide the denoising process.
The iterative denoising process is guided by the classifier-free guidance (CFG) scale \( \gamma \), which was set to \( \gamma = 7 \). This process was applied to the test image and two custom images, namely an orchestra and an F1 race.
The following results show the test image progressively edited at noise levels [1, 3, 5, 7, 10, 20]:
Below are the results of applying the same procedure on the orchestra image:
Finally, we applied the text-conditioned translation to an image of an F1 race:
In this section, we implemented visual anagrams using a diffusion model. A visual anagram is an image that appears as one subject when upright and as another subject when flipped upside down. This effect was achieved by conditioning the denoising process on two different text prompts and averaging the noise estimates from both orientations during iterative denoising.
Mathematically, the algorithm for noise estimation is:
\( \epsilon_1 = \text{UNet}(x_t, t, p_1) \), \( \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \),
\( \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} \),
where \( p_1 \) and \( p_2 \) are the text prompt embeddings, and \( \text{flip}(\cdot) \) represents vertical flipping. The final denoising process uses this averaged noise estimate to iteratively refine the image.
This visual anagram appears as "an oil painting of people around a campfire" when upright and "an oil painting of an old man" when flipped upside down:
This visual anagram appears as "a photo of a hipster barista" when upright and "a photo of a man" when flipped upside down:
This visual anagram appears as "a lithograph of waterfalls" when upright and "a lithograph of a skull" when flipped upside down:
To create these illusions, the following steps were performed:
In this section, we created hybrid images that appear as one image from far away (low frequency) and a different image up close (high frequency). The results leverage Gaussian blur to isolate low-frequency components and combine them with the high-frequency details of another prompt.
The hybrid image creation process involves iterative denoising guided by two separate text prompts for the far and close views, blending their frequency components to create a visually striking effect. The formula used for combining the images is:
\( \text{Hybrid Image} = \text{Low Frequency Image} + (\text{High Frequency Image} - \text{Blurred High Frequency Image}) \)
Below are the results for three hybrid images:
To create these images:
Each hybrid image effectively combines the visual properties of two different prompts. From far away, the low-frequency image dominates, creating the appearance of the first prompt. As you get closer, the high-frequency details become visible, revealing the second prompt.
In this part, we train a UNet-based diffusion model on the MNIST dataset. The process involves generating noisy images, training the model to denoise them, and evaluating the model’s performance.
To train the UNet denoiser, we generate noisy images \( z \) by adding Gaussian noise to clean MNIST images \( x \), as follows:
z = x + σ * ε
, where ε ∼ N(0, I)
.
Below is a visualization of the noising process, showing how images are progressively corrupted as the noise level \( \sigma \) increases:
The UNet denoiser was trained to minimize the L2 loss:
\( L = \mathbb{E}_{z,x} [\| D_\theta(z) - x \|^2] \)
Training was conducted over 5 epochs, with random noise levels applied to the images during each batch. The following plot shows the training loss over steps:
After training, the denoiser was tested on unseen MNIST digits. Below are the results of the denoiser:
To evaluate the model's generalization, the denoiser was tested on noise levels it was not trained for, ranging from \( \sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0] \). The results below show how the model performs as the noise level increases:
To train the time-conditioned UNet \( \epsilon_\theta(x_t, t) \), we minimize the L2 loss:
\( L = \mathbb{E}_{x_0, t, \epsilon} \| \epsilon_\theta(x_t, t) - \epsilon \|^2 \)
Here:
Using the trained UNet, we generate images by iteratively denoising a pure noise image \( x_T \sim \mathcal{N}(0, I) \) through the reverse diffusion process. The equations for denoising are as follows:
\( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \)
At each timestep \( t \), the UNet predicts the noise \( \epsilon_\theta(x_t, t) \), which is used to compute the denoised image \( x_0 \):
\( x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_\theta(x_t, t) \right) \)
Then, the next timestep \( x_{t-1} \) is computed as:
\( x_{t-1} = \frac{\sqrt{\bar{\alpha}_{t-1}}}{\sqrt{1 - \bar{\alpha}_t}} x_0 + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \sqrt{\beta_t} z \)
Here:
Below are the generated images after 5 and 20 epochs of training. These results demonstrate the improvement in the UNet's ability to generate clean samples as training progresses.
In this section, we extend the UNet to be class-conditioned, allowing it to generate MNIST digits conditioned on their class (digits 0-9). This involves introducing class-conditioning vectors alongside time-conditioning. The model is trained using a combination of the class and time signals, with dropout applied to class-conditioning 10% of the time to enable unconditional generation.
The class-conditioned UNet minimizes the noise prediction loss:
\( L = \mathbb{E}_{x_0, t, c, \epsilon} \| \epsilon - \epsilon_\theta(x_t, t, c) \|^2 \)
where \( x_t \) is generated as:
\( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \)
Training details:
Below is the training loss curve, showing convergence over the course of training:
During sampling, we use classifier-free guidance to enhance the quality of class-conditioned results. The guided noise prediction is computed as:
\( \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \)
where:
The denoising process then follows the reverse diffusion formula:
\( x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right) \)
\( x_{t-1} = \frac{\sqrt{\bar{\alpha}_{t-1}} \hat{x}_0}{1 - \bar{\alpha}_{t-1}} + \sqrt{\beta_t} z + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t \)
Sampling results for 4 instances of each digit (0-9) after 5 and 20 epochs of training are shown below:
The training loss curve indicates steady convergence over 20 epochs, suggesting that the UNet learns to predict noise effectively for both class-conditioned and unconditional generation. Sampling results show the improvement in generation quality and diversity after 20 epochs compared to 5 epochs.
Classifier-free guidance significantly enhances the quality of the generated images by leveraging both unconditional and class-conditioned predictions.