Fun With Diffusion Models!
Elana Ho
Overview

5A: Part 0

I first experimented by running an existing trained stable diffusion model, DeepFloyd IF. Given a test prompt as input and using a random seed (I used $180$), the model accurately outputs an image that depicts the description. The images start off as white noise, and each step, the objects take form. As num_inference_steps increases, the images become more complex and less noisy.


num_inference_steps = 1

num_inference_steps = 5

num_inference_steps = 10

num_inference_steps = 20

5A: Part 1.1-1.4

Part 1.1 Forward process

Diffusion involves taking in a noisy image and denoising it to produce a clean image. This process is difficult, but the reverse–adding noise to a clean image–is very simple. Therefore, the first part of this project is implementing the forward process using the following equation:

$$x_t = \sqrt{\bar{\alpha}_t} x0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon \quad \text{where} \quad \epsilon \sim N(0, 1)$$

Given the original image $x_0$ and timestep $t \in [0, T]$, the function forward noises the image. The image at $t = 0$ is the original image, while at $t = T$, the image is pure noise. Thus, $\bar{\alpha}_t$ is close to 1 for small $t$ and close to $0$ for large $t$.


campanile.png

t = 250

t = 250

t = 750

Part 1.2 Classical Denoising

In order to recover the original image, one option is to apply Gaussian blur filtering to remove noise. This was done using the function torchvision.transforms.functional.gaussian_blur


blurred t = 250

blurred t = 500

blurred t = 750

Part 1.3 One-Step Denoising

To achieve a better result, a pretrained diffusion model is used. Given a noisy image im_noisy and timestep $t$, the UNet estimates the amount of noise in the image. Then, this noise is removed from im_noisy to produce the estimated clean image.

$$x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon \quad \text{where} \quad \epsilon \sim N(0, 1)$$ $$x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon = \sqrt{\bar{\alpha}_t} \cdot x_0$$ $$x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon}{ \sqrt{\bar{\alpha}_t} }$$

Compared to using Gaussian blur filtering, the result is improved. At $t=0$, the output is the most accurate, and the quality decreases with higher $t$, producing a more distorted campanile.


after one-step denoising
t = 250

after one-step denoising
t = 500

after one-step denoising
t = 750

Part 1.4 Iterative Denoising

Because one-step denoising performs worse with added noise, this problem can be addressed by iteratively denoising. For each time step $t$, the function iterative_denoise uses a UNet to estimate the amount of noise, and then denoises the image $x_t$ to generate the less noisy image $x_{t-1}$ at the previous timestep $t-1$.

$$x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\bar{\alpha}_t} (1 - \alpha{t'})}{1 - \bar{\alpha}_t} x_t + v_{\sigma}$$

While this is the general idea, it is not necessary to denoise at each timestep from $T$ to $0$. To skip steps, a list of timesteps strided_timesteps is used which accelerates the process. This list is created starting at $990$ with a stride of $30$, eventually reaching $0$.


noisy campanile.png

t = 690

t = 540

t = 390

t = 240

t = 90

after iterative denoising

after one-step denoising

after Gaussian blur filtering

5A: Part 1.5-1.6

Part 1.5 Diffusion Model Sampling

Beyond denoising a noisy image, iterative_denoise can also generate images from scratch. By setting i_start=0 and passing in random noise, the function effectively denoises pur noise. Given the prompt "a high quality photo", the result is an image with fairly discernible content.

Generated images:






Part 1.6 Classifier Free Guidance

To improve the image quality (though at the cost of image diversity), classifier-free guidance (CFG) is utilized. This technique involves computing both a text-conditioned noise estimate $\epsilon_c$ and an unconditional noise estimate $\epsilon_u$. Both $\epsilon_c$ and $\epsilon_u$ are then combined into the final noise estimate $\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$ where $\gamma$ controls the strength of CFG. When $\gamma > 1$ the quality of the generated images are significantly improved.

Generated images using $\gamma=7$:







5A: Part 1.7-1.8

Part 1.7 Image-to-Image Translation

Through iterative denoising, a noisy image is denoised, resulting in an output that roughly resembles the original. Following this concept, it is possible to make edits to existing images. By adding noise to an image and then denoising, the denoising procedure forces the noisy image back onto the manifold of natural images. As more noise is added, the resulting content is further removed from the initial input.

By using the SDEdit algorithm, a noised image is forced onto the image manifold without conditioning. As shown below, elements of the original content are preserved through the process, while with more being preserved at lower noise levels (higher i_start).

Campanile


original campanile.png

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Self-portrait


original elana.jpg

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

French toast


original toast.jpg

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Part 1.7.1 Editing Hand-Drawn and Web Images

Image-to-image translation can be applied effectively to any input. Even nonrealistic images such as scribbles can be noised and then denoised to be projected onto the natural image manifold.

Pompompurin (from the web)


original pompompurin.png

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Duck (hand-drawn)


original duck.png

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Rabbit (hand-drawn)


original rabbit.png

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Part 1.7.2 Inpainting

This procedure can be utilized to selectively modify sections of a given image, a technique called inpainting. Following the RePaint paper, inpainting is implemented. By applying a binary mask $m$ on an image $x_{\text{orig}}$, a new image is created with content from $x_{\text{orig}}$ where $m$ is 0 and new content where $m$ is 1. This is done by adding noise to the masked area and then isolating the denoising process there. Throughout the process, the unmasked $x_{\text{orig}}$ pixels are preserved by updating $x_t$ as follows:

$$x_t := m x_t + (1 - m) \cdot \text{forward}(x_{\text{orig}}, t)$$

Campanile


original campanile.png

campanile_mask.png

masked area to be replaced

inpainted result

Le Roy Avenue


original leroy.jpg

leroy_mask.png

masked area to be replaced

inpainted result

Mazda MX-5


original mx5.png

mx5_mask.png

masked area to be replaced

inpainted result

1.7.3 Text-Conditioned Image-to-image Translation

Image-to-image translation can be used to modify the image randomly or it can be guided with a text prompt. As the noise level decreases (higher i_start), the result resembles the original image more.

Campanile & Rocket

Text prompt: "a rocket ship"


original campanile.png

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Self-portrait & Snowy mountains

Text prompt: "an oil painting of a snowy mountain village"


original elana.jpg

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Luna (cat) & Dog

Text prompt: "a photo of a dog"


original luna.jpg

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Part 1.8 Visual Anagrams

Using these tools, visual anagrams can be created. Visual anagrams are optical illusions in which one scene is shown in an image ("an oil painting of people around a campfire"), but when the image is flipped upside down, a different scene is shown ("an oil painting of an old man").

To create this effect, the denoising process is used. At step $t$, $x_t$ is denoised as normal, guided by the prompt $p_1$ "an oil painting of people around a campfire" to obtain estimate $\epsilon_1$. Then, $x_t$ is flipped and denoised with the second prompt $p_2$ "an oil painting of people around a campfire" to produce $\epsilon_2$. To make it right-side up, $\epsilon_2$ is flipped again, and the two noise estimates are averaged.

$$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$ $$ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2))$$ $$ \epsilon = (\epsilon_1 + \epsilon_2) / 2 $$

Old Man & Campfire


"an oil painting of
an old man"

"an oil painting of
people around a campfire"

Barista & Waterfalls


"a photo of
a hipster barista"

"a lithograph of
waterfalls"

Pencil & Amalfi Coast


"a pencil"

"a photo of
the amalfi cost"

Part 1.9 Hybrid Images

Hybrid images are images in which two scenes are overlaid, one which is visible when viewed up close, and the other which dominates when viewed from a distance. With the diffusion model, hybrid images can be created by using two different text prompts, and then combining low frequencies from one noise estimate with high frequencies of the other.

$$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$ $$ \epsilon_2 = \text{UNet}(x_t, t, p_2) $$ $$ \epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)$$

Skull & Waterfall

Text prompts: "a lithograph of a skull" and "a lithograph of waterfalls"

Man with a hat & People around a campfire

Text prompts: "a man wearing a hat" and "an oil painting of people around a campfire"

Man & Dog

Text prompts: "a photo of a man" and "a photo of a dog"


5B: Part 1 Training a Single-step Denoising UNet

Part 1.1 Implementing the UNet

First, the simple and composed operations of the UNet were implemented.

These operations were then used to construct the following architecture:

Part 1.2 Using the UNet to Train a Denoiser

The UNet is trained using the MNIST dataset to denoise and produce images of handwritten digits. For each image in the dataset, noise can be added to varying degrees depending on the noise coefficient $\in [0, 1.0]$. At level $0$, the result is the original image, and the result is pure noise at $1.0$. Thus, as this coefficient increases, so does the amount of noise added.

The model is trained on images in the dataset with noise applied at noise_level = 0.5 for $5$ epochs and batch_size=256 . Adam optimizer with learning rate $= 1e-4$ is used.


training loss over $5$ epochs


results after $0$ epochs

results after $5$ epochs

The model is then evaluated on images from the testset noised to varying degrees.


images with noise added at noise_level=0

denoised results

images with noise added at
noise_level=0.2

denoised results

images with noise added at
noise_level=0.4

denoised results

images with noise added at
noise_level=0.5

denoised results

images with noise added at
noise_level=0.6

denoised results

images with noise added at
noise_level=0.8

denoised results

images with noise added at
noise_level=1.0

denoised results

5B: Part 2 Training a Diffusion Model

2.1 Adding Time Conditioning to UNet

To improve model performance, the UNet is conditioned with the timestep $t$. To inject $t$ into the UNet, FCBlocks are used.


time-conditioned UNet

The following algorithm is implemented to train the UNet guided by $t$.


The model is trained with num_hiddens=64 and batch_size=128 for 20 epochs. The Adam optimizer is used with an initial learning rate of $1e-3$ with an exponential learning rate decay scheduler ($\gamma=0.1^{1 / \text{num_epochs}}$), taking one step every epoch.

training loss over $20$ epochs


results after $0$ epochs

results after $5$ epochs

results after $20$ epochs