Part 1.1 Forward process
Diffusion involves taking in a noisy image and denoising it to produce a clean image. This process is difficult, but the reverse–adding noise to a clean image–is very simple. Therefore, the first part of this project is implementing the forward process using the following equation:
$$x_t = \sqrt{\bar{\alpha}_t} x0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon \quad \text{where} \quad \epsilon \sim N(0, 1)$$
- $x_0 =$ clean image
- $x_t =$ noisy image at timestep $t$
- $\bar{\alpha}_t =$ noise coefficient
- $\epsilon =$ Gaussian noise
Given the original image $x_0$ and timestep $t \in [0, T]$, the function forward noises the image. The image at $t = 0$ is the original image, while at $t = T$, the image is pure noise. Thus, $\bar{\alpha}_t$ is close to 1 for small $t$ and close to $0$ for large $t$.
campanile.png
|
t = 250
|
t = 250
|
t = 750
|
Part 1.2 Classical Denoising
In order to recover the original image, one option is to apply Gaussian blur filtering to remove noise. This was done using the function torchvision.transforms.functional.gaussian_blur
blurred t = 250
|
blurred t = 500
|
blurred t = 750
|
Part 1.3 One-Step Denoising
To achieve a better result, a pretrained diffusion model is used. Given a noisy image im_noisy and timestep $t$, the UNet estimates the amount of noise in the image. Then, this noise is removed from im_noisy to produce the estimated clean image.
$$x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon \quad \text{where} \quad \epsilon \sim N(0, 1)$$
$$x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon = \sqrt{\bar{\alpha}_t} \cdot x_0$$
$$x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon}{ \sqrt{\bar{\alpha}_t} }$$
Compared to using Gaussian blur filtering, the result is improved. At $t=0$, the output is the most accurate, and the quality decreases with higher $t$, producing a more distorted campanile.
after one-step denoising t = 250
|
after one-step denoising t = 500
|
after one-step denoising t = 750
|
Part 1.4 Iterative Denoising
Because one-step denoising performs worse with added noise, this problem can be addressed by iteratively denoising. For each time step $t$, the function iterative_denoise uses a UNet to estimate the amount of noise, and then denoises the image $x_t$ to generate the less noisy image $x_{t-1}$ at the previous timestep $t-1$.
$$x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\bar{\alpha}_t} (1 - \alpha{t'})}{1 - \bar{\alpha}_t} x_t + v_{\sigma}$$
- $x_t = $ image at timestep $t$
- $x_{t'} =$ noisy image at timestep $t'$ where $t' < t$ (less noisy)
- $\bar{\alpha}_t =$ noise coefficient
- $\alpha_t = \bar{\alpha}_t / \bar{\alpha}_{t'}$
- $\beta_t = 1 - \alpha_t$
- $x_0 = $ current estimate of the clean image (generated from one-step denoising)
- $v_{\sigma} =$ random noise calculated by the function add_variance
While this is the general idea, it is not necessary to denoise at each timestep from $T$ to $0$. To skip steps, a list of timesteps strided_timesteps is used which accelerates the process. This list is created starting at $990$ with a stride of $30$, eventually reaching $0$.
noisy campanile.png
|
t = 690
|
t = 540
|
t = 390
|
t = 240
|
t = 90
|
after iterative denoising
|
after one-step denoising
|
after Gaussian blur filtering
|
Part 1.5 Diffusion Model Sampling
Beyond denoising a noisy image, iterative_denoise can also generate images from scratch. By setting i_start=0 and passing in random noise, the function effectively denoises pur noise. Given the prompt "a high quality photo", the result is an image with fairly discernible content.
Generated images:
Part 1.6 Classifier Free Guidance
To improve the image quality (though at the cost of image diversity), classifier-free guidance (CFG) is utilized. This technique involves computing both a text-conditioned noise estimate $\epsilon_c$ and an unconditional noise estimate $\epsilon_u$. Both $\epsilon_c$ and $\epsilon_u$ are then combined into the final noise estimate $\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$ where $\gamma$ controls the strength of CFG. When $\gamma > 1$ the quality of the generated images are significantly improved.
Generated images using $\gamma=7$:
Part 1.7 Image-to-Image Translation
Through iterative denoising, a noisy image is denoised, resulting in an output that roughly resembles the original. Following this concept, it is possible to make edits to existing images. By adding noise to an image and then denoising, the denoising procedure forces the noisy image back onto the manifold of natural images. As more noise is added, the resulting content is further removed from the initial input.
By using the SDEdit algorithm, a noised image is forced onto the image manifold without conditioning. As shown below, elements of the original content are preserved through the process, while with more being preserved at lower noise levels (higher i_start).
Campanile
original campanile.png
|
i_start = 1
|
i_start = 3
|
i_start = 5
|
i_start = 7
|
i_start = 10
|
i_start = 20
|
Self-portrait
original elana.jpg
|
i_start = 1
|
i_start = 3
|
i_start = 5
|
i_start = 7
|
i_start = 10
|
i_start = 20
|
French toast
original toast.jpg
|
i_start = 1
|
i_start = 3
|
i_start = 5
|
i_start = 7
|
i_start = 10
|
i_start = 20
|
Part 1.7.1 Editing Hand-Drawn and Web Images
Image-to-image translation can be applied effectively to any input. Even nonrealistic images such as scribbles can be noised and then denoised to be projected onto the natural image manifold.
Pompompurin (from the web)
original pompompurin.png
|
i_start = 1
|
i_start = 3
|
i_start = 5
|
i_start = 7
|
i_start = 10
|
i_start = 20
|
Duck (hand-drawn)
original duck.png
|
i_start = 1
|
i_start = 3
|
i_start = 5
|
i_start = 7
|
i_start = 10
|
i_start = 20
|
Rabbit (hand-drawn)
original rabbit.png
|
i_start = 1
|
i_start = 3
|
i_start = 5
|
i_start = 7
|
i_start = 10
|
i_start = 20
|
Part 1.7.2 Inpainting
This procedure can be utilized to selectively modify sections of a given image, a technique called inpainting. Following the RePaint paper, inpainting is implemented. By applying a binary mask $m$ on an image $x_{\text{orig}}$, a new image is created with content from $x_{\text{orig}}$ where $m$ is 0 and new content where $m$ is 1. This is done by adding noise to the masked area and then isolating the denoising process there. Throughout the process, the unmasked $x_{\text{orig}}$ pixels are preserved by updating $x_t$ as follows:
$$x_t := m x_t + (1 - m) \cdot \text{forward}(x_{\text{orig}}, t)$$
Campanile
original campanile.png
|
campanile_mask.png
|
masked area to be replaced
|
inpainted result
|
Le Roy Avenue
original leroy.jpg
|
leroy_mask.png
|
masked area to be replaced
|
inpainted result
|
Mazda MX-5
original mx5.png
|
mx5_mask.png
|
masked area to be replaced
|
inpainted result
|
1.7.3 Text-Conditioned Image-to-image Translation
Image-to-image translation can be used to modify the image randomly or it can be guided with a text prompt. As the noise level decreases (higher i_start), the result resembles the original image more.
Campanile & Rocket
Text prompt: "a rocket ship"
original campanile.png
|
i_start = 1
|
i_start = 3
|
i_start = 5
|
i_start = 7
|
i_start = 10
|
i_start = 20
|
Self-portrait & Snowy mountains
Text prompt: "an oil painting of a snowy mountain village"
original elana.jpg
|
i_start = 1
|
i_start = 3
|
i_start = 5
|
i_start = 7
|
i_start = 10
|
i_start = 20
|
Luna (cat) & Dog
Text prompt: "a photo of a dog"
original luna.jpg
|
i_start = 1
|
i_start = 3
|
i_start = 5
|
i_start = 7
|
i_start = 10
|
i_start = 20
|
Part 1.8 Visual Anagrams
Using these tools, visual anagrams can be created. Visual anagrams are optical illusions in which one scene is shown in an image ("an oil painting of people around a campfire"), but when the image is flipped upside down, a different scene is shown ("an oil painting of an old man").
To create this effect, the denoising process is used. At step $t$, $x_t$ is denoised as normal, guided by the prompt $p_1$ "an oil painting of people around a campfire" to obtain estimate $\epsilon_1$. Then, $x_t$ is flipped and denoised with the second prompt $p_2$ "an oil painting of people around a campfire" to produce $\epsilon_2$. To make it right-side up, $\epsilon_2$ is flipped again, and the two noise estimates are averaged.
$$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$
$$ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2))$$
$$ \epsilon = (\epsilon_1 + \epsilon_2) / 2 $$
Old Man & Campfire
"an oil painting of an old man"
|
"an oil painting of people around a campfire"
|
Barista & Waterfalls
"a photo of a hipster barista"
|
"a lithograph of waterfalls"
|
Pencil & Amalfi Coast
"a pencil"
|
"a photo of the amalfi cost"
|
Part 1.9 Hybrid Images
Hybrid images are images in which two scenes are overlaid, one which is visible when viewed up close, and the other which dominates when viewed from a distance. With the diffusion model, hybrid images can be created by using two different text prompts, and then combining low frequencies from one noise estimate with high frequencies of the other.
$$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$
$$ \epsilon_2 = \text{UNet}(x_t, t, p_2) $$
$$ \epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)$$
Skull & Waterfall
Text prompts: "a lithograph of a skull" and "a lithograph of waterfalls"
Man with a hat & People around a campfire
Text prompts: "a man wearing a hat" and "an oil painting of people around a campfire"
Man & Dog
Text prompts: "a photo of a man" and "a photo of a dog"
Part 1.1 Implementing the UNet
First, the simple and composed operations of the UNet were implemented.
- Simple operations: Conv, UpConv, Flatten, Unflatten
- Composed operations: ConvBlock, DownBlock, UpBlock
These operations were then used to construct the following architecture:
Part 1.2 Using the UNet to Train a Denoiser
The UNet is trained using the MNIST dataset to denoise and produce images of handwritten digits. For each image in the dataset, noise can be added to varying degrees depending on the noise coefficient $\in [0, 1.0]$. At level $0$, the result is the original image, and the result is pure noise at $1.0$. Thus, as this coefficient increases, so does the amount of noise added.
The model is trained on images in the dataset with noise applied at noise_level = 0.5 for $5$ epochs and batch_size=256 . Adam optimizer with learning rate $= 1e-4$ is used.
training loss over $5$ epochs
|
results after $0$ epochs
|
results after $5$ epochs
|
The model is then evaluated on images from the testset noised to varying degrees.
images with noise added at noise_level=0
|
denoised results
|
images with noise added at noise_level=0.2
|
denoised results
|
images with noise added at noise_level=0.4
|
denoised results
|
images with noise added at noise_level=0.5
|
denoised results
|
images with noise added at noise_level=0.6
|
denoised results
|
images with noise added at noise_level=0.8
|
denoised results
|
images with noise added at noise_level=1.0
|
denoised results
|
2.1 Adding Time Conditioning to UNet
To improve model performance, the UNet is conditioned with the timestep $t$. To inject $t$ into the UNet, FCBlocks are used.
time-conditioned UNet
|
The following algorithm is implemented to train the UNet guided by $t$.
The model is trained with
num_hiddens=64 and
batch_size=128 for 20 epochs. The Adam optimizer is used with an initial learning rate of $1e-3$ with an exponential learning rate decay scheduler ($\gamma=0.1^{1 / \text{num_epochs}}$), taking one step every epoch.
training loss over $20$ epochs
|
results after $0$ epochs
|
results after $5$ epochs
|
results after $20$ epochs
|