This is an overview of my implementation of image super-resolution using generative adversarial networks, based on the paper “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network” (Ledig et al., 2017).
I did this project to increase my understanding of how GANs is used in practice, and to get better at reading and implementing research papers.
Problem
- We want to be able to upscale a given image to 4x its resolution, so that the upscaled details look as realistic as possible
Overview of solution
- We will train two neural networks, a generator and a discriminator, against each other
- Generator network will be trained to upscale images to 4x their resolution that fool the discriminator into classifying them as natural (“real”) high-resolution images
- Discriminator network will be trained to discriminate between given two types of images: natural high-resolution images, and images upscaled by the generator
- Both adversarial (cross-entropy-based) loss and perceptual (VGG-based) loss, will be used
Neural Network Architecture
Some differences in my implementation:
- 256 dense layer units in the discriminator, instead of 1024, to save memory
- Nearest-neighbor upscaling instead of sub-pixel convolutions in the generator, since I already had a nearest-neighbor upscaling layer to use
- Reflection padding instead of zero padding, in attempt to avoid border artifacts
Definitions
- Discriminator D will output a value between 0 and 1 representing how confident it thinks that its input is a natural high-resolution image, with 0 meaning generated and 1 meaning natural
- $D(I_{HR})$ is the output of the discriminator for natural high-resolution image input $I_{HR}$
- $D(G(I_{LR}))$ is the output of the discriminator for super-resolved image $G(I_{LR})$ that was created using low-resolution image $I_{LR}$
Loss functions
Perceptual loss
As shown in “Perceptual Losses for Real-Time Style Transfer and Super-Resolution” (Johnson et al., 2016), we get more visually-pleasing results when using perceptual loss instead of pixel-wise MSE to measure the differences between 2 images.
$L_{perceptual}(I_{LR}, I_{HR}) = \frac{1}{n_Hn_Wn_C}||(a(I_{HR}) - a(G(I_{LR}))||^2$
- $a$ is the activation map of the layer block5_conv4 of a pre-trained VGG19 network
- $n_Hn_Wn_C$ is the activation map size
Generator loss
The generator loss is the sum of the VGG-based perceptual loss and an adversarial loss. The perceptual, or content, loss has been previously successfully used in non-GANs image super-resolution networks. The adversarial loss helps the generator better capture the distribution of natural images.
$L_{G} = L_{perceptual}(I_{LR}, I_{HR}) - E_{I_{LR}}[\log(D(G(I_{LR})))]$
Discriminator loss
This is the same cross-entropy-based adversarial loss function used in vanilla GANs.
$L_{D} = -E_{I_{HR}}[\log{D(I_{HR})}] - E_{I_{LR}}[\log(1 - D(G(I_{LR})))]$
Training process
Here are some key details about my training process, and its comparison to the paper’s:
- High-resolution images $I_{HR}$: I used 96x96 resolution random crops from the MSCOCO-2014 dataset, because I already had that dataset downloaded. The paper used 96x96 resolution random crops of 350,000 images from ImageNet
- Low-resolution images $I_{LR}$: created from $I_{HR}$ downsampled by a factor of 4 using a bicubic kernel
- Optimizer: I used an Adam optimizer with a constant learning rate of $10^{-4}$. The paper used a learning rate of $10^{-4}$ for $10^5$ iterations and $10^{-5}$ for another $10^5$ iterations
- My generator was trained from scratch. But the paper employed a trained MSE-based SRResNet as initialization for the generator, to avoid undesired local optima.
Personal results
I implemented the algorithm described in the paper (with a few minor differences as described earlier) in Tensorflow, and was able to get similar results (as judged by my own visual inspection). As a post-processing step, I did histogram-matching on the generator output images after training to improve the color. I’m not sure if the paper’s authors did this.
I haven’t tested to see whether the results would be better with more training, but possibly they would be.
Here is an example of the results:
Random thoughts
To possibly think about and research later:
- What if we did progressively grew the resolutions of the discriminator and generator during training, like in “Progressive Growing of GANs for Improved Quality, Stability, and Variation” (Karras et al., 2018)? Would the model train faster and yield better results?
- What if we replaced the cross-entropy-based adversarial loss with a more modern adversarial loss, such as the Wasserstein loss?
Acknowledgments
This project was based on the paper “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network” (Ledig et al., 2017).