This is an overview of my implementation of image super-resolution using generative adversarial networks, based on the paper “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network” (Ledig et al., 2017).

I did this project to increase my understanding of how GANs is used in practice, and to get better at reading and implementing research papers.

Problem

  • We want to be able to upscale a given image to 4x its resolution, so that the upscaled details look as realistic as possible

Overview of solution

  • We will train two neural networks, a generator and a discriminator, against each other
  • Generator network will be trained to upscale images to 4x their resolution that fool the discriminator into classifying them as natural (“real”) high-resolution images
  • Discriminator network will be trained to discriminate between given two types of images: natural high-resolution images, and images upscaled by the generator
  • Both adversarial (cross-entropy-based) loss and perceptual (VGG-based) loss, will be used

Neural Network Architecture

Diagram of the neural architecture, from the paper "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network" (Ledig et al., 2017)

Some differences in my implementation:

  • 256 dense layer units in the discriminator, instead of 1024, to save memory
  • Nearest-neighbor upscaling instead of sub-pixel convolutions in the generator, since I already had a nearest-neighbor upscaling layer to use
  • Reflection padding instead of zero padding, in attempt to avoid border artifacts

Definitions

  • Discriminator D will output a value between 0 and 1 representing how confident it thinks that its input is a natural high-resolution image, with 0 meaning generated and 1 meaning natural
  • $D(I_{HR})$ is the output of the discriminator for natural high-resolution image input $I_{HR}$
  • $D(G(I_{LR}))$ is the output of the discriminator for super-resolved image $G(I_{LR})$ that was created using low-resolution image $I_{LR}$

Loss functions

Perceptual loss

As shown in “Perceptual Losses for Real-Time Style Transfer and Super-Resolution” (Johnson et al., 2016), we get more visually-pleasing results when using perceptual loss instead of pixel-wise MSE to measure the differences between 2 images.

$L_{perceptual}(I_{LR}, I_{HR}) = \frac{1}{n_Hn_Wn_C}||(a(I_{HR}) - a(G(I_{LR}))||^2$

  • $a$ is the activation map of the layer block5_conv4 of a pre-trained VGG19 network
  • $n_Hn_Wn_C$ is the activation map size

Generator loss

The generator loss is the sum of the VGG-based perceptual loss and an adversarial loss. The perceptual, or content, loss has been previously successfully used in non-GANs image super-resolution networks. The adversarial loss helps the generator better capture the distribution of natural images.

$L_{G} = L_{perceptual}(I_{LR}, I_{HR}) - E_{I_{LR}}[\log(D(G(I_{LR})))]$

Discriminator loss

This is the same cross-entropy-based adversarial loss function used in vanilla GANs.

$L_{D} = -E_{I_{HR}}[\log{D(I_{HR})}] - E_{I_{LR}}[\log(1 - D(G(I_{LR})))]$

Training process

Here are some key details about my training process, and its comparison to the paper’s:

  • High-resolution images $I_{HR}$: I used 96x96 resolution random crops from the MSCOCO-2014 dataset, because I already had that dataset downloaded. The paper used 96x96 resolution random crops of 350,000 images from ImageNet
  • Low-resolution images $I_{LR}$: created from $I_{HR}$ downsampled by a factor of 4 using a bicubic kernel
  • Optimizer: I used an Adam optimizer with a constant learning rate of $10^{-4}$. The paper used a learning rate of $10^{-4}$ for $10^5$ iterations and $10^{-5}$ for another $10^5$ iterations
  • My generator was trained from scratch. But the paper employed a trained MSE-based SRResNet as initialization for the generator, to avoid undesired local optima.

Personal results

I implemented the algorithm described in the paper (with a few minor differences as described earlier) in Tensorflow, and was able to get similar results (as judged by my own visual inspection). As a post-processing step, I did histogram-matching on the generator output images after training to improve the color. I’m not sure if the paper’s authors did this.

I haven’t tested to see whether the results would be better with more training, but possibly they would be.

Here is an example of the results:

From left to right: 1) upscaled 4x using bicubic upsampling 2) upscaled 4x using trained super-resolution generator 3) true high-resolution image.
The GANs-based super-resolver is not perfect; there are still some noticeable artifacts and unrealistic details. But in my opinion, it's still impressive that the generator learned how to fill in so many fine details. It's also interesting to note that some areas that were blurry due to being out of focus in the original high-resolution image, are sharp in the super-resolved image. (original content image from Pixabay)

From left to right: 1) upscaled 4x using bicubic upsampling 2) upscaled 4x using trained super-resolution generator 3) true high-resolution image.
Here you can see the artifacts and increased sharpness in the super-resolved image, versus the true high-resolution image, more clearly.

Random thoughts

To possibly think about and research later:

  • What if we did progressively grew the resolutions of the discriminator and generator during training, like in “Progressive Growing of GANs for Improved Quality, Stability, and Variation” (Karras et al., 2018)? Would the model train faster and yield better results?
  • What if we replaced the cross-entropy-based adversarial loss with a more modern adversarial loss, such as the Wasserstein loss?

Acknowledgments

This project was based on the paper “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network” (Ledig et al., 2017).