This post visually demonstrates why reflection padding and transpose convolutions used in some CNNs, for example the one used in “Perceptual Losses for Real-Time Style Transfer and Super-Resolution” (Johnson et al., 2016), are problematic. It also shows how some proposed solutions mitigate those problems.

How to address border and checkerboard artifacts in CNNs?

While reading “A Learned Representation For Artistic Style” (Dumoulin et al., 2016), I learned of their methodology that addresses the border and checkerboard artifacts sometimes noticed in CNNs, specifically in neural style transfer. They made two changes to achieve that:

1. Reflection padding instead of zero-padding

Without any padding, convolutions cause feature maps to shrink. Commonly, the edges of the feature maps are padded with zeros to prevent shrinkage. But the zeros have no relation to the values of the edges and so can cause border artifacts. In contrast, reflection padding pads the edges of the feature map with reflections of the values of the edges, which alleviates the issue of border artifacts.

2. Nearest-neighbor upsampling instead of transpose convolutions

Transpose convolutions can have uneven overlap with the feature maps when the kernel size is not divisible by the stride. Nearest-neighbor upsampling does not have this uneven overlap.

Visualizing the effects of reflection padding and nearest-neighbor upsampling

I think that visualization tests on CNNs are particularly useful for understanding and sanity-checking, especially when implementing custom layers. So I decided to visualize the effects of my implementations of reflection padding and nearest-neighbor upsampling.

The following are the 64x64 outputs of a very simple CNN with different randomly initialized weights, for different inputs and different layer configurations.

Notice that for each row of images, the first has border and checkerboard artifacts, the second has border artifacts, and the third one has no artifacts.

Outputs of convolutional layer with input tensor of all 1s, when layer uses 1) Same padding and transpose convolutions 2) Same padding and nearest-neighbor upscaling 3) Reflection padding and nearest-neighbor upscaling

Outputs of convolutional layer with content image input, when layer uses 1) Same padding and transpose convolutions 2) Same padding and nearest-neighbor upscaling 3) Reflection padding and nearest-neighbor upscaling

Acknowledgments

  • “A Learned Representation For Artistic Style” (Dumoulin et al., 2016)