This is an introduction to neural style transfer (NST), and an explanation of a few very basic techniques used.

I was motivated to learn about neural style transfer because I thought it would be a fun way to get better at implementing research paper ideas in Tensorflow, and to learn more about image-processing neural networks.

What does neural style transfer do?

Given a content image and a style image, a neural style transfer algorithm outputs a new image whose contents (ex. a dog) are recognizably those of the content image, but whose “brush” strokes are those of the style image. Here is an example:

From left to right: the content image (from Pixabay), the style image, and the stylized image

What is content and style in this context?

You can consider the “content” to be the higher-level spatial relationships in the image, ex. the shape of a dog, rather than the texture of its fur.

You can consider the “style” to be like the “textures” in an image, rather than its higher-level spatial relationships.

A basic algorithm for neural style transfer

This section will review the style transfer algorithm described in the paper “A Neural Algorithm of Artistic Style” (Gatys et al., 2015). I think that it’s a good introduction to neural style transfer and using backpropagation, because it’s fun yet relatively simple - it doesn’t even require updating any weights (only pixels).

1. How can we measure the difference in “content” of two images?

We could calculate the mean per-pixel squared differences. However, this value could be high even for two images that to us look very similar in content. For example, the pixel positions could be barely perceptibly misaligned.

There is a more effective way of comparing the contents of two images, that is more similar to how our brains do it. We calculate the mean squared differences of the activations of the images in a particular layer of a pre-trained convolutional image processing network, such as VGG-19. The higher this chosen layer, the more it represents higher level content and the less sensitive it is to exact pixel values.

Content loss = $J_{content}(C, G) = \frac{1}{n_H[L]n_W[L]n_C[L]}||(a_L(C) - a_L(G)||^2$

  • $C$: the content image
  • $G$: the generated image
  • $a_L(x)$: the activation map at layer $L$ of a VGG network with input image $x$
  • $n_H[L], n_W[L], n_C[L]$: height, width and number of channels of the C’s or G’s activation map (they must be equal) at VGG layer L. The product is the size of the activation map at layer L.

2. How can we measure the difference in “style” of two images?

We can represent the style of an image by calculating the correlations between the activations of those images, at certain layers of the pre-trained VGG. We calculate this across multiple layers so that we can capture texture information at multiple scales. Since higher layers of this network have greater receptive field sizes (they get input from a larger area of the image) and are combinations of more parameters, we notice larger and more complex “strokes” in the stylized image when we choose higher layers for capturing style.

Let $G[i]$ be a $n_c[i] * n_c[i]$ matrix

  • $i$: layer number representing layer $L$ in the VGG network
  • $n_c[i]$: number of filters in convolutional layer $L$ (indexed by $i$ here) in the VGG network

Let each entry in $G[i]$ represent the correlation of a pair of channels $(c, c')$ from S and G respectively.

  • $G_{c,c'}[i] = \sum_{x=1}^h(\sum_{y=1}^w(a_{xyc}[i]a_{xyc'}[i]))$This is called the Gram matrix.
  • $a_{xyc}[i]$: the (x,y,c) coordinate of the activation map at VGG layer i when S is input
  • $a_{xyc'}[i]$: the (x,y,c') coordinate of the activation map at VGG layer i when G is input

Style loss = $J_{style}(S, G) = \sum_{}\frac{1}{(2n_H[i]n_W[i]n_C[i])^2}||G_S[i] - G_G[i]||^2$

  • $n_H[i], n_W[i], n_C[i]$: height, width and number of channels of the S’s or G’s activation map (they must be equal) at VGG layer i
  • $S$: style image
  • $G$: generated image
  • $G_S[i]$: normalized Gram matrix for the style image’s activations at VGG layer i
  • $G_G[i]$: normalized Gram matrix for the generated image’s activations at VGG layer i

How does this capture style?

  • In some convolutional layer used for style loss in the VGG, each filter responds differently to different features in the input image. For example, one filter may be excited by the color green, and another may be excited by hairy lines. The more that they are jointly excited, the higher their correlation entry in the Gram matrix, and the greater the presence of some higher-level texture (ex. green grass). If S and G excite the same filters, implying the presence of the same higher-level textures, they will have similar entries in their Gram matrices.
  • I stumbled upon a more comprehensive explanation for why the Gram matrix works to capture style in the paper “Demystifying Neural Style Transfer” (Li et al., 2017). But that is out of scope for this post.

3. The combined content and style loss between two images

$J(C, S, G) = \alpha J_{content}(C, G) + \beta J_{style}(S, G)$

  • $C$: content image
  • $S$: style image
  • $G$: generated image
  • $\alpha $: content loss weight
  • $\beta $: style loss weight

4. How can we learn to generate an image that minimizes the content and style loss?

An extremely simple way to do this is to begin with a random noise image, and progressively update its individual pixels using backpropagation, using its gradients with respect to the loss function. Other than the pixels of that image, no parameters are learned in this method.

Faster style transfer using an image transformation network

One undesirable thing with the basic style transfer method mentioned previously, is that it takes a long time to style one image (~2 minutes on my GPU for 1024x768 images).

We can use a different algorithm that performs style transfer fast enough to be run in real time. We can do this by training a separate image stylization network that is responsible stylizing a given content image, in the style of a fixed style image.

This section will explore the fast style transfer algorithm proposed in “Perceptual Losses for Real-Time Style Transfer and Super-Resolution” (Johnson et al., 2016), and its differences from the basic style transfer algorithm.

From the paper "Perceptual Losses for Real-Time Style Transfer and Super-Resolution" (Johnson et al., 2016)

1. What is the architecture of this image stylization network?

  • A deep residual convolutional neural network with encoder-decoder structure

2. How to train this image transformation network for a fixed style image S?

  • Acquire an image dataset (ex. ImageNet, MS-COCO)
  • Input a batch of images into the image stylization network
  • Calculate the content loss of the output images and the input images
  • Calculate the style loss of the output images and the fixed style image S (you can precalculate the Gram matrices of S)
  • Calculate the gradients of the composite loss with respect to the weights of the stylization network, and use them to update the weights

3. Results

For the same input image, it takes 0.1 seconds to stylize it using this method, versus ~2 minutes using the Gatys method, on my GPU. Of course, we can achieve lower transformation times for both methods by reducing the input image resolution.

Acknowledgments

  • “A Neural Algorithm of Artistic Style” (Gatys et al., 2015)
  • “Perceptual Losses for Real-Time Style Transfer and Super-Resolution” (Johnson et al., 2016)