Entropy-based loss functions, such as cross-entropy and Kullback-Liebler divergence, are commonly used in deep learning algorithms. This post is a summary of my understanding of them.
What is information?
Since the definition of entropy involves the concept of information (in information theory, according to Claude Shannon), I found it helpful to better understand what information is, in order to better understand entropy.
Information can be interpreted as a measure of “surprise”. How can we measure information of an event x, in terms of the probability of x?
Consider:
- If we already know that the probability of the event x (ex. that we will be hungry within 12 hours) is 100%, we will not be surprised at all if we are told the event x. That would convey 0 information.
- The lower the probability of event x, the more surprised we will be by knowing it and the more informative that would be. For example, we would generally be more surprised to be informed that there will be a space alien invasion here tomorrow, versus that it will be sunny here tomorrow.
- If we have an event that represents the joint occurrence of multiple independent events, its information is the sum of the individual information of those independent events
A measure of information that satisfies the above criteria is:
$I(x) = \log(\frac{1}{p(x)}) = -\log(p(x))$
- where $p(x)$ is the probability of the event $x$
That $log(n)$ term reminds me of the height of a tree of $n$ nodes. How does this relate to trees? We will find out in an upcoming aside.
What is entropy?
The entropy of a random variable X is simply the expected value of the information of X.
$H(X) = -\underset{x \in X}{\sum}p(x)\log(p(x)) = -E_{p(X)}[\log(p(X))]$
- where $p(x)$ is the probability of the independent event $x$
Aside: an interpretation of entropy and information under certain conditions
Consider a set of independent events $x_{1}$, $x_{2}$, $x_{3}$, $x_{4}$, with probabilities $\frac{1}{4}$, $\frac{1}{8}$, $\frac{1}{8}$ and $\frac{1}{2}$ respectively, and that we randomly draw one of those events. How do we determine which event we have drawn, using a series of questions that ask whether the event is a certain subset of our set or not?
We could first ask if it is $x_{1}$. If it’s not, we ask if it’s $x_{2}$. If it’s not, we ask if it’s $x_{3}$. Finally if it’s not, we know it’s $x_{4}$. On average, we have to ask $(\frac{1}{4})1 + (\frac{1}{8})2 + (\frac{1}{8})3 + (\frac{1}{2})3 = \frac{19}{8}$ questions. Note that the way we ask these questions is important. If we reverse the ordering in which these questions are asked, we will ask $(\frac{1}{2})1 + (\frac{1}{8})2 + (\frac{1}{8})3 + (\frac{1}{4})3 = \frac{15}{8}$ questions on average.
Can we find a more optimal way of asking these questions, to minimize the average number of questions asked? We don’t want to ask as many questions for higher-probability events, so we should ask about those first.
An optimal questioning procedure would be to first ask if it is $x_4$. Then ask if it is $x_1$. Then ask if it is $x_2$. The average number of questions in this case will be $(\frac{1}{2})1 + (\frac{1}{4})2 + (\frac{1}{8})3 + (\frac{1}{8})3 = \frac{14}{8}$ questions. Notice that this is equal to the entropy of X!
So, if we use base 2 for the log terms and probabilities are integer powers of 2: one way to interpret the entropy of a random variable X is as the lowest average number of binary questions (by using the optimal question tree) we need to ask to find out which event it is.
Then, under the same conditions, information of an event x can be interpreted as the number of binary questions we need to ask to find out whether the random variable X = x. In other words, it is the height of the question sub-tree that ends with the question that determines if the random variable is event x. And so 0 information is conveyed from being informed X = x, if 0 questions need to be asked to determine if X = x.
What is cross-entropy?
In summary, cross-entropy is a measure of how strongly the data $p$, supports an alternative conclusion $q$.
In other words, it is the average number of questions that we need to ask to know $x$, if we create the optimal question tree based on the predicted distribution $q(x)$.
$H(p, q) = -\underset{x \in X}{\sum}p(x)\log(q(x)) = -E_{p(X)}[\log(q(X))]$, where $p(x)$ is the true distribution of event $x$, and $q$ is the predicted distribution
- lower cross-entropy means that the data more strongly supports the alternative conclusion $q$
- if $p == q$, then cross-entropy is equal to entropy
- if $p \neq q$, then cross-entropy will be greater than entropy by some number of bits (this is the Kullback-Liebler divergence, which we will get to)
Cross-entropy is commonly used in loss functions for classification problems.
Aside: It works well in terms of providing a good gradient signal when the neural network’s final activation function is sigmoid. To see why:
- Let the network’s final output be $a(z) = sigmoid(z) = \frac{1}{1 + e^{-z}}$. We will refer to this as $a$ for notational simplicity.
- Let the (binary, for simplicity) cross-entropy cost for a single data point be $J(a) = - y\ln(a) - (1 - y)\ln(1 - a)$
- where $a$ is the network’s output and $y$ is the expected output
- $\frac{da}{dz} = \frac{0 - 1(-e^{-z})}{(1 + e^{-z})^2} = \frac{e^{-z}}{(1 + e^{-z})^2} = a(1 - a)$
- $\frac{dJ}{da} = -\frac{y}{a} - \frac{1 - y}{a - 1} = \frac{-y(a - 1)}{a(a - 1)} - \frac{(1 - y)(a)}{a(a - 1)} = \frac{-ya + y - a + ay}{a(a - 1)} = \frac{y - a}{a(a - 1)} = \frac{a - y}{a(1 - a)}$
- $\frac{dJ}{dz} = \frac{dJ}{da}\frac{da}{dz} = \frac{a - y}{a(1 - a)}a(1 - a) = a - y$
- $a - y$ is a good gradient signal because it is always decreases as “error” ($a - y$, the difference between actual and expected values) decreases, and increases as “error” increases. In some other combinations of final activation function and cost function (ex. sigmoid and MSE), we may not get this. Instead, the gradient could decrease (and learning could slow down) even as “error” gets larger.
Kullback-Liebler divergence
The Kullback-Liebler divergence (KL divergence) is also known as relative entropy. It is a measure of how one probability distribution differs from another.
$D_{KL}(P || Q) = \underset{x \in X}{\sum}P(x)\log(\frac{P(x)}{Q(x)}) = E_{p(X)}[-\log(q(X)) + \log(p(X))]$
- Measures the expected number of extra bits required to code samples from $p(x)$, when using a code based on $q(x)$ instead of $p(x)$
- Measures the average amount of “surprise” that you get from obtaining samples from the actual distribution $p$ after having assumed that the distribution is $q$
- This is $crossentropy(P, Q) - entropy(P)$
Note that KL divergence is:
- not symmetrical; $D_{KL}(P || Q) \neq D_{KL}(Q || P)$
- If $P(x_i) = 0$, and $Q(x_i) \neq 0$, then $P(x_i)log(\frac{P(x_i)}{Q(x_i)}) = 0$
- because $0*\log0 => 0$
- If $P(x_i) = 0$, and $Q(x_i) = 0$, then $P(x_i)\log(\frac{P(x_i)}{Q(x_i)}) = 0$
- because $0*\log\frac{0}{0} = 0*(\log0 - \log0) = 0$
- If $P(x_i) \neq 0$, but $Q(x_i) = 0$, then $P(x_i)\log(\frac{P(x_i)}{Q(x_i)}) = \infty$
- This happens when $P$ and $Q$ are disjoint/non-overlapping
The case in which $D_{KL}(Q || P) = \infty$ when $Q(x_i) \neq 0, P(x_i) = 0$ may be desirable if we’re using $D_{KL}(Q || P)$ to find the best $Q$ that matches the true distribution $P$, and we want to be sure that any sample drawn from $Q$ will also be drawable from $P$.
Aside: Minimizing KL divergence and cross-entropy is equivalent to maximizing likelihood
If P is the fixed, true data distribution and Q is the model distribution, then minimizing $D_{KL}(P || Q)$ is equivalent to minimizing the cross-entropy $H(P, Q)$. Why?
- $D_{KL}(P || Q) = \underset{x \in X}{\sum}P(x)\log(\frac{P(x)}{Q(x)}) = - \underset{x \in X}{\sum}P(x)\log(Q(x)) + \underset{x \in X}{\sum}P(x)\log({P(x)})$
- $\underset{x \in X}{\sum}P(x)\log({P(x)})$ has no gradient because P is fixed, so we only need to consider the term $- \underset{x \in X}{\sum}P(x)\log(Q(x))$, which is exactly the cross-entropy $H(P, Q)$, when calculating gradients of $D_{KL}(P || Q)$ for minimization
Minimizing $D_{KL}(P || Q)$ or $H(P, Q)$ is also equivalent to maximizing likelihood. Why?
- Let w the parameters of the model whose distribution is Q
- $w* = \underset{w}{\arg\min}(D_{KL}(P || Q_{w}))$
- $w* = \underset{w}{\arg\min}(-\underset{x \in X}{\sum}P(x)\log(Q(x; w)) + \underset{x \in X}{\sum}P(x)\log({P(x)}))$
- We can drop the term $\underset{x \in X}{\sum}P(x)\log({P(x)}))$ when trying to minimize by finding the optimal w, because it is fixed and independent of w. So we are effectively just trying to minimize $H(P, Q)$
- $w* = \underset{w}{\arg\min}(-\underset{x \in X}{\sum}P(x)\log(Q(x; w)))$
- $w* = \underset{w}{\arg\max}(\underset{x \in X}{\sum}P(x)\log(Q(x; w)))$
Jensen-Shannon divergence
$JSD(P || Q) = \frac{1}{2}D_{KL}(P || \frac{P+Q}{2}) + \frac{1}{2}D_{KL}(Q || \frac{P+Q}{2})$
This is like a symmetric, smoother version of the KL divergence.
- symmetrical; $JSD(P || Q) = JSD(Q || P)$
- smoother than KL divergence. No explosion in value or infinities
- always finite in value (no infinities, as in KL divergence)No such case in which $\frac{P(x_i)+Q(x_i)}{2} = 0$ but $P(x_i) \neq 0$ to create a $\log(\infty)$ term, since all probability values must be > 0