Mean-squared error (MSE) is commonly used in loss functions for regression problems. What justifies its usage? What are we actually doing when we minimize the MSE? We will address these questions from a few different perspectives.

$MSE = \frac{1}{2N}\sum_{i=1}^{N}(y_{i} - \hat{y_{i}})^2$

A maximum likelihood perspective

Givens and assumptions:

  • $X$: set of inputs {$x_{i}$} for all $i$
  • $Y$: set of observed outputs {$y_{i}$} for all $i$
  • $y_{i} = h(x_{i}) + e_{i}$Where h is some hypothesis function, and $e_{i}$ is normally-distributed noise

Aside: Why can we assume that noise is normally distributed? We assume that the noise is the sum of a lot of independent random variables. Because of the central limit theorem, the sum of a lot of independent variables is approximately normally distributed.

Our goal is to find the hypothesis $h$ that is most likely to give rise to the set of observations $Y$. We can formulate this as a maximum likelihood estimation problem.

$h^* = \underset{h}{\arg\max}P(h | Y)$

We will show that the hypothesis that maximizes the above likelihood is one that minimizes the sum of squared errors.

  • $h^* = \underset{h}{\arg\max}\frac{P(Y | h)P(h)}{P(Y)}$
    • Because Bayes' Theorem states that P(h | Y) = $\frac{P(Y | h)P(h)}{P(Y)}$
  • $h^* = \underset{h}{\arg\max}P(Y | h)$
    • Because P(Y) is constant, and P(h) is the same for all h (in other words, P(h) is uniform), they do not affect argmax so we can drop them
  • $h^* = \underset{h}{\arg\max}\prod_{i=1}^{N} P(y_{i} | h)$
    • Assuming that the training examples are mutually independent
  • $h^* = \underset{h}{\arg\max}\prod_{i=1}^{N} P(h(x_{i}) + e_{i})$
  • $h^* = \underset{h}{\arg\max}\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(y_{i} - h(x_{i}))^2}{2\sigma^2}}$
  • $h^* = \underset{h}{\arg\max}(log(\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(y_{i} - h(x_{i}))^2}{2\sigma^2}}))$
    • $log_{b}(MN) = log_{b}M + log_{b}N$
  • $h^* = \underset{h}{\arg\max}\sum_{i=1}^{N}log(\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(y_{i} - h(x_{i}))^2}{2\sigma^2}})$
    • The log function is monotonic and won’t affect argmax so we can drop it
  • $h^* = \underset{h}{\arg\max}\sum_{i=1}^{N}\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(y_{i} - h(x_{i}))^2}{2\sigma^2}}$
    • $\frac{1}{\sqrt{2\pi\sigma^2}}$ is a constant, and $e$ is monotonic, so they won’t affect argmax, so we can drop them
  • $h^* = \underset{h}{\arg\max}\sum_{i=1}^{N}\frac{-(y_{i} - h(x_{i}))^2}{2\sigma^2}$
  • $h^* = \underset{h}{\arg\max}\sum_{i=1}^{N}-(y_{i} - h(x_{i}))^2$
  • $h^* = \underset{h}{\arg\min}\sum_{i=1}^{N}(y_{i} - h(x_{i}))^2$

This shows that from a maximum likelihood perspective, the “best” hypothesis for the data is the one that minimizes the least squares function, if the assumptions listed above are true.

A linear algebra perspective

What is the connection between the solution $x$ of $Ax = b$, and MSE?

Consider the equation $Ax = b$

  • $A$: given matrix representing input values
  • $b$: given vector representing the observed output values
  • $x$: unknown weights for the linear regression equation that transforms the input values $A$ to best fit the output values $b$. Our goal is to solve for this.

A common problem with trying to solve for $x$ in this equation, is that if $b$ is not in the column space of $A$ (in other words, no combination of $A$’s columns can result in b), there is no solution for x. However, we can still try to find the value of x that best fits $A$ to $b$. To do that, let’s reformulate our problem.

Let $p$ be the vector that is in the column space of $A$, and that is as close to $b$ as possible. In other words, let $p$ be the projection of $b$ onto the column space of $A$.

Let a represent the column space of A (image is not representative of its dimensionality). p is the vector that is in this space, that is as close to b as possible. e is the distance between b and p that we want to be minimal. (I drew this in GIMP)

New equation to solve: $Ax = p$

  • $e = b - p$
  • $e = b - Ax$
  • $A^Te = 0$
    • Because $A$ and $e$ are orthogonal, because $p$ is as close to $b$ as possible
  • $A^T(b - Ax) = 0$
  • $A^Tb - A^TAx = 0$
  • $x = \frac{A^Tb}{A^TA}$

The closed-form solution for $x$ minimizes the absolute error $e$, between $b$ and $p$, which also minimizes the squared error $e^2$. In the next section, we will show that this solution is equivalent to one found by directly minimizing the least squares cost function.

A matrix calculus perspective

We can also show through calculus that finding the $x$ that minimizes the least squares cost function is equivalent to $x = \frac{A^Tb}{A^TA}$ that we just found above.

$J(x) = \frac{1}{2N}\sum_{i=1}^{N}(a_{i}x - b_{i})^2$

  • $i$ represents the $i^{th}$ row
  • $x$ is a column vector; values are unknown
  • $b$ is a column vector; values are given
  • $a_{i}$ represents the $i^{th}$ given datapoint as a row

Minimizing $J(x)$ is equivalent to minimizing the following:

  • $J(x) = (Ax - b)^T(Ax - b)$
    • We can ignore the constant $\frac{1}{2N}$ when minimizing
  • $J(x) = ((Ax)^T - b^T)(Ax - b)$
  • $J(x) = (Ax)^T(Ax) - (Ax)^Tb - b^T(Ax) + b^Tb$
    • Because $y^Tb = b^Ty$ when $y$ and $b$ are vectors, $- (Ax)^Tb - b^T(Ax) = - 2(Ax)^Tb$
  • $J(x) = (Ax)^T(Ax) - 2(Ax)^Tb + b^Tb$

Now we take the derivative of the above equation and set it to 0. We will be using numerator layout notation for this entire section (because it slightly simplifies using the chain rule, to me).

  • $\frac{\partial{J(x)}}{\partial{x}} = 0 = \frac{\partial}{\partial{x}}((Ax)^T(Ax) - 2(Ax)^Tb + b^Tb)$

Recall how to get derivatives with vectors:

  • The derivative of a scalar a with respect to a vector $b$ of dimension n x 1 is:$\frac{\partial{a}}{\partial{b}} = [\frac{\partial{a}}{\partial{b_1}}, \frac{\partial{a}}{\partial{b_2}}, … \frac{\partial{a}}{\partial{b_n}}]$
  • The derivative of a vector $a$ of dimension m x 1, with respect to a vector $b$ of dimension of n x 1 is: $\frac{\partial{a}}{\partial{b}} = \begin{bmatrix}\frac{\partial{a_1}}{\partial{b_1}} & \frac{\partial{a_1}}{\partial{b_2}} & … & \frac{\partial{a_1}}{\partial{b_n}}\\\frac{\partial{a_2}}{\partial{b_1}} & \frac{\partial{a_2}}{\partial{b_2}} & … & \frac{\partial{a_2}}{\partial{b_n}}\\… & … & … & …\\\frac{\partial{a_m}}{\partial{b_1}} & \frac{\partial{a_m}}{\partial{b_2}} & … & \frac{\partial{a_m}}{\partial{b_n}}\end{bmatrix}$

First let’s compute $\frac{\partial{((Ax)^T(Ax))}}{\partial{x}}$

  • Let $y = Ax$
  • $\frac{\partial{(y^{T}y)}}{\partial{y}} = 2y^T$
  • $\frac{\partial{y}}{\partial{x}} = A$
  • $\frac{\partial{(y^{T}y)}}{\partial{x}} = \frac{\partial{(y^{T}y)}}{\partial{y}}\frac{\partial{y}}{\partial{x}} = 2y^TA$
    • Because of the chain rule
  • $\frac{\partial{((Ax)^T(Ax))}}{\partial{x}} = 2x^{T}A^{T}A$
    • Because $y = Ax$

Next let’s compute $\frac{\partial(-2(Ax)^Tb)}{\partial{x}}$

  • Let $y = Ax$
  • $\frac{\partial{(-2y^Tb)}}{\partial{y}} = -2b^T$
  • $\frac{\partial{y}}{\partial{x}} = A$
  • $\frac{\partial{(-2y^Tb)}}{\partial{x}} = \frac{\partial{(-2y^Tb)}}{\partial{y}}\frac{\partial{y}}{\partial{x}} = -2b^TA$
    • Because of the chain rule

Finally, we have $\frac{\partial{J(x)}}{\partial{x}} = 0 = 2x^{T}A^{T}A - 2b^TA$

  • $x^{T}A^{T}A = b^TA$
  • Now transpose both sides
  • $A^TAx = A^Tb$
  • $x = \frac{A^Tb}{A^TA}$

We can verify that this is a minimum by checking the second derivative of $J(x)$

  • $\frac{\partial}{\partial{x}}(\frac{\partial{J(x)}}{\partial{x}}) = \frac{\partial(2x^{T}A^{T}A - 2b^TA)}{\partial{x}} = 2A^TA$
  • This is positive everywhere, so the $x$ we found earlier is a minimum of $J(x)$

This calculus-derived solution for the $x$ that minimizes the squared error is the same as the one as we got from finding the best $x$ in the linear equation $Ax = b$.

More properties of MSE to note

  • All errors are positive quantities
  • It is differentiable everywhere (unlike the absolute value function)
  • It heavily penalizes outliers