Next: A.3 Iterative mean Up: A. Statistical tools Previous: A.1 Bayes' theorem

A.2 Maximum likelihood

The maximum likelihood principle is illustrated in an example with a one-dimensional data distribution {x_i}, i = 1,..., n. We assume that the data originate from a Gaussian distribution p(x) with parameters $\sigma$ and $\mu$ ,

p(x) = $\displaystyle {\frac{{1}}{{\sqrt{2 \pi} \sigma}}}$ exp $\displaystyle \left(\vphantom{-\frac{(x-\mu)^2}{2 \sigma^2}}\right.$ - $\displaystyle {\frac{{(x-\mu)^2}}{{2 \sigma^2}}}$ $\displaystyle \left.\vphantom{-\frac{(x-\mu)^2}{2 \sigma^2}}\right)$ .

(A.3)

According to the maximum likelihood principle, we will choose the unknown parameters such that the given data are most likely under the obtained distribution. The probability L of the given data set is

L( $\displaystyle \sigma$ , $\displaystyle \mu$ ) = $\displaystyle \prod^{n}_{{i=1}}$ p(x_i) = $\displaystyle \left(\vphantom{\frac{1}{\sqrt{2 \pi} \sigma}}\right.$ $\displaystyle {\frac{{1}}{{\sqrt{2 \pi} \sigma}}}$ $\displaystyle \left.\vphantom{\frac{1}{\sqrt{2 \pi} \sigma}}\right)^{n}_{}$ exp $\displaystyle \left(\vphantom{-\frac{\sum_{i=1}^n(x_i-\mu)^2}{2 \sigma^2}}\right.$ - $\displaystyle {\frac{{\sum_{i=1}^n(x_i-\mu)^2}}{{2 \sigma^2}}}$ $\displaystyle \left.\vphantom{-\frac{\sum_{i=1}^n(x_i-\mu)^2}{2 \sigma^2}}\right)$ .

(A.4)

We want to find $\hat{{\sigma}}$ and $\hat{{\mu}}$ that maximize L. Maximizing L is equivalent to maximizing log L, which is also called the log-likelihood $\mathcal {L}$ ,

$\displaystyle \mathcal {L}$ ( $\displaystyle \sigma$ , $\displaystyle \mu$ ) = log L( $\displaystyle \sigma$ , $\displaystyle \mu$ ) = - n log $\displaystyle \sigma$ - $\displaystyle {\frac{{\sum_i(x_i-\mu)^2}}{{2 \sigma^2}}}$ + const .

(A.5)

To find the maximum we compute the derivatives of the log-likelihood $\mathcal {L}$ and set them to zero:

$\displaystyle {\frac{{\partial\mathcal{L}}}{{\partial\sigma}}}$	=	- $\displaystyle {\frac{{n}}{{\sigma}}}$ + $\displaystyle {\frac{{\sum_i(x_i-\mu)^2}}{{\sigma^3}}}$ $\displaystyle \;\stackrel{{\textstyle!}}{{=}}\;$ 0 ,	(A.6)
$\displaystyle {\frac{{\partial\mathcal{L}}}{{\partial\mu}}}$	=	$\displaystyle {\frac{{\sum_i(x_i-\mu)}}{{\sigma^2}}}$ $\displaystyle \;\stackrel{{\textstyle!}}{{=}}\;$ 0 .	(A.7)

Thus, we obtain the values of the parameters $\hat{{\sigma}}$ and $\hat{{\mu}}$ :

$\displaystyle \hat{{\sigma}}^{2}_{}$	=	$\displaystyle {\frac{{\sum_i(x_i-\hat{\mu})^2}}{{n}}}$ ,	(A.8)
$\displaystyle \hat{{\mu}}$	=	$\displaystyle {\frac{{\sum_i x_i}}{{n}}}$ .	(A.9)

The resulting $\hat{\sigma}^{2}_{}$ is the variance of the distribution and $\hat{{\mu}}$ is its center. The extremum of $\mathcal {L}$ is indeed a local maximum, as can be seen by computing the Hesse matrix of $\mathcal {L}$ and evaluating it at the extreme point ( $\hat{{\sigma}}$ , $\hat{{\mu}}$ ):

H = $\displaystyle \left\vert\vphantom{ \begin{array}{cc} \frac{\partial^2\mathcal{L... ...rtial\sigma} & \frac{\partial^2\mathcal{L}}{\partial\mu^2} \end{array} }\right.$ $\displaystyle \begin{array}{cc} \frac{\partial^2\mathcal{L}}{\partial\sigma^2} ... ...ial\mu\partial\sigma} & \frac{\partial^2\mathcal{L}}{\partial\mu^2} \end{array}$ $\displaystyle \left.\vphantom{ \begin{array}{cc} \frac{\partial^2\mathcal{L}}{\... ...l\sigma} & \frac{\partial^2\mathcal{L}}{\partial\mu^2} \end{array} }\right\vert$ ,

(A.10)

$\displaystyle {\frac{{\partial^2\mathcal{L}}}{{\partial\sigma^2}}}$ $\displaystyle \Big\vert _{{\sigma=\hat{\sigma}, \mu=\hat{\mu}}}^{}$	=	$\displaystyle {\frac{{n}}{{\hat{\sigma}^2}}}$ - $\displaystyle {\frac{{3\sum_i (x_i-\hat{\mu})^2}}{{\hat{\sigma}^4}}}$ = $\displaystyle {\frac{{n}}{{\hat{\sigma}^2}}}$ - $\displaystyle {\frac{{3 n\hat{\sigma}^2}}{{\hat{\sigma}^4}}}$ = - $\displaystyle {\frac{{2 n}}{{\hat{\sigma}^2}}}$ ,	(A.11)

$\displaystyle {\frac{{\partial^2\mathcal{L}}}{{\partial\sigma\partial\mu}}}$ $\displaystyle \Big\vert _{{\sigma=\hat{\sigma}, \mu=\hat{\mu}}}^{}$	=	$\displaystyle {\frac{{\partial^2\mathcal{L}}}{{\partial\mu\partial\sigma}}}$ $\displaystyle \Big\vert _{{\sigma=\hat{\sigma}, \mu=\hat{\mu}}}^{}$ = - $\displaystyle {\frac{{2 \sum_i(x_i - \hat{\mu})}}{{\hat{\sigma}^3}}}$ = 0 ,

$\displaystyle {\frac{{\partial^2\mathcal{L}}}{{\partial\mu^2}}}$ $\displaystyle \Big\vert _{{\sigma=\hat{\sigma}, \mu=\hat{\mu}}}^{}$	=	- $\displaystyle {\frac{{n}}{{\hat{\sigma}^2}}}$ .

It follows that the Hesse matrix at the extremum is negative definite,

H|_=, = = $\displaystyle \left\vert\vphantom{ \begin{array}{cc} - \frac{2 n}{\hat{\sigma}^2} & 0 \ \ 0 & - \frac{n}{\hat{\sigma}^2} \end{array} }\right.$ $\displaystyle \begin{array}{cc} - \frac{2 n}{\hat{\sigma}^2} & 0 \ \ 0 & - \frac{n}{\hat{\sigma}^2} \end{array}$ $\displaystyle \left.\vphantom{ \begin{array}{cc} - \frac{2 n}{\hat{\sigma}^2} & 0 \ \ 0 & - \frac{n}{\hat{\sigma}^2} \end{array} }\right\vert$ .

(A.12)

Therefore, the extremum is a local maximum. Moreover, it is also a global maximum. First, for finite parameters, no other extrema exist because $\mathcal {L}$ is a smooth function. Second, $\mathcal {L}$ is positive for finite parameters, but approaches zero for infinite values. Thus, any maximum must be in the finite range.

Next: A.3 Iterative mean Up: A. Statistical tools Previous: A.1 Bayes' theorem

Heiko Hoffmann
2005-03-22