Foundations of computer programming in MATLAB — Statistics

MATLAB has a number of useful statistical functions, especially in the statistics toolbox. We have already seen the use of hist and histogram in plotting histograms, but this is only scratching the surface. We will highlight a few functions you will find especially useful.

Basic statistical functions

Mean

It is easy to compute means of arrays with the mean function. For instance, if we have a column or row vector array, we can take its mean as follows.

x = randn(50,1);
m_x = mean(x);

Here m_x is the mean of the 50 by 1 array x. The mean function has more versatility, however. Say we have a two- or multi-dimensional array x, and we would like to take a mean in the dimension. We can use the second argument of mean as follows.

x = randn(10,3);
m_x = mean(x,1);

This gives us the mean of each column of x. Try mean(x,2) to see what that gives.

If the array over which the mean is computed has a not-a-number value NaN, mean will return NaN. Sometimes we wish to simply ignore the NaN values and still compute the mean. The function nanmean can be used in these cases. Try out the following.

x = randn(50,1);
x(30:50) = NaN;
m_x = mean(x,1);
m_x_nan = nanmean(x,1);

Variance and standard deviation

Let us recall that the definition of the unbiased sample variance estimate, which gives an estimate of the population variance $\sigma^2$ given a finite set of $N$ measurements $x_i$ with mean $m$: \(s_{N-1}^2 = \frac{1}{N-1}\sum_{i=1}^{N} \left(x_i - m\right)^2.\)

The MATLAB function var gives this estimate by default. This is the recommended method, but the optional second argument of var can be used to return a biased estimate, which is normalized by N instead of N-1. Here is an example.

x = randn(50,1);
var_x = var(x);

Here var_x is the unbiased estimate of the variance of sampled as x. This function also has the niceties of mean for handling multi-dimensional arrays. For instance, see the following example.

x = randn(10,3);
var_x = var(x,0,1);

This gives an estimate of the variance from each column of x. The second argument of var was 0 because it is the “weight,” which we want to be the default of 0 for an unbiased estimator. The third argument gives the dimension along which the variance is estimated.

If data has NaN values that we would like to ignore, we can use the omitnan flag to omit them.

Exercise

Estimate the mean and standard deviation of a sinusoidal wave of amplitude 5 added to Gaussian noise of standard deviation 3. Use 100 samples. Try again using 10^3 and 10^5 samples.

Linear least-squares regression

We often find ourselves measuring a quantity $\overline{y}$ that can be approximately described by a mathematical function $y$ that has the dependent variablee $x$. Usually $y$ depends on one or more parameters we would like to estimate from our measurement data. If those parameters enter $y$ linearly, we can use linear least-squares regression.

First, let’s see what it means for parameters $a_1$, $a_2$, …, $a_n$ to enter the function linearly. It means, given $n$ functions $f_1(x)$, $f_2(x)$, …, $f_n$, that the function $y$ can be written as \(y(x) = a_1 f_1(x) + a_2 f_2(x) + \ldots + a_n f_n(x).\) Note that $y$ can actually be a nonlinear function. The key is that the parameters must enter linearly.

Second, let’s say we have a system with $n = 3$ parameters—specifically, let’s choose it to be a quadratic function with $f_1(x) = 1$, $f_2(x) = x$, and $f_3(x) = x^2$. Let’s say we take $m = 3$ measurements of $y$ corresponding to measurements of the dependent variable $x$. We will denote these measurements as $(\overline{x}_1,\overline{y}_1)$, $(\overline{x}_2, \overline{y}_2)$, and $(\overline{x}_3, \overline{y}_3)$. Then we have the following system of algebraic equations:

\begin{align} \overline{y}_1 &= a_1 + a_2 \overline{x}_1 + a_3 \overline{x}_1^2 \\ \overline{y}_2 &= a_1 + a_2 \overline{x}_2 + a_3 \overline{x}_2^2 \\ \overline{y}_3 &= a_1 + a_2 \overline{x}_3 + a_3 \overline{x}_3^2. \end{align}

The “unknowns” in this system are $a_1$, $a_2$, and $a_3$. Therefore, we have $3$ equations and $3$ unknowns! We can solve this using linear algebra by creating the following matrices. \begin{align} \begin{bmatrix} 1 & \overline{x}_1 & \overline{x}_1^2 \\ 1 & \overline{x}_2 & \overline{x}_2^2 \\ 1 & \overline{x}_3 & \overline{x}_3^2 \end{bmatrix} \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix} = \begin{bmatrix} \overline{y}_1 \\ \overline{y}_2 \\ \overline{y}_3 \end{bmatrix}. \end{align} Assigning the obvious variables, we have \(\overline{\textbf{X}} \hat{\textbf{a}} = \overline{\textbf{y}}.\) Assuming $X$ is invertible, we have the solution \(\hat{\textbf{a}} = \overline{\textbf{X}}^{\,-1} \overline{\textbf{y}}.\) This is our estimate of the parameters $a_1$, $a_2$, and $a_3$.

But what if we take more measurements $(\overline{x}_i,\overline{y}_i)$? With $m = 8$ measurements, for instance, $\overline{\textbf{X}}$ is an $8 \times 3$ (i.e. $m \times n$) matrix, which cannot be inverted! Enter the Moore-Penrose pseudo-inverse of $\overline{\textbf{X}}$, written $\overline{\textbf{X}}^{\dagger}$. It can be used to give an estimate of $a_1$, $a_2$, … $a_n$ that minimizes the expression \(\left\lVert \overline{\textbf{X}}\hat{\textbf{a}} - \overline{\textbf{y}}\right\rVert_2.\) The pseudoinverse is defined as \(\overline{\textbf{X}}^{\dagger} = \left(\overline{\textbf{X}}^{*}\overline{\textbf{X}}\right)^{-1}\overline{\textbf{X}}^{*},\) where $*$ is the conjugate transpose. This exists even when $\overline{\textbf{X}}$ isn’t square, and gives the least-squares estimate \(\hat{\textbf{a}} = \overline{\textbf{X}}^{\dagger} \overline{\textbf{y}}.\)

Exercise

Download this data, unzip it, and load it into MATLAB. Write a function that can do an n-th order polynomial fit and apply it to the data. Fifth order should look pretty good.