Foundations of computer programming in MATLAB — Statistics
MATLAB
has a number of useful statistical functions, especially in the statistics
toolbox. We have already seen the use of hist
and histogram
in plotting histograms, but this is only scratching the surface. We will highlight a few functions you will find especially useful.
Basic statistical functions
Mean
It is easy to compute means of arrays with the mean
function. For instance, if we have a column or row vector array, we can take its mean as follows.
x = randn(50,1);
m_x = mean(x);
Here m_x
is the mean of the 50
by 1
array x
. The mean
function has more versatility, however. Say we have a two- or multi-dimensional array x
, and we would like to take a mean in the dimension. We can use the second argument of mean
as follows.
x = randn(10,3);
m_x = mean(x,1);
This gives us the mean of each column of x
. Try mean(x,2)
to see what that gives.
If the array over which the mean is computed has a not-a-number value NaN
, mean
will return NaN
. Sometimes we wish to simply ignore the NaN
values and still compute the mean. The function nanmean
can be used in these cases. Try out the following.
x = randn(50,1);
x(30:50) = NaN;
m_x = mean(x,1);
m_x_nan = nanmean(x,1);
Variance and standard deviation
Let us recall that the definition of the unbiased sample variance estimate, which gives an estimate of the population variance $\sigma^2$ given a finite set of $N$ measurements $x_i$ with mean $m$: \(s_{N-1}^2 = \frac{1}{N-1}\sum_{i=1}^{N} \left(x_i - m\right)^2.\)
The MATLAB
function var
gives this estimate by default. This is the recommended method, but the optional second argument of var
can be used to return a biased estimate, which is normalized by N
instead of N-1
. Here is an example.
x = randn(50,1);
var_x = var(x);
Here var_x
is the unbiased estimate of the variance of sampled as x
. This function also has the niceties of mean
for handling multi-dimensional arrays. For instance, see the following example.
x = randn(10,3);
var_x = var(x,0,1);
This gives an estimate of the variance from each column of x
. The second argument of var
was 0
because it is the “weight,” which we want to be the default of 0
for an unbiased estimator. The third argument gives the dimension along which the variance is estimated.
If data has NaN
values that we would like to ignore, we can use the omitnan
flag to omit them.
Exercise
Estimate the mean and standard deviation of a sinusoidal wave of amplitude 5
added to Gaussian noise of standard deviation 3. Use 100
samples. Try again using 10^3
and 10^5
samples.
Linear least-squares regression
We often find ourselves measuring a quantity $\overline{y}$ that can be approximately described by a mathematical function $y$ that has the dependent variablee $x$. Usually $y$ depends on one or more parameters we would like to estimate from our measurement data. If those parameters enter $y$ linearly, we can use linear least-squares regression.
First, let’s see what it means for parameters $a_1$, $a_2$, …, $a_n$ to enter the function linearly. It means, given $n$ functions $f_1(x)$, $f_2(x)$, …, $f_n$, that the function $y$ can be written as \(y(x) = a_1 f_1(x) + a_2 f_2(x) + \ldots + a_n f_n(x).\) Note that $y$ can actually be a nonlinear function. The key is that the parameters must enter linearly.
Second, let’s say we have a system with $n = 3$ parameters—specifically, let’s choose it to be a quadratic function with $f_1(x) = 1$, $f_2(x) = x$, and $f_3(x) = x^2$. Let’s say we take $m = 3$ measurements of $y$ corresponding to measurements of the dependent variable $x$. We will denote these measurements as $(\overline{x}_1,\overline{y}_1)$, $(\overline{x}_2, \overline{y}_2)$, and $(\overline{x}_3, \overline{y}_3)$. Then we have the following system of algebraic equations:
\begin{align} \overline{y}_1 &= a_1 + a_2 \overline{x}_1 + a_3 \overline{x}_1^2 \\ \overline{y}_2 &= a_1 + a_2 \overline{x}_2 + a_3 \overline{x}_2^2 \\ \overline{y}_3 &= a_1 + a_2 \overline{x}_3 + a_3 \overline{x}_3^2. \end{align}
The “unknowns” in this system are $a_1$, $a_2$, and $a_3$. Therefore, we have $3$ equations and $3$ unknowns! We can solve this using linear algebra by creating the following matrices. \begin{align} \begin{bmatrix} 1 & \overline{x}_1 & \overline{x}_1^2 \\ 1 & \overline{x}_2 & \overline{x}_2^2 \\ 1 & \overline{x}_3 & \overline{x}_3^2 \end{bmatrix} \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix} = \begin{bmatrix} \overline{y}_1 \\ \overline{y}_2 \\ \overline{y}_3 \end{bmatrix}. \end{align} Assigning the obvious variables, we have \(\overline{\textbf{X}} \hat{\textbf{a}} = \overline{\textbf{y}}.\) Assuming $X$ is invertible, we have the solution \(\hat{\textbf{a}} = \overline{\textbf{X}}^{\,-1} \overline{\textbf{y}}.\) This is our estimate of the parameters $a_1$, $a_2$, and $a_3$.
But what if we take more measurements $(\overline{x}_i,\overline{y}_i)$? With $m = 8$ measurements, for instance, $\overline{\textbf{X}}$ is an $8 \times 3$ (i.e. $m \times n$) matrix, which cannot be inverted! Enter the Moore-Penrose pseudo-inverse of $\overline{\textbf{X}}$, written $\overline{\textbf{X}}^{\dagger}$. It can be used to give an estimate of $a_1$, $a_2$, … $a_n$ that minimizes the expression \(\left\lVert \overline{\textbf{X}}\hat{\textbf{a}} - \overline{\textbf{y}}\right\rVert_2.\) The pseudoinverse is defined as \(\overline{\textbf{X}}^{\dagger} = \left(\overline{\textbf{X}}^{*}\overline{\textbf{X}}\right)^{-1}\overline{\textbf{X}}^{*},\) where $*$ is the conjugate transpose. This exists even when $\overline{\textbf{X}}$ isn’t square, and gives the least-squares estimate \(\hat{\textbf{a}} = \overline{\textbf{X}}^{\dagger} \overline{\textbf{y}}.\)
Exercise
Download this data, unzip it, and load
it into MATLAB
. Write a function that can do an n-th order polynomial fit and apply it to the data. Fifth order should look pretty good.