%% Exported from Jupyter Notebook
% Run each section by placing your cursor in it and pressing Ctrl+Enter
%% Markdown Cell:
% One must have it to give a lecture named it.
% Confidence is used in the common sense, although we do endow it with a mathematical definition to scare business majors, who aren't actually impressed, but indifferent.
% Approximately: if, under some reasonable assumptions (probabilistic model), we estimate the probability of some event to be $P\%$, we say we have $P\%$ confidence in it.
% I mean, business majors are all, "Supply and demand? Let's call that a 'law,'" so I think we're even.
%
% So we're back to computing probability from distributions—probability density functions (PDFs) and probability mass functions (PMFs).
% Usually we care most about estimating the mean of our distribution.
% Recall from the previous lecture that when several samples are taken, each with its own mean, the mean is itself a random variable—with a mean, of course.
% [Meanception](http://ricopic.one/resources/inception.gif).
%
% But, more importantly (just kidding—equally so), the mean has a probability distribution of its own.
% The *central limit theorem* has as one of its implications that, as the sample size $N$ gets large, *regardless of the sample distributions*, [*this distribution of means approaches the Gaussian distribution*](http://ricopic.one/resources/mind_blown.gif).
%
% But sometimes I always worry I'm being lied to, so let's check.
%% Code Cell[106]:
clear; close all; % clear kernel
%% Code Cell[107]:
save_figures = true; % save flag
%% Markdown Cell:
% ### Generate some data to test the central limit theorem
%
% Data can be generated by constructing an array using a (seeded for consistency) random number generator.
% Let's use a uniformly distributed PDF between `0` and `1`.
%% Code Cell[108]:
N = 150; % sample size (number of measurements per sample)
M = 120; % number of samples
n = N*M; % total number of measurements
mu_pop = 0.5; % because it's a uniform PDF between 0 and 1
rng(11); % seed the random number generator
signal_a = rand(N,M); % uniform PDF
size(signal_a) % check the size
%% Markdown Cell:
% Let's take a look at the data by plotting the first ten samples (columns), as shown in \autoref{fig:confidence_raw}.
%% Code Cell[109]:
samples_to_plot = 10;
h = figure;
c = jet(samples_to_plot); % color array
for j=1:samples_to_plot
plot(signal_a(:,j),'o-',...
'Color',c(j,:),...
'MarkerFaceColor',c(j,:),...
'MarkerEdgeColor','none',...
'MarkerSize',3);
hold on;
end
hold off
xlabel('index');
ylabel('measurement');
hgsave(h,'figures/temp');
%% Code Cell[110]:
if save_figures
fn = 'confidence_raw.tex';
h = hgload('figures/temp',struct('Visible','Off'));
cleanfigure;
matlab2tikz(['figures/',fn],...
'figurehandle',h,...
'width','.9\linewidth',...
'height','.33\linewidth',...
'extraAxisOptions',...
[...
'scaled ticks=false,',...
'tick label style={/pgf/number format/fixed}'...
],...
'showInfo', false...
);
end
%% Markdown Cell:
% This is something like what we might see for continuous measurement data.
% Now, the histogram.
%% Code Cell[111]:
samples_to_plot = 10;
h = figure;
c = jet(samples_to_plot); % color array
for j=1:samples_to_plot
histogram(signal_a(:,j),...
30, ... % number of bins
'facecolor',c(j,:),...
'facealpha',.3,...
'normalization','probability'... % for PMF
);
hold on;
end
hold off;
xlim([-.05,1.05])
xlabel('measurement')
ylabel('probability')
hgsave(h,'figures/temp');
%% Markdown Cell:
% This isn't a great plot, but it shows roughly that each sample is fairly uniformly distributed.
%% Code Cell[112]:
if save_figures
fn = 'confidence_hist_1.tex';
h = hgload('figures/temp',struct('Visible','Off'));
cleanfigure;
matlab2tikz(['figures/',fn],...
'figurehandle',h,...
'width','.9\linewidth',...
'height','.33\linewidth',...
'extraAxisOptions',...
[...
'scaled ticks=false,',...
'tick label style={/pgf/number format/fixed}'...
],...
'showInfo', false...
);
end
%% Markdown Cell:
% ### Sample statistics
%% Markdown Cell:
% Now let's check out the sample statistics.
% We want the sample mean and standard deviation of each column.
% Let's use the built-in functions `mean` and `std`.
%% Code Cell[113]:
mu_a = mean(signal_a,1); % mean of each column
s_a = std(signal_a,1); % std of each column
%% Markdown Cell:
% Now we can compute the mean statistics, both the mean of the mean $\overline{\overline{X}}$ and the standard deviation of the mean $s_{\overline{X}}$, which we don't strictly need for this part, but we're curious.
% We choose to use the direct estimate instead of the $s_X/\sqrt{N}$ formula, but they should be close.
%% Code Cell[114]:
mu_mu = mean(mu_a)
s_mu = std(mu_a)
%% Markdown Cell:
% ### The truth about sample means
%% Markdown Cell:
% It's the moment of truth.
% Let's look at the distribution.
%% Code Cell[115]:
h = figure;
histogram(mu_a,...
'normalization','probability'... % for PMF
);
hold off;
xlabel('measurement')
ylabel('probability')
hgsave(h,'figures/temp');
%% Markdown Cell:
% This looks like a Gaussian distribution about the mean of means, so I guess the central limit theorem is legit.
%% Code Cell[116]:
if save_figures
fn = 'confidence_hist_2.tex';
h = hgload('figures/temp',struct('Visible','Off'));
cleanfigure;
matlab2tikz(['figures/',fn],...
'figurehandle',h,...
'width','.9\linewidth',...
'height','.33\linewidth',...
'extraAxisOptions',...
[...
'scaled ticks=false,',...
'tick label style={/pgf/number format/fixed}'...
],...
'showInfo', false...
);
end
%% Markdown Cell:
% ### Gaussian and probability
%% Markdown Cell:
% We already know how to compute the probability $P$ a value of a random variable $X$ lies in a certain interval from a PMF or PDF (the sum or the integral, respectively).
% This means that, for sufficiently large sample size $N$ such that we can assume from the central limit theorem that the sample means $\overline{x_i}$ are normally distributed, *the probability a sample mean value* $\overline{x_i}$ *is in a certain interval* is given by integrating the Gaussian PDF.
% The Gaussian PDF for random variable $Y$ representing the sample means is
% \begin{align}
% f(y) &= \frac{1}{\sigma\sqrt{2\pi}} \exp{\frac{-(y-\mu)^2}{2\sigma^2}}.
% \end{align}
% where $\mu$ is the population mean and $\sigma$ is the population standard deviation.
%
%% Markdown Cell:
% The integral of $f$ over some interval is the probability a value will be in that interval.
% Unfortunately, that integral is uncool.
% It gives rise to the definition of the *error function*, which, for the Gaussian random variable $Y$, is
% \begin{align}
% \text{erf}(y_b) = \frac{1}{\sqrt{\pi}} \int_{-y_b}^{y_b} e^{-t^2} dt.
% \end{align}
% This expresses the probability a sample mean being in the interval $[-y_b,y_b]$ if $Y$ has mean $0$ and variance $1/2$.
%% Markdown Cell:
% Matlab has this built-in as `erf`.
% Let's plot it.
%% Code Cell[147]:
y_a = linspace(0,3,100);
h = figure;
p1 = plot(y_a,erf(y_a));
p1.LineWidth = 2;
grid on
xlabel('interval bound $y_b$','interpreter','latex')
ylabel('error function $\textrm{erf}(y_b)$','interpreter','latex')
hgsave(h,'figures/temp');
%% Code Cell[148]:
if save_figures
fn = 'confidence_erf.tex';
h = hgload('figures/temp',struct('Visible','Off'));
cleanfigure;
matlab2tikz(['figures/',fn],...
'figurehandle',h,...
'width','.9\linewidth',...
'height','.33\linewidth',...
'extraAxisOptions',...
[...
'scaled ticks=false,',...
'tick label style={/pgf/number format/fixed}'...
],...
'showInfo', false...
);
end
%% Markdown Cell:
% We could deal directly with the error function, but most people don't and we're weird enough, as it is.
% Instead, people use the *Gaussian cumulative distribution function* $\Phi:\mathbb{R}\rightarrow\mathbb{R}$, which is defined as
% \begin{align}
% \Phi(z) = \frac{1}{2} \left(1+\textrm{erf}\left(\frac{z}{\sqrt{2}}\right)\right)
% \end{align}
% and which expresses the probability of a Gaussian random variable $Z$ with mean $0$ and standard deviation $1$ taking on a value in the interval $(-\infty,z]$.
%
% That's great and all, but occasionally we have Gaussian random variables with nonzero means and nonunity standard deviations.
% It turns out we can shift any Gaussian random variable by its mean and scale it by its standard deviation to make it have zero mean and standard deviation.
% We can then use $\Phi$ and interpret the results as being relative to the mean and standard deviation, using phrases like "the probability it is within two standard deviations of its mean."
% The transformed random variable $Z$ and its values $z$ are sometimes called the *z-score*.
% For a particular value $x$ of a random variable $X$, we can compute its $z$-score (or value $z$ of random variable $Z$) with the formula
% \begin{align}
% z = \frac{x - \mu_X}{\sigma_X}
% \end{align}
% and compute the probability of $X$ taking on a value within the interval, say, $x\in [x_{b-},x_{b+}]$ from the table.
% (Sample statistics $\overline{X}$ and $S_X$ are appropriate when population statistics are unknown.)
%% Code Cell[149]:
z_a = linspace(-3,3,300);
threshold = 1.5;
a_pdf = @(z) (z