Probability Theory for Machine Learning¶

There is (almost) no machine learning algorithm that cannot be interpreted from a probabilistic perspective.

For some algorithms that will be more obvious than for others, e.g.

Gaussian Mixture Models (Clustering, Regression)
Naive Bayes (Classification)
Linear Regression (Regression)
Logistic Regression (Classification)

Probability

Let \(A\) be the outcome of a random experiment. \(P(A)\) is its probability and

\(0 \leq P(A) \leq 1\)
\(P(A) = 1\) means the random experiment will always have the outcome \(A\)
\(P(A) = 0\) means the random experiment will never have the outcome \(A\)
\(P(A) = p \in (0, 1)\) means the random experiment will have the outcome \(A\) with probability \(p\)

Interpretations

Bayesian: \(P(A) = 0.9\) means we believe that the outcome of the random experiment is \(A\) with 90 % confidence.

Frequentist: \(P(A) = 0.9\) means if we repeated the random experiment infinitely often we would get the outcome \(A\) in 90 % of all cases.

Joint Distribution

Probability of observing \(A\) and \(B\)

\[P(A, B) = P(A) P(B)\]

Conditional Distribution

Probability of observing \(A\) given \(B\) has been observed

\[P(A|B) = \frac{P(A \cap B)}{P(B)}\]

Independence

Outcomes \(A\) and \(B\) (with probability greater than 0) are independend (\(A \perp B\)) iff (if and only if)

\[P(A|B) = P(A) \text{ which is the same as } P(A \cap B) = P(A) P(B)\]

Bayes' Theorem

\[P(B|A) = \frac{P(A|B) P(B)}{P(A)}\]

Some Vocabularies

Let

\(\mathcal{D}\) be a set of observations (e.g. a training set),
\(H\) is a hypothesis

We call

\(P(H)\) is a prior; how likely is the hypothesis \(H\) without any observations?
\(P(\mathcal{D})\) the evidence; usually all observations are equiprobable
\(P(\mathcal{D} | H)\)
- probability of the observations for fixed \(H\)
- likelihood of the hypothesis for fixed \(\mathcal{D}\), also \(P(\mathcal{D} | H) = \mathcal{L}(H | \mathcal{D})\)
\(P(H | \mathcal{D})\) the posterior

Bayesian Inference

\[P(H|\mathcal{D}) = \frac{P(\mathcal{D}|H) P(H)}{P(\mathcal{D})} = \frac{\mathcal{L}(H | \mathcal{D}) P(H)}{P(\mathcal{D})}\]

\[\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}}\]

What Does Learning Mean?

Learning means optimizing the hypothesis with respect to a given objective, e.g.

Maximum a Postiori Estimate (MAP)

\(H_{\text{MAP}} = \arg \max_H \mathcal{L}(H | \mathcal{D}) P(H)\)

Maximum Likelihood Estimate (MLE)

\(H_{\text{MLE}} = \arg \max_H \mathcal{L}(H | \mathcal{D})\)

Notes

the evidence \(P(\mathcal{D})\) does not change the maximum
often the log-likelihood \(\log \mathcal{L}\) will be optimized
MAP is often a regularized version of the MLE, i.e. it generalizes better over unseen data
MLE is a special case of MAP with uniform prior

Moments of random variables¶

Mode - the value that appears most often in a dataset

Mean - the average value of a random variable X

\[\mu_x = E(X) = \sum_x p(x) \cdot x\]

Lemma: \(E(a + b X) = a + b E(X)\)

Variance - squared average deviation from the mean of a random variable

\[Var(X) = E([X - E(X)]^2) = E(X^2) - \mu_x^2\]

Lemma: \(Var(a + b X) = b^2 Var(X)\)

Standard deviation - square root of the variance

Estimated Moments from Samples¶

In [4]:

import numpy as np
np.random.seed(0)
x = np.random.randn(1000) * np.random.randn() + np.random.randn()

In [7]:

estimated_mean = np.sum(x) / 1000.0
estimated_mean

Out[7]:

0.86731284696043887

In [10]:

np.mean(x)

Out[10]:

0.86731284696043887

In [9]:

estimated_var = np.sum((x - estimated_mean) ** 2) / 1000.0
estimated_var

Out[9]:

0.30113051335498248

In [11]:

np.var(x)

Out[11]:

0.30113051335498248

References

[1] Murphy, Kevin P.: Machine Learning - A Probabilistic Perspective, 2012, MIT Press.