Probability Theory for Machine Learning

There is (almost) no machine learning algorithm that cannot be interpreted from a probabilistic perspective.

For some algorithms that will be more obvious than for others, e.g.

  • Gaussian Mixture Models (Clustering, Regression)
  • Naive Bayes (Classification)
  • Linear Regression (Regression)
  • Logistic Regression (Classification)

Probability

Let \(A\) be the outcome of a random experiment. \(P(A)\) is its probability and

  • \(0 \leq P(A) \leq 1\)
  • \(P(A) = 1\) means the random experiment will always have the outcome \(A\)
  • \(P(A) = 0\) means the random experiment will never have the outcome \(A\)
  • \(P(A) = p \in (0, 1)\) means the random experiment will have the outcome \(A\) with probability \(p\)

Interpretations

Bayesian: \(P(A) = 0.9\) means we believe that the outcome of the random experiment is \(A\) with 90 % confidence.

Frequentist: \(P(A) = 0.9\) means if we repeated the random experiment infinitely often we would get the outcome \(A\) in 90 % of all cases.

Joint Distribution

Probability of observing \(A\) and \(B\)

\[P(A, B) = P(A) P(B)\]

Conditional Distribution

Probability of observing \(A\) given \(B\) has been observed

\[P(A|B) = \frac{P(A \cap B)}{P(B)}\]

Independence

Outcomes \(A\) and \(B\) (with probability greater than 0) are independend (\(A \perp B\)) iff (if and only if)

\[P(A|B) = P(A) \text{ which is the same as } P(A \cap B) = P(A) P(B)\]

Bayes' Theorem

\[P(B|A) = \frac{P(A|B) P(B)}{P(A)}\]

Some Vocabularies

Let

  • \(\mathcal{D}\) be a set of observations (e.g. a training set),
  • \(H\) is a hypothesis

We call

  • \(P(H)\) is a prior; how likely is the hypothesis \(H\) without any observations?
  • \(P(\mathcal{D})\) the evidence; usually all observations are equiprobable
  • \(P(\mathcal{D} | H)\)
    • probability of the observations for fixed \(H\)
    • likelihood of the hypothesis for fixed \(\mathcal{D}\), also \(P(\mathcal{D} | H) = \mathcal{L}(H | \mathcal{D})\)
  • \(P(H | \mathcal{D})\) the posterior

Bayesian Inference

\[P(H|\mathcal{D}) = \frac{P(\mathcal{D}|H) P(H)}{P(\mathcal{D})} = \frac{\mathcal{L}(H | \mathcal{D}) P(H)}{P(\mathcal{D})}\]

\[\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}}\]

What Does Learning Mean?

Learning means optimizing the hypothesis with respect to a given objective, e.g.

Maximum a Postiori Estimate (MAP)

\(H_{\text{MAP}} = \arg \max_H \mathcal{L}(H | \mathcal{D}) P(H)\)

Maximum Likelihood Estimate (MLE)

\(H_{\text{MLE}} = \arg \max_H \mathcal{L}(H | \mathcal{D})\)

Notes

  • the evidence \(P(\mathcal{D})\) does not change the maximum
  • often the log-likelihood \(\log \mathcal{L}\) will be optimized
  • MAP is often a regularized version of the MLE, i.e. it generalizes better over unseen data
  • MLE is a special case of MAP with uniform prior

Moments of random variables

Mode - the value that appears most often in a dataset

Mean - the average value of a random variable X

\[\mu_x = E(X) = \sum_x p(x) \cdot x\]

Lemma: \(E(a + b X) = a + b E(X)\)

Variance - squared average deviation from the mean of a random variable

\[Var(X) = E([X - E(X)]^2) = E(X^2) - \mu_x^2\]

Lemma: \(Var(a + b X) = b^2 Var(X)\)

Standard deviation - square root of the variance

Estimated Moments from Samples

In [4]:
import numpy as np
np.random.seed(0)
x = np.random.randn(1000) * np.random.randn() + np.random.randn()
In [7]:
estimated_mean = np.sum(x) / 1000.0
estimated_mean
Out[7]:
0.86731284696043887
In [10]:
np.mean(x)
Out[10]:
0.86731284696043887
In [9]:
estimated_var = np.sum((x - estimated_mean) ** 2) / 1000.0
estimated_var
Out[9]:
0.30113051335498248
In [11]:
np.var(x)
Out[11]:
0.30113051335498248

References

[1] Murphy, Kevin P.: Machine Learning - A Probabilistic Perspective, 2012, MIT Press.