For some algorithms that will be more obvious than for others, e.g.
Let \(A\) be the outcome of a random experiment. \(P(A)\) is its probability and
Bayesian: \(P(A) = 0.9\) means we believe that the outcome of the random experiment is \(A\) with 90 % confidence.
Frequentist: \(P(A) = 0.9\) means if we repeated the random experiment infinitely often we would get the outcome \(A\) in 90 % of all cases.
Probability of observing \(A\) and \(B\)
\[P(A, B) = P(A) P(B)\]
Probability of observing \(A\) given \(B\) has been observed
\[P(A|B) = \frac{P(A \cap B)}{P(B)}\]
Outcomes \(A\) and \(B\) (with probability greater than 0) are independend (\(A \perp B\)) iff (if and only if)
\[P(A|B) = P(A) \text{ which is the same as } P(A \cap B) = P(A) P(B)\]
\[P(B|A) = \frac{P(A|B) P(B)}{P(A)}\]
Let
We call
\[P(H|\mathcal{D}) = \frac{P(\mathcal{D}|H) P(H)}{P(\mathcal{D})} = \frac{\mathcal{L}(H | \mathcal{D}) P(H)}{P(\mathcal{D})}\]
\[\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}}\]
Learning means optimizing the hypothesis with respect to a given objective, e.g.
\(H_{\text{MAP}} = \arg \max_H \mathcal{L}(H | \mathcal{D}) P(H)\)
\(H_{\text{MLE}} = \arg \max_H \mathcal{L}(H | \mathcal{D})\)
Mode - the value that appears most often in a dataset
Mean - the average value of a random variable X
\[\mu_x = E(X) = \sum_x p(x) \cdot x\]
Lemma: \(E(a + b X) = a + b E(X)\)
Variance - squared average deviation from the mean of a random variable
\[Var(X) = E([X - E(X)]^2) = E(X^2) - \mu_x^2\]
Lemma: \(Var(a + b X) = b^2 Var(X)\)
Standard deviation - square root of the variance
import numpy as np
np.random.seed(0)
x = np.random.randn(1000) * np.random.randn() + np.random.randn()
estimated_mean = np.sum(x) / 1000.0
estimated_mean
0.86731284696043887
np.mean(x)
0.86731284696043887
estimated_var = np.sum((x - estimated_mean) ** 2) / 1000.0
estimated_var
0.30113051335498248
np.var(x)
0.30113051335498248
[1] Murphy, Kevin P.: Machine Learning - A Probabilistic Perspective, 2012, MIT Press.