Approximate an unknown function $f$ that maps from $\mathbb{R}^D$ to $\mathbb{R}^F$.
Model:
- We assume that there is some latent function $f: \mathbb{R}^D \rightarrow \mathbb{R}^F$.
- We observe samples $(\boldsymbol{x}_n, \boldsymbol{y}_n)$ with $f(\boldsymbol{x}_n) + \boldsymbol{\epsilon}_n = \boldsymbol{y}_n$
- with i.i.d. noise, for example $\boldsymbol{\epsilon}_n \sim \mathcal{N}(0, \sigma^2 \boldsymbol{I}_F)$.
Overview¶
Algorithms
- Linear Regression
- Polynomial (Ridge) Regression
- Kernel (Ridge) Regression
- Gaussian Process Regression
- Support Vector Regression
- Neural Nets
Tips
- Bias vs. Variance
- Model Selection
Dataset¶
All samples $\{(\boldsymbol{x}_1, \boldsymbol{y}_1), \ldots, (\boldsymbol{x}_N, \boldsymbol{y}_N)\}$ are called dataset.
Here is an example.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.utils import check_random_state
def generate_data(n_samples, sigma, random_state=None):
random_state = check_random_state(random_state)
x = np.linspace(0, 1, n_samples)
X = x[:, np.newaxis]
y_true = np.cos(2 * np.pi * x)
y = y_true + sigma * random_state.randn(n_samples)
return x, X, y_true, y
x, X, y_true, y = generate_data(101, 0.3, 0)
plt.plot(x, y_true, label="Latent function")
plt.scatter(x, y, label="Samples")
plt.xlim((0, 1))
plt.legend()
We will only consider the case $F = 1$.
In this case, we can organize the whole dataset in a design matrix $\boldsymbol{X} \in \mathbb{R}^{N \times D}$ which contains the input $\boldsymbol{x}_n$ of one sample in each row and a vector $\boldsymbol{y} \in \mathbb{R}^N$ that contains the output $y_n$ of one sample in each entry.
Design matrix
$$\boldsymbol{X} = \begin{pmatrix} \boldsymbol{x}_1^T\\ \boldsymbol{x}_2^T\\ \vdots\\ \boldsymbol{x}_N^T\ \end{pmatrix} = \overbrace{ \begin{pmatrix} x_{1,1}&x_{1,2}&\ldots&x_{1,D}\\ x_{2,1}&x_{2,2}&\ldots&x_{2,D}\\ \vdots&\vdots&\ddots&\vdots\\ x_{N,1}&x_{N,2}&\ldots&x_{N,D} \end{pmatrix} }^\textrm{Features}$$Linear Regression¶
Model: We assume that there is some latent function $f: \mathbb{R}^D \rightarrow \mathbb{R}$. We observe samples $(\boldsymbol{x}_n, y_n)$ with $$f(\boldsymbol{x}_n) + \epsilon_n = y_n, \quad \epsilon_n \sim \mathcal{N}(0, \sigma^2).$$
We assume that $f$ is a linear function of the form $$f(\boldsymbol{x}_n) = \boldsymbol{w}^T \boldsymbol{x}$$ ($\boldsymbol{x}$ will be extended by 1 so that $\boldsymbol{w}$ incorporates a bias).
For the whole dataset: $$\boldsymbol{y} = \boldsymbol{X}\boldsymbol{w} + \boldsymbol{\epsilon}$$
We will learn by minimizing the sum of squared errors (SSE) $$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} ||^2_2.$$
We can find a solution for the vector $\boldsymbol{w}$ by setting the derivative of the objective function to 0 and solving for $\boldsymbol{w}$.
\begin{eqnarray*} && 0 = \boldsymbol{X}^T \left( \boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} \right) \boldsymbol{X} = \boldsymbol{X}^T \boldsymbol{y} - \boldsymbol{X}^T \boldsymbol{X} \boldsymbol{w}\\ &\Leftrightarrow& \boldsymbol{X}^T \boldsymbol{X} \boldsymbol{w} = \boldsymbol{X}^T \boldsymbol{y}\\ &\Leftrightarrow& \boldsymbol{w} = \left( \boldsymbol{X}^T \boldsymbol{X} \right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \end{eqnarray*}(Now we actually have to prove that the second derivative is greater than 0.)
from sklearn.base import BaseEstimator, RegressorMixin
class LinearRegression(BaseEstimator, RegressorMixin):
def fit(self, X, y):
n_samples = X.shape[0]
X_bias = np.hstack((np.ones((n_samples, 1)), X))
self.w = np.linalg.inv(X_bias.T.dot(X_bias)).dot(X_bias.T).dot(y)
return self
def predict(self, X):
n_samples = X.shape[0]
X_bias = np.hstack((np.ones((n_samples, 1)), X))
return X_bias.dot(self.w)
def plot_model(model):
x, X, y_true, y = generate_data(101, 0.3, 0)
plt.plot(x, y_true, label="Latent function")
plt.scatter(x, y, label="Samples")
plt.plot(x, model.predict(X), label="Prediction")
plt.xlim((0, 1))
plt.legend()
x, X, y_true, y = generate_data(101, 0.3, 0)
plot_model(LinearRegression().fit(X, y))
Problems:
- linear model
- complexity of inverse depends on features: $O(D^3)$
- alternative: we can use the iterative algorithm gradient descent; requires multiple iterations over the whole dataset
Nonlinear Regression¶
We can extend the linear model to approximate nonlinear functions by
- using fixed nonlinear features (polynomial regression, radial basis function networks, ...)
- learning nonlinear features (neural nets)
- using nonlinear similarities (kernel methods)
- mixing multiple linear models (local regression)
Ingredients for Machine Learning¶
Machine learning algorithms often consist of
- a parameterized function (e.g. $\boldsymbol{X} \boldsymbol{w}$, $\boldsymbol{w}$ are the parameters)
- an objective function (e.g. $\frac{1}{2} || \boldsymbol{y} - \hat{\boldsymbol{y}} ||^2_2$)
- an optimization procedure
Polynomial Regression¶
We can approximate nonlinear functions with linear regression by generating nonlinear features $\phi(\boldsymbol{x})$.
For example: $$\left( x_1, x_2 \right) \rightarrow \left( x_1^3, x_2^3, x_1^2x_2, x_1 x_2^2, x_1^2, x_2^2, x_1 x_2, x_1, x_2, 1 \right)$$
from sklearn.preprocessing import PolynomialFeatures
class PolynomialRegression(BaseEstimator, RegressorMixin):
def __init__(self, degree):
self.degree = degree
def fit(self, X, y):
self.poly = PolynomialFeatures(degree=self.degree).fit(X)
X_poly = self.poly.transform(X)
self.w = np.linalg.pinv(X_poly.T.dot(X_poly)).dot(X_poly.T).dot(y)
return self
def predict(self, X):
X_poly = self.poly.transform(X)
return X_poly.dot(self.w)
plot_model(PolynomialRegression(100).fit(X, y))
Problems:
- overfitting
- features have to be selected
- makes the problem of cubic complexity worse (more features)
Bias vs. Variance¶
def plot_bias_variance_error():
plt.setp(plt.gca(), xticks=(), yticks=(), xlabel="Model Complexity", ylabel="Error")
resolution = 100
complexity = np.linspace(0, 4, resolution)
noise_error = np.ones(resolution) * 0.1
bias_error = np.exp(-complexity)
variance_error = np.exp(complexity - np.max(complexity))
plt.plot(complexity, bias_error, label="Bias")
plt.plot(complexity, variance_error, label="Variance")
plt.plot(complexity, noise_error, label="Noise")
plt.plot(complexity, bias_error + variance_error + noise_error, label="Total")
plt.legend(loc="best")
with plt.xkcd():
plot_bias_variance_error()
def plot_bias_variance():
plt.figure(figsize=(12, 12))
def plot_target():
t = np.linspace(0, 2 * np.pi, 100)
plt.plot(np.cos(t), np.sin(t))
plt.plot(0.67 * np.cos(t), 0.67 * np.sin(t))
plt.plot(0.33 * np.cos(t), 0.33 * np.sin(t))
plt.scatter(0, 0, color="black", s=100)
plt.setp(plt.subplot(2, 2, 1, aspect="equal"), xticks=(), yticks=())
plot_target()
plt.title("Low Variance")
plt.ylabel("Low Bias")
plt.scatter((np.random.rand(10)-0.5)*0.2, (np.random.rand(10)-0.5)*0.2, s=50)
plt.setp(plt.subplot(2, 2, 2, aspect="equal"), xticks=(), yticks=())
plot_target()
plt.title("High Variance")
plt.scatter((np.random.rand(10)-0.5)*0.8, (np.random.rand(10)-0.5)*0.8, s=50)
plt.setp(plt.subplot(2, 2, 3, aspect="equal"), xticks=(), yticks=())
plot_target()
plt.ylabel("High Bias")
plt.scatter((np.random.rand(10)-0.5)*0.2+0.5, (np.random.rand(10)-0.5)*0.2+0.5, s=50)
plt.setp(plt.subplot(2, 2, 4, aspect="equal"), xticks=(), yticks=())
plot_target()
plt.scatter((np.random.rand(10)-0.5)*0.8+0.5, (np.random.rand(10)-0.5)*0.8+0.5, s=50)
with plt.xkcd():
plot_bias_variance()
Ridge Regression¶
To reduce overfitting, we minimize
$$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} ||^2_2 + \frac{\gamma}{2} \boldsymbol{w}^T \boldsymbol{w},$$which means that we set a prior for $\boldsymbol{w}$.
class PolynomialRidgeRegression(BaseEstimator, RegressorMixin):
def __init__(self, degree, gamma):
self.degree = degree
self.gamma = gamma
def fit(self, X, y):
self.poly = PolynomialFeatures(degree=self.degree).fit(X)
X_poly = self.poly.transform(X)
self.w = (np.linalg.inv(X_poly.T.dot(X_poly) + self.gamma *
np.eye(X_poly.shape[1])).dot(X_poly.T).dot(y))
return self
def predict(self, X):
X_poly = self.poly.transform(X)
return X_poly.dot(self.w)
plot_model(PolynomialRidgeRegression(100, 0.01).fit(X, y))
Problems:
- regularization coefficient has to be selected
- does not address the cubic complexity
Kernel Regression¶
We can express $\boldsymbol{w}$ as a weighted sum of the training data $$\boldsymbol{w} = \boldsymbol{X}^T \boldsymbol{\alpha}$$
Which leads to the dual problem $$\arg \min_\boldsymbol{\alpha} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{X} \boldsymbol{X}^T \boldsymbol{\alpha}||^2_2$$
We call $\boldsymbol{K} = \boldsymbol{X}\boldsymbol{X}^T \in \mathbb{R}^{N \times N}$ the Gram matrix so that we can find the minimum at \begin{eqnarray*} && 0 = \boldsymbol{K} \boldsymbol{y} - \boldsymbol{K}\boldsymbol{K} \boldsymbol{\alpha}\\ &\Leftrightarrow& \boldsymbol{K}\boldsymbol{K} \boldsymbol{\alpha} = \boldsymbol{K} \boldsymbol{y}\\ &\Leftrightarrow& \boldsymbol{\alpha} = \boldsymbol{K}^{-1} \boldsymbol{y} \end{eqnarray*}
Kernel Trick¶
Instead of the linear kernel $k(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T \boldsymbol{x}'$, we can use any Mercer kernel to build the Gram matrix $\boldsymbol{K}$. The intuition behind this is that we use another definition of similarity when we use another kernel.
- Linear kernel: $k(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T \boldsymbol{x}'$
- Radial basis function (RBF) kernel: $k(\boldsymbol{x}, \boldsymbol{x}') = \exp \left( - \gamma ||\boldsymbol{x} - \boldsymbol{x}'||^2 \right)$
- Polynomial kernel
- String kernels
- Graph kernels
- ...
from sklearn.metrics.pairwise import pairwise_kernels
class KernelRegression(BaseEstimator, RegressorMixin):
def __init__(self, kernel, **kernel_args):
self.kernel = kernel
self.kernel_args = kernel_args
def fit(self, X, y):
self.X = X
K = pairwise_kernels(self.X, metric=self.kernel, **self.kernel_args)
self.alpha = np.linalg.pinv(K).dot(y)
return self
def predict(self, X):
K_star = pairwise_kernels(X, self.X, metric=self.kernel, **self.kernel_args)
return K_star.dot(self.alpha)
plot_model(KernelRegression("linear").fit(X, y))
plot_model(KernelRegression("rbf", gamma=10.0).fit(X, y))
Problems:
- overfitting
- kernel has to be selected (often less critical than features)
- complexity with respect to number of samples is $O(N^3)$
Kernel Ridge Regression¶
We start from ridge regression
$$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} ||^2_2 + \frac{\gamma}{2} \boldsymbol{w}^T \boldsymbol{w}.$$and replace again $\boldsymbol{w} = \boldsymbol{X}^T \boldsymbol{\alpha},$ so that
$$\boldsymbol{w}^T \boldsymbol{w} = \left( \boldsymbol{X}^T \boldsymbol{\alpha} \right)^T \left( \boldsymbol{X}^T \boldsymbol{\alpha} \right) = \boldsymbol{\alpha}^T \boldsymbol{X} \boldsymbol{X}^T \boldsymbol{\alpha} = \boldsymbol{\alpha}^T \boldsymbol{K} \boldsymbol{\alpha},$$which results in a modification of the dual problem
$$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{K} \boldsymbol{\alpha} ||^2_2 + \frac{\gamma}{2} \boldsymbol{\alpha}^T \boldsymbol{K} \boldsymbol{\alpha}.$$The solution is \begin{eqnarray*} 0 &=& \boldsymbol{K} \left( \boldsymbol{y} - \boldsymbol{K} \boldsymbol{\alpha} \right) + \gamma \boldsymbol{K} \boldsymbol{\alpha}\\ &=& \boldsymbol{K} \boldsymbol{y} + \boldsymbol{K} \boldsymbol{K} \boldsymbol{\alpha} + \gamma \boldsymbol{K} \boldsymbol{\alpha}\\ \Leftrightarrow && \left(\boldsymbol{K} + \gamma \boldsymbol{I} \right)^{-1} \boldsymbol{y} = \boldsymbol{\alpha} \end{eqnarray*}
class KernelRidgeRegression(BaseEstimator, RegressorMixin):
def __init__(self, kernel, lmbda, **kernel_args):
self.kernel = kernel
self.lmbda = lmbda
self.kernel_args = kernel_args
def fit(self, X, y):
self.X = X
K = pairwise_kernels(self.X, metric=self.kernel, **self.kernel_args)
self.alpha = np.linalg.pinv(K + self.lmbda * np.eye(X.shape[0])).dot(y)
return self
def predict(self, X):
K_star = pairwise_kernels(X, self.X, metric=self.kernel, **self.kernel_args)
return K_star.dot(self.alpha)
plot_model(KernelRidgeRegression("rbf", lmbda=1.0, gamma=10.0).fit(X, y))
Gaussian Process Regression¶
Kernels that are used for GPR (covariance functions) usually have more hyperparameters than kernels that we have been looking at before. However, the hyperparameter optimization of GPR is much better than other methods (but more computationally complex).
- RBF kernel: $$k(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = \exp \left( - \gamma \lvert\lvert\boldsymbol{x}^{(1)} - \boldsymbol{x}^{(2)}\rvert\rvert^2 \right)$$
- Squared exponential kernel: $$k(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = \exp \left( - \sum_i l_i \lvert\lvert\boldsymbol{x}^{(1)}_i - \boldsymbol{x}^{(2)}_i\rvert\rvert^2 \right)$$
We maximize the marginal likelihood $p(\boldsymbol{y}|\boldsymbol{X})$.
We minimize the log-likelihood $$\log p(\boldsymbol{y}|\boldsymbol{X}) = - \frac{1}{2} \boldsymbol{y} \boldsymbol{K}^{-1} \boldsymbol{y} - \frac{1}{2} \log |\boldsymbol{K}| - \frac{N}{2} \log (2 \pi)$$ with respect to the kernel's hyperparameters $\boldsymbol{\theta}$.
The derivative can be computed as $$\frac{\partial}{\partial \theta_j} \log p(\boldsymbol{y}|\boldsymbol{X}) = \frac{1}{2} tr \left( (\boldsymbol{\alpha} \boldsymbol{\alpha}^T - \boldsymbol{K}^{-1}) \frac{\partial \boldsymbol{K}}{\partial \theta_j} \right)$$
from sklearn.gaussian_process import GaussianProcess
gpr = GaussianProcess(nugget=0.1).fit(X, y)
plot_model(gpr)
y_pred, y_mse = gpr.predict(X, eval_MSE=True)
plt.fill_between(x, y_pred - 2 * np.sqrt(y_mse), y_pred + 2 * np.sqrt(y_mse), alpha=0.2)
Problems:
- does not address the cubic complexity
Support Vector Regression¶
Support vector algorithms usually have constrained optimization problem. In this case, we want to optimize
$\arg \min_{\boldsymbol{w}} \frac{1}{2} ||\boldsymbol{w}||^2$ subject to $\begin{cases} \xi_i \leq \epsilon\\ \xi_i^* \leq \epsilon\\ \end{cases}$, where $\begin{array}{l}\xi_i = y_i - \boldsymbol{w}_i^T \boldsymbol{x}_i - b\\\xi_i^* = \boldsymbol{w}_i^T \boldsymbol{x}_i + b - y_i\end{array}$.
In comparison to kernel ridge regression, this will result in only a few support vectors that will be used for predictions instead of the whole training set.
from sklearn.svm import SVR
def plot_svr(svr):
plt.scatter(x[svr.support_], y[svr.support_], color="r", s=100)
plot_model(svr)
plt.fill_between(x, svr.predict(X) - svr.epsilon,
svr.predict(X) + svr.epsilon, alpha=0.2)
epsilon = 0.3
svr = SVR("rbf", epsilon=epsilon, gamma=10.0).fit(X, y)
plot_svr(svr)
plt.figure(figsize=(14, 4))
plt.subplot(121)
plot_svr(svr)
svi = svr.support_[6]
svy = svr.predict([X[svi]])[0]
errori = y[svi] - svy
plt.plot([x[svi], x[svi]], [svy, svy + epsilon * np.sign(errori)], c="g")
plt.plot([x[svi], x[svi]], [svy + epsilon * np.sign(errori), y[svi]], c="r")
plt.setp(plt.gca(), xlim=(0.1, 0.3), ylim=(-0.5, 1.5))
square_loss = lambda d: d ** 2
epsilon_insensitive = lambda d, epsilon: np.maximum(0.0, np.abs(d) - epsilon)
plt.subplot(122)
d = np.linspace(-1, 1, 100)
plt.plot(square_loss(d), d, label="Square")
plt.plot(epsilon_insensitive(d, epsilon=0.3), d,
label="$\epsilon$-insensitive ($\epsilon=0.3$)")
loss = epsilon_insensitive(errori, epsilon=0.3)
plt.scatter(loss, errori, color="r", s=100, zorder=10)
plt.scatter(loss, errori, color="r", s=100, zorder=10)
plt.plot([loss, loss], [0, epsilon * np.sign(errori)], c="g")
plt.fill_between([0, loss], [0, 0], [epsilon * np.sign(errori),
epsilon * np.sign(errori)],
color="g", alpha=0.2)
plt.plot([loss, loss], [errori, epsilon * np.sign(errori)], c="r")
plt.fill_between([0, loss], [errori, errori], [epsilon * np.sign(errori),
epsilon * np.sign(errori)],
color="r", alpha=0.2)
plt.gca().annotate("$\epsilon$", xy=(loss, -0.15), xytext=(1, -0.22),
fontsize=30, arrowprops=dict(facecolor="black", shrink=0.1))
plt.gca().annotate("$\\xi^*$", xy=(loss, -0.6), xytext=(1, -0.68),
fontsize=30, arrowprops=dict(facecolor="black", shrink=0.1))
plt.setp(plt.gca(), xlim=(-0.01, 2), ylim=(-1, 1),
xlabel="Loss", ylabel="Sample - Prediction: $y_i - w_i^T x_i - b$")
plt.legend(loc="upper right")
Problems:
- epsilon and kernel have to be selected
Neural Nets¶
- A neural net is an interconnected assembly of neurons.
- Each neuron behaves like a linear model with a nonlinear activation function.
- Learning is complicated and involves optimizing a non-convex objective (incremental minimization of error function by changing the weights).
Objective function: $$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \hat{\boldsymbol{y}} ||^2_2.$$
Because $\hat{\boldsymbol{y}}$ is a nonlinear term, we have to use a gradient-based optimizer to change the weights of the neural net.
Each neuron computes $$g(\boldsymbol{X}\boldsymbol{w}).$$
(Weighted sum of inputs + nonlinear activation function, e.g. $\tanh$.)
from openann import Net, Activation, DataSet, LMA
class NeuralNet(BaseEstimator, RegressorMixin):
def __init__(self, n_nodes, gamma=0.0):
self.n_nodes = n_nodes
self.gamma = gamma
def fit(self, X, y):
n_features = X.shape[1]
self.net = Net()
self.net.set_regularization(0.0, self.gamma, 0.0)
self.net.input_layer(n_features)
self.net.fully_connected_layer(self.n_nodes, Activation.LOGISTIC)
self.net.output_layer(1, Activation.LINEAR)
Y = y.reshape(len(X), 1)
dataset = DataSet(X, Y)
lma = LMA({"maximal_iterations": 300})
lma.optimize(self.net, dataset)
return self
def predict(self, X):
return self.net.predict(X).ravel()
plot_model(NeuralNet(100).fit(X, y))
plot_model(NeuralNet(100, gamma=1e-4).fit(X, y))
Problems:
- topology, regularization coefficients, optimization algorithm have to be selected
- non-convex optimization
- might require a lot of training time although prediction is fast
Comparison¶
def generate_data_1(n_samples, sigma, random_state=None):
random_state = check_random_state(random_state)
x = np.linspace(0, 1, n_samples)
X = x[:, np.newaxis]
y_true = np.cos(4 * np.pi * x) + 2 * x
y = y_true + sigma * random_state.randn(n_samples)
return x, X, y_true, y
def generate_data_2(n_samples, sigma, random_state=None):
random_state = check_random_state(random_state)
x = np.linspace(0, 1, n_samples)
X = x[:, np.newaxis]
y_true = (4 * x - 1.0) ** 2
y = y_true + sigma * random_state.randn(n_samples)
return x, X, y_true, y
def generate_data_3(n_samples, sigma, random_state=None):
random_state = check_random_state(random_state)
x = np.linspace(0, 1, n_samples)
X = x[:, np.newaxis]
y_true = 5.0 * x
y_true[x < 0.5] -= 3.0
y = y_true + sigma * random_state.randn(n_samples)
return x, X, y_true, y
plt.figure(figsize=(13, 4))
for i, gen_data in enumerate([generate_data_1, generate_data_2, generate_data_3]):
plt.subplot(1, 3, 1 + i)
x, X, y_true, y = gen_data(51, 0.3, 0)
plt.plot(x, y_true, label="Latent function")
plt.scatter(x, y, label="Samples")
plt.legend(loc="upper left")
def plot_model_on_datasets(model):
plt.figure(figsize=(13, 4))
for j, gen_data in enumerate([generate_data_1, generate_data_2, generate_data_3]):
plt.subplot(1, 3, 1 + j)
x, X, y_true, y = gen_data(51, 0.3, 0)
plt.plot(x, y_true, label="Latent function")
plt.scatter(x, y, label="Samples")
model.fit(X, y)
plt.plot(x, model.predict(X), label="Prediction")
plt.legend(loc="upper left")
plot_model_on_datasets(LinearRegression())
plot_model_on_datasets(PolynomialRegression(100))
plot_model_on_datasets(PolynomialRidgeRegression(100, gamma=1e-5))
plot_model_on_datasets(KernelRegression("rbf", gamma=100.0))
plot_model_on_datasets(KernelRidgeRegression("rbf", lmbda=0.1, gamma=100.0))
plot_model_on_datasets(GaussianProcess(nugget=1e-12))
plot_model_on_datasets(SVR("rbf", epsilon=0.5, gamma=10.0, C=10.0))
plot_model_on_datasets(NeuralNet(50))
plot_model_on_datasets(NeuralNet(50, gamma=1e-4))
Model Selection¶
How can I evaluate my result?
Appropriate metrics for regression are e.g.
- Mean squared error (MSE): $\frac{1}{N} \sum_n \left( \hat{y}_n - y_n \right)^2$
- Mean absolute error (MAE): $\frac{1}{N} \sum_n \lvert \hat{y}_n - y_n \rvert$
- Coefficient of determination ($R^2$)
What is the best model for me?
That depends on...
- the number of samples
- the number of features
- the complexity of the latent function
- what kind of noise do we expect?
- are we learning online?
- ...
How do we select appropriate parameters?
- Grid search + cross-validation
- Bayesian optimization + validation set
- Bayesian hyperparameter estimation (e.g. Gaussian process regression)
Do I have enough samples? Is my model complex enough?
# Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.learning_curve import validation_curve
digits = load_digits()
X, y = digits.data, digits.target
param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(
SVC(), X, y, param_name="gamma", param_range=param_range,
cv=10, scoring="accuracy", n_jobs=4)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with SVM")
plt.xlabel("$\gamma$")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2, color="r")
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
color="g")
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2, color="g")
plt.legend(loc="best")
# Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=4, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
digits = load_digits()
X, y = digits.data, digits.target
title = "Learning Curves (Naive Bayes)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = cross_validation.ShuffleSplit(digits.data.shape[0], n_iter=100,
test_size=0.2, random_state=0)
estimator = GaussianNB()
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)
title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = cross_validation.ShuffleSplit(digits.data.shape[0], n_iter=10,
test_size=0.2, random_state=0)
estimator = SVC(gamma=0.001)
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)