Regression

Approximate an unknown function $f$ that maps from $\mathbb{R}^D$ to $\mathbb{R}^F$.

Model:

  • We assume that there is some latent function $f: \mathbb{R}^D \rightarrow \mathbb{R}^F$.
  • We observe samples $(\boldsymbol{x}_n, \boldsymbol{y}_n)$ with $f(\boldsymbol{x}_n) + \boldsymbol{\epsilon}_n = \boldsymbol{y}_n$
  • with i.i.d. noise, for example $\boldsymbol{\epsilon}_n \sim \mathcal{N}(0, \sigma^2 \boldsymbol{I}_F)$.

Overview

Algorithms

  • Linear Regression
  • Polynomial (Ridge) Regression
  • Kernel (Ridge) Regression
  • Gaussian Process Regression
  • Support Vector Regression
  • Neural Nets

Tips

  • Bias vs. Variance
  • Model Selection

Dataset

All samples $\{(\boldsymbol{x}_1, \boldsymbol{y}_1), \ldots, (\boldsymbol{x}_N, \boldsymbol{y}_N)\}$ are called dataset.

Here is an example.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
In [2]:
from sklearn.utils import check_random_state


def generate_data(n_samples, sigma, random_state=None):
    random_state = check_random_state(random_state)
    x = np.linspace(0, 1, n_samples)
    X = x[:, np.newaxis]
    y_true = np.cos(2 * np.pi * x)
    y = y_true + sigma * random_state.randn(n_samples)
    return x, X, y_true, y
In [3]:
x, X, y_true, y = generate_data(101, 0.3, 0)
plt.plot(x, y_true, label="Latent function")
plt.scatter(x, y, label="Samples")
plt.xlim((0, 1))
plt.legend()
Out[3]:
<matplotlib.legend.Legend at 0x7f489b586b90>

We will only consider the case $F = 1$.

In this case, we can organize the whole dataset in a design matrix $\boldsymbol{X} \in \mathbb{R}^{N \times D}$ which contains the input $\boldsymbol{x}_n$ of one sample in each row and a vector $\boldsymbol{y} \in \mathbb{R}^N$ that contains the output $y_n$ of one sample in each entry.

Design matrix

$$\boldsymbol{X} = \begin{pmatrix} \boldsymbol{x}_1^T\\ \boldsymbol{x}_2^T\\ \vdots\\ \boldsymbol{x}_N^T\ \end{pmatrix} = \overbrace{ \begin{pmatrix} x_{1,1}&x_{1,2}&\ldots&x_{1,D}\\ x_{2,1}&x_{2,2}&\ldots&x_{2,D}\\ \vdots&\vdots&\ddots&\vdots\\ x_{N,1}&x_{N,2}&\ldots&x_{N,D} \end{pmatrix} }^\textrm{Features}$$

Linear Regression

Model: We assume that there is some latent function $f: \mathbb{R}^D \rightarrow \mathbb{R}$. We observe samples $(\boldsymbol{x}_n, y_n)$ with $$f(\boldsymbol{x}_n) + \epsilon_n = y_n, \quad \epsilon_n \sim \mathcal{N}(0, \sigma^2).$$

We assume that $f$ is a linear function of the form $$f(\boldsymbol{x}_n) = \boldsymbol{w}^T \boldsymbol{x}$$ ($\boldsymbol{x}$ will be extended by 1 so that $\boldsymbol{w}$ incorporates a bias).

For the whole dataset: $$\boldsymbol{y} = \boldsymbol{X}\boldsymbol{w} + \boldsymbol{\epsilon}$$

We will learn by minimizing the sum of squared errors (SSE) $$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} ||^2_2.$$

We can find a solution for the vector $\boldsymbol{w}$ by setting the derivative of the objective function to 0 and solving for $\boldsymbol{w}$.

\begin{eqnarray*} && 0 = \boldsymbol{X}^T \left( \boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} \right) \boldsymbol{X} = \boldsymbol{X}^T \boldsymbol{y} - \boldsymbol{X}^T \boldsymbol{X} \boldsymbol{w}\\ &\Leftrightarrow& \boldsymbol{X}^T \boldsymbol{X} \boldsymbol{w} = \boldsymbol{X}^T \boldsymbol{y}\\ &\Leftrightarrow& \boldsymbol{w} = \left( \boldsymbol{X}^T \boldsymbol{X} \right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \end{eqnarray*}

(Now we actually have to prove that the second derivative is greater than 0.)

In [4]:
from sklearn.base import BaseEstimator, RegressorMixin


class LinearRegression(BaseEstimator, RegressorMixin):
    def fit(self, X, y):
        n_samples = X.shape[0]
        X_bias = np.hstack((np.ones((n_samples, 1)), X))
        self.w = np.linalg.inv(X_bias.T.dot(X_bias)).dot(X_bias.T).dot(y)
        return self

    def predict(self, X):
        n_samples = X.shape[0]
        X_bias = np.hstack((np.ones((n_samples, 1)), X))
        return X_bias.dot(self.w)
In [5]:
def plot_model(model):
    x, X, y_true, y = generate_data(101, 0.3, 0)
    plt.plot(x, y_true, label="Latent function")
    plt.scatter(x, y, label="Samples")
    plt.plot(x, model.predict(X), label="Prediction")
    plt.xlim((0, 1))
    plt.legend()
In [6]:
x, X, y_true, y = generate_data(101, 0.3, 0)
plot_model(LinearRegression().fit(X, y))

Problems:

  • linear model
  • complexity of inverse depends on features: $O(D^3)$
  • alternative: we can use the iterative algorithm gradient descent; requires multiple iterations over the whole dataset

Nonlinear Regression

We can extend the linear model to approximate nonlinear functions by

  • using fixed nonlinear features (polynomial regression, radial basis function networks, ...)
  • learning nonlinear features (neural nets)
  • using nonlinear similarities (kernel methods)
  • mixing multiple linear models (local regression)

Ingredients for Machine Learning

Machine learning algorithms often consist of

  • a parameterized function (e.g. $\boldsymbol{X} \boldsymbol{w}$, $\boldsymbol{w}$ are the parameters)
  • an objective function (e.g. $\frac{1}{2} || \boldsymbol{y} - \hat{\boldsymbol{y}} ||^2_2$)
  • an optimization procedure

Polynomial Regression

We can approximate nonlinear functions with linear regression by generating nonlinear features $\phi(\boldsymbol{x})$.

For example: $$\left( x_1, x_2 \right) \rightarrow \left( x_1^3, x_2^3, x_1^2x_2, x_1 x_2^2, x_1^2, x_2^2, x_1 x_2, x_1, x_2, 1 \right)$$

In [7]:
from sklearn.preprocessing import PolynomialFeatures


class PolynomialRegression(BaseEstimator, RegressorMixin):
    def __init__(self, degree):
        self.degree = degree

    def fit(self, X, y):
        self.poly = PolynomialFeatures(degree=self.degree).fit(X)
        X_poly = self.poly.transform(X)
        self.w = np.linalg.pinv(X_poly.T.dot(X_poly)).dot(X_poly.T).dot(y)
        return self

    def predict(self, X):
        X_poly = self.poly.transform(X)
        return X_poly.dot(self.w)
In [8]:
plot_model(PolynomialRegression(100).fit(X, y))

Problems:

  • overfitting
  • features have to be selected
  • makes the problem of cubic complexity worse (more features)

Bias vs. Variance

In [9]:
def plot_bias_variance_error():
    plt.setp(plt.gca(), xticks=(), yticks=(), xlabel="Model Complexity", ylabel="Error")
    resolution = 100
    complexity = np.linspace(0, 4, resolution)
    noise_error = np.ones(resolution) * 0.1
    bias_error = np.exp(-complexity)
    variance_error = np.exp(complexity - np.max(complexity))
    plt.plot(complexity, bias_error, label="Bias")
    plt.plot(complexity, variance_error, label="Variance")
    plt.plot(complexity, noise_error, label="Noise")
    plt.plot(complexity, bias_error + variance_error + noise_error, label="Total")
    plt.legend(loc="best")
In [10]:
with plt.xkcd():
    plot_bias_variance_error()
In [11]:
def plot_bias_variance():
    plt.figure(figsize=(12, 12))
    def plot_target():
        t = np.linspace(0, 2 * np.pi, 100)
        plt.plot(np.cos(t), np.sin(t))
        plt.plot(0.67 * np.cos(t), 0.67 * np.sin(t))
        plt.plot(0.33 * np.cos(t), 0.33 * np.sin(t))
        plt.scatter(0, 0, color="black", s=100)
    plt.setp(plt.subplot(2, 2, 1, aspect="equal"), xticks=(), yticks=())
    plot_target()
    plt.title("Low Variance")
    plt.ylabel("Low Bias")
    plt.scatter((np.random.rand(10)-0.5)*0.2, (np.random.rand(10)-0.5)*0.2, s=50)
    plt.setp(plt.subplot(2, 2, 2, aspect="equal"), xticks=(), yticks=())
    plot_target()
    plt.title("High Variance")
    plt.scatter((np.random.rand(10)-0.5)*0.8, (np.random.rand(10)-0.5)*0.8, s=50)
    plt.setp(plt.subplot(2, 2, 3, aspect="equal"), xticks=(), yticks=())
    plot_target()
    plt.ylabel("High Bias")
    plt.scatter((np.random.rand(10)-0.5)*0.2+0.5, (np.random.rand(10)-0.5)*0.2+0.5, s=50)
    plt.setp(plt.subplot(2, 2, 4, aspect="equal"), xticks=(), yticks=())
    plot_target()
    plt.scatter((np.random.rand(10)-0.5)*0.8+0.5, (np.random.rand(10)-0.5)*0.8+0.5, s=50)
In [12]:
with plt.xkcd():
    plot_bias_variance()

Ridge Regression

To reduce overfitting, we minimize

$$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} ||^2_2 + \frac{\gamma}{2} \boldsymbol{w}^T \boldsymbol{w},$$

which means that we set a prior for $\boldsymbol{w}$.

\begin{eqnarray*} && 0 = -\boldsymbol{X}^T \left( \boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} \right) + \gamma \boldsymbol{w}\\ &\Leftrightarrow& \boldsymbol{X}^T \boldsymbol{X} \boldsymbol{w} + \gamma \boldsymbol{w} = \boldsymbol{X}^T \boldsymbol{y}\\ &\Leftrightarrow& \boldsymbol{w} = \left( \boldsymbol{X}^T \boldsymbol{X} + \gamma \boldsymbol{I} \right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \end{eqnarray*}
In [13]:
class PolynomialRidgeRegression(BaseEstimator, RegressorMixin):
    def __init__(self, degree, gamma):
        self.degree = degree
        self.gamma = gamma

    def fit(self, X, y):
        self.poly = PolynomialFeatures(degree=self.degree).fit(X)
        X_poly = self.poly.transform(X)
        self.w = (np.linalg.inv(X_poly.T.dot(X_poly) + self.gamma *
                                np.eye(X_poly.shape[1])).dot(X_poly.T).dot(y))
        return self

    def predict(self, X):
        X_poly = self.poly.transform(X)
        return X_poly.dot(self.w)
In [14]:
plot_model(PolynomialRidgeRegression(100, 0.01).fit(X, y))

Problems:

  • regularization coefficient has to be selected
  • does not address the cubic complexity

Kernel Regression

We can express $\boldsymbol{w}$ as a weighted sum of the training data $$\boldsymbol{w} = \boldsymbol{X}^T \boldsymbol{\alpha}$$

Which leads to the dual problem $$\arg \min_\boldsymbol{\alpha} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{X} \boldsymbol{X}^T \boldsymbol{\alpha}||^2_2$$

We call $\boldsymbol{K} = \boldsymbol{X}\boldsymbol{X}^T \in \mathbb{R}^{N \times N}$ the Gram matrix so that we can find the minimum at \begin{eqnarray*} && 0 = \boldsymbol{K} \boldsymbol{y} - \boldsymbol{K}\boldsymbol{K} \boldsymbol{\alpha}\\ &\Leftrightarrow& \boldsymbol{K}\boldsymbol{K} \boldsymbol{\alpha} = \boldsymbol{K} \boldsymbol{y}\\ &\Leftrightarrow& \boldsymbol{\alpha} = \boldsymbol{K}^{-1} \boldsymbol{y} \end{eqnarray*}

Kernel Trick

Instead of the linear kernel $k(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T \boldsymbol{x}'$, we can use any Mercer kernel to build the Gram matrix $\boldsymbol{K}$. The intuition behind this is that we use another definition of similarity when we use another kernel.

  • Linear kernel: $k(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T \boldsymbol{x}'$
  • Radial basis function (RBF) kernel: $k(\boldsymbol{x}, \boldsymbol{x}') = \exp \left( - \gamma ||\boldsymbol{x} - \boldsymbol{x}'||^2 \right)$
  • Polynomial kernel
  • String kernels
  • Graph kernels
  • ...
In [15]:
from sklearn.metrics.pairwise import pairwise_kernels


class KernelRegression(BaseEstimator, RegressorMixin):
    def __init__(self, kernel, **kernel_args):
        self.kernel = kernel
        self.kernel_args = kernel_args

    def fit(self, X, y):
        self.X = X
        K = pairwise_kernels(self.X, metric=self.kernel, **self.kernel_args)
        self.alpha = np.linalg.pinv(K).dot(y)
        return self

    def predict(self, X):
        K_star = pairwise_kernels(X, self.X, metric=self.kernel, **self.kernel_args)
        return K_star.dot(self.alpha)
In [16]:
plot_model(KernelRegression("linear").fit(X, y))
In [17]:
plot_model(KernelRegression("rbf", gamma=10.0).fit(X, y))

Problems:

  • overfitting
  • kernel has to be selected (often less critical than features)
  • complexity with respect to number of samples is $O(N^3)$

Kernel Ridge Regression

We start from ridge regression

$$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} ||^2_2 + \frac{\gamma}{2} \boldsymbol{w}^T \boldsymbol{w}.$$

and replace again $\boldsymbol{w} = \boldsymbol{X}^T \boldsymbol{\alpha},$ so that

$$\boldsymbol{w}^T \boldsymbol{w} = \left( \boldsymbol{X}^T \boldsymbol{\alpha} \right)^T \left( \boldsymbol{X}^T \boldsymbol{\alpha} \right) = \boldsymbol{\alpha}^T \boldsymbol{X} \boldsymbol{X}^T \boldsymbol{\alpha} = \boldsymbol{\alpha}^T \boldsymbol{K} \boldsymbol{\alpha},$$

which results in a modification of the dual problem

$$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \boldsymbol{K} \boldsymbol{\alpha} ||^2_2 + \frac{\gamma}{2} \boldsymbol{\alpha}^T \boldsymbol{K} \boldsymbol{\alpha}.$$

The solution is \begin{eqnarray*} 0 &=& \boldsymbol{K} \left( \boldsymbol{y} - \boldsymbol{K} \boldsymbol{\alpha} \right) + \gamma \boldsymbol{K} \boldsymbol{\alpha}\\ &=& \boldsymbol{K} \boldsymbol{y} + \boldsymbol{K} \boldsymbol{K} \boldsymbol{\alpha} + \gamma \boldsymbol{K} \boldsymbol{\alpha}\\ \Leftrightarrow && \left(\boldsymbol{K} + \gamma \boldsymbol{I} \right)^{-1} \boldsymbol{y} = \boldsymbol{\alpha} \end{eqnarray*}

In [18]:
class KernelRidgeRegression(BaseEstimator, RegressorMixin):
    def __init__(self, kernel, lmbda, **kernel_args):
        self.kernel = kernel
        self.lmbda = lmbda
        self.kernel_args = kernel_args

    def fit(self, X, y):
        self.X = X
        K = pairwise_kernels(self.X, metric=self.kernel, **self.kernel_args)
        self.alpha = np.linalg.pinv(K + self.lmbda * np.eye(X.shape[0])).dot(y)
        return self

    def predict(self, X):
        K_star = pairwise_kernels(X, self.X, metric=self.kernel, **self.kernel_args)
        return K_star.dot(self.alpha)
In [19]:
plot_model(KernelRidgeRegression("rbf", lmbda=1.0, gamma=10.0).fit(X, y))

Gaussian Process Regression

Kernels that are used for GPR (covariance functions) usually have more hyperparameters than kernels that we have been looking at before. However, the hyperparameter optimization of GPR is much better than other methods (but more computationally complex).

  • RBF kernel: $$k(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = \exp \left( - \gamma \lvert\lvert\boldsymbol{x}^{(1)} - \boldsymbol{x}^{(2)}\rvert\rvert^2 \right)$$
  • Squared exponential kernel: $$k(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = \exp \left( - \sum_i l_i \lvert\lvert\boldsymbol{x}^{(1)}_i - \boldsymbol{x}^{(2)}_i\rvert\rvert^2 \right)$$

We maximize the marginal likelihood $p(\boldsymbol{y}|\boldsymbol{X})$.

We minimize the log-likelihood $$\log p(\boldsymbol{y}|\boldsymbol{X}) = - \frac{1}{2} \boldsymbol{y} \boldsymbol{K}^{-1} \boldsymbol{y} - \frac{1}{2} \log |\boldsymbol{K}| - \frac{N}{2} \log (2 \pi)$$ with respect to the kernel's hyperparameters $\boldsymbol{\theta}$.

The derivative can be computed as $$\frac{\partial}{\partial \theta_j} \log p(\boldsymbol{y}|\boldsymbol{X}) = \frac{1}{2} tr \left( (\boldsymbol{\alpha} \boldsymbol{\alpha}^T - \boldsymbol{K}^{-1}) \frac{\partial \boldsymbol{K}}{\partial \theta_j} \right)$$

In [20]:
from sklearn.gaussian_process import GaussianProcess


gpr = GaussianProcess(nugget=0.1).fit(X, y)
plot_model(gpr)
y_pred, y_mse = gpr.predict(X, eval_MSE=True)
plt.fill_between(x, y_pred - 2 * np.sqrt(y_mse), y_pred + 2 * np.sqrt(y_mse), alpha=0.2)
Out[20]:
<matplotlib.collections.PolyCollection at 0x7f4895b0c910>

Problems:

  • does not address the cubic complexity

Support Vector Regression

Support vector algorithms usually have constrained optimization problem. In this case, we want to optimize

$\arg \min_{\boldsymbol{w}} \frac{1}{2} ||\boldsymbol{w}||^2$ subject to $\begin{cases} \xi_i \leq \epsilon\\ \xi_i^* \leq \epsilon\\ \end{cases}$, where $\begin{array}{l}\xi_i = y_i - \boldsymbol{w}_i^T \boldsymbol{x}_i - b\\\xi_i^* = \boldsymbol{w}_i^T \boldsymbol{x}_i + b - y_i\end{array}$.

In comparison to kernel ridge regression, this will result in only a few support vectors that will be used for predictions instead of the whole training set.

In [21]:
from sklearn.svm import SVR


def plot_svr(svr):
    plt.scatter(x[svr.support_], y[svr.support_], color="r", s=100)
    plot_model(svr)
    plt.fill_between(x, svr.predict(X) - svr.epsilon,
                     svr.predict(X) + svr.epsilon, alpha=0.2)

epsilon = 0.3
svr = SVR("rbf", epsilon=epsilon, gamma=10.0).fit(X, y)
plot_svr(svr)
In [22]:
plt.figure(figsize=(14, 4))
plt.subplot(121)
plot_svr(svr)
svi = svr.support_[6]
svy = svr.predict([X[svi]])[0]
errori = y[svi] - svy
plt.plot([x[svi], x[svi]], [svy, svy + epsilon * np.sign(errori)], c="g")
plt.plot([x[svi], x[svi]], [svy + epsilon * np.sign(errori), y[svi]], c="r")
plt.setp(plt.gca(), xlim=(0.1, 0.3), ylim=(-0.5, 1.5))

square_loss = lambda d: d ** 2
epsilon_insensitive = lambda d, epsilon: np.maximum(0.0, np.abs(d) - epsilon)

plt.subplot(122)
d = np.linspace(-1, 1, 100)
plt.plot(square_loss(d), d, label="Square")
plt.plot(epsilon_insensitive(d, epsilon=0.3), d,
         label="$\epsilon$-insensitive ($\epsilon=0.3$)")
loss = epsilon_insensitive(errori, epsilon=0.3)
plt.scatter(loss, errori, color="r", s=100, zorder=10)
plt.scatter(loss, errori, color="r", s=100, zorder=10)
plt.plot([loss, loss], [0, epsilon * np.sign(errori)], c="g")
plt.fill_between([0, loss], [0, 0], [epsilon * np.sign(errori),
                                     epsilon * np.sign(errori)],
                 color="g", alpha=0.2)
plt.plot([loss, loss], [errori, epsilon * np.sign(errori)], c="r")
plt.fill_between([0, loss], [errori, errori], [epsilon * np.sign(errori),
                                               epsilon * np.sign(errori)],
                 color="r", alpha=0.2)
plt.gca().annotate("$\epsilon$", xy=(loss, -0.15), xytext=(1, -0.22),
                   fontsize=30, arrowprops=dict(facecolor="black", shrink=0.1))
plt.gca().annotate("$\\xi^*$", xy=(loss, -0.6), xytext=(1, -0.68),
                   fontsize=30, arrowprops=dict(facecolor="black", shrink=0.1))
plt.setp(plt.gca(), xlim=(-0.01, 2), ylim=(-1, 1),
         xlabel="Loss", ylabel="Sample - Prediction: $y_i - w_i^T x_i - b$")
plt.legend(loc="upper right")
Out[22]:
<matplotlib.legend.Legend at 0x7f4895b222d0>

Problems:

  • epsilon and kernel have to be selected

Neural Nets

  • A neural net is an interconnected assembly of neurons.
  • Each neuron behaves like a linear model with a nonlinear activation function.
  • Learning is complicated and involves optimizing a non-convex objective (incremental minimization of error function by changing the weights).

Objective function: $$\arg \min_\boldsymbol{w} \frac{1}{2} || \boldsymbol{y} - \hat{\boldsymbol{y}} ||^2_2.$$

Because $\hat{\boldsymbol{y}}$ is a nonlinear term, we have to use a gradient-based optimizer to change the weights of the neural net.

Each neuron computes $$g(\boldsymbol{X}\boldsymbol{w}).$$

(Weighted sum of inputs + nonlinear activation function, e.g. $\tanh$.)

In [23]:
from openann import Net, Activation, DataSet, LMA


class NeuralNet(BaseEstimator, RegressorMixin):
    def __init__(self, n_nodes, gamma=0.0):
        self.n_nodes = n_nodes
        self.gamma = gamma

    def fit(self, X, y):
        n_features = X.shape[1]
        self.net = Net()
        self.net.set_regularization(0.0, self.gamma, 0.0)
        self.net.input_layer(n_features)
        self.net.fully_connected_layer(self.n_nodes, Activation.LOGISTIC)
        self.net.output_layer(1, Activation.LINEAR)
        Y = y.reshape(len(X), 1)
        dataset = DataSet(X, Y)
        lma = LMA({"maximal_iterations": 300})
        lma.optimize(self.net, dataset)
        return self

    def predict(self, X):
        return self.net.predict(X).ravel()
In [24]:
plot_model(NeuralNet(100).fit(X, y))
In [25]:
plot_model(NeuralNet(100, gamma=1e-4).fit(X, y))

Problems:

  • topology, regularization coefficients, optimization algorithm have to be selected
  • non-convex optimization
  • might require a lot of training time although prediction is fast

Comparison

In [26]:
def generate_data_1(n_samples, sigma, random_state=None):
    random_state = check_random_state(random_state)
    x = np.linspace(0, 1, n_samples)
    X = x[:, np.newaxis]
    y_true = np.cos(4 * np.pi * x) + 2 * x
    y = y_true + sigma * random_state.randn(n_samples)
    return x, X, y_true, y

def generate_data_2(n_samples, sigma, random_state=None):
    random_state = check_random_state(random_state)
    x = np.linspace(0, 1, n_samples)
    X = x[:, np.newaxis]
    y_true = (4 * x - 1.0) ** 2
    y = y_true + sigma * random_state.randn(n_samples)
    return x, X, y_true, y

def generate_data_3(n_samples, sigma, random_state=None):
    random_state = check_random_state(random_state)
    x = np.linspace(0, 1, n_samples)
    X = x[:, np.newaxis]
    y_true = 5.0 * x
    y_true[x < 0.5] -= 3.0
    y = y_true + sigma * random_state.randn(n_samples)
    return x, X, y_true, y
In [27]:
plt.figure(figsize=(13, 4))
for i, gen_data in enumerate([generate_data_1, generate_data_2, generate_data_3]):
    plt.subplot(1, 3, 1 + i)
    x, X, y_true, y = gen_data(51, 0.3, 0)
    plt.plot(x, y_true, label="Latent function")
    plt.scatter(x, y, label="Samples")
    plt.legend(loc="upper left")
In [28]:
def plot_model_on_datasets(model):
    plt.figure(figsize=(13, 4))
    for j, gen_data in enumerate([generate_data_1, generate_data_2, generate_data_3]):
        plt.subplot(1, 3, 1 + j)
        x, X, y_true, y = gen_data(51, 0.3, 0)
        plt.plot(x, y_true, label="Latent function")
        plt.scatter(x, y, label="Samples")
        model.fit(X, y)
        plt.plot(x, model.predict(X), label="Prediction")
        plt.legend(loc="upper left")
In [29]:
plot_model_on_datasets(LinearRegression())
In [30]:
plot_model_on_datasets(PolynomialRegression(100))
In [31]:
plot_model_on_datasets(PolynomialRidgeRegression(100, gamma=1e-5))
In [32]:
plot_model_on_datasets(KernelRegression("rbf", gamma=100.0))
In [33]:
plot_model_on_datasets(KernelRidgeRegression("rbf", lmbda=0.1, gamma=100.0))
In [34]:
plot_model_on_datasets(GaussianProcess(nugget=1e-12))
In [35]:
plot_model_on_datasets(SVR("rbf", epsilon=0.5, gamma=10.0, C=10.0))
In [36]:
plot_model_on_datasets(NeuralNet(50))
In [37]:
plot_model_on_datasets(NeuralNet(50, gamma=1e-4))

Model Selection

How can I evaluate my result?

Appropriate metrics for regression are e.g.

  • Mean squared error (MSE): $\frac{1}{N} \sum_n \left( \hat{y}_n - y_n \right)^2$
  • Mean absolute error (MAE): $\frac{1}{N} \sum_n \lvert \hat{y}_n - y_n \rvert$
  • Coefficient of determination ($R^2$)

What is the best model for me?

That depends on...

  • the number of samples
  • the number of features
  • the complexity of the latent function
  • what kind of noise do we expect?
  • are we learning online?
  • ...

How do we select appropriate parameters?

  • Grid search + cross-validation
  • Bayesian optimization + validation set
  • Bayesian hyperparameter estimation (e.g. Gaussian process regression)

Do I have enough samples? Is my model complex enough?

In [38]:
# Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.learning_curve import validation_curve

digits = load_digits()
X, y = digits.data, digits.target

param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(
    SVC(), X, y, param_name="gamma", param_range=param_range,
    cv=10, scoring="accuracy", n_jobs=4)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.title("Validation Curve with SVM")
plt.xlabel("$\gamma$")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
plt.fill_between(param_range, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.2, color="r")
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
             color="g")
plt.fill_between(param_range, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.2, color="g")
plt.legend(loc="best")
Out[38]:
<matplotlib.legend.Legend at 0x7f488744df50>
In [39]:
# Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
from sklearn.learning_curve import learning_curve


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=4, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt
In [40]:
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation


digits = load_digits()
X, y = digits.data, digits.target


title = "Learning Curves (Naive Bayes)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = cross_validation.ShuffleSplit(digits.data.shape[0], n_iter=100,
                                   test_size=0.2, random_state=0)

estimator = GaussianNB()
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)

title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = cross_validation.ShuffleSplit(digits.data.shape[0], n_iter=10,
                                   test_size=0.2, random_state=0)
estimator = SVC(gamma=0.001)
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)
Out[40]:
<module 'matplotlib.pyplot' from '/usr/local/lib/python2.7/dist-packages/matplotlib/pyplot.pyc'>

More Topics