Meta Learning

In [1]:
%pylab inline
from IPython.display import display, HTML, Image
import sys
sys.path.append("02_meta_learning")
Populating the interactive namespace from numpy and matplotlib

Data Flow

Data Flow

Application of the Week: Netflix Prize

  • $1M prize
  • 100,480,507 movie ratings from 480,189 users for 17,770 movies
  • for example:

| | Harry Potter | Avatar | LOTR | Gladiator | Titanic | Glitter | | :---- | ------------:| ------:| ----:| ---------:| -------:| -------:| | Alice | ? | 5 | 3 | ? | 5 | ? | | Bob | 4 | 5 | 5 | 4 | ? | ? | | Carol | 3 | ? | ? | 2 | 5 | 3 | | David | 3 | ? | 4 | 5 | 1 | 1 | | Eric | 4 | ? | 2 | ? | ? | 3 | | Fred | 1 | 1 | 5 | ? | ? | 1 |

  • How can we infer the "?"s ? -- Collaborative filtering
  • How can we win the competition? -- Ensemble methods

    "The lesson here is that having lots of models is useful for the incremental results needed to win competitions, but practically, excellent systems can be built with just a few well-selected models."

Summary, Wikipedia

Question 1

  • What is the difference between bias and variance?
  • How do ensemble methods help?

Answer 1

We vary the training set and (hypothetically) observe the true error.

Bias vs. Variance

In [2]:
with xkcd():
    figure(figsize=(6, 6))
    def plot_target():
        t = linspace(0, 2*pi, 100); plot(cos(t), sin(t)); plot(0.67*cos(t), 0.67*sin(t))
        plot(0.33*cos(t), 0.33*sin(t)); scatter(0, 0, color="black")
    setp(subplot(2, 2, 1), xticks=(), yticks=()); plot_target()
    title("Low Variance"); ylabel("Low Bias")
    scatter((random.rand(10)-0.5)*0.2, (random.rand(10)-0.5)*0.2)
    setp(subplot(2, 2, 2), xticks=(), yticks=()); plot_target()
    title("High Variance")
    scatter((random.rand(10)-0.5)*0.8, (random.rand(10)-0.5)*0.8)
    setp(subplot(2, 2, 3), xticks=(), yticks=()); plot_target()
    ylabel("High Bias")
    scatter((random.rand(10)-0.5)*0.2+0.5, (random.rand(10)-0.5)*0.2+0.5)
    setp(subplot(2, 2, 4), xticks=(), yticks=()); plot_target()
    scatter((random.rand(10)-0.5)*0.8+0.5, (random.rand(10)-0.5)*0.8+0.5)
In [3]:
with xkcd():
    setp(gca(), xticks=(), yticks=(), xlabel="Complexity", ylabel="Error")
    resolution = 100
    complexity = linspace(0, 4, resolution)
    noise_error = ones(resolution) * 0.1
    bias_error = exp(-complexity)
    variance_error = exp(complexity - np.max(complexity))
    plot(complexity, bias_error, label="Bias")
    plot(complexity, variance_error, label="Variance")
    plot(complexity, noise_error, label="Noise")
    plot(complexity, bias_error + variance_error + noise_error, label="Total")
    legend(loc="best")

Examples

Example for Bias

  • We use a linear model to approximate a nonlinear function.
  • We use a decision stump to approximate function that requires a hierarchy.
  • We use a polynomial of degree 3 to approximate a polynomial of degree 10.
  • In general: we use a simple model to approximate a more complex function.

Example for Variance

  • We use gradient descent to optimize a non-convex error function of a model (with many local minima).
  • We use an SVM with RBF kernel and \(C \rightarrow \infty\) and a small dataset.
  • In general: we use a complex model that is likely to overfit and learn the noise of the training set.

Meta Learning for Regression

Base Learner: Multilayer Neural Network

  • is a universal function approximator
  • can be used for classification and regression

To run this example you have to install the library OpenANN.

In [4]:
from openann import *

class NeuralNetwork(object):
    """Wrapper around OpenANN library."""
    def __init__(self, n_nodes):
        self.n_nodes = n_nodes
    def fit(self, X, y):
        Y = y[:, newaxis]
        self.net = Net().input_layer(X.shape[1]) \
                        .fully_connected_layer(self.n_nodes, Activation.TANH) \
                        .output_layer(1, Activation.LINEAR)
        dataset = DataSet(X, Y)
        optimizer = LMA({"maximal_iterations" : 50})
        optimizer.optimize(self.net, dataset)
    def predict(self, X):
        return self.net.predict(X)[:, 0]

Meta Learner: Bagging

  • each base learner is trained on a different subset (generated by sampling with replacement) of the training set
  • predictions will be made based on majority voting
  • Bagging requires unstable learners (high variance error) to build a single stable model (low variance error)
  • usually combines models of the same type

In this example we will average predictions are over all base learners!

In [5]:
class Bagging(object):
    def __init__(self, models, bag_size):
        assert bag_size > 0.0 and bag_size < 1.0
        self.models = models
        self.bag_size = bag_size
    def fit(self, X, y):
        N = X.shape[0]
        for model in self.models:
            bag_indices = random.randint(0, N, int(N*self.bag_size))
            model.fit(X[bag_indices], y[bag_indices])
    def predict(self, X):
        return mean([m.predict(X) for m in self.models], axis=0)

Data Set

Sine function with normally distributed noise.

In [6]:
random.seed(0)
N = 100
X = linspace(0, 2*pi, N)[:, newaxis]
y = array(sin(X[:, 0]) + random.randn(N) * 0.3)

plot(X, y, "o")
r = xlim(0, 2*pi)
In [7]:
def eval_bagging(X, y, n_models=50, bag_size=0.2, n_nodes=10):
    models = [NeuralNetwork(n_nodes) for _ in xrange(n_models)]
    bagging = Bagging(models, bag_size)
    bagging.fit(X, y)
    h = bagging.predict(X)
    p = [m.predict(X) for m in models]
    p_err = [abs(pn-y) for pn in p]
    h_err = abs(h-y)
    return p, h, p_err, h_err
In [8]:
random.seed(0)
RandomNumberGenerator().seed(0)

p, h, p_err, h_err = eval_bagging(X, y)

figure(figsize=(10, 5))
# Plot dataset and model(s)
setp(subplot(1, 2, 1), xlabel="x", ylabel="y", xlim=(0, 2*pi), ylim=(-3, 3))
plot(X, y, "o")
for pn in p: plot(X, pn, "r-")
plot(X, h, "-", linewidth=5)
# Plot errors
setp(subplot(1, 2, 2), xlabel="x", ylabel="Error", xlim=(0, 2*pi))
plot(X, h_err, "g", label="Bagging Error")
plot(X, mean(p_err, axis=0), "b", label="Average Error of Base Learners")
l = legend(loc="best")

Question 2

  • What are the differences between Bagging and AdaBoost?

AdaBoost

  • assigns a weight for each model
  • requires learners of the same type
  • each base learner is an expert for a part of the training set

(There is another interesting type of boosting: Human Boosting)

AdaBoost Algorithm

  • training loop
    • train weak classifier
    • assign a weight according to its error on the training set
    • reweight dataset to encourage the next classifier to become an expert for the part of the training set that has been classified wrong
  • prediction
    • weighted average of predictions of the weak classifiers

Example: AdaBoost with Neural Nets

In [9]:
n_samples = 500

random.seed(0)

X = random.randn(n_samples, 2)
y = array([linalg.norm(x) > 1.0 for x in X], dtype=float64)
T = y[:, newaxis]

figure(figsize=(5, 5))
r = scatter(X[:, 0], X[:, 1], c=y)

To run this example OpenANN is required.

In [10]:
n_models = 5

from openann import *
from util import plot_classifier

# Train ensemble
RandomNumberGenerator().seed(0)
adaboost = AdaBoost()
nets = [Net().input_layer(2)
             .fully_connected_layer(2, Activation.LOGISTIC)
             .output_layer(1, Activation.LOGISTIC)
        for _ in xrange(n_models)]
for net in nets: adaboost.add_learner(net)
opt = LMA(stop={"maximal_iterations" : 10})
adaboost.set_optimizer(opt)
dataset = DataSet(X, T)
adaboost.train(dataset)
weights = adaboost.get_weights()
In [11]:
figure(figsize=(9, 6))
n_rows, n_cols = (2, 3)
for m in xrange(n_models):
    subplot(n_rows, n_cols, 1+m)
    plot_classifier(X, y, nets[m], "Net #%d, weight: %.2f" % (m+1, weights[m]), threshold=0.5)
subplot(n_rows, n_cols, n_models+1)
plot_classifier(X, y, adaboost, "AdaBoost", threshold=0.5)

Example: AdaBoost with Decision Stumps