Neuroevolution and Direct Policy Search¶

Continuous State, Continuous Action¶

Approximating \(Q(s, a)\) does not help, we cannot (simply) determine \(\arg\max_a Q(s, a)\) to infer a policy.

We can discretize \(A\).

Instead, we can approximate \(\pi_{\boldsymbol{w}}(s) = a\) directly (direct policy search).

Parametrizable Policy¶

Neuroevolution: Artificial neural network (ANN) which maps states to actions

Weight vector \(\boldsymbol{w}\) can be modified

Optimizer¶

Gradient-free optimizer (because we cannot determine the correct action)

Usually either
- evolutionary algorithms (e.g. CMA-ES, CoSyNE, ...) or
- RL algorithms (e.g. RWR, PoWER, \(PI^2\), REPS, ...)

Adjusts \(\boldsymbol{w}\) to maximize the return (sum of rewards)

Neuroevolution

CMA-ES¶

Covariance Matrix Adaptation Evolution Strategy

Derivative-free optimization algorithm, mimimizes fitness functions that are
- non-linear
- non-separable
- noisy
- ill-conditioned

Should not be used
- for a very small number of parameters
- for a very large number of parameters (requires inversion of a covariance matrix)
- if exact derivatives are available

See Nikolaus Hansens's webpage or Wikipedia for more details.

In [1]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [2]:

def objective(x):
    return sqrt(x[0]**2+x[1]**2)

In [3]:

def plot_objective():
    x, y = meshgrid(arange(-1.6, 1.6, 0.1), arange(-1.6, 1.6, 0.1))
    z = [[objective([x[i,j], y[i,j]]) for i in range(x.shape[0])] for j in range(x.shape[1])]
    contourf(x, y, z, cmap=cm.Blues)
    setp(gca(), xticks=(), yticks=(), xlim=(-1.5, 1.5), ylim=(-1.5, 1.5))

figure(figsize=(5, 5))
plot_objective()

In [4]:

def equiprobable_ellipse(C, factor=1.0):
    """Source: http://stackoverflow.com/questions/12301071/multidimensional-confidence-intervals"""
    def eigsorted(C):
        vals, vecs = linalg.eigh(C)
        order = vals.argsort()[::-1]
        return vals[order], vecs[:,order]
    vals, vecs = eigsorted(C)
    angle = arctan2(*vecs[:,0][::-1])
    width, height = 2 * factor * np.sqrt(vals)
    return angle, width, height

def plot_ellipse(C, mean=zeros(2)):
    from matplotlib.patches import Ellipse
    angle, width, height = equiprobable_ellipse(C)
    e = Ellipse(xy=mean, width=width, height=height,
                angle=degrees(angle), ec="orange",
                fc="none", lw=3, ls="dashed")
    gca().add_artist(e)

In [5]:

__import__("sys").path.append("05_neuroevolution")
from barecmaes2 import Cmaes
opt = Cmaes(xstart=[1.0, 1.0], sigma=0.1, ftarget=1e-10, popsize=50)

In [6]:

def plot_generations(n_generations):
    figure(figsize=(n_generations*4, 4))
    for it in range(n_generations):
        X = asarray(opt.ask())
        F = [objective(x) for x in X]
        subplot(1, n_generations, it+1)
        plot_objective()
        plot_ellipse(C=opt.sigma*asarray(opt.C), mean=opt.xmean)
        weights = zeros_like(F)
        weights[argsort(F)[:len(opt.weights)]] = opt.weights
        scatter(X[:, 0], X[:, 1], c=weights, cmap=cm.gray)
        opt.tell(X, F)

In [7]:

plot_generations(n_generations=3)

In [8]:

plot_generations(n_generations=3)

In [9]:

plot_generations(n_generations=3)

Double Pole Balancing¶

In [10]:

from IPython.display import YouTubeVideo
YouTubeVideo("5y3UTEgFH2Q")

Out[10]:

Octopus Arm¶

In [12]:

YouTubeVideo("AxeeHif0euY")

Out[12]:

TORCS¶

In [14]:

YouTubeVideo("3JiNC6vw8zE")

Out[14]:

This problem also has been solved with visual input.

Partially Observable Markov Decision Processes (POMDP)¶

Markov property:

\[P(s_{t+1}|a_t,s_t,\ldots,a_0,s_0) = P(s_{t+1}|a_t,s_t)\]

But we cannot observe \(s_t\) completely!

Question 1¶

Can you give examples for partially observable Markov decision processes?

Question 2¶

How can we deal with partial observability?

Answer 2¶

State estimation
- Bayes filters (e.g. particle filter, Kalman filter)
- Exponential smoothing
- \(\boldsymbol{\alpha}\)-\(\boldsymbol{\beta}\) filter
- ...
Use recurrent connections
We need some kind of memory

Smoothing Noisy Observations with \(\alpha\)-\(\beta\) Filters¶

In [16]:

class ABF(object):
    """Alpha-beta filter."""
    def __init__(self, g, delta):
        r = (4+g-sqrt(8*g+g**2))/4
        self.a, self.b = 1-r**2, 2*(1-r)**2
        self.delta = delta
        self.s, self.sd = None, None
    def step(self, s):
        if self.s == None: self.s, self.sd = s, zeros_like(s)
        se = self.s + self.delta * self.sd
        sde = self.sd
        r = s - se
        self.s = se + self.a * r
        self.sd = sde + self.b/self.delta * r
        return self.s, self.sd

In [17]:

random.seed(0)

steps = 100
X = linspace(0, numpy.pi*4, steps)
S = sin(X)
Sn = S + random.randn(*S.shape)*0.2

figure(figsize=(12, 5))
plot(Sn, label="Observation")
plot(S, label="Signal")
r = legend()

In [18]:

abf = ABF(g=0.05, delta=1.0)

SE = []
SDE = []
for s in range(steps):
    se, sd = abf.step(Sn[s])
    SE.append(se)
    SDE.append(sd)

figure(figsize=(12, 5))
plot(Sn, label="Observation")
plot(S, label="Signal")
plot(SE, label="Estimation")
r = legend()

Policy Search in Robotics¶

Often we use a policy of the form \(\pi(s_t) = s_{t+1}\), that maps from a state to its successor.

\(\pi\) generates a trajectory.

We can use low-level controllers to generate an action that transfers from \(s_t\) to \(s_{t+1}\).

We can even work in task space (e.g. Cartesian end-effector coordinates):
- We need inverse kinematics (IK) to determine joint angles
- We could use inverse dynamics to determine torques

Policy Representations in Robotics¶

General function approximators

Gaussian mixture models
Gaussian process regression
Radial basis functions
Neural networks

Nonlinear dynamical systems

Stable estimators of dynamical systems (SEDS)
Dynamical movement primitives (DMP)
Parametric dynamical movement primitives (PDMP)

Via points and splines

Probabilistic movement primitives (ProMP)
Bézier splines (B-splines)
...

Applications in Robotics¶

Table tennis

Pancake flipping

Ball throwing

Darts