Reinforcement Learning

Agent-environment interface

Agent-environment interface

Question 1

What are the differences between supervised learning and reinforcement learning?

Answer 1

  • label -> reward
  • we have a sequence of decisions, i.e. future rewards must be considered

Classical RL Approach

  • We have a Markov decision process (MDP) with \((S, A, \mathcal{R_{ss'}^a}, \mathcal{P}_{ss'}^a)\)
  • \(S\) is the discrete set of states
  • \(A\) is the discrete set of actions
  • \(\mathcal{R_{ss'}^a}\) is the reward function
  • \(\mathcal{P}_{ss'}^a\) is the state transition distribution
  • Usually we also require initial state(s) and goal state(s).
  • Q-Learning tries to find an estimate of the value function \(Q(s, a)\).

Markov property

\[P(s_{t+1}|a_t,s_t,\ldots,a_0,s_0) = P(s_{t+1}|a_t,s_t)\]

Value Function

How good is it to be in state \(s\) and execute action \(a\)?

\[Q(s, a) = \mathbb{E}\left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1} \vert s, a \right] = \mathbb{E}\left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) \vert s, a \right]\]

\(Q\) is usually a table!

Return model: discounted infinite horizon

\(R_t = \sum_{i = 0}^\infty \gamma^i r_{t+1+i}\),

The discounting factor \(\gamma \in (0, 1)\) should be close to 1!

Policy

  • We can infer the greedy policy: \[\pi_Q(s) = \arg \max_a Q(s, a)\] (optimal according to the current estimate of \(Q\))
  • During learning, we can encourage exploration with softmax action selection \[\pi_{\text{Softmax}}(s, a) = \frac{\exp(Q(s, a) / \tau)}{\sum_{a' \in A} \exp(Q(s, a') / \tau)}\] which will give you a probability for each action \(a\) according to a temperature \(\tau\). (A high temperature will result in more exploration, possible values are e.g. \(10^1, 10^0, 10^{-1}, 10^{-2}\))
  • A more intuitive but often worse exploration policy is called \(\epsilon\)-greedy: select a random action with probability \(\epsilon\) and otherwise take the optimal action (greedy policy)

Q-Learning

Source: Sutton & Barto

Random Walk

Source: Sutton & Barto

Random Walk

Random Walk

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

In [2]:
A = ["left", "right"]
In [3]:
S = ["left_goal", "A", "B", "C", "D", "E", "right_goal"]
In [4]:
def state_transition(state, action): # Deterministic, unknown to the agent
    state_index = S.index(state)
    if action == "left": state_index -= 1
    elif action == "right": state_index += 1
    return S[state_index]
In [5]:
def reward(state, action, next_state): # Deterministic, unknown to the agent
    if next_state == "right_goal": return 1
    else: return 0
In [6]:
def initial_state():
    return "C"
In [7]:
def finished(state):
    return state in ["left_goal", "right_goal"]
In [8]:
class GreedyPolicy(object):
    def __init__(self, Q): self.Q = Q
    def __call__(self, state): return A[argmax(self.Q(state))]
In [9]:
class Softmax(object):
    def __init__(self, Q, tau):
        self.Q = Q
        self.tau = tau
    def __call__(self, state):
        p = exp(self.Q(state)/self.tau)
        p /= p.sum()
        # You could also use random.choice() in NumPy >= 1.7
        # for the following lines:
        x = random.rand()
        for i in range(p.shape[0]):
            if p.cumsum()[i] >= x: return A[i]
In [10]:
class EpsilonGreedyPolicy(object):
    def __init__(self, Q, epsilon):
        self.Q = Q
        self.epsilon = epsilon
    def __call__(self, state):
        if random.rand() < self.epsilon:
            return A[random.randint(0, len(A))]
        else:
            return A[argmax(self.Q(state))]
In [11]:
class QTable(object):
    def __init__(self, S, A):
        self.Q = zeros((len(S), len(A)))
    def __call__(self, state, action=None):
        if action is None:
            return self.Q[S.index(state)]
        else:
            return self.Q[S.index(state), A.index(action)]
    def correct(self, state, action, target, alpha):
        self.Q[S.index(state), A.index(action)] += alpha * (target - self(state, action))
In [12]:
def q_learning(alpha, gamma, epsilon, n_episodes, Q):
    for episode in range(n_episodes):
        state = initial_state()
        while not finished(state):
            action = EpsilonGreedyPolicy(Q, epsilon)(state)
            next_state = state_transition(state, action)
            r = reward(state, action, next_state)
            Q.correct(state, action, r + gamma * numpy.max(Q(next_state)), alpha)
            state = next_state

    return Q
In [13]:
figure(figsize(10, 6))
for i in range(6):
    n_episodes = 100*i
    setp(subplot(3, 2, i+1), xticks=(), yticks=(), title="Episode #%d" % n_episodes,
         xlabel="State", ylabel="Action")
    # Note that epsilon = 0.4 is very high!
    imshow(q_learning(alpha=0.1, gamma=0.95, epsilon=0.4, n_episodes=n_episodes,
                      Q=QTable(S, A)).Q.T, cmap=cm.gray, interpolation="nearest")

Discrete Actions, Continuous States

  • Value function \(Q(s, a)\) cannot be represented in a table for either a continuous state space or a large number of states.
  • \(Q(s, a)\) will be represented by a function approximator, e.g.
    • linear with nonlinear features: coarse coding, tile coding, RBF, Kanerva coding, ...
    • nonlinear: ANN, SVR, GPR (Gaussian process regression), ...
  • In step \(t\), we will use the tuple \((s_t, a_t, r_t, s_{t+1})\) to update the value function.
  • The function approximator will predict \(\hat{y}^{(t)} = Q(s_t, a_t)\) and receives the target \(y^{(t)} = r_t + \max_{a'} Q(s_{t+1}, a')\).

Example: Mountain Car

In [14]:
from IPython.display import YouTubeVideo
YouTubeVideo("0_xeAD6ZgVU")
Out[14]:
In [15]:
print YouTubeVideo("0_xeAD6ZgVU").src
http://www.youtube.com/embed/0_xeAD6ZgVU

Function Approximators

  • Use linear function approximators whenever possible! They are more stable than nonlinear FAs (RL FAQ).
  • Otherwise use local function approximators, e.g. radial basis functions (RBFs).
  • Global function approximators (e.g. MLPs) have the problem of catastrophic forgetting.

Radial Basis Functions

  • Remember: you can approximate arbitrary functions with linear regression and nonlinear features \(\phi(x)\).
  • With RBFs, changing a weight \(w_i\) that is associated with \(\phi_i(x)\) has only local effects.
  • An example for an RBF is \[\phi_i(x) = \exp \left(-\frac{(x-c_i)^2}{2 h_i} \right).\]
In [16]:
%pylab inline
random.seed(0)
Populating the interactive namespace from numpy and matplotlib

Centers \(c_i\) can be

  • uniform random distributed
  • equally spaced
  • determined by clustering (k-means, GMM-EM, ...)
In [17]:
n_rbfs = 5
centers = linspace(0, 1, n_rbfs)

Usually the widths \(h_i\) are all equal.

In [18]:
widths = diff(centers)
widths = 0.05*hstack((widths, widths[-1]))
In [19]:
def rbf_features(x):
    return exp(-0.5*(x-centers)**2/widths)
In [20]:
n_samples = 100
X = linspace(0, 1, n_samples)
Phi = array([rbf_features(x) for x in X])
w = random.randn(n_rbfs)
y = sum(Phi * w, axis=1)

figure(figsize=(10, 5))
subplot(1, 2, 1)
title("Radial Basis Functions")
for n in xrange(n_rbfs):
    plot(X, Phi[:, n])
subplot(1, 2, 2)
title("Weighted Sum of RBFs")
r = plot(X, y)
In [21]:
t = sin(2*pi*X) + random.randn(X.shape[0]) * 0.1
w = linalg.lstsq(Phi, t)[0]
y = sum(Phi * w, axis=1)
title("w = (" + (", ".join(["%.2f"] * len(w)) % tuple(w)) + ")")
plot(X, t, "o")
r = plot(X, y, lw=5)
In [22]:
t[50:] = 0
w = linalg.lstsq(Phi, t)[0]
y = sum(Phi * w, axis=1)
title("w = (" + (", ".join(["%.2f"] * len(w)) % tuple(w)) + ")")
plot(X, t, "o")
r = plot(X, y, lw=5)

Random Walk with Function Approximator

In [23]:
S = linspace(0, 1, 100)
In [24]:
def initial_state():
    return 0.5
def finished(state):
    return state <= 0.0 or state >= 1.0
In [25]:
def state_transition(state, action): # Stochastic, unknown to the agent
    if action == "left": state -= 0.15+random.rand()*0.05
    elif action == "right": state += 0.15+random.rand()*0.05
    return state
In [26]:
def reward(state, action, next_state): # Deterministic, unknown to the agent
    if next_state >= 1.0: return 1
    else: return 0
In [27]:
class QRBF(object):
    def __init__(self, n_features, S, A):
        self.c = linspace(0, max(S), n_features)
        self.h = 0.1*diff(self.c)
        self.h = hstack((self.h, self.h[-1]))
        self.W = zeros((len(A), n_features))
    def features(self, state):
        return exp(-0.5*(state-self.c)**2/self.h)
    def __call__(self, state, action=None):
        Q = self.W.dot(self.features(state))
        if action is None: return Q
        else: return Q[A.index(action)]
    def correct(self, state, action, target, alpha):
        phi = self.features(state)
        y = phi.dot(self.W[A.index(action)])
        self.W[A.index(action)] -= alpha * phi * (y - target)
In [28]:
Q = q_learning(alpha=0.02, gamma=0.9, epsilon=0.4, n_episodes=100,
               Q=QRBF(5, S, A))
In [29]:
figure(figsize=(12, 5))
Qtable = zeros((len(A), len(S)/5))
for i in range(0, len(S), 5):
    for j in range(len(A)):
        Qtable[j, i/5] = Q(S[i], A[j])
setp(gca(), xticks=(), yticks=())
r = imshow(Qtable, cmap=cm.gray, interpolation="nearest")

Question 1

  • Q(s, a) can have values greater than 1, why?

Answer 1

  • The target for online gradient descent is \(r_t + \max_{a'} Q(s_{t+1}, a')\).
  • \(Q(s_{t+1}, a')\) is not necessarily correct.

Neural Networks as Function Approximators

  • NNs require lots of tuning.
  • There is no perfect solution yet.
  • They are able to model highly nonlinear value functions.
  • Neural networks have the problem of catastrophic forgetting.

However, there are successful applications:

Tasks with Discrete Action Space

  • Games (chess, backgammon, ...)
  • Computer games (car racing, arcade games, ...)
  • Selection of high-level actions (open the door, move forward, ...)