unit 8.0 - introduction to Reinforcement Learning (RL)

Learning in the real world

Suppose we want to create the brain for a robot to help us in everyday tasks at home. The robot is an “agent” that needs to learn in an “environment”. The robot will use sensory data to gather information about the status of the environment and of itself, and use actions to affect the environment, move. All of this happens in a loop, just like in the image below.

An agent collects information about the environment in the following modalities:

imaging: cameras that collect 2D images, 3D cameras that can perceive depth
sound: hearing sounds, speech
proprioception: sensing the position of the agent’s degrees of freedom

An environment can be:

fully observable: like a chess or go board
not fully-observable: like a 3D first person shooter game (FPS)

Learning in the real world involves learning a sequence of steps that eventually lead to solving a task. For example, many chess actions (moves) are necessary to win the game. Winning the game may give us a reward signal (happiness? money?) that we can use to lean how to get to the final winning state. See an example sequence of actions below:

\(action(0) \rightarrow action(1) ... \rightarrow action(N) \rightarrow reward\)

Before each action, the agent senses the environment and then produces an action.

\(action(t+1) = Brain-function( state(t), action(t), state(t-1), action(t-1)... )\)

How do we use neural network to learn in this context?

In the field of RL, usually we use the following terminology:

action \(a\)
state \(s\)
reward \(r\)
brain function \(F\), sometimes called “policy” is the robot brain function

A neural network that works in RL need to perform the following: use the brain function to create an action plan (create a sequence of actions) after observing the environment, and of course maintaining a history of previous states, actions and rewards.

\(a(t+1), a(t+2), ... = F( s(t), s(t-1), ..., a(t), a(t-1), ..., r(t), r(t-1), ..., )\)

The question now is to have a way to train a model to generate the right action sequence. One idea is to train a model with “imitation learning”. This model can learn sequences of actions leading to a reward directly, if such sequences are available. Sequence of actions are only available in narrow sets of applications. Also learning by imitation does not involve exploration of new ideas or techniques, so the results are often limited by the training set.

Reinforcement Learning

Reinforcement learning (RL) in the strict sense tries to learn a “brain function” model by trial and error in an online setting, or operating in the real-life loop of sensing and actuation. RL supposes we have an agent and an environment, and that the agent get access to a “reward” signal only. The agent has to learn to operate exploring and trying and maximizing the reward.

As you can imagine, this is more difficult than imitation learning (which is a subset of supervised learning) but leads to more optimal solutions because the randomness of exploration can be used to achieve fine-tuned trajectories.

Q-learning

Q-Learning is a form of RL where the algorithm tries to create a “quality Q function” of current and previous states and actions. The Q-function can be used to select an action that bring the agent closer to high reward state.

Q-learning works with the following formula:

\(Q(S(t),A(t))←Q(S(t),A(t))+α[R(t+1)+γ*maxQ(S(t+1),a(t+1))−Q(S(t),A(t))]\)

\(Q\) is the quality function, \(maxQ\) is the next state with the highest \(Q\) value
\(s, a, r\) are state, action, reward

This formula creates a Q-table by iterating multiple times (trials) in the environment and slowly learning what sets of actions maximize reward at each step.

Example application of Q-learning

Here is an example of Q-learning applied to a simple 3x3 grid world, where an agent learns to navigate from a start state (S) to a goal state (G) while avoiding obstacles (X).

Assuming the following grid world:

| S |   |   |
|   | X |   |
|   |   | G |

S: Start X: Obstacle G: Goal

The agent can move in four directions: up, down, left, and right. Actions that would move the agent into an obstacle or outside the grid are invalid.

Here’s how the Q-table might look during different iterations of Q-learning:

Initial Q-table

State	Up	Down	Left	Right
(1,1)	0	0	0	0
(1,2)	0	0	0	0
(1,3)	0	0	0	0
(2,1)	0	0	0	0
(2,2)	0	0	0	0
(2,3)	0	0	0	0
(3,1)	0	0	0	0
(3,2)	0	0	0	0
(3,3)	0	0	0	0

After Several Iterations of Q-learning

State	Up	Down	Left	Right
(1,1)	0	0.2	0.2	0.2
(1,2)	0.2	0	0.2	0.2
(1,3)	0.2	0.2	0	0.2
(2,1)	0	0.2	0.2	0.2
(2,2)	0.2	0.2	0.2	0.2
(2,3)	0.2	0.2	0.2	0
(3,1)	0.2	0	0.2	0
(3,2)	0.2	0	0	0.2
(3,3)	0	0	0.2	0.2

Here, each cell in the Q-table represents the value of taking a particular action from the corresponding state. After multiple iterations of Q-learning, the agent learns the optimal actions to take in each state to reach the goal while avoiding obstacles. The values in the Q-table converge towards the optimal action-values.

Q table in neural networks:

We can represent this Q-table as a neural network. In this case, we can use a fully connected neural network with one hidden layer. The input layer will have 4 neurons (corresponding to the four actions: Up, Down, Left, Right), the hidden layer can have any number of neurons, and the output layer will have 9 neurons (corresponding to the 9 states).

Let’s represent this neural network with 4 neurons in the input layer, 6 neurons in the hidden layer, and 9 neurons in the output layer (one for each state):

[1]:

import torch
import torch.nn as nn

class QNetwork(nn.Module):
    def __init__(self):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(4, 6)  # Input layer: 4 neurons, Hidden layer: 6 neurons
        self.fc2 = nn.Linear(6, 9)  # Hidden layer: 6 neurons, Output layer: 9 neurons

    def forward(self, x):
        x = torch.relu(self.fc1(x))  # Applying ReLU activation function in the hidden layer
        x = self.fc2(x)
        return x

# Instantiate the network
model = QNetwork()
print(model)

QNetwork(
  (fc1): Linear(in_features=4, out_features=6, bias=True)
  (fc2): Linear(in_features=6, out_features=9, bias=True)
)

The weights between the input layer and the hidden layer will be a 4×6 matrix, and the weights between the hidden layer and the output layer will be a 6×9 matrix. We can initialize these weights randomly and update them during the training process using back-propagation and gradient descent.

This neural network will take the current state-action pair as input and output the corresponding Q-value for each state.