【Beginner】Explanation of Q learning and implementation in python ~ simple maze problem as an example ~

This time, I introduce Q-learning, a typical algorithm for reinforcement learning.

In short, Q-learning is a method that uses the maximum Q value of the transition destination state, and is a reinforcement learning method called the optimistic method.

I will explain carefully from now on.

What is Reinforcement Learning?

I explain about reinforcement learning using the following diagram.

Suppose the robot on the left learns with reinforcement learning.

The Objects that learn by reinforcement learning like this robot are called agents and reinforcement learning agents .

Agents do something about the environment.

They will receive the state and reward from the environment that have been changed by the action.

Based on the received status and reward, the agent learns the behavior pattern aiming at maximizing the cumulative reward .

In reinforcement learning, agents are rewarded instead of the correct answer data.

The reward is not a correct or incorrect answer, but a good action.

Therefore, it is called an intermediate method between supervised learning and unsupervised learning.

The details are summarized in the following article.

What is Q learning

Q learning is one of the typical reinforcement learning methods.

In Q learning, the Q value of each action \ (a \) for each state \ (s \) is stored Q table \ (Q (s, a) \) Holding a table (Left of figure)

In Q learning, the value of this Q table is updated by the following formula.

As you can see from this equation (red line), the main feature of Q learning is to learn using the maximum Q value of the transition destination state\(max_{a’} Q(s, a’)\).

In other words, it uses only the best part of the state to which it has transitioned.

Therefore, it is called an optimistic method.

By the way, \( r+\gamma \max_{a’}Q(s’,a’) -Q(s,a)\) is called TD error,
In Q learning, we will learn to reduce this error.

\(\alpha\) is a parameter that takes a value between 0 and 1, which is called learning rate.
Determines how much TD error is reflected.

\( \gamma\) is a discount rate, a parameter that determines how much the maximum Q value of the transition destination is used.

Learning Procedure

Now let’s look at the specific Q learning learning procedure (algorithm).

First, it is shown in following bullets.

  1. In the current state \(s\), select and execute an action \(a\) according to \(Q(s, a)\)
  2. Receive transition destination state \(s’\) and reward \(r\) from environment
  3. Update \(Q(s, a)\) based on the obtained transition destination state \(s’\) and reward \(r\)
  4. Repeat steps 1-3

I explain each step using a diagram.

1.In the current state \(s\), select and execute an action \(a\) according to \(Q(s, a)\)

First, the reinforcement learning agent selects an action in the current state \(s\) according to \(Q(s, a)\) in the Q table.

The red part in the figure is \(Q(s, a)\).

From here, the agent selects the action using ε-greedy action selection, softmax action selection, etc.

Basically, all methods preferentially select actions with high \(Q(s, a)\).

Here, it is assumed that action \(a\) is selected.

After selecting an action, actually execute the action.

2.Receive transition destination state \(s’\) and reward \(r\) from environment

Next, the agent receives the state \(s’\) and the reward \(r\), which have transitioned by the action \(a\), from the environment.

This is called observation in reinforcement learning.

3. Update \(Q(s, a)\) based on the obtained transition destination state \(s’\) and reward \(r\)

Then, update \(Q(s, a)\) using the Q value update formula introduced earlier.

As mentioned earlier, Q learning is characterized by using the maximum Q value \(\max_{a’}Q(s’, a’)\) of the transition destination state \(s’\).

4. Repeat Steps 1-3

After that, Q learning repeats steps 1-3 until \(Q(s, a)\) converges to some extent.

Implementation and Experiment

Problem Settings

This experiment uses the following grid world issues: It will be a pretty simple map.

The reinforcement learning agent is initially at the lower left.

Suppose the map has a goal in the upper right and a danger zone in the upper left.

From here, the problem is for the agent to learn the shortest path to the upper right house.

Detail Settings

The states, actions and rewards are as follows.

  • State: Cells (coordinates)
  • Action: Up, down, left, right
  • Reward: 100 for top right, -100 for top left, 0 for normal points, -1 for walls and borders

When the agent arrives at the top left or top right, the game is considered to be over,
Return the agent to the lower left.
(This is expressed in units called episodes)

Source Code

Implement using Python. Python 3 is used.
As in the previous case, it is implemented separately for the Q learning agent class and the environment (grid world) class.

Github: https://github.com/tocom242242/QLearning_in_GridWorld

Q Learning Agent

First is Q learning.
The code looks like this:

The agent selects an action by the act method in lines 33 and 41.
Here, the action is selected by the \(\epsilon\)-greedy method.

The agent observes the state and reward of the transition destination by the observe method on lines 43 to 56.
Here, agent learn only when the reward is observed.

The Q value is updated by the learn method on lines 58-65.

Grid World

In grid world class,
Receives actions from agents and returns destination states and rewards.

Overview of Source Code

Here is an overview of the source code.
Running the following code will plot the rewards obtained for each episode.

Experimental Result

The cumulative reward for each episode is plotted.
(One episode is from the start to the goal)


The horizontal axis is the episode, and the vertical axis is the cumulative reward.

You can see that you are getting a high reward for learning.