【Reinforcement Learning】 Epsilon-Greedy Action Selection

Overview of ε-greedy action selection

ε-greedy action selection is a method that randomly selects an action with a probability of ε, and selects the action with the highest expected value with a probability (1-ε) other than that.

It is often used in reinforcement learning as well as the softmax action selection method.

Example

I will explain using the problem shown in the figure below.
Players have three choices. Here, the choices are A, B, and C.
The numbers in the squares below are the expected profits that will be obtained when you select each option.

If you select an action using ε-greedy action selection, the action is selected as follows.

With probability ε

Actions are randomly selected with a probability of ε.
This is the operation called Exploration.
Therefore ε is called Exploration Rate.
Because it is a exploration, it is an operation to search.
That is, trying new things, updating existing knowledge, and so on.

With probability (1 – ε)

Selecting the action with the highest expected value with a probability of (1-ε).
This is the operation of knowledge exploitation.
It means choosing the most rational action with current knowledge.

About parameter ε

This ε (epsilon) is called the exploration rate and is a parameter that determines the search rate.

In ε-greedy action selection, it is necessary to set or adjust ε correctly.

If this ε is too low, it will be difficult to find the optimal behavior.
if it is too high, the behavior becomes almost random and the profit that can be earned becomes unstable.

Implementation

Below is a sample code for ε-greedy action selection.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

def epsilon_greedy_selection(epsilon, values):
    nb_values = len(values)

    if np.random.uniform() < epsilon:  # Exploration
        action = np.random.randint(0, nb_values)
    else:  # Exploitation
        action = np.argmax(values)

    return action

nb_steps = 1000
values = [100, 50, 10]
epsilon = 0.1        # Exploration Rate
results = []

# Select
for _ in range(nb_steps):
    selected_action = epsilon_greedy_selection(epsilon, values)
    results.append(selected_action)

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.set_xticklabels(["", "A", "B", "C"])
ax.set_ylim(0, 1000)
ax.hist(results)
plt.show()

In this code, the action choice was made 1000 times in the ε-greedy action choice.

The results when ε = 0.1 are shown below.

Since ε is small, we can see that it is simply selecting the option with the maximum action value.

The results when ε = 0.4 are shown below.

Since ε is made larger than the previous experiment, it can be seen that the number of times B and C are selected increases.

コメント

タイトルとURLをコピーしました