A brief introduction to Deep Q-Learning

A paper review on DQN

Introduction

This paper proposes a basic end-to-end high-dimensional pixel control strategy based on the reinforcement learning framework of the Q-Learning algorithm in the 1992 paper, combined with the powerful image processing capabilities of convolutional neural networks.

The above framework can handle high-dimensional data such as pixels, but the correlation between the data and the instability of data distribution in reinforcement learning itself have not been solved. The author uses the experience replay mechanism to solve this problem based on the 1993 paper 2. This experience replay mechanism hopes that the distribution of reinforcement learning can slowly transition from the random data at the beginning to the current situation with better results.

Q-Learning

Definition

Q-function maps state and action pairs to expected reward.
For some state $s$ and action $a$, Q-function $Q(s,a)$ represents the expected accumulated reward when the agent choose action $a$ under the state $s$, and follow some policy until the end of the task.
If we have an optimal function $Q^{*}(s,a)$, then we can get the largest reward if the agent take action $a$ of the $Q^{*}(s,a)$ under the state of $s$. This is the learning target of Q-Learning.

Algorithm Procedure

Initialization of the Q function
- Represent it by a Q-table or Q-network.
- Generate the the initial value.
Interation between the agent and the environment.
- The agent observes the current state $s_t$.
- The agent selects an action $a_t$ according to the given policy, like the $\epsilon$-greedy policy.
- Take action $a_t$, and the environment returns the reward $r_t$ and the next state $s_{t+1}$.
- Experience $(s_t, a_t, r_t, s_{t+1})$ is restored in the replay pool.
Training
- Sample a batch of (s, a, r, s’) in the experience replay pool.
- Compute target for each experience: $y = r + \gamma * max_{a'} Q(s', a')$ where \gamma is the discount factor, indicating the importance attached to future rewards
- compute TD error: $e = y - Q(s, a)$
- Update Q-function based on TD error: $Q(s, a) <- Q(s, a) + α * e$ where $\alpha$ is the learning rate
Repeat the procedure 2 and 3, until the Q-function converges or when it reaches the given training step.