A brief introduction to Deep Q-Learning

A paper review on DQN

Introduction

This paper proposes a basic end-to-end high-dimensional pixel control strategy based on the reinforcement learning framework of the Q-Learning algorithm in the 1992 paper, combined with the powerful image processing capabilities of convolutional neural networks.

The above framework can handle high-dimensional data such as pixels, but the correlation between the data and the instability of data distribution in reinforcement learning itself have not been solved. The author uses the experience replay mechanism to solve this problem based on the 1993 paper 2. This experience replay mechanism hopes that the distribution of reinforcement learning can slowly transition from the random data at the beginning to the current situation with better results.

Q-Learning

Definition

Algorithm Procedure

  1. Initialization of the Q function
    • Represent it by a Q-table or Q-network.
    • Generate the the initial value.
  2. Interation between the agent and the environment.
    • The agent observes the current state $s_t$.
    • The agent selects an action $a_t$ according to the given policy, like the $\epsilon$-greedy policy.
    • Take action $a_t$, and the environment returns the reward $r_t$ and the next state $s_{t+1}$.
    • Experience $(s_t, a_t, r_t, s_{t+1})$ is restored in the replay pool.
  3. Training
    • Sample a batch of (s, a, r, s’) in the experience replay pool.
    • Compute target for each experience: \(y = r + \gamma * max_{a'} Q(s', a')\) where \gamma is the discount factor, indicating the importance attached to future rewards
    • compute TD error: \(e = y - Q(s, a)\)
    • Update Q-function based on TD error: \(Q(s, a) <- Q(s, a) + α * e\) where $\alpha$ is the learning rate
  4. Repeat the procedure 2 and 3, until the Q-function converges or when it reaches the given training step.