Deep Q-Learning (DQN)
Overview
As an extension of the Q-learning, DQN's main technical contribution is the use of replay buffer and target network, both of which would help improve the stability of the algorithm.
Original papers:
Implemented Variants
Variants Implemented | Description |
---|---|
dqn.py , docs |
For classic control tasks like CartPole-v1 . |
dqn_atari.py , docs |
For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques. |
Below are our single-file implementations of DQN:
dqn.py
The dqn.py has the following features:
- Works with the
Box
observation space of low-level features - Works with the
Discrete
action space - Works with envs like
CartPole-v1
Implementation details
dqn.py includes the 11 core implementation details:
dqn_atari.py
The dqn_atari.py has the following features:
- For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Usage
poetry install -E atari
python cleanrl/dqn_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/dqn_atari.py --env-id PongNoFrameskip-v4
Implementation details
dqn_atari.py is based on (Mnih et al., 2015)1 but presents a few implementation differences:
dqn_atari.py
use slightly different hyperparameters. Specifically,dqn_atari.py
uses the more popular Adam Optimizer with the--learning-rate=1e-4
as follows:whereas (Mnih et al., 2015)1 (Exntended Data Table 1) uses the RMSProp optimizer withoptim.Adam(q_network.parameters(), lr=1e-4)
--learning-rate=2.5e-4
, gradient momentum0.95
, squared gradient momentum0.95
, and min squared gradient0.01
as follows:optim.RMSprop( q_network.parameters(), lr=2.5e-4, momentum=0.95, # ... PyTorch's RMSprop does not directly support # squared gradient momentum and min squared gradient # so we are not sure what to put here. )
dqn_atari.py
uses--learning-starts=80000
whereas (Mnih et al., 2015)1 (Exntended Data Table 1) uses--learning-starts=50000
.dqn_atari.py
uses--total-timesteps=10000000
(i.e., 10M timesteps = 40M frames because of frame-skipping) whereas (Mnih et al., 2015)1 uses--total-timesteps=12500000
(i.e., 12.5M timesteps = 50M frames) (See "Training details" under "METHODS" on page 6).dqn_atari.py
uses--end-e=0.01
(the final exploration epsilon) whereas (Mnih et al., 2015)1 (Exntended Data Table 1) uses--end-e=0.1
.
dqn_atari.py
use a self-contained evaluation scheme:dqn_atari.py
simply reports the episodic returns obtained throughout training, whereas (Mnih et al., 2015)1 is trained with--end-e=0.1
but reported episodic returns using a separate evaluation process with--end-e=0.01
(See "Evaluation procedure" under "METHODS" on page 6).