Deep Q-Network - combines Q-learning with deep neural networks. Learns to estimate the value (Q-value) of taking actions in different states.
Q(state, action) = Expected future reward
Agent learns: "How good is action A in state S?"
Input (12) → [128, ReLU] → [64, ReLU] → Output (4)
↓
Q-values for each action# Store transitions
memory = [(s, a, r, s', done), ...]
# Learn from random samples
batch = memory.sample(64)Why? Breaks correlation between consecutive samples.
policy_net # Updated every step
target_net # Updated every 10 episodesWhy? Stabilizes training by providing consistent targets.
if random() < epsilon:
action = random_action() # Explore
else:
action = best_q_action() # ExploitEpsilon: 1.0 → 0.01 over training
for episode in episodes:
state = env.reset()
for step in steps:
# 1. Choose action
action = epsilon_greedy(state)
# 2. Take action
next_state, reward, done = env.step(action)
# 3. Store transition
memory.push(state, action, reward, next_state, done)
# 4. Learn from batch
if len(memory) > batch_size:
batch = memory.sample(batch_size)
loss = compute_td_loss(batch)
optimize(loss)
# 5. Update target network
if episode % 10 == 0:
target_net.copy(policy_net)
state = next_state# Current Q-value
Q_current = policy_net(state)[action]
# Target Q-value
Q_target = reward + gamma * max(target_net(next_state))
# Loss
loss = (Q_current - Q_target)²Low (0.0001): Stable, slow
Medium (0.001): Balanced ✓
High (0.01): Fast, unstable
0.9: Short-term focus
0.99: Long-term planning ✓
0.999: Very long-term (can be unstable)
Fast (0.99): Quick exploitation
Medium (0.995): Balanced ✓
Slow (0.999): Extended exploration
Small (32): Noisy updates, faster
Medium (64): Balanced ✓
Large (128): Stable, slower
Small (1000): Recent experience only
Medium (10000): Good memory ✓
Large (100000): Diverse experience, more RAM
# Not just sparse goal reward
reward = base + distance_improvement + exploration - time_penaltylayers = [
Linear(128),
ReLU(),
Dropout(0.1), # Prevents overfitting
...
]loss.backward()
clip_grad_norm_(parameters, max_norm=1.0) # Prevents exploding gradients
optimizer.step()# Check if Q-values changing
print(f"Q-values: {q_values.mean():.2f}")
# Verify loss decreasing
print(f"Loss: {loss.item():.4f}")
# Ensure exploration happening
print(f"Epsilon: {epsilon:.3f}")- Lower learning rate:
lr = 0.0005 - Increase target update frequency:
freq = 20 - Smaller batch size:
batch = 32
- Add more dropout:
p = 0.2 - Increase replay buffer:
size = 20000 - Use regularization:
weight_decay = 1e-5
- Use GPU:
device = 'cuda' - Reduce network size:
hidden = [64, 32] - Smaller buffer:
size = 5000
✅ Proven method - works on many domains
✅ Sample efficient - reuses experiences
✅ Stable training - target network + replay
✅ Continuous learning - online updates
✅ GPU acceleration - fast with CUDA
❌ Overestimation bias - Q-values often too high
❌ Brittle - sensitive to hyperparameters
❌ Exploration challenge - ε-greedy is simple
❌ Discrete actions only - can't handle continuous
❌ Correlation issues - even with replay
# Use policy net to select action
action = policy_net(next_state).argmax()
# Use target net to evaluate
Q_target = reward + gamma * target_net(next_state)[action]Benefit: Reduces overestimation
# Split network into value and advantage streams
V(s) = state_value_stream(features)
A(s,a) = advantage_stream(features)
Q(s,a) = V(s) + (A(s,a) - mean(A(s,:)))Benefit: Better state value estimation
# Sample high-error transitions more often
priority = abs(td_error) + epsilon
batch = memory.sample(batch_size, priorities)Benefit: Learns from important experiences faster
from dqn_solver import DQNMazeSolver
from env.maze_env import MazeEnv
# Setup
env = MazeEnv()
solver = DQNMazeSolver(env, log_dir='logs/dqn')
# Train
solver.train(num_episodes=500, verbose=True)
# Evaluate
results = solver.evaluate(num_episodes=10)
# Visualize
solver.visualize_training()solver.gamma = 0.95 # Less long-term
solver.epsilon_decay = 0.99 # Exploit sooner
solver.lr = 0.005 # Learn fastersolver.gamma = 0.99 # More planning
solver.epsilon_decay = 0.995 # Explore longer
solver.target_update = 20 # Less frequent updatessolver.epsilon_min = 0.05 # Keep exploring
solver.epsilon_decay = 0.999 # Decay slower- Training becomes unstable
- Q-values explode or collapse
- Overfits to recent experiences
- Forgets important lessons
- If rewards > 100: normalize
- If sparse: add shaping
- High LR causes oscillation
- Reduce to 0.0001 or lower
- Gets stuck in local optima
- Increase epsilon_min or decay slower
# Should increase
avg_reward = mean(episode_rewards[-100:])
# Should decrease
loss = mean(losses[-100:])
# Should decrease
epsilon = current_epsilon
# Should increase
success_rate = wins / total_episodes- Reward trending upward
- Loss trending downward
- Success rate improving
- Q-values stabilizing
- Reward not improving after 200 episodes
- Loss staying high or increasing
- Q-values exploding (>1000)
- No successful episodes
- Original DQN Paper
- Rainbow DQN - Combines improvements
- OpenAI Spinning Up