Skip to content

Rishabhmannu/reinforcement-learning-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training RL Agents on a MacBook — From CartPole to Atari

A hands-on reinforcement learning project where I trained agents across 4 environments using 3 algorithms, all on an Apple M4 Pro MacBook. This was a learning exercise in building RL pipelines with proper ML engineering practices — multi-seed evaluation, experiment tracking, reward shaping, and reproducible configs.

CartPole-v1

A2C — 499.9 avg reward
ALE/Breakout-v5

A2C — 31.4 avg reward
CarRacing-v3

PPO — 688.9 avg reward

What I Learned

  • How PPO, A2C, and DQN actually work — not just the theory, but tuning them on different environments and seeing where each one shines
  • Why multi-seed evaluation matters — a single seed can be misleading (PPO solves CartPole in 4/5 seeds but fails on 1)
  • How reward shaping can make or break learning — same algorithm, same environment, 7x performance difference just from changing the reward function
  • The practical side of training CNN policies on Apple Silicon (MPS) — when it helps and when CPU is actually faster
  • Building a clean, modular codebase that separates config, training, evaluation, and visualization

Results

9 experiments, 4 environments, 3 algorithms. All evaluated with deterministic policies across multiple seeds.

Experiment Algorithm Environment Mean Reward 95% CI Median IQM
cartpole_a2c A2C CartPole-v1 499.9 ± 1.4 [499.8, 500.1] 500.0 500.0
cartpole_ppo PPO CartPole-v1 438.6 ± 140.8 [319.2, 558.0] 500.0 500.0
cartpole_dqn DQN CartPole-v1 360.6 ± 128.6 [252.5, 468.6] 362.0 372.3
breakout_a2c A2C Breakout-v5 31.4 ± 9.8 [28.4, 34.4] 30.0 30.0
breakout_ppo PPO Breakout-v5 28.1 ± 8.4 [24.9, 31.3] 28.0 27.4
carracing_ppo PPO CarRacing-v3 688.9 ± 243.8 [680.8, 697.0] 819.4 768.8
gridnav_shaped PPO GridNav-v0 0.70 ± 0.36 [0.66, 0.74] 0.63 0.65
gridnav_sparse PPO GridNav-v0 0.10 ± 0.30 [0.07, 0.13] 0.00 0.00
gridnav_dense PPO GridNav-v0 -35.9 ± 32.9 [-39.7, -32.0] -25.6 -29.8

Evaluation setup: 100 episodes/seed for CartPole & GridNav (5 seeds each), 30 episodes/seed for Breakout & CarRacing (3 seeds each).

Environments

Environment Type Observation Actions What I Explored
CartPole-v1 Classic Control 4D float Discrete(2) Algorithm comparison (PPO vs A2C vs DQN)
ALE/Breakout-v5 Atari (4, 84, 84) frames Discrete(4) CNN policies, frame stacking, MPS training
CarRacing-v3 Continuous Control 96x96 RGB Box(3) Continuous action spaces, high-variance evaluation
GridNav-v0 Custom (built from scratch) 68D float Discrete(4) Reward shaping — sparse vs dense vs potential-based

Algorithms

  • PPO (Proximal Policy Optimization) — on-policy, clipped surrogate objective. The most versatile — used across all 4 environments.
  • A2C (Advantage Actor-Critic) — on-policy, synchronous advantage estimation. Fastest to converge on CartPole and Breakout at moderate training budgets.
  • DQN (Deep Q-Network) — off-policy with replay buffer. More sample efficient in theory, but showed highest seed variance on CartPole.

Reward Shaping Experiment

One of the more interesting parts of this project. I built a custom 8x8 grid navigation environment with 3 reward functions and trained the same algorithm (PPO) with the same hyperparameters on each:

Reward Mode Mean Reward What Happened
Shaped (potential-based) 0.70 Agent reliably reaches the goal (~70% of episodes). Validates Ng et al. (1999).
Sparse (+1 at goal only) 0.10 Agent rarely stumbles on the reward by chance — the classic exploration problem.
Dense (distance penalty per step) -35.9 Agent learns to stay still to avoid penalties. Naive reward engineering backfires.

Same algorithm, same environment, same hyperparameters — but the reward function alone determines success or failure.

Project Structure

├── configs/               # YAML experiment configs (one per experiment)
├── src/                   # Importable Python package
│   ├── config.py          # ExperimentConfig dataclass + YAML loader
│   ├── environments/      # Custom GridNavigationEnv
│   ├── training/          # Trainer, callbacks (W&B + CSV logging)
│   ├── evaluation/        # Multi-seed eval, statistics, GIF recording
│   └── visualization/     # Learning curves, comparisons, distributions
├── scripts/               # CLI entry points
│   ├── train.py           # Train agents from config
│   ├── evaluate.py        # Evaluate with 95% CI, IQM
│   ├── tune.py            # Optuna hyperparameter search
│   └── compare.py         # Cross-algorithm comparison
├── notebooks/             # Analysis notebooks (load pre-computed results)
│   ├── 01_cartpole_fundamentals.ipynb
│   ├── 02_atari_breakout.ipynb
│   ├── 03_car_racing.ipynb
│   ├── 04_custom_environment.ipynb
│   └── 05_algorithm_comparison.ipynb
├── videos/                # Agent demo GIFs
├── tests/                 # pytest tests for custom environment
└── results/               # Figures, metrics CSVs, eval summaries

Quick Start

# 1. Create conda environment
conda create -n rl-env python=3.12 -y
conda activate rl-env
pip install -r requirements.txt

# 2. Run tests
pytest tests/ -v

# 3. Train CartPole with PPO (~5 min)
python scripts/train.py --config configs/cartpole_ppo.yaml

# 4. Quick smoke test (2K steps, no W&B)
python scripts/train.py --config configs/cartpole_ppo.yaml --no-wandb --seeds 42 --timesteps 2000

# 5. Evaluate trained model
python scripts/evaluate.py --config configs/cartpole_ppo.yaml --episodes 100

# 6. Compare algorithms
python scripts/compare.py --configs configs/cartpole_ppo.yaml configs/cartpole_dqn.yaml configs/cartpole_a2c.yaml

# 7. Watch the agent play live
python scripts/evaluate.py --config configs/cartpole_ppo.yaml --render

# 8. Hyperparameter tuning with Optuna
python scripts/tune.py --config configs/cartpole_ppo.yaml --n-trials 50

Training All Experiments

# CartPole (~5 min each, 5 seeds)
python scripts/train.py --config configs/cartpole_ppo.yaml
python scripts/train.py --config configs/cartpole_dqn.yaml
python scripts/train.py --config configs/cartpole_a2c.yaml

# GridNav reward shaping (~10 min each, 5 seeds)
python scripts/train.py --config configs/gridnav_ppo_dense.yaml
python scripts/train.py --config configs/gridnav_ppo_sparse.yaml
python scripts/train.py --config configs/gridnav_ppo_shaped.yaml

# Atari Breakout (~1-2 hours each, 3 seeds, MPS)
python scripts/train.py --config configs/breakout_a2c.yaml
python scripts/train.py --config configs/breakout_ppo.yaml

# Car Racing (~1-2 hours, 3 seeds, MPS)
python scripts/train.py --config configs/carracing_ppo.yaml

Hardware

Trained entirely on an Apple M4 Pro MacBook Pro.

  • CPU for MLP policies (CartPole, GridNav) — GPU data transfer overhead exceeds benefit for small networks
  • MPS for CNN policies (Breakout, CarRacing) — ~2-3x speedup over CPU for convolutional workloads

Tech Stack

Python 3.12 · PyTorch 2.10 · Stable-Baselines3 2.7 · Gymnasium 1.2 · Weights & Biases · Optuna · Seaborn · Matplotlib

Future Improvements

  • Train Atari agents for longer (10M+ steps) — PPO typically surpasses A2C with more training budget
  • Try additional algorithms (SAC, TD3) for continuous control environments
  • Run full Optuna sweeps across all experiments, not just CartPole
  • Add more complex custom environments (multi-agent, partial observability)
  • Deploy a trained agent as an interactive web demo

For learning and for fun.

About

Training RL agents from classic control to Atari — built for learning and for fun.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors