A hands-on reinforcement learning project where I trained agents across 4 environments using 3 algorithms, all on an Apple M4 Pro MacBook. This was a learning exercise in building RL pipelines with proper ML engineering practices — multi-seed evaluation, experiment tracking, reward shaping, and reproducible configs.
CartPole-v1![]() A2C — 499.9 avg reward |
ALE/Breakout-v5![]() A2C — 31.4 avg reward |
CarRacing-v3![]() PPO — 688.9 avg reward |
- How PPO, A2C, and DQN actually work — not just the theory, but tuning them on different environments and seeing where each one shines
- Why multi-seed evaluation matters — a single seed can be misleading (PPO solves CartPole in 4/5 seeds but fails on 1)
- How reward shaping can make or break learning — same algorithm, same environment, 7x performance difference just from changing the reward function
- The practical side of training CNN policies on Apple Silicon (MPS) — when it helps and when CPU is actually faster
- Building a clean, modular codebase that separates config, training, evaluation, and visualization
9 experiments, 4 environments, 3 algorithms. All evaluated with deterministic policies across multiple seeds.
| Experiment | Algorithm | Environment | Mean Reward | 95% CI | Median | IQM |
|---|---|---|---|---|---|---|
| cartpole_a2c | A2C | CartPole-v1 | 499.9 ± 1.4 | [499.8, 500.1] | 500.0 | 500.0 |
| cartpole_ppo | PPO | CartPole-v1 | 438.6 ± 140.8 | [319.2, 558.0] | 500.0 | 500.0 |
| cartpole_dqn | DQN | CartPole-v1 | 360.6 ± 128.6 | [252.5, 468.6] | 362.0 | 372.3 |
| breakout_a2c | A2C | Breakout-v5 | 31.4 ± 9.8 | [28.4, 34.4] | 30.0 | 30.0 |
| breakout_ppo | PPO | Breakout-v5 | 28.1 ± 8.4 | [24.9, 31.3] | 28.0 | 27.4 |
| carracing_ppo | PPO | CarRacing-v3 | 688.9 ± 243.8 | [680.8, 697.0] | 819.4 | 768.8 |
| gridnav_shaped | PPO | GridNav-v0 | 0.70 ± 0.36 | [0.66, 0.74] | 0.63 | 0.65 |
| gridnav_sparse | PPO | GridNav-v0 | 0.10 ± 0.30 | [0.07, 0.13] | 0.00 | 0.00 |
| gridnav_dense | PPO | GridNav-v0 | -35.9 ± 32.9 | [-39.7, -32.0] | -25.6 | -29.8 |
Evaluation setup: 100 episodes/seed for CartPole & GridNav (5 seeds each), 30 episodes/seed for Breakout & CarRacing (3 seeds each).
| Environment | Type | Observation | Actions | What I Explored |
|---|---|---|---|---|
| CartPole-v1 | Classic Control | 4D float | Discrete(2) | Algorithm comparison (PPO vs A2C vs DQN) |
| ALE/Breakout-v5 | Atari | (4, 84, 84) frames | Discrete(4) | CNN policies, frame stacking, MPS training |
| CarRacing-v3 | Continuous Control | 96x96 RGB | Box(3) | Continuous action spaces, high-variance evaluation |
| GridNav-v0 | Custom (built from scratch) | 68D float | Discrete(4) | Reward shaping — sparse vs dense vs potential-based |
- PPO (Proximal Policy Optimization) — on-policy, clipped surrogate objective. The most versatile — used across all 4 environments.
- A2C (Advantage Actor-Critic) — on-policy, synchronous advantage estimation. Fastest to converge on CartPole and Breakout at moderate training budgets.
- DQN (Deep Q-Network) — off-policy with replay buffer. More sample efficient in theory, but showed highest seed variance on CartPole.
One of the more interesting parts of this project. I built a custom 8x8 grid navigation environment with 3 reward functions and trained the same algorithm (PPO) with the same hyperparameters on each:
| Reward Mode | Mean Reward | What Happened |
|---|---|---|
| Shaped (potential-based) | 0.70 | Agent reliably reaches the goal (~70% of episodes). Validates Ng et al. (1999). |
| Sparse (+1 at goal only) | 0.10 | Agent rarely stumbles on the reward by chance — the classic exploration problem. |
| Dense (distance penalty per step) | -35.9 | Agent learns to stay still to avoid penalties. Naive reward engineering backfires. |
Same algorithm, same environment, same hyperparameters — but the reward function alone determines success or failure.
├── configs/ # YAML experiment configs (one per experiment)
├── src/ # Importable Python package
│ ├── config.py # ExperimentConfig dataclass + YAML loader
│ ├── environments/ # Custom GridNavigationEnv
│ ├── training/ # Trainer, callbacks (W&B + CSV logging)
│ ├── evaluation/ # Multi-seed eval, statistics, GIF recording
│ └── visualization/ # Learning curves, comparisons, distributions
├── scripts/ # CLI entry points
│ ├── train.py # Train agents from config
│ ├── evaluate.py # Evaluate with 95% CI, IQM
│ ├── tune.py # Optuna hyperparameter search
│ └── compare.py # Cross-algorithm comparison
├── notebooks/ # Analysis notebooks (load pre-computed results)
│ ├── 01_cartpole_fundamentals.ipynb
│ ├── 02_atari_breakout.ipynb
│ ├── 03_car_racing.ipynb
│ ├── 04_custom_environment.ipynb
│ └── 05_algorithm_comparison.ipynb
├── videos/ # Agent demo GIFs
├── tests/ # pytest tests for custom environment
└── results/ # Figures, metrics CSVs, eval summaries
# 1. Create conda environment
conda create -n rl-env python=3.12 -y
conda activate rl-env
pip install -r requirements.txt
# 2. Run tests
pytest tests/ -v
# 3. Train CartPole with PPO (~5 min)
python scripts/train.py --config configs/cartpole_ppo.yaml
# 4. Quick smoke test (2K steps, no W&B)
python scripts/train.py --config configs/cartpole_ppo.yaml --no-wandb --seeds 42 --timesteps 2000
# 5. Evaluate trained model
python scripts/evaluate.py --config configs/cartpole_ppo.yaml --episodes 100
# 6. Compare algorithms
python scripts/compare.py --configs configs/cartpole_ppo.yaml configs/cartpole_dqn.yaml configs/cartpole_a2c.yaml
# 7. Watch the agent play live
python scripts/evaluate.py --config configs/cartpole_ppo.yaml --render
# 8. Hyperparameter tuning with Optuna
python scripts/tune.py --config configs/cartpole_ppo.yaml --n-trials 50# CartPole (~5 min each, 5 seeds)
python scripts/train.py --config configs/cartpole_ppo.yaml
python scripts/train.py --config configs/cartpole_dqn.yaml
python scripts/train.py --config configs/cartpole_a2c.yaml
# GridNav reward shaping (~10 min each, 5 seeds)
python scripts/train.py --config configs/gridnav_ppo_dense.yaml
python scripts/train.py --config configs/gridnav_ppo_sparse.yaml
python scripts/train.py --config configs/gridnav_ppo_shaped.yaml
# Atari Breakout (~1-2 hours each, 3 seeds, MPS)
python scripts/train.py --config configs/breakout_a2c.yaml
python scripts/train.py --config configs/breakout_ppo.yaml
# Car Racing (~1-2 hours, 3 seeds, MPS)
python scripts/train.py --config configs/carracing_ppo.yamlTrained entirely on an Apple M4 Pro MacBook Pro.
- CPU for MLP policies (CartPole, GridNav) — GPU data transfer overhead exceeds benefit for small networks
- MPS for CNN policies (Breakout, CarRacing) — ~2-3x speedup over CPU for convolutional workloads
Python 3.12 · PyTorch 2.10 · Stable-Baselines3 2.7 · Gymnasium 1.2 · Weights & Biases · Optuna · Seaborn · Matplotlib
- Train Atari agents for longer (10M+ steps) — PPO typically surpasses A2C with more training budget
- Try additional algorithms (SAC, TD3) for continuous control environments
- Run full Optuna sweeps across all experiments, not just CartPole
- Add more complex custom environments (multi-agent, partial observability)
- Deploy a trained agent as an interactive web demo
For learning and for fun.


