An implementation of the Q-Learning algorithm to solve the classic CliffWalking reinforcement learning environment.
This repository presents a practical demonstration of the Q-Learning algorithm, a fundamental model-free reinforcement learning technique. The goal is to train an autonomous agent to navigate the "CliffWalking" environment, a classic grid-world problem from OpenAI Gym. The agent must learn an optimal policy to reach a designated goal state while avoiding "cliffs" that reset its progress. This project serves as an excellent educational resource for understanding the core mechanics of Q-Learning, including state-action value estimation, exploration-exploitation trade-off using an epsilon-greedy policy, and policy convergence.
- Q-Learning Algorithm Implementation: A clear and concise implementation of the Q-Learning algorithm.
- CliffWalking Environment: Utilizes the popular
CliffWalking-v1environment from OpenAI Gym for a standardized problem setup. - Epsilon-Greedy Policy: Incorporates an epsilon-greedy strategy to balance exploration of new actions and exploitation of known optimal actions.
- Hyperparameter Tuning: Demonstrates configurable learning rate (
alpha), discount factor (gamma), and exploration rate (epsilon). - Policy Visualization: Tracks and visualizes the agent's performance (e.g., total rewards per episode) during training.
- Optimal Policy Derivation: Learns and displays the optimal policy discovered by the agent to navigate the cliff-walking grid safely.
Follow these steps to set up the project and run the Q-Learning simulation.
- Python 3.7+
pip(Python package installer)
-
Clone the repository
git clone https://github.com/Mayank-Kumar-Maurya/CliffWalking-Reinforcement-Q-Learning.git cd CliffWalking-Reinforcement-Q-Learning -
Install dependencies It's highly recommended to use a virtual environment to manage dependencies.
# Create a virtual environment python -m venv venv # Activate the virtual environment # On macOS/Linux: source venv/bin/activate # On Windows: .\venv\Scripts\activate # Install the required packages pip install numpy gym jupyter
-
Start Jupyter Notebook After installing dependencies and activating the virtual environment, start the Jupyter Notebook server:
jupyter notebook
-
Open and Run the Notebook Your web browser should open a new tab with the Jupyter interface. Navigate to and open
Q_Learning.ipynb. You can then execute the cells sequentially to observe the Q-Learning agent's training process, policy convergence, and results.
CliffWalking-Reinforcement-Q-Learning/
├── Q_Learning.ipynb # Main Jupyter Notebook containing the Q-Learning implementation
└── README.md # Project README file
The Q-Learning algorithm's behavior is governed by several hyperparameters, which are defined and can be adjusted within the Q_Learning.ipynb notebook:
alpha(Learning Rate): Determines how much new information overrides old information. (e.g.,0.1)gamma(Discount Factor): Controls the importance of future rewards. (e.g.,0.99)epsilon(Exploration Rate): The probability of taking a random action instead of the greedy action. This value typically decays over time. (e.g., initial1.0, decay factor0.999)n_episodes: The total number of episodes for which the agent will be trained. (e.g.,5000)
Experimenting with these parameters can significantly impact the agent's learning speed and the quality of the learned policy.
Q-Learning is an off-policy temporal difference control algorithm that learns the optimal Q-value function. The Q-value, Q(s, a), represents the expected maximum future reward achievable by taking action a in state s and then following the optimal policy thereafter.
The update rule for Q-Learning is:
Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') - Q(s, a)]
Where:
s: Current statea: Current actionr: Immediate reward received after taking actionain statess': Next statea': Possible actions in the next states'α: Learning rate (alpha)γ: Discount factor (gamma)max_a' Q(s', a'): The maximum Q-value for the next states'over all possible actionsa'
To balance exploration and exploitation, this implementation uses an epsilon-greedy policy. At each step, the agent chooses:
- A random action with probability
epsilon(exploration). - The action with the highest Q-value for the current state with probability
(1 - epsilon)(exploitation).
The epsilon value typically starts high and decays over time, allowing the agent to explore more initially and then exploit its learned knowledge as training progresses.
Contributions are welcome! If you have suggestions for improvements, bug fixes, or new features, please feel free to:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature). - Make your changes.
- Commit your changes (
git commit -m 'Add new feature'). - Push to the branch (
git push origin feature/your-feature). - Open a Pull Request.
- OpenAI Gym: For providing the
CliffWalking-v1environment, a valuable tool for reinforcement learning research and education. - NumPy: Essential libraries for numerical operations in Python.
- 🐛 Issues: GitHub Issues
⭐ Star this repo if you find it helpful!
Made with ❤️ by Mayank Kumar Maurya