Skip to content

Commit ede4941

Browse files
committed
module 1 notes
1 parent bb69ba6 commit ede4941

9 files changed

Lines changed: 1092 additions & 5 deletions

File tree

Gemfile.lock

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,6 @@ GEM
110110
jekyll (>= 3.8, < 5.0)
111111
jekyll-sitemap (1.4.0)
112112
jekyll (>= 3.7, < 5.0)
113-
jekyll-toc (0.19.0)
114-
jekyll (>= 3.9)
115-
nokogiri (~> 1.12)
116113
jekyll-watch (2.2.1)
117114
listen (~> 3.0)
118115
json (2.12.2)
@@ -225,7 +222,6 @@ DEPENDENCIES
225222
jekyll-include-cache
226223
jekyll-seo-tag
227224
jekyll-sitemap
228-
jekyll-toc
229225
just-the-docs (~> 0.10.0)
230226

231227
BUNDLED WITH
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
---
2+
title: "What is CRISP-DM?"
3+
parent: "Module 1: Introduction to Machine Learning"
4+
nav_order: 4
5+
---
6+
7+
# What is CRISP-DM?
8+
9+
> These notes are based on the video [ML Zoomcamp 1.4 - CRISP-DM](https://youtu.be/dCa3JvmJbr0?si=QixEZxWzDeCnSvCq)
10+
11+
<iframe width="560" height="315" src="https://www.youtube.com/embed/dCa3JvmJbr0?si=QixEZxWzDeCnSvCq" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
12+
13+
CRISP-DM stands for **Cross-Industry Standard Process for Data Mining**, a methodology for organizing machine learning projects. Despite being developed in the 1990s by IBM, it has stood the test of time and remains relevant for modern ML projects with minimal modifications.
14+
15+
CRISP-DM is a methodology for organizing machine learning projects. It is a process that helps you understand the problem, the data, and the model. It is a process that helps you build a machine learning model that is accurate and reliable.
16+
17+
This methodology helps structure the entire ML workflow from problem understanding to deployment through six key steps:
18+
19+
1. Business Understanding
20+
2. Data Understanding
21+
3. Data Preparation
22+
4. Modeling
23+
5. Evaluation
24+
6. Deployment
25+
26+
Throughout this lesson, we'll use a spam detection example to illustrate how each step applies in practice.
27+
28+
## The Spam Detection Example
29+
30+
Our example involves creating a system that identifies spam emails:
31+
- The system receives an email
32+
- Features are extracted from the email
33+
- A model processes these features
34+
- The model outputs a score (e.g., probability of being spam)
35+
- If the score exceeds a threshold (e.g., 50%), the email goes to the spam folder; otherwise, it goes to the inbox
36+
37+
Now let's explore each step of CRISP-DM using this example.
38+
39+
## Business Understanding
40+
41+
The first step involves identifying and understanding the problem we want to solve. Key activities include:
42+
43+
- **Problem identification**: For our spam example, we need to understand why spam detection matters. Are users complaining about spam? How many users? How severe is the problem?
44+
45+
- **Solution approach assessment**: We must determine if machine learning is the right tool for this problem or if simpler approaches (like rule-based systems) would suffice.
46+
47+
- **Success metric definition**: It's crucial to establish measurable goals. Instead of vaguely saying "reduce spam," we should specify "reduce spam messages by 50%." This concrete metric helps us evaluate success later.
48+
49+
The business understanding step ensures we're solving the right problem with appropriate tools and have clear success criteria.
50+
51+
## Data Understanding
52+
53+
Once we've defined the problem, we need to understand what data is available to solve it. This step involves:
54+
55+
- **Data availability assessment**: For spam detection, we might have data from users clicking a "mark as spam" button.
56+
57+
- **Data quality evaluation**: We need to verify if the data is reliable. Do we consistently record when users mark emails as spam? Are there cases where users incorrectly mark legitimate emails as spam?
58+
59+
- **Data volume assessment**: Is the dataset large enough for machine learning? If we only have 10 records, we might need to collect more data before proceeding.
60+
61+
This step might reveal issues that require revisiting the business understanding step. For example, if we discover that our data tracking is unreliable, we might need to redefine our approach.
62+
63+
## Data Preparation
64+
65+
After confirming we have sufficient, reliable data, we transform it into a format suitable for machine learning algorithms:
66+
67+
- **Data cleaning**: Remove noise and errors, such as instances where users accidentally marked legitimate emails as spam.
68+
69+
- **Feature extraction**: Convert raw data into features that algorithms can process. For spam detection, we might extract features like:
70+
- Sender information
71+
- Presence of specific words (e.g., "deposit")
72+
- Email length
73+
- Number of recipients
74+
- Other relevant characteristics
75+
76+
- **Pipeline building**: Create a sequence of transformations that convert raw data into a clean, tabular format with features (X) and target variables (y).
77+
78+
The goal is to produce data in the standard format we discussed previously: a feature matrix X and a target vector y.
79+
80+
## Modeling
81+
82+
This is where actual machine learning happens:
83+
84+
- **Model selection**: Try different algorithms (logistic regression, decision trees, neural networks, etc.) to see which performs best on our data.
85+
86+
- **Model training**: Train these models on our prepared data.
87+
88+
- **Model comparison**: Compare their performance to select the most effective one.
89+
90+
Often during this step, we discover that our features are insufficient or there are data issues, requiring us to return to the data preparation step. This iterative process helps refine our approach.
91+
92+
## Evaluation
93+
94+
After selecting the best model, we need to evaluate if it meets our business goals:
95+
96+
- **Goal assessment**: Return to the business understanding step and check if our model achieves the metrics we established (e.g., reducing spam by 50%).
97+
98+
- **Success determination**: If we aimed for a 50% reduction but only achieved 30%, we need to decide if this is acceptable or requires further iteration.
99+
100+
- **Project viability**: Based on results, we might continue improving the model, revise our goals, or determine the project isn't viable.
101+
102+
In modern ML workflows, evaluation often happens alongside deployment through online testing with real users.
103+
104+
## Deployment
105+
106+
The final step involves implementing the model in production:
107+
108+
- **Gradual rollout**: Often, we first deploy to a small percentage of users (e.g., 5%) to evaluate performance before full deployment.
109+
110+
- **Engineering focus**: While previous steps emphasized machine learning, deployment focuses on engineering aspects:
111+
- Monitoring
112+
- Maintainability
113+
- Service quality
114+
- Reliability
115+
- Scalability
116+
117+
This ensures our model works reliably in real-world conditions.
118+
119+
## Iteration: The Continuous Cycle
120+
121+
Machine learning projects don't end with deployment. The CRISP-DM process is cyclical:
122+
123+
- After deployment, we learn from real-world performance
124+
- We return to the business understanding step with new insights
125+
- We refine our goals and approach
126+
- We go through the cycle again to improve our solution
127+
128+
A best practice is to start simple:
129+
1. Complete a quick iteration with a simple model
130+
2. Deploy and learn from this initial version
131+
3. Return to business understanding with new insights
132+
4. Gradually increase complexity in subsequent iterations
133+
134+
This approach delivers value quickly while allowing for continuous improvement.
135+
136+
## Summary
137+
138+
The CRISP-DM methodology provides a structured approach to machine learning projects through six key steps:
139+
140+
1. **Business Understanding**: Define measurable goals and determine if ML is appropriate
141+
2. **Data Understanding**: Assess available data for quality, reliability, and sufficiency
142+
3. **Data Preparation**: Clean data and extract features in a format suitable for ML
143+
4. **Modeling**: Train and select the best performing model
144+
5. **Evaluation**: Verify if the model meets business goals
145+
6. **Deployment**: Roll out the model to users with proper engineering practices
146+
147+
The process is iterative, with each cycle building on lessons from previous iterations. Starting simple and gradually increasing complexity allows for faster delivery of value while maintaining a path for continuous improvement.
148+
149+
In the next lesson, we'll dive deeper into the modeling step to explore how to select and evaluate different machine learning models.
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
---
2+
title: "GitHub Codespaces"
3+
parent: "Module 1: Introduction to Machine Learning"
4+
nav_order: 6
5+
---
6+
7+
# GitHub Codespaces
8+
9+
> These notes are based on the video [ML Zoomcamp 1.6 - GitHub Codespaces](https://youtu.be/pqQFlV3f9Bo?si=dJUqRaIH8nlQHDwf)
10+
11+
<iframe width="560" height="315" src="https://www.youtube.com/embed/pqQFlV3f9Bo?si=dJUqRaIH8nlQHDwf" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
12+
13+
GitHub Codespaces is a cloud-based development environment that requires minimal configuration. It provides a remote environment with most of the tools needed for the Machine Learning Zoomcamp course. The main advantages include:
14+
15+
- Almost no configuration required
16+
- Remote environment with pre-installed tools
17+
- Seamless integration with GitHub
18+
- Accessible from anywhere
19+
20+
## Setting Up a Repository with Codespaces
21+
22+
### Creating a New Repository
23+
24+
1. Create a new repository on GitHub
25+
- Name it appropriately (e.g., "machine-learning-zoomcamp-homework")
26+
- Add a README file
27+
- Make it public
28+
- Add a Python .gitignore file
29+
- Click "Create repository"
30+
31+
### Launching Codespaces
32+
33+
1. Navigate to the repository
34+
2. Click on the "Code" button
35+
3. Select the "Codespaces" tab
36+
4. Click "Create codespace on main"
37+
38+
This will create a Visual Studio Code instance within your browser. You can either:
39+
- Use it directly in the browser
40+
- Open it in VS Code desktop by clicking the button in the corner labeled "Open in VS Code Desktop"
41+
42+
## Working with Codespaces
43+
44+
### Basic Operations
45+
46+
- The environment feels like local development
47+
- Files can be created and edited as usual
48+
- Terminal is accessible via:
49+
- Ctrl+` (Control+Tilda)
50+
- View > Terminal menu
51+
52+
### Terminal Tips
53+
54+
For a cleaner terminal prompt, you can use:
55+
```bash
56+
PS1="> "
57+
```
58+
This shortens the prompt to just a ">" sign, giving you more space to see your commands.
59+
60+
### Git Operations
61+
62+
Git is pre-configured in Codespaces:
63+
```bash
64+
git status
65+
git commit -am "message"
66+
git push
67+
```
68+
69+
## Installing Required Libraries
70+
71+
Install the necessary Python libraries using pip:
72+
73+
```bash
74+
pip install jupyter numpy pandas scikit-learn seaborn
75+
```
76+
77+
Additional libraries like XGBoost and TensorFlow can be installed the same way when needed.
78+
79+
## Using Jupyter Notebooks
80+
81+
### Starting Jupyter
82+
83+
Launch Jupyter Notebook:
84+
85+
```bash
86+
jupyter notebook
87+
```
88+
89+
Codespaces automatically detects the running service on port 8888 and forwards it to your local machine.
90+
91+
### Accessing Jupyter
92+
93+
1. In Codespaces, look for the "Ports" tab
94+
2. Find the forwarded port 8888
95+
3. Click on the link to open Jupyter in your browser
96+
4. Copy the token from the terminal or the full URL and paste it in the browser if prompted
97+
98+
### Working with Notebooks
99+
100+
1. Create folders for organization (e.g., "01-intro")
101+
2. Create new notebooks
102+
3. Import libraries and start working:
103+
```python
104+
import pandas as pd
105+
df = pd.read_csv('file.csv')
106+
```
107+
108+
## Completing and Submitting Homework
109+
110+
1. Create and complete your homework notebook
111+
2. Rename files as needed (can be done directly in VS Code)
112+
3. Commit and push your changes:
113+
```bash
114+
git add .
115+
git commit -m "homework"
116+
git push
117+
```
118+
4. Submit the GitHub repository URL in the course homework submission form
119+
120+
## Additional Tips
121+
122+
- Install the VS Code Python extension for better Python support
123+
- When first launching VS Code desktop, it will prompt you to install the Codespaces extension
124+
- If not prompted, you can install it manually:
125+
1. Go to Extensions
126+
2. Search for "GitHub Codespaces"
127+
3. Install the extension
128+
129+
## Conclusion
130+
131+
GitHub Codespaces provides a convenient, pre-configured environment for the Machine Learning Zoomcamp course. It eliminates most setup issues and allows you to focus on learning machine learning concepts rather than environment configuration.

0 commit comments

Comments
 (0)