Skip to content

Commit f71fafc

Browse files
committed
chore: update jupyter notebooks
1 parent f07e806 commit f71fafc

3 files changed

Lines changed: 450 additions & 17 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ __pycache__
7878
.pytest_cache
7979
venv
8080
.venv
81+
venv3.12
8182
.pytest_cache
8283
*.pyc
8384
*.pyo

jupyter/House_Prices_Regression_Demo.ipynb

Lines changed: 139 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,24 @@
33
{
44
"cell_type": "markdown",
55
"metadata": {},
6-
"source": "# \ud83d\udcd3 House Prices Regression Demo using Kaggle Dataset\n"
6+
"source": [
7+
"# 📓 House Prices Regression Demo using Kaggle Dataset\n"
8+
]
79
},
810
{
911
"cell_type": "markdown",
1012
"metadata": {},
11-
"source": "## Step 1: Setup Kaggle API and Download Dataset\n\nWe'll use the Kaggle API to download the House Prices dataset. To do this, you need to upload your `kaggle.json` file, which contains your API credentials.\n\n- Go to [https://www.kaggle.com/account](https://www.kaggle.com/account)\n- Scroll down to the \"API\" section\n- Click \u201cCreate New API Token\u201d\n- Save the downloaded `kaggle.json` file\n- Upload it when prompted below"
13+
"source": [
14+
"## Step 1: Setup Kaggle API and Download Dataset\n",
15+
"\n",
16+
"We'll use the Kaggle API to download the House Prices dataset. To do this, you need to upload your `kaggle.json` file, which contains your API credentials.\n",
17+
"\n",
18+
"- Go to [https://www.kaggle.com/account](https://www.kaggle.com/account)\n",
19+
"- Scroll down to the \"API\" section\n",
20+
"- Click “Create New API Token”\n",
21+
"- Save the downloaded `kaggle.json` file\n",
22+
"- Upload it when prompted below"
23+
]
1224
},
1325
{
1426
"cell_type": "code",
@@ -18,86 +30,197 @@
1830
"source": [
1931
"# Skip uploading kaggle.json. Assume it already exists at ~/.kaggle/kaggle.json\n",
2032
"!pip install -q kaggle\n",
33+
"\n",
34+
"\n",
35+
"# Upload the kaggle.json file (from your local computer)\n",
36+
"from google.colab import files\n",
37+
"files.upload() # Choose kaggle.json when prompted\n",
38+
"\n",
39+
"# Move the file to the right location\n",
40+
"!mkdir -p /root/.kaggle\n",
41+
"!cp kaggle.json /root/.kaggle/\n",
42+
"!chmod 600 /root/.kaggle/kaggle.json\n",
43+
"\n",
44+
"# Download the House Prices dataset from Kaggle\n",
2145
"!kaggle competitions download -c house-prices-advanced-regression-techniques\n",
22-
"!unzip -q house-prices-advanced-regression-techniques.zip -d house_prices"
46+
"!unzip -q house-prices-advanced-regression-techniques.zip -d house_prices\n",
47+
"\n"
2348
]
2449
},
2550
{
2651
"cell_type": "markdown",
2752
"metadata": {},
28-
"source": "## Step 2: Load and Inspect the Data\n\nNow that we have the dataset, let's load it into a pandas DataFrame and take a quick look at the structure.\n\nWe'll use the `train.csv` file, which includes both the input features and the target variable (`SalePrice`)."
53+
"source": [
54+
"## Step 2: Load and Inspect the Data\n",
55+
"\n",
56+
"Now that we have the dataset, let's load it into a pandas DataFrame and take a quick look at the structure.\n",
57+
"\n",
58+
"We'll use the `train.csv` file, which includes both the input features and the target variable (`SalePrice`)."
59+
]
2960
},
3061
{
3162
"cell_type": "code",
3263
"execution_count": null,
3364
"metadata": {},
3465
"outputs": [],
35-
"source": "import pandas as pd\n\ndf = pd.read_csv(\"house_prices/train.csv\")\nprint(\"Shape of dataset:\", df.shape)\ndf.head()"
66+
"source": [
67+
"import pandas as pd\n",
68+
"\n",
69+
"df = pd.read_csv(\"house_prices/train.csv\")\n",
70+
"print(\"Shape of dataset:\", df.shape)\n",
71+
"df.head()"
72+
]
3673
},
3774
{
3875
"cell_type": "markdown",
3976
"metadata": {},
40-
"source": "## Step 3: Preprocess the Data\n\nTo keep this demo simple, we'll do the following:\n\n1. Keep only numeric features (to avoid complex encoding for now).\n2. Drop columns with missing values.\n3. Separate our input features (`X`) and the target variable (`y`)."
77+
"source": [
78+
"## Step 3: Preprocess the Data\n",
79+
"\n",
80+
"To keep this demo simple, we'll do the following:\n",
81+
"\n",
82+
"1. Keep only numeric features (to avoid complex encoding for now).\n",
83+
"2. Drop columns with missing values.\n",
84+
"3. Separate our input features (`X`) and the target variable (`y`)."
85+
]
4186
},
4287
{
4388
"cell_type": "code",
4489
"execution_count": null,
4590
"metadata": {},
4691
"outputs": [],
47-
"source": "# Keep only numeric columns\ndf_numeric = df.select_dtypes(include=[\"number\"])\n\n# Drop columns with missing values\ndf_clean = df_numeric.dropna(axis=1)\n\n# Separate features (X) and target (y)\nX = df_clean.drop(\"SalePrice\", axis=1)\ny = df_clean[\"SalePrice\"]"
92+
"source": [
93+
"# Keep only numeric columns\n",
94+
"df_numeric = df.select_dtypes(include=[\"number\"])\n",
95+
"\n",
96+
"# Drop columns with missing values\n",
97+
"df_clean = df_numeric.dropna(axis=1)\n",
98+
"\n",
99+
"# Separate features (X) and target (y)\n",
100+
"X = df_clean.drop(\"SalePrice\", axis=1)\n",
101+
"y = df_clean[\"SalePrice\"]"
102+
]
48103
},
49104
{
50105
"cell_type": "markdown",
51106
"metadata": {},
52-
"source": "## Step 4: Train-Test Split\n\nTo evaluate our model fairly, we'll split the data into training and testing sets. \nThis means the model will learn from one part and be tested on another, unseen part."
107+
"source": [
108+
"## Step 4: Train-Test Split\n",
109+
"\n",
110+
"To evaluate our model fairly, we'll split the data into training and testing sets. \n",
111+
"This means the model will learn from one part and be tested on another, unseen part."
112+
]
53113
},
54114
{
55115
"cell_type": "code",
56116
"execution_count": null,
57117
"metadata": {},
58118
"outputs": [],
59-
"source": "from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(\n X, y, test_size=0.2, random_state=42\n)"
119+
"source": [
120+
"from sklearn.model_selection import train_test_split\n",
121+
"\n",
122+
"X_train, X_test, y_train, y_test = train_test_split(\n",
123+
" X, y, test_size=0.2, random_state=42\n",
124+
")"
125+
]
60126
},
61127
{
62128
"cell_type": "markdown",
63129
"metadata": {},
64-
"source": "## Step 5: Train a Linear Regression Model\n\nWe'll use **Linear Regression**, one of the simplest and most interpretable machine learning models for regression tasks."
130+
"source": [
131+
"## Step 5: Train a Linear Regression Model\n",
132+
"\n",
133+
"We'll use **Linear Regression**, one of the simplest and most interpretable machine learning models for regression tasks."
134+
]
65135
},
66136
{
67137
"cell_type": "code",
68138
"execution_count": null,
69139
"metadata": {},
70140
"outputs": [],
71-
"source": "from sklearn.linear_model import LinearRegression\n\nmodel = LinearRegression()\nmodel.fit(X_train, y_train)"
141+
"source": [
142+
"from sklearn.linear_model import LinearRegression\n",
143+
"\n",
144+
"model = LinearRegression()\n",
145+
"model.fit(X_train, y_train)"
146+
]
72147
},
73148
{
74149
"cell_type": "markdown",
75150
"metadata": {},
76-
"source": "## Step 6: Evaluate the Model\n\nAfter training the model, we want to check how well it's performing.\n\nWe'll use:\n- **Root Mean Squared Error (RMSE)**: how far predictions are from actual prices\n- **R\u00b2 Score**: how much of the variance in house prices is explained by our features"
151+
"source": [
152+
"## Step 6: Evaluate the Model\n",
153+
"\n",
154+
"After training the model, we want to check how well it's performing.\n",
155+
"\n",
156+
"We'll use:\n",
157+
"- **Root Mean Squared Error (RMSE)**: how far predictions are from actual prices\n",
158+
"- **R² Score**: how much of the variance in house prices is explained by our features"
159+
]
77160
},
78161
{
79162
"cell_type": "code",
80163
"execution_count": null,
81164
"metadata": {},
82165
"outputs": [],
83-
"source": "from sklearn.metrics import mean_squared_error, r2_score\n\ny_pred = model.predict(X_test)\n\nrmse = mean_squared_error(y_test, y_pred, squared=False)\nr2 = r2_score(y_test, y_pred)\n\nprint(f\"RMSE: {rmse:.2f}\")\nprint(f\"R\u00b2 Score: {r2:.2f}\")"
166+
"source": [
167+
"from sklearn.metrics import mean_squared_error, r2_score\n",
168+
"\n",
169+
"y_pred = model.predict(X_test)\n",
170+
"\n",
171+
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
172+
"r2 = r2_score(y_test, y_pred)\n",
173+
"\n",
174+
"print(f\"RMSE: {rmse:.2f}\")\n",
175+
"print(f\"R² Score: {r2:.2f}\")"
176+
]
84177
},
85178
{
86179
"cell_type": "markdown",
87180
"metadata": {},
88-
"source": "## Step 7: Visualize Predictions\n\nA scatter plot of predicted prices vs. actual prices helps us visually assess model performance. \nIf predictions are perfect, points will lie along the diagonal."
181+
"source": [
182+
"## Step 7: Visualize Predictions\n",
183+
"\n",
184+
"A scatter plot of predicted prices vs. actual prices helps us visually assess model performance. \n",
185+
"If predictions are perfect, points will lie along the diagonal."
186+
]
89187
},
90188
{
91189
"cell_type": "code",
92190
"execution_count": null,
93191
"metadata": {},
94192
"outputs": [],
95-
"source": "import matplotlib.pyplot as plt\n\nplt.figure(figsize=(8, 6))\nplt.scatter(y_test, y_pred, alpha=0.5)\nplt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')\nplt.xlabel(\"Actual Sale Price\")\nplt.ylabel(\"Predicted Sale Price\")\nplt.title(\"Predicted vs. Actual House Prices\")\nplt.grid(True)\nplt.show()"
193+
"source": [
194+
"import matplotlib.pyplot as plt\n",
195+
"\n",
196+
"plt.figure(figsize=(8, 6))\n",
197+
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
198+
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')\n",
199+
"plt.xlabel(\"Actual Sale Price\")\n",
200+
"plt.ylabel(\"Predicted Sale Price\")\n",
201+
"plt.title(\"Predicted vs. Actual House Prices\")\n",
202+
"plt.grid(True)\n",
203+
"plt.show()"
204+
]
96205
},
97206
{
98207
"cell_type": "markdown",
99208
"metadata": {},
100-
"source": "## \u2705 Summary\n\nIn this notebook, we:\n- Downloaded real-world housing data from Kaggle\n- Cleaned and prepared the data\n- Trained a basic regression model using Linear Regression\n- Evaluated and visualized the results\n\nThis is just a starting point. You can improve the model by:\n- Handling categorical variables (e.g., one-hot encoding)\n- Filling in missing values instead of dropping them\n- Trying other models like Decision Trees or XGBoost\n- Performing feature selection and engineering"
209+
"source": [
210+
"## ✅ Summary\n",
211+
"\n",
212+
"In this notebook, we:\n",
213+
"- Downloaded real-world housing data from Kaggle\n",
214+
"- Cleaned and prepared the data\n",
215+
"- Trained a basic regression model using Linear Regression\n",
216+
"- Evaluated and visualized the results\n",
217+
"\n",
218+
"This is just a starting point. You can improve the model by:\n",
219+
"- Handling categorical variables (e.g., one-hot encoding)\n",
220+
"- Filling in missing values instead of dropping them\n",
221+
"- Trying other models like Decision Trees or XGBoost\n",
222+
"- Performing feature selection and engineering"
223+
]
101224
}
102225
],
103226
"metadata": {

0 commit comments

Comments
 (0)