chore: add jupyter notebooks

lpm0073 · lpm0073 · commit f07e80683c53 · 2025-07-14T12:28:57.000-06:00
diff --git a/docs/smarter-codebase.xlsx b/docs/smarter-codebase.xlsx
diff --git a/jupyter/House_Prices_Regression_Demo.ipynb b/jupyter/House_Prices_Regression_Demo.ipynb
@@ -0,0 +1,116 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "# \ud83d\udcd3 House Prices Regression Demo using Kaggle Dataset\n"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Step 1: Setup Kaggle API and Download Dataset\n\nWe'll use the Kaggle API to download the House Prices dataset. To do this, you need to upload your `kaggle.json` file, which contains your API credentials.\n\n- Go to [https://www.kaggle.com/account](https://www.kaggle.com/account)\n- Scroll down to the \"API\" section\n- Click \u201cCreate New API Token\u201d\n- Save the downloaded `kaggle.json` file\n- Upload it when prompted below"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Skip uploading kaggle.json. Assume it already exists at ~/.kaggle/kaggle.json\n",
+        "!pip install -q kaggle\n",
+        "!kaggle competitions download -c house-prices-advanced-regression-techniques\n",
+        "!unzip -q house-prices-advanced-regression-techniques.zip -d house_prices"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Step 2: Load and Inspect the Data\n\nNow that we have the dataset, let's load it into a pandas DataFrame and take a quick look at the structure.\n\nWe'll use the `train.csv` file, which includes both the input features and the target variable (`SalePrice`)."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": "import pandas as pd\n\ndf = pd.read_csv(\"house_prices/train.csv\")\nprint(\"Shape of dataset:\", df.shape)\ndf.head()"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Step 3: Preprocess the Data\n\nTo keep this demo simple, we'll do the following:\n\n1. Keep only numeric features (to avoid complex encoding for now).\n2. Drop columns with missing values.\n3. Separate our input features (`X`) and the target variable (`y`)."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": "# Keep only numeric columns\ndf_numeric = df.select_dtypes(include=[\"number\"])\n\n# Drop columns with missing values\ndf_clean = df_numeric.dropna(axis=1)\n\n# Separate features (X) and target (y)\nX = df_clean.drop(\"SalePrice\", axis=1)\ny = df_clean[\"SalePrice\"]"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Step 4: Train-Test Split\n\nTo evaluate our model fairly, we'll split the data into training and testing sets.  \nThis means the model will learn from one part and be tested on another, unseen part."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": "from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.2, random_state=42\n)"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Step 5: Train a Linear Regression Model\n\nWe'll use **Linear Regression**, one of the simplest and most interpretable machine learning models for regression tasks."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": "from sklearn.linear_model import LinearRegression\n\nmodel = LinearRegression()\nmodel.fit(X_train, y_train)"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Step 6: Evaluate the Model\n\nAfter training the model, we want to check how well it's performing.\n\nWe'll use:\n- **Root Mean Squared Error (RMSE)**: how far predictions are from actual prices\n- **R\u00b2 Score**: how much of the variance in house prices is explained by our features"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": "from sklearn.metrics import mean_squared_error, r2_score\n\ny_pred = model.predict(X_test)\n\nrmse = mean_squared_error(y_test, y_pred, squared=False)\nr2 = r2_score(y_test, y_pred)\n\nprint(f\"RMSE: {rmse:.2f}\")\nprint(f\"R\u00b2 Score: {r2:.2f}\")"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Step 7: Visualize Predictions\n\nA scatter plot of predicted prices vs. actual prices helps us visually assess model performance.  \nIf predictions are perfect, points will lie along the diagonal."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": "import matplotlib.pyplot as plt\n\nplt.figure(figsize=(8, 6))\nplt.scatter(y_test, y_pred, alpha=0.5)\nplt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')\nplt.xlabel(\"Actual Sale Price\")\nplt.ylabel(\"Predicted Sale Price\")\nplt.title(\"Predicted vs. Actual House Prices\")\nplt.grid(True)\nplt.show()"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## \u2705 Summary\n\nIn this notebook, we:\n- Downloaded real-world housing data from Kaggle\n- Cleaned and prepared the data\n- Trained a basic regression model using Linear Regression\n- Evaluated and visualized the results\n\nThis is just a starting point. You can improve the model by:\n- Handling categorical variables (e.g., one-hot encoding)\n- Filling in missing values instead of dropping them\n- Trying other models like Decision Trees or XGBoost\n- Performing feature selection and engineering"
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python",
+      "version": ""
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}
diff --git a/jupyter/k-means.ipynb b/jupyter/k-means.ipynb
@@ -0,0 +1 @@
+{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyOUYrjEYAVUpeNCP4+c07lo"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["# K-means Clustering on the Iris Dataset\n","\n","This notebook demonstrates how to perform K-means clustering on the classic Iris dataset using Python, scikit-learn, and Google Colab.  \n","K-means is an unsupervised machine learning algorithm that groups data into clusters based on feature similarity.\n","\n","**In this notebook, you will:**\n","- Load the Iris dataset directly from the UCI Machine Learning Repository\n","- Apply K-means clustering to group the data into clusters\n","- Visualize the resulting clusters\n","\n","No prior setup or downloads are required—simply run each cell to see the results!"],"metadata":{"id":"4gKAjgYiDMQn"}},{"cell_type":"code","source":["# Step 1: Import libraries\n","import pandas as pd\n","from sklearn.cluster import KMeans\n","import matplotlib.pyplot as plt\n","\n","# Step 2: Load the Iris dataset from UC Irvine.\n","url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\"\n","cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']\n","df = pd.read_csv(url, header=None, names=cols)\n","\n","# Step 3: Prepare data (drop species column for clustering)\n","X = df.drop('species', axis=1)\n","\n","# Step 4: Run K-means clustering\n","kmeans = KMeans(n_clusters=3, random_state=42)\n","df['cluster'] = kmeans.fit_predict(X)\n","\n","# Step 5: Show results\n","print(df.head())\n","\n","# Step 6: Visualize clusters (using first two features)\n","plt.scatter(df['sepal_length'], df['sepal_width'], c=df['cluster'])\n","plt.xlabel('Sepal Length')\n","plt.ylabel('Sepal Width')\n","plt.title('K-means Clusters on Iris Dataset')\n","plt.show()\n"],"metadata":{"id":"Ca52XxqOEC78"},"execution_count":null,"outputs":[]}]}

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyOUYrjEYAVUpeNCP4+c07lo"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["# K-means Clustering on the Iris Dataset\n","\n","This notebook demonstrates how to perform K-means clustering on the classic Iris dataset using Python, scikit-learn, and Google Colab. \n","K-means is an unsupervised machine learning algorithm that groups data into clusters based on feature similarity.\n","\n","In this notebook, you will:\n","- Load the Iris dataset directly from the UCI Machine Learning Repository\n","- Apply K-means clustering to group the data into clusters\n","- Visualize the resulting clusters\n","\n","No prior setup or downloads are required—simply run each cell to see the results!"],"metadata":{"id":"4gKAjgYiDMQn"}},{"cell_type":"code","source":["# Step 1: Import libraries\n","import pandas as pd\n","from sklearn.cluster import KMeans\n","import matplotlib.pyplot as plt\n","\n","# Step 2: Load the Iris dataset from UC Irvine.\n","url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\"\n","cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']\n","df = pd.read_csv(url, header=None, names=cols)\n","\n","# Step 3: Prepare data (drop species column for clustering)\n","X = df.drop('species', axis=1)\n","\n","# Step 4: Run K-means clustering\n","kmeans = KMeans(n_clusters=3, random_state=42)\n","df['cluster'] = kmeans.fit_predict(X)\n","\n","# Step 5: Show results\n","print(df.head())\n","\n","# Step 6: Visualize clusters (using first two features)\n","plt.scatter(df['sepal_length'], df['sepal_width'], c=df['cluster'])\n","plt.xlabel('Sepal Length')\n","plt.ylabel('Sepal Width')\n","plt.title('K-means Clusters on Iris Dataset')\n","plt.show()\n"],"metadata":{"id":"Ca52XxqOEC78"},"execution_count":null,"outputs":[]}]}