chore: add anomaly detection Jupyter notebook

lpm0073 · lpm0073 · commit 9bb4a0693e56 · 2025-07-16T09:26:26.000-06:00
diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb
@@ -20,7 +20,9 @@
     "Thus, you can run this notebook from either of the Azure AI Machine Learning web console, or locally, assuming that you've created and activated the [Python virtual environment](https://realpython.com/python-virtual-environments-a-primer/) provided in the course [GitHub repository](https://github.com/FullStackWithLawrence/azureml-example) for Python 3.9\n",
     "\n",
     "\n",
-    "## Step 1: import the PyPi packages\n"
+    "## Workflow\n",
+    "\n",
+    "### Step 1: import the PyPi packages\n"
    ]
   },
   {
@@ -43,27 +45,27 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 2: Load the Credit Card Fraud Dataset from Azure ML\n",
+    "### Step 2: Load the Credit Card Fraud Dataset from Azure ML\n",
     "\n",
     "Retrieve the dataset from our existing workspace, and set this up for use with Pandas.\n",
     "\n",
     "**IMPORTANT: be mindful of the size of the dataset that you're working with. For example, if you run this notebook locally then be aware that you're downloading around 150Mib from your Azure workspace. When running locally this snippet will take approximately 4 minutes to run.**\n",
     "\n",
-    "### What This Code Does\n",
+    "#### What This Code Does\n",
     "\n",
     "- **`Workspace.from_config()`** connects to your Azure ML workspace using the `config.json` file (you should already have this if you followed earlier lectures).\n",
     "- **`Dataset.get_by_name(...)`** loads the dataset you previously uploaded and registered in the Azure ML web interface.\n",
     "- **`.to_pandas_dataframe()`** converts the Azure Dataset into a standard pandas DataFrame so you can explore and manipulate it with Python.\n",
     "- **`df.head()`** shows the first 5 rows of the data — this is just a quick preview to confirm that the dataset loaded correctly.\n",
     "\n",
-    "### Why This Matters\n",
+    "#### Why This Matters\n",
     "\n",
     "This is the standard pattern you’ll use throughout Azure ML when working with registered datasets in notebooks. It keeps your workflow consistent and lets you:\n",
     "- Avoid re-uploading data every time.\n",
     "- Ensure reproducibility across experiments and pipelines.\n",
     "- Easily switch to remote compute environments without changing your code.\n",
     "\n",
-    "### Console output\n",
+    "#### Console output\n",
     "\n",
     "You might (probably) see a few console output messages. This is expected. They come from Azure’s background systems for logging and monitoring.  \n",
     "Unless you see an actual `ERROR` or `Traceback`, you can **safely ignore** any of the following.\n",
@@ -338,11 +340,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 3: Prepare the Data\n",
+    "### Step 3: Prepare the Data\n",
     "\n",
     "we're going to normalize the distribution of the transaction $ amount column, which helps the model treat transaction amounts on the same scale as the other features (which are already normalized).\n",
     "\n",
-    "### What This Code Does\n",
+    "#### What This Code Does\n",
     "\n",
     "- **Standardizes the `Amount` column**:  \n",
     "  We scale the `Amount` feature so that it has a mean of 0 and a standard deviation of 1.  \n",
@@ -353,7 +355,7 @@
     "\n",
     "- We also drop the `Time` column since it doesn't contribute meaningfully to anomaly detection in this context.\n",
     "\n",
-    "### Why This Matters\n",
+    "#### Why This Matters\n",
     "\n",
     "Many machine learning algorithms — including Isolation Forest — perform better when numeric features are on a similar scale.  \n",
     "Also, splitting the data into `X` and `y` is a standard step that prepares it for training and evaluation."
@@ -374,7 +376,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Train the model\n",
+    "### Step 4: Train the model\n",
     "\n",
     "The **Isolation Forest** algorithm is a popular unsupervised method for **detecting anomalies** in high-dimensional datasets. Instead of learning what “normal” looks like, it works by **isolating outliers** — rare points that are easier to separate from the rest of the data. It does this by randomly splitting the dataset using decision trees and measuring how quickly a data point can be isolated. The idea is that **anomalies require fewer splits to isolate**, because they are different from everything else. Isolation Forest is widely used in **fraud detection**, **network security**, and **industrial monitoring** because it is **fast, efficient**, and handles **large datasets** with many features. In our code, we set the `contamination` parameter to roughly match the known fraction of fraud cases in the dataset."
    ]
@@ -406,30 +408,25 @@
     "| **F1-Score**  | A balance between precision and recall — like a combined performance score     |\n",
     "| **Support**   | The number of examples in each group (normal or fraud) in the real data        |\n",
     "\n",
-    "### Results Summary\n",
+    "#### Results Summary\n",
     "\n",
     "| Class | Description           | Precision | Recall | F1-Score | Support |\n",
     "|-------|------------------------|-----------|--------|----------|---------|\n",
     "| `0`   | Normal transactions    | **1.00**  | **1.00** | **1.00**   | 284,315 |\n",
     "| `1`   | Fraudulent transactions| **0.29**  | **0.28** | **0.28**   | 492     |\n",
     "\n",
-    "### Interpretation (In Simple Terms)\n",
+    "#### Interpretation (In Simple Terms)\n",
     "\n",
     "- The model is **excellent at recognizing normal transactions** — it almost never makes a mistake with those.\n",
     "- However, it **struggles to correctly catch fraud**:\n",
     "  - When it says a transaction is fraud, it’s **only right 29% of the time**.\n",
     "  - It **only finds 28% of the real fraud cases** — it misses most of them.\n",
     "\n",
-    "### Overall Accuracy\n",
+    "#### Overall Accuracy\n",
     "\n",
     "- The model is **99.9% accurate**, but this is misleading.\n",
     "- Because **fraud cases are very rare**, the model can look “perfect” just by saying everything is normal.\n",
-    "- That’s why we look at **precision**, **recall**, and **F1-score** for a fuller picture.\n",
-    "\n",
-    "### Conclusion\n",
-    "\n",
-    "- Our model is great at recognizing normal behavior.\n",
-    "- But we need to **improve how it detects fraud** — maybe by tuning parameters or using a different method like autoencoders or SMOTE (for balancing the data)."
+    "- That’s why we look at **precision**, **recall**, and **F1-score** for a fuller picture.\n"
    ]
   },
   {
@@ -458,13 +455,19 @@
     "print(classification_report(y, y_pred))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 6 (Optional): Register the Model\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Step 6 (Optional): Register the Model\n",
     "joblib.dump(model, 'isolation_forest.pkl')\n",
     "Model.register(model_path='isolation_forest.pkl',\n",
     "               model_name='creditcard_if_model',\n",
@@ -484,14 +487,14 @@
     "  - `1` means the model thinks the transaction is **fraud** or **anomalous**.\n",
     "- **Y-axis**: The total number of transactions in each category.\n",
     "\n",
-    "### How to Interpret This Chart\n",
+    "#### How to Interpret This Chart\n",
     "\n",
     "- You will (hopefully) see a **very tall bar for `0`** and a **very short bar for `1`**.\n",
     "- This is because **fraud is rare** in the dataset (only 492 out of 284,807 transactions).\n",
     "- The model is trained to detect outliers, so it **flags a small number of transactions as anomalies** (which is expected).\n",
     "- If the number of predicted frauds is **close to the actual number** (around 500), that’s a good sign that the model is well-calibrated.\n",
     "\n",
-    "### Why This Matters\n",
+    "#### Why This Matters\n",
     "\n",
     "- This simple chart gives a **quick health check** of how aggressive or conservative the model is in flagging anomalies.\n",
     "- If the model predicts **too many anomalies**, it might be overreacting.\n",
@@ -542,20 +545,20 @@
     "  - `1` = predicted fraud/anomaly\n",
     "- **Y-axis**: The dollar **amount** of each transaction (standardized)\n",
     "\n",
-    "### How to Interpret This Chart\n",
+    "#### How to Interpret This Chart\n",
     "\n",
     "- Each box shows how transaction amounts are distributed for each prediction class.\n",
     "- The **line in the middle** of each box is the **median** transaction amount.\n",
     "- The **height of the box** shows where most transaction amounts fall.\n",
     "- **Dots outside the box** are **outliers** — unusual values far from the average.\n",
     "\n",
-    "### What This Tells Us\n",
+    "#### What This Tells Us\n",
     "\n",
     "- You may see that predicted frauds (`1`) tend to have **more extreme** or **variable amounts**.\n",
     "- This could suggest that the model is flagging **unusually high or low transaction amounts** as suspicious.\n",
     "- If the fraud predictions have a **much wider range**, it means the model may be reacting to extreme values — which is common in anomaly detection.\n",
     "\n",
-    "### Usefulness\n",
+    "#### Usefulness\n",
     "\n",
     "This chart helps you:\n",
     "- Understand what kinds of amounts the model thinks are suspicious.\n",
@@ -594,7 +597,7 @@
     "\n",
     "The beeswarm plot below is generated using **SHAP** (SHapley Additive exPlanations). It helps explain **which features influenced the model's decisions**, and **how strongly**. We only analyze the first 100 transactions here in order to keep the visualization fast and readable.\n",
     "\n",
-    "### How to Read the SHAP Beeswarm Plot\n",
+    "#### How to Read the SHAP Beeswarm Plot\n",
     "\n",
     "- **Each dot** represents a single transaction.\n",
     "- **Each row** is one feature (like `V1`, `V2`, `Amount`, etc.).\n",
@@ -605,14 +608,14 @@
     "  - Dots farther to the right **push the model toward predicting fraud**.\n",
     "  - Dots farther to the left **push the model toward predicting normal**.\n",
     "\n",
-    "### What This Tells Us\n",
+    "#### What This Tells Us\n",
     "\n",
     "- The **topmost features** are the most important ones in the model’s decisions.\n",
     "- For example, if `V14` is at the top and its red dots are far right, it means:\n",
     "  - High values of `V14` increase the chance that the model flags a transaction as fraud.\n",
     "- This plot helps us understand **why** the model flagged certain transactions as anomalies.\n",
     "\n",
-    "### Why Use SHAP?\n",
+    "#### Why Use SHAP?\n",
     "\n",
     "- SHAP adds transparency to the model, even for complex algorithms like Isolation Forest.\n",
     "- Helps **build trust**, especially in sensitive tasks like fraud detection.\n",