adding comments to make the notebook more understandable

CaesarGhazi · WuorBhang · commit 4c9d70f75373 · 2025-12-05T10:48:59.000+03:00
diff --git a/4_data_analysis/resource_demand.ipynb b/4_data_analysis/resource_demand.ipynb
@@ -35,6 +35,16 @@
     "sns.set_palette(\"husl\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "804371b4",
+   "metadata": {},
+   "source": [
+    "## Load Data\n",
+    "Load the raw patient treatment dataset and check basic structure.\n",
+    "Dataset contains individual patient records with demographics, substance use patterns, and treatment info"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -45,6 +55,16 @@
     "df = pd.read_csv(\"1_datasets/processed/teds_ml_ready.csv\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "e461f05f",
+   "metadata": {},
+   "source": [
+    "## Initial Data Quality Checks\n",
+    "Understand data structure, missing values, and data types.\n",
+    "Missing values inform our imputation strategy (high/medium/low missingness)\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
@@ -74,6 +94,17 @@
     "print(missing_pct[missing_pct > 0].head(10))\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "aad309cb",
+   "metadata": {},
+   "source": [
+    "## Data Cleaning & Type Conversion\n",
+    "Handle missing values and convert problematic columns from object to numeric.\n",
+    "Numeric cols to fill with median | Categorical cols to fill with mode.\n",
+    "This prepares clean data for feature engineering."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -103,6 +134,17 @@
     "        df_clean[col] = converted.fillna(0)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "d08bd2d0",
+   "metadata": {},
+   "source": [
+    "## Pre-Aggregation One-Hot Encoding\n",
+    "Convert categorical demographics to binary columns BEFORE aggregating.\n",
+    "This preserves distributions (e.g., % Female) instead of losing info via mode().\n",
+    "When aggregated by mean, these become percentages for each facility."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
@@ -117,6 +159,18 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "887728c5",
+   "metadata": {},
+   "source": [
+    "## Aggregate by State & Service Type\n",
+    "Group individual patient records by (state, service_type).\n",
+    "Target: Count of patients = total_admissions.\n",
+    "Features: Mean of binary indicators = prevalence rates for that facility type.\n",
+    "This is where we CREATE the training dataset."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 10,
@@ -135,26 +189,50 @@
     "df_grouped.rename(columns={\"patient_id\": \"total_admissions\"}, inplace=True)\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "786b8e51",
+   "metadata": {},
+   "source": [
+    "## Feature Engineering - Complexity Score\n",
+    "Create a composite metric: complexity_score.\n",
+    "Weights reflect severity of each condition (chronic=2.0 is most serious).\n",
+    "Since we aggregated by mean, values are 0.0-1.0 (prevalence rates).\n",
+    "Score reflects average complexity of facility's patient population."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": null,
    "id": "168ff8d3",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Complexity Score (now percentages since we aggregated by mean)\n",
+    "# Complexity Score\n",
     "df_grouped[\"complexity_score\"] = (\n",
     "    df_grouped.get(\"is_polysubstance\", 0) * 1.5\n",
     "    + df_grouped.get(\"is_chronic_treatment\", 0) * 2.0\n",
     "    + df_grouped.get(\"has_mental_health_disorder\", 0) * 1.8\n",
     "    + df_grouped.get(\"is_homeless\", 0) * 1.5\n",
     "    + df_grouped.get(\"is_injection_user\", 0) * 2.0\n",
     ")\n",
+    "# One-hot encode remaining categorical features\n",
     "df_final = pd.get_dummies(\n",
     "    df_grouped, columns=[\"state\", \"service_type\"], drop_first=True\n",
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "08cd2488",
+   "metadata": {},
+   "source": [
+    "## Train-Test Split\n",
+    "Split before imputation/scaling to prevent leakage.\n",
+    "Imputer & scaler will fit only on training data.\n",
+    "This ensures test data statistics don't influence preprocessing."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 12,
@@ -198,6 +276,17 @@
     "print(f\"Target Skew (Log): {y_log.skew():.2f}\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "588ec192",
+   "metadata": {},
+   "source": [
+    "## Imputation & Scaling (Fit on Train Only)\n",
+    "Imputer & scaler learn statistics ONLY from training data.\n",
+    "Test data is transformed using train statistics (simulates true production scenario).\n",
+    "This gives honest estimate of how model performs on unseen data."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 13,
@@ -218,6 +307,17 @@
     "X_test_scaled = pd.DataFrame(scaler.transform(X_test_imputed), columns=X_test.columns)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "0014a72e",
+   "metadata": {},
+   "source": [
+    "## Train Multiple Models\n",
+    "Compare 3 different algorithms: Ridge, Random Forest, Gradient Boosting\n",
+    "Evaluate on TEST data (holds out 20% for final assessment)\n",
+    "Use CROSS-VALIDATION on TRAIN data (5-fold, more robust than single split)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 38,
@@ -285,6 +385,17 @@
     "print(f\"{name} CV R²: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})\")\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "7a8476da",
+   "metadata": {},
+   "source": [
+    "## Retrain Best Model on Full Data (Production)\n",
+    "retrain it on ALL data (no more train/test split).\n",
+    "This captures all available signal for production predictions.\n",
+    "Use full-data fitted imputer/scaler, not the train-only ones."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -311,9 +422,36 @@
     "# Retrain Best Model on Full Log Data\n",
     "y_full_log = np.log1p(y_raw)\n",
     "best_model_prod = clone(best_model)\n",
-    "best_model_prod.fit(X_full_scaled, y_full_log)\n",
-    "\n",
-    "# Calculate Bias Correction (Fix Log Transformation Under-prediction)\n",
+    "best_model_prod.fit(X_full_scaled, y_full_log)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9cc69d17",
+   "metadata": {},
+   "source": [
+    "## Generate Predictions\n",
+    "Use production model to predict admissions for all facilities.\n",
+    "Clip negative predictions to 0 (admissions cannot be negative).\n",
+    "Calculate the Bias Correction."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "725331ec",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Bias Correction Factor: 1.0014\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Calculate Bias Correction\n",
     "train_preds_log = best_model_prod.predict(X_full_scaled)\n",
     "train_preds_raw = np.maximum(np.expm1(train_preds_log), 0)\n",
     "\n",
@@ -324,6 +462,15 @@
     "print(f\"Bias Correction Factor: {correction_factor:.4f}\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "ee72bda7",
+   "metadata": {},
+   "source": [
+    "## Generate The Final Values\n",
+    "Generate final predictions by converting the model’s log-scaled outputs back to the original scale using `expm1`, then applying a correction factor and clipping negatives to zero. The resulting values are stored in `df_grouped[\"predicted_admissions\"]`.\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 28,
@@ -340,6 +487,18 @@
     "df_grouped[\"predicted_admissions\"] = final_preds"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "53f5b9bd",
+   "metadata": {},
+   "source": [
+    "## Calculate Resource Requirements\n",
+    "Convert predicted admissions into actionable resource recommendations\n",
+    "Beds: Assume 12 patients per bed (standard occupancy rate)\n",
+    "Staff: Assume 50 patients per staff member, adjusted by facility complexity\n",
+    "High-demand flag: Identifies facilities above median volume for priority"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 29,
@@ -363,6 +522,17 @@
     ").astype(int)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "ab50c55a",
+   "metadata": {},
+   "source": [
+    "## Feature Importance Analysis\n",
+    "Identify which features drive the model's predictions.\n",
+    "Top features show what patterns the model learned are most predictive.\n",
+    "Use for model interpretability and validation."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 32,
@@ -405,6 +575,16 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "d17d2302",
+   "metadata": {},
+   "source": [
+    "## Resource Allocation Report\n",
+    "Display sample predictions and top high-demand facilities.\n",
+    "This is the actionable output for resource planning teams."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 36,
@@ -460,6 +640,19 @@
     "print(top_demand.to_string(index=False))\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "68e413de",
+   "metadata": {},
+   "source": [
+    "## Visualization Dashboard\n",
+    "Create 4-panel visualization to understand model performance and recommendations\n",
+    "Panel 1: Actual vs Predicted (assess accuracy, colored by complexity)\n",
+    "Panel 2: Complexity distribution (understand patient case mix)\n",
+    "Panel 3: Admissions vs Beds (resource scaling relationship)\n",
+    "Panel 4: Model performance comparison (train vs test)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 34,