|
| 1 | +--- |
| 2 | +title: "Ridge Regression (L2 Regularization)" |
| 3 | +sidebar_label: Ridge Regression |
| 4 | +description: "Mastering L2 regularization to prevent overfitting and handle multicollinearity in regression models." |
| 5 | +tags: [machine-learning, supervised-learning, regression, ridge, l2-regularization] |
| 6 | +--- |
| 7 | + |
| 8 | +**Ridge Regression** is an extension of linear regression that adds a regularization term to the cost function. It is specifically designed to handle **overfitting** and issues caused by **multicollinearity** (when input features are highly correlated). |
| 9 | + |
| 10 | +## 1. The Mathematical Objective |
| 11 | + |
| 12 | +In standard OLS (Ordinary Least Squares), the model only cares about minimizing the error. In Ridge Regression, we add a "penalty" proportional to the square of the magnitude of the coefficients ($\beta$). |
| 13 | + |
| 14 | +The cost function becomes: |
| 15 | + |
| 16 | +$$ |
| 17 | +Cost = \text{MSE} + \alpha \sum_{j=1}^{p} \beta_j^2 |
| 18 | +$$ |
| 19 | + |
| 20 | +* **MSE (Mean Squared Error):** The standard loss (prediction error). |
| 21 | +* **$\alpha$ (Alpha):** The complexity parameter. It controls how much you want to penalize the size of the coefficients. |
| 22 | +* **$\beta_j^2$:** The L2 norm. Squaring the coefficients ensures they stay small but rarely hit exactly zero. |
| 23 | + |
| 24 | +## 2. Why use Ridge? |
| 25 | + |
| 26 | +### A. Preventing Overfitting |
| 27 | +When a model has too many features or the features are highly correlated, the coefficients ($\beta$) can become very large. This makes the model extremely sensitive to small fluctuations in the training data. Ridge "shrinks" these coefficients, making the model more stable. |
| 28 | + |
| 29 | +### B. Handling Multicollinearity |
| 30 | +If two variables are nearly identical (e.g., height in inches and height in centimeters), standard regression might assign one a massive positive weight and the other a massive negative weight. Ridge forces the weights to be distributed more evenly and kept small. |
| 31 | + |
| 32 | +```mermaid |
| 33 | +graph LR |
| 34 | + subgraph RIDGE["Ridge Regression Coefficient Shrinkage"] |
| 35 | + A["$$\\alpha = 0$$"] --> B["$$w_1, w_2, w_3$$ (OLS Coefficients)"] |
| 36 | + B --> C["$$\\alpha \\uparrow$$"] |
| 37 | + C --> D["$$|w_i| \\downarrow$$ (Shrinkage)"] |
| 38 | + D --> E["$$w_i \\to 0$$ (Never Exactly Zero)"] |
| 39 | + E --> F["$$\\text{Reduced Model Variance}$$"] |
| 40 | + end |
| 41 | +
|
| 42 | + subgraph OBJ["Optimization View"] |
| 43 | + L["$$\\min \\sum (y - \\hat{y})^2 + \\alpha \\sum w_i^2$$"] |
| 44 | + L --> G["$$\\text{Penalty Grows with } \\alpha$$"] |
| 45 | + G --> H["$$\\text{Stronger Pull Toward Origin}$$"] |
| 46 | + end |
| 47 | +
|
| 48 | + C -.->|"$$\\text{Controls Strength}$$"| G |
| 49 | + F -.->|"$$\\text{Bias–Variance Tradeoff}$$"| H |
| 50 | +
|
| 51 | +``` |
| 52 | + |
| 53 | +## 3. The Alpha ($\alpha$) Trade-off |
| 54 | + |
| 55 | +Choosing the right $\alpha$ is a balancing act between **Bias** and **Variance**: |
| 56 | + |
| 57 | +* **$\alpha = 0$:** Equivalent to standard Linear Regression (High variance, Low bias). |
| 58 | +* **As $\alpha \to \infty$:** The penalty dominates. Coefficients approach zero, and the model becomes a flat line (Low variance, High bias). |
| 59 | + |
| 60 | +## 4. Implementation with Scikit-Learn |
| 61 | + |
| 62 | +```python |
| 63 | +from sklearn.linear_model import Ridge |
| 64 | +from sklearn.preprocessing import StandardScaler |
| 65 | + |
| 66 | +# 1. Scaling is MANDATORY for Ridge |
| 67 | +# Because the penalty is based on the size of coefficients, |
| 68 | +# features with larger scales will be penalized unfairly. |
| 69 | +scaler = StandardScaler() |
| 70 | +X_train_scaled = scaler.fit_transform(X_train) |
| 71 | + |
| 72 | +# 2. Initialize and Train |
| 73 | +# alpha=1.0 is the default |
| 74 | +ridge = Ridge(alpha=1.0) |
| 75 | +ridge.fit(X_train_scaled, y_train) |
| 76 | + |
| 77 | +# 3. Predict |
| 78 | +y_pred = ridge.predict(X_test_scaled) |
| 79 | + |
| 80 | +``` |
| 81 | + |
| 82 | +## 5. Ridge vs. Lasso: A Summary |
| 83 | + |
| 84 | +| Feature | Ridge Regression ($L2$) | Lasso Regression ($L1$) | |
| 85 | +| --- | --- | --- | |
| 86 | +| **Penalty Term** | $\alpha \sum_{j=1}^{p} \beta_j^2$ | $\alpha \sum_{j=1}^{p} \vert \beta_j \vert$ | |
| 87 | +| **Mathematical Goal** | Minimizes the **square** of the weights. | Minimizes the **absolute value** of the weights. | |
| 88 | +| **Coefficient Shrinkage** | Shrinks coefficients asymptotically toward zero, but they rarely reach exactly zero. | Can shrink coefficients **exactly to zero**, effectively removing the feature. | |
| 89 | +| **Feature Selection** | **No.** Keeps all predictors in the final model, though some may have tiny weights. | **Yes.** Acts as an automated feature selector by discarding unimportant variables. | |
| 90 | +| **Model Complexity** | Produces a **dense** model (uses all features). | Produces a **sparse** model (uses a subset of features). | |
| 91 | +| **Ideal Scenario** | When you have many features that all contribute a small amount to the output. | When you have many features, but only a few are actually significant. | |
| 92 | +| **Handling Correlated Data** | Very stable; handles multicollinearity by distributing weights across correlated features. | Less stable; if features are highly correlated, it may randomly pick one and zero out the others. | |
| 93 | + |
| 94 | +```mermaid |
| 95 | +graph LR |
| 96 | + subgraph L2["L2 Regularization Constraint (Ridge)"] |
| 97 | + O1["$$w_1^2 + w_2^2 \leq t$$"] --> C1["$$\text{Circle (L2 Ball)}$$"] |
| 98 | + C1 --> E1["$$\text{Smooth Boundary}$$"] |
| 99 | + E1 --> S1["$$\text{Rarely touches axes}$$"] |
| 100 | + S1 --> R1["$$w_1, w_2 \neq 0$$"] |
| 101 | + end |
| 102 | +
|
| 103 | + subgraph L1["L1 Regularization Constraint (Lasso)"] |
| 104 | + O2["$$|w_1| + |w_2| \leq t$$"] --> D1["$$\text{Diamond (L1 Ball)}$$"] |
| 105 | + D1 --> C2["$$\text{Sharp Corners}$$"] |
| 106 | + C2 --> A1["$$\text{Corners lie on axes}$$"] |
| 107 | + A1 --> Z1["$$w_1 = 0 \ \text{or}\ w_2 = 0$$"] |
| 108 | + end |
| 109 | +
|
| 110 | + R1 -.->|"$$\text{Geometry Explains Behavior}$$"| Z1 |
| 111 | +``` |
| 112 | + |
| 113 | +## 6. RidgeCV: Finding the Best Alpha |
| 114 | + |
| 115 | +Finding the perfect manually is tedious. Scikit-Learn provides `RidgeCV`, which uses built-in cross-validation to find the optimal alpha for your specific dataset automatically. |
| 116 | + |
| 117 | +```python |
| 118 | +from sklearn.linear_model import RidgeCV |
| 119 | + |
| 120 | +# Define a list of alphas to test |
| 121 | +alphas = [0.1, 1.0, 10.0, 100.0] |
| 122 | + |
| 123 | +# RidgeCV finds the best one automatically |
| 124 | +ridge_cv = RidgeCV(alphas=alphas) |
| 125 | +ridge_cv.fit(X_train_scaled, y_train) |
| 126 | + |
| 127 | +print(f"Best Alpha: {ridge_cv.alpha_}") |
| 128 | + |
| 129 | +``` |
| 130 | + |
| 131 | +## References for More Details |
| 132 | + |
| 133 | +* **[Scikit-Learn: Linear Models](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression):** Technical details on the solvers used (like 'cholesky' or 'sag'). |
0 commit comments