Skip to content

Commit 0b7573c

Browse files
authored
Merge pull request #167 from codeharborhub/dev-1
done supervised learn...
2 parents b785db8 + e2c9e1e commit 0b7573c

File tree

3 files changed

+355
-0
lines changed

3 files changed

+355
-0
lines changed
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
title: Elastic Net Regression
3+
sidebar_label: Elastic Net
4+
description: "Combining L1 and L2 regularization for the ultimate balance in feature selection and model stability."
5+
tags: [machine-learning, supervised-learning, regression, elastic-net, regularization]
6+
---
7+
8+
**Elastic Net** is a regularized regression method that linearly combines the $L1$ and $L2$ penalties of the [Lasso](./lasso) and [Ridge](./ridge) methods.
9+
10+
It was developed to overcome the limitations of Lasso, particularly when dealing with highly correlated features or situations where the number of features exceeds the number of samples.
11+
12+
## 1. The Mathematical Objective
13+
14+
Elastic Net adds both penalties to the loss function. It uses a ratio to determine how much of each penalty to apply.
15+
16+
The cost function is:
17+
18+
$$
19+
Cost = \text{MSE} + \alpha \cdot \rho \sum_{j=1}^{p} |\beta_j| + \frac{\alpha \cdot (1 - \rho)}{2} \sum_{j=1}^{p} \beta_j^2
20+
$$
21+
22+
* **$\alpha$ (Alpha):** The overall regularization strength.
23+
* **$\rho$ (L1 Ratio):** Controls the mix between Lasso and Ridge.
24+
* If $\rho = 1$, it is pure **Lasso**.
25+
* If $\rho = 0$, it is pure **Ridge**.
26+
* If $0 < \rho < 1$, it is a **combination**.
27+
28+
## 2. Why use Elastic Net?
29+
30+
### A. Overcoming Lasso's Limitations
31+
Lasso tends to pick one variable from a group of highly correlated variables and ignore the others. Elastic Net is more likely to keep the whole group in the model (the "grouping effect") thanks to the Ridge component.
32+
33+
### B. High-Dimensional Data
34+
In cases where the number of features ($p$) is greater than the number of observations ($n$), Lasso can only select at most $n$ variables. Elastic Net can select more than $n$ variables if necessary.
35+
36+
### C. Maximum Flexibility
37+
Because you can tune the ratio, you can "slide" your model anywhere on the spectrum between Ridge and Lasso to find the exact point that minimizes validation error.
38+
39+
```mermaid
40+
graph LR
41+
subgraph RIDGE["Ridge (L2) Coefficient Path"]
42+
R0["$$\\alpha = 0$$"] --> R1["$$w_1, w_2, w_3$$"]
43+
R1 --> R2["$$\\alpha \\uparrow$$"]
44+
R2 --> R3["$$|w_i| \\downarrow$$ (Smooth Shrinkage)"]
45+
R3 --> R4["$$w_i \\neq 0$$"]
46+
end
47+
48+
subgraph LASSO["Lasso (L1) Coefficient Path"]
49+
L0["$$\\alpha = 0$$"] --> L1["$$w_1, w_2, w_3$$"]
50+
L1 --> L2["$$\\alpha \\uparrow$$"]
51+
L2 --> L3["$$|w_i| \\downarrow$$ (Linear Shrinkage)"]
52+
L3 --> L4["$$w_j = 0$$ for some j"]
53+
end
54+
55+
subgraph ENET["Elastic Net (L1 + L2) Coefficient Path"]
56+
E0["$$\\alpha = 0$$"] --> E1["$$w_1, w_2, w_3$$"]
57+
E1 --> E2["$$\\alpha \\uparrow$$"]
58+
E2 --> E3["$$\\text{Mixed Shrinkage}$$"]
59+
E3 --> E4["$$\\text{Grouped Selection + Stability}$$"]
60+
end
61+
62+
R4 -.->|"$$\\text{No Sparsity}$$"| L4
63+
L4 -.->|"$$\\text{Pure Sparsity}$$"| E4
64+
```
65+
66+
## 3. Key Hyperparameters in Scikit-Learn
67+
68+
* **`alpha`**: Constant that multiplies the penalty terms. High values mean more regularization.
69+
* **`l1_ratio`**: The $\rho$ parameter. Scikit-Learn uses `l1_ratio=0.5` by default, giving equal weight to $L1$ and $L2$.
70+
71+
## 4. Implementation with Scikit-Learn
72+
73+
```python
74+
from sklearn.linear_model import ElasticNet
75+
from sklearn.preprocessing import StandardScaler
76+
77+
# 1. Scaling is mandatory
78+
scaler = StandardScaler()
79+
X_scaled = scaler.fit_transform(X)
80+
81+
# 2. Initialize and Train
82+
# l1_ratio=0.5 means 50% Lasso, 50% Ridge
83+
model = ElasticNet(alpha=1.0, l1_ratio=0.5)
84+
model.fit(X_scaled, y)
85+
86+
# 3. View the results
87+
print(f"Coefficients: {model.coef_}")
88+
89+
```
90+
91+
## 5. Decision Matrix: Which one to use?
92+
93+
| Scenario | Recommended Model |
94+
| --- | --- |
95+
| Most features are useful and small. | **Ridge** |
96+
| You suspect only a few features are actually important. | **Lasso** |
97+
| You have many features that are highly correlated with each other. | **Elastic Net** |
98+
| Number of features is much larger than the number of samples ($p \gg n$). | **Elastic Net** |
99+
100+
101+
## 6. Automated Tuning with ElasticNetCV
102+
103+
Like Ridge and Lasso, Scikit-Learn provides a cross-validation version that tests multiple `alpha` values and `l1_ratio` values to find the best combination for you.
104+
105+
```python
106+
from sklearn.linear_model import ElasticNetCV
107+
108+
# Search for the best alpha and l1_ratio
109+
model_cv = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], cv=5)
110+
model_cv.fit(X_scaled, y)
111+
112+
print(f"Best Alpha: {model_cv.alpha_}")
113+
print(f"Best L1 Ratio: {model_cv.l1_ratio_}")
114+
115+
```
116+
117+
## References for More Details
118+
119+
* **[Scikit-Learn ElasticNet Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html):** Understanding technical parameters like `tol` (tolerance) and `max_iter`.
120+
121+
---
122+
123+
**You've now covered all the primary linear regression models! But what if your goal isn't to predict a number, but to group similar data points together?** Head over to the [Clustering](/tutorial/category/clustering) section to explore techniques like K-Means and DBSCAN!
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
title: "Lasso Regression (L1 Regularization)"
3+
sidebar_label: Lasso Regression
4+
description: "Understanding L1 regularization, sparse models, and automated feature selection."
5+
tags: [machine-learning, supervised-learning, regression, lasso, l1-regularization]
6+
---
7+
8+
**Lasso Regression** (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that uses **L1 Regularization**.
9+
10+
While standard Linear Regression tries to minimize only the error, Lasso adds a penalty equal to the absolute value of the magnitude of the coefficients. This forces the model to not only be accurate but also as simple as possible.
11+
12+
## 1. The Mathematical Objective
13+
14+
Lasso minimizes the following cost function:
15+
16+
$$
17+
Cost = \text{MSE} + \alpha \sum_{j=1}^{p} |\beta_j|
18+
$$
19+
20+
Where:
21+
22+
* **MSE (Mean Squared Error):** Keeps the model accurate.
23+
* **$\alpha$ (Alpha):** The tuning parameter that controls the strength of the penalty.
24+
* **$|\beta_j|$:** The absolute value of the coefficients.
25+
26+
## 2. Feature Selection: The Power of Zero
27+
28+
The most significant difference between Lasso and its sibling, [Ridge Regression](./ridge), is that Lasso can shrink coefficients **exactly to zero**.
29+
30+
When a coefficient becomes zero, that feature is effectively removed from the model. This makes Lasso an excellent tool for:
31+
1. **Automated Feature Selection:** Identifying the most important variables in a dataset with hundreds of features.
32+
2. **Model Interpretability:** Creating "sparse" models that are easier for humans to understand.
33+
34+
```mermaid
35+
graph LR
36+
subgraph RIDGE["Ridge Regression (L2)"]
37+
A1["$$\\alpha = 0$$"] --> B1["$$w_1, w_2, w_3$$ (OLS Coefficients)"]
38+
B1 --> C1["$$\\alpha \\uparrow$$"]
39+
C1 --> D1["$$|w_i| \\downarrow$$ (Smooth Shrinkage)"]
40+
D1 --> E1["$$w_i \\neq 0$$ for all i"]
41+
E1 --> F1["$$\\text{No Feature Elimination}$$"]
42+
end
43+
44+
subgraph LASSO["Lasso Regression (L1)"]
45+
A2["$$\\alpha = 0$$"] --> B2["$$w_1, w_2, w_3$$ (OLS Coefficients)"]
46+
B2 --> C2["$$\\alpha \\uparrow$$"]
47+
C2 --> D2["$$|w_i| \\downarrow$$ (Linear Shrinkage)"]
48+
D2 --> E2["$$w_j = 0$$ for some j"]
49+
E2 --> F2["$$\\text{Automatic Feature Selection}$$"]
50+
end
51+
52+
F1 -.->|"$$\\text{Shrinkage Path Comparison}$$"| F2
53+
```
54+
55+
## 3. Choosing the Alpha ($\alpha$) Parameter
56+
57+
* **If $\alpha = 0$:** The penalty is removed, and the result is standard Ordinary Least Squares (OLS).
58+
* **As $\alpha$ increases:** More coefficients are pushed to zero, leading to a simpler, more biased model.
59+
* **If $\alpha$ is too high:** All coefficients become zero, and the model predicts only the mean (Underfitting).
60+
61+
## 4. Implementation with Scikit-Learn
62+
63+
```python
64+
from sklearn.linear_model import Lasso
65+
from sklearn.preprocessing import StandardScaler
66+
67+
# 1. Scaling is REQUIRED for Lasso
68+
scaler = StandardScaler()
69+
X_train_scaled = scaler.fit_transform(X_train)
70+
71+
# 2. Initialize and Train
72+
# 'alpha' is the regularization strength
73+
lasso = Lasso(alpha=0.1)
74+
lasso.fit(X_train_scaled, y_train)
75+
76+
# 3. Check which features were selected (non-zero)
77+
import pandas as pd
78+
importance = pd.Series(lasso.coef_, index=feature_names)
79+
print(importance[importance != 0])
80+
81+
```
82+
83+
## 5. Lasso vs. Ridge
84+
85+
| Feature | Ridge ($L2$) | Lasso ($L1$) |
86+
| --- | --- | --- |
87+
| **Penalty** | Square of coefficients | Absolute value of coefficients |
88+
| **Coefficients** | Shrink towards zero, but never reach it | Can shrink exactly to **zero** |
89+
| **Use Case** | When most features are useful | When you have many "noisy" or useless features |
90+
| **Model Type** | Dense (all features kept) | Sparse (some features removed) |
91+
92+
## 6. Limitations of Lasso
93+
94+
1. **Correlated Features:** If two features are highly correlated, Lasso will randomly pick one and discard the other, which can lead to instability.
95+
2. **Sample Size:** If , Lasso can select at most features.
96+
97+
## References for More Details
98+
99+
* **[Scikit-Learn Lasso Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html):** Exploring `LassoCV`, which automatically finds the best Alpha using cross-validation.
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
---
2+
title: "Ridge Regression (L2 Regularization)"
3+
sidebar_label: Ridge Regression
4+
description: "Mastering L2 regularization to prevent overfitting and handle multicollinearity in regression models."
5+
tags: [machine-learning, supervised-learning, regression, ridge, l2-regularization]
6+
---
7+
8+
**Ridge Regression** is an extension of linear regression that adds a regularization term to the cost function. It is specifically designed to handle **overfitting** and issues caused by **multicollinearity** (when input features are highly correlated).
9+
10+
## 1. The Mathematical Objective
11+
12+
In standard OLS (Ordinary Least Squares), the model only cares about minimizing the error. In Ridge Regression, we add a "penalty" proportional to the square of the magnitude of the coefficients ($\beta$).
13+
14+
The cost function becomes:
15+
16+
$$
17+
Cost = \text{MSE} + \alpha \sum_{j=1}^{p} \beta_j^2
18+
$$
19+
20+
* **MSE (Mean Squared Error):** The standard loss (prediction error).
21+
* **$\alpha$ (Alpha):** The complexity parameter. It controls how much you want to penalize the size of the coefficients.
22+
* **$\beta_j^2$:** The L2 norm. Squaring the coefficients ensures they stay small but rarely hit exactly zero.
23+
24+
## 2. Why use Ridge?
25+
26+
### A. Preventing Overfitting
27+
When a model has too many features or the features are highly correlated, the coefficients ($\beta$) can become very large. This makes the model extremely sensitive to small fluctuations in the training data. Ridge "shrinks" these coefficients, making the model more stable.
28+
29+
### B. Handling Multicollinearity
30+
If two variables are nearly identical (e.g., height in inches and height in centimeters), standard regression might assign one a massive positive weight and the other a massive negative weight. Ridge forces the weights to be distributed more evenly and kept small.
31+
32+
```mermaid
33+
graph LR
34+
subgraph RIDGE["Ridge Regression Coefficient Shrinkage"]
35+
A["$$\\alpha = 0$$"] --> B["$$w_1, w_2, w_3$$ (OLS Coefficients)"]
36+
B --> C["$$\\alpha \\uparrow$$"]
37+
C --> D["$$|w_i| \\downarrow$$ (Shrinkage)"]
38+
D --> E["$$w_i \\to 0$$ (Never Exactly Zero)"]
39+
E --> F["$$\\text{Reduced Model Variance}$$"]
40+
end
41+
42+
subgraph OBJ["Optimization View"]
43+
L["$$\\min \\sum (y - \\hat{y})^2 + \\alpha \\sum w_i^2$$"]
44+
L --> G["$$\\text{Penalty Grows with } \\alpha$$"]
45+
G --> H["$$\\text{Stronger Pull Toward Origin}$$"]
46+
end
47+
48+
C -.->|"$$\\text{Controls Strength}$$"| G
49+
F -.->|"$$\\text{Bias–Variance Tradeoff}$$"| H
50+
51+
```
52+
53+
## 3. The Alpha ($\alpha$) Trade-off
54+
55+
Choosing the right $\alpha$ is a balancing act between **Bias** and **Variance**:
56+
57+
* **$\alpha = 0$:** Equivalent to standard Linear Regression (High variance, Low bias).
58+
* **As $\alpha \to \infty$:** The penalty dominates. Coefficients approach zero, and the model becomes a flat line (Low variance, High bias).
59+
60+
## 4. Implementation with Scikit-Learn
61+
62+
```python
63+
from sklearn.linear_model import Ridge
64+
from sklearn.preprocessing import StandardScaler
65+
66+
# 1. Scaling is MANDATORY for Ridge
67+
# Because the penalty is based on the size of coefficients,
68+
# features with larger scales will be penalized unfairly.
69+
scaler = StandardScaler()
70+
X_train_scaled = scaler.fit_transform(X_train)
71+
72+
# 2. Initialize and Train
73+
# alpha=1.0 is the default
74+
ridge = Ridge(alpha=1.0)
75+
ridge.fit(X_train_scaled, y_train)
76+
77+
# 3. Predict
78+
y_pred = ridge.predict(X_test_scaled)
79+
80+
```
81+
82+
## 5. Ridge vs. Lasso: A Summary
83+
84+
| Feature | Ridge Regression ($L2$) | Lasso Regression ($L1$) |
85+
| --- | --- | --- |
86+
| **Penalty Term** | $\alpha \sum_{j=1}^{p} \beta_j^2$ | $\alpha \sum_{j=1}^{p} \vert \beta_j \vert$ |
87+
| **Mathematical Goal** | Minimizes the **square** of the weights. | Minimizes the **absolute value** of the weights. |
88+
| **Coefficient Shrinkage** | Shrinks coefficients asymptotically toward zero, but they rarely reach exactly zero. | Can shrink coefficients **exactly to zero**, effectively removing the feature. |
89+
| **Feature Selection** | **No.** Keeps all predictors in the final model, though some may have tiny weights. | **Yes.** Acts as an automated feature selector by discarding unimportant variables. |
90+
| **Model Complexity** | Produces a **dense** model (uses all features). | Produces a **sparse** model (uses a subset of features). |
91+
| **Ideal Scenario** | When you have many features that all contribute a small amount to the output. | When you have many features, but only a few are actually significant. |
92+
| **Handling Correlated Data** | Very stable; handles multicollinearity by distributing weights across correlated features. | Less stable; if features are highly correlated, it may randomly pick one and zero out the others. |
93+
94+
```mermaid
95+
graph LR
96+
subgraph L2["L2 Regularization Constraint (Ridge)"]
97+
O1["$$w_1^2 + w_2^2 \leq t$$"] --> C1["$$\text{Circle (L2 Ball)}$$"]
98+
C1 --> E1["$$\text{Smooth Boundary}$$"]
99+
E1 --> S1["$$\text{Rarely touches axes}$$"]
100+
S1 --> R1["$$w_1, w_2 \neq 0$$"]
101+
end
102+
103+
subgraph L1["L1 Regularization Constraint (Lasso)"]
104+
O2["$$|w_1| + |w_2| \leq t$$"] --> D1["$$\text{Diamond (L1 Ball)}$$"]
105+
D1 --> C2["$$\text{Sharp Corners}$$"]
106+
C2 --> A1["$$\text{Corners lie on axes}$$"]
107+
A1 --> Z1["$$w_1 = 0 \ \text{or}\ w_2 = 0$$"]
108+
end
109+
110+
R1 -.->|"$$\text{Geometry Explains Behavior}$$"| Z1
111+
```
112+
113+
## 6. RidgeCV: Finding the Best Alpha
114+
115+
Finding the perfect manually is tedious. Scikit-Learn provides `RidgeCV`, which uses built-in cross-validation to find the optimal alpha for your specific dataset automatically.
116+
117+
```python
118+
from sklearn.linear_model import RidgeCV
119+
120+
# Define a list of alphas to test
121+
alphas = [0.1, 1.0, 10.0, 100.0]
122+
123+
# RidgeCV finds the best one automatically
124+
ridge_cv = RidgeCV(alphas=alphas)
125+
ridge_cv.fit(X_train_scaled, y_train)
126+
127+
print(f"Best Alpha: {ridge_cv.alpha_}")
128+
129+
```
130+
131+
## References for More Details
132+
133+
* **[Scikit-Learn: Linear Models](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression):** Technical details on the solvers used (like 'cholesky' or 'sag').

0 commit comments

Comments
 (0)