Skip to content

Commit b785db8

Browse files
authored
Merge pull request #166 from codeharborhub/dev-1
added ml content
2 parents ad8191f + 2348619 commit b785db8

File tree

2 files changed

+331
-0
lines changed

2 files changed

+331
-0
lines changed
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
---
2+
title: Linear Regression
3+
sidebar_label: Linear Regression
4+
description: "Mastering the fundamentals of predicting continuous values using lines, slopes, and intercepts."
5+
tags: [machine-learning, supervised-learning, regression, linear-regression, ordinary-least-squares]
6+
---
7+
8+
**Linear Regression** is a supervised learning algorithm used to predict a continuous numerical output based on one or more input features. It assumes that there is a linear relationship between the input variables ($X$) and the single output variable ($y$).
9+
10+
## 1. The Mathematical Model
11+
12+
The goal of linear regression is to find the "Line of Best Fit." Mathematically, this line is represented by the equation:
13+
14+
$$
15+
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \epsilon
16+
$$
17+
18+
Where:
19+
20+
* **$y$**: The dependent variable (Target).
21+
* **$x$**: The independent variables (Features).
22+
* **$\beta_0$**: The **Intercept** (where the line crosses the y-axis).
23+
* **$\beta_1, \beta_2$**: The **Coefficients** or Slopes (representing the weight of each feature).
24+
* **$\epsilon$**: The error term (Residual).
25+
26+
## 2. Ordinary Least Squares (OLS)
27+
28+
How does the model find the "best" line? It uses a method called **Ordinary Least Squares**.
29+
30+
The algorithm calculates the distance between every actual data point and the predicted point on the line. It then squares these distances (to remove negative signs) and sums them up. The "best" line is the one that minimizes this **Sum of Squared Errors (SSE)**.
31+
32+
```mermaid
33+
graph LR
34+
subgraph LR["Linear Regression Model"]
35+
X["$$x$$ (Input Feature)"] --> H["$$\hat{y} = wx + b$$"]
36+
end
37+
38+
subgraph ERR["Residuals"]
39+
Y["$$y$$ (Actual Value)"]
40+
H --> R["$$r = y - \hat{y}$$"]
41+
Y --> R
42+
end
43+
44+
subgraph SSE["Sum of Squared Errors"]
45+
R --> S1["$$r^2 = (y - \hat{y})^2$$"]
46+
S1 --> S2["$$\text{SSE} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$"]
47+
S2 --> S3["$$\text{Loss to Minimize}$$"]
48+
end
49+
50+
X -.->|"$$\text{Best Fit Line}$$"| Y
51+
52+
```
53+
54+
In this diagram:
55+
56+
* The input feature ($x$) is fed into the linear model to produce a predicted value ($\hat{y}$).
57+
* The residual ($r$) is calculated as the difference between the actual value ($y$) and the predicted value ($\hat{y}$).
58+
* The squared residuals are summed up to compute the SSE, which the model aims to minimize.
59+
60+
61+
## 3. Simple vs. Multiple Linear Regression
62+
63+
* **Simple Linear Regression:** Uses only one feature to predict the target (e.g., predicting house price based *only* on square footage).
64+
* **Multiple Linear Regression:** Uses two or more features (e.g., predicting house price based on square footage, number of bedrooms, and age of the house).
65+
66+
## 4. Key Assumptions
67+
68+
For Linear Regression to be effective and reliable, the data should ideally meet these criteria:
69+
1. **Linearity:** The relationship between $X$ and $y$ is a straight line.
70+
2. **Independence:** Observations are independent of each other.
71+
3. **Homoscedasticity:** The variance of residual errors is constant across all levels of the independent variables.
72+
4. **Normality:** The residuals (errors) of the model are normally distributed.
73+
74+
## 5. Implementation with Scikit-Learn
75+
76+
```python title="Linear Regression with Scikit-Learn"
77+
import numpy as np
78+
import pandas as pd
79+
from sklearn.model_selection import train_test_split
80+
from sklearn.linear_model import LinearRegression
81+
from sklearn.metrics import mean_squared_error, r2_score
82+
83+
# --------------------------------------------------
84+
# 1. Create a sample dataset
85+
# --------------------------------------------------
86+
# Example: Predict salary based on years of experience
87+
88+
np.random.seed(42)
89+
90+
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) # Feature
91+
y = np.array([30, 35, 37, 42, 45, 50, 52, 56, 60, 65]) # Target
92+
93+
# --------------------------------------------------
94+
# 2. Split the data into training and testing sets
95+
# --------------------------------------------------
96+
X_train, X_test, y_train, y_test = train_test_split(
97+
X, y, test_size=0.2, random_state=42
98+
)
99+
100+
# --------------------------------------------------
101+
# 3. Initialize the Linear Regression model
102+
# --------------------------------------------------
103+
model = LinearRegression()
104+
105+
# --------------------------------------------------
106+
# 4. Train the model
107+
# --------------------------------------------------
108+
model.fit(X_train, y_train)
109+
110+
# --------------------------------------------------
111+
# 5. Make predictions
112+
# --------------------------------------------------
113+
y_pred = model.predict(X_test)
114+
115+
# --------------------------------------------------
116+
# 6. Inspect learned parameters
117+
# --------------------------------------------------
118+
print(f"Intercept (β₀): {model.intercept_}")
119+
print(f"Coefficient (β₁): {model.coef_[0]}")
120+
121+
# --------------------------------------------------
122+
# 7. Evaluate the model
123+
# --------------------------------------------------
124+
mse = mean_squared_error(y_test, y_pred)
125+
r2 = r2_score(y_test, y_pred)
126+
127+
print(f"Mean Squared Error (MSE): {mse}")
128+
print(f"R² Score: {r2}")
129+
130+
# --------------------------------------------------
131+
# 8. Compare actual vs predicted values
132+
# --------------------------------------------------
133+
results = pd.DataFrame({
134+
"Actual": y_test,
135+
"Predicted": y_pred
136+
})
137+
138+
print("\nPrediction Results:")
139+
print(results)
140+
141+
```
142+
143+
```bash title="Output"
144+
Intercept (β₀): 26.025862068965512
145+
Coefficient (β₁): 3.836206896551725
146+
Mean Squared Error (MSE): 0.9994426278240237
147+
R² Score: 0.9936035671819262
148+
149+
Prediction Results:
150+
Actual Predicted
151+
0 60 60.551724
152+
1 35 33.698276
153+
154+
```
155+
156+
157+
## 6. Evaluating Regression
158+
159+
Unlike classification (where we use accuracy), we evaluate regression using error metrics:
160+
161+
* **Mean Squared Error (MSE):** The average of the squared differences between predicted and actual values.
162+
* **Root Mean Squared Error (RMSE):** The square root of MSE (brings the error back to the original units).
163+
* **R-Squared ($R^2$):** Measures how much of the variance in $y$ is explained by the model (ranges from 0 to 1).
164+
165+
```python title="Evaluating Linear Regression Model"
166+
from sklearn.metrics import mean_squared_error, r2_score
167+
import numpy as np
168+
169+
# Calculate evaluation metrics
170+
mse = mean_squared_error(y_test, y_pred)
171+
rmse = np.sqrt(mse) # Root Mean Squared Error
172+
r2 = r2_score(y_test, y_pred)
173+
174+
# Display results
175+
print("Model Evaluation Metrics")
176+
print("------------------------")
177+
print(f"Mean Squared Error (MSE): {mse:.4f}")
178+
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
179+
print(f"R-Squared (R²): {r2:.4f}")
180+
```
181+
182+
```bash title="Output"
183+
Model Evaluation Metrics
184+
------------------------
185+
Mean Squared Error (MSE): 0.9994
186+
Root Mean Squared Error (RMSE): 0.9997
187+
R-Squared (R²): 0.9936
188+
```
189+
190+
## 7. Pros and Cons
191+
192+
| Advantages | Disadvantages |
193+
| --- | --- |
194+
| **Highly Interpretable:** You can see exactly how much each feature influences the result. | **Sensitive to Outliers:** A single extreme value can significantly tilt the line. |
195+
| **Fast:** Requires very little computational power. | **Assumption Heavy:** Fails if the underlying relationship is non-linear. |
196+
| **Baseline Model:** Excellent starting point for any regression task. | **Overfitting:** Can overfit if there are too many features (Multicollinearity). |
197+
198+
## References for More Details
199+
200+
* **[Scikit-Learn Linear Models](https://scikit-learn.org/stable/modules/linear_model.html):** Technical details on OLS and alternative solvers.
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
---
2+
title: "Polynomial Regression: Beyond Straight Lines"
3+
sidebar_label: Polynomial Regression
4+
description: "Learning to model curved relationships by transforming features into higher-degree polynomials."
5+
tags: [machine-learning, supervised-learning, regression, polynomial-features, non-linear]
6+
---
7+
8+
**Polynomial Regression** is a form of regression analysis in which the relationship between the independent variable $x$ and the dependent variable $y$ is modelled as an $n^{th}$ degree polynomial.
9+
10+
While it fits a non-linear model to the data, as a statistical estimation problem, it is still considered **linear** because the regression function is linear in terms of the unknown parameters ($\beta$) that are estimated from the data.
11+
12+
## 1. Why use Polynomial Regression?
13+
14+
Linear regression requires a straight-line relationship. However, real-world data often follows curves, such as:
15+
* **Growth Rates:** Biological growth or interest rates.
16+
* **Physics:** The path of a projectile or the relationship between speed and braking distance.
17+
* **Economics:** Diminishing returns on investment.
18+
19+
## 2. The Mathematical Equation
20+
21+
In a simple linear model, we have:
22+
23+
$$
24+
y = \beta_0 + \beta_1x_1
25+
$$
26+
27+
In Polynomial Regression, we add higher-degree terms of the same feature:
28+
29+
$$
30+
y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + ... + \beta_nx^n + \epsilon
31+
$$
32+
33+
Where:
34+
35+
* **$y$**: The dependent variable (Target).
36+
* **$x$**: The independent variable (Feature).
37+
* **$\beta_0$**: The Intercept.
38+
* **$\beta_1, \beta_2, ..., \beta_n$**: The Coefficients for each polynomial term.
39+
* **$\epsilon$**: The error term (Residual).
40+
41+
By treating $x^2, x^3, ...$ as distinct features, we allow the model to "bend" to fit the data points.
42+
43+
## 3. The Danger of Degree: Overfitting
44+
45+
Choosing the right **degree** ($n$) is the most critical part of Polynomial Regression:
46+
47+
* **Underfitting (Degree 1):** A straight line that fails to capture the curve in the data.
48+
* **Optimal Fit (Degree 2 or 3):** A smooth curve that captures the general trend.
49+
* **Overfitting (Degree 10+):** A wiggly line that passes through every single data point but fails to predict new data because it has captured the noise instead of the signal.
50+
51+
```mermaid
52+
graph LR
53+
subgraph UF["Underfitting (Low Degree)"]
54+
X1["$$x$$"] --> L1["$$\hat{y} = w_1x + b$$"]
55+
L1 --> U1["$$\text{High Bias}$$"]
56+
U1 --> U2["$$\text{Misses Data Pattern}$$"]
57+
U2 --> U3["$$\text{High Train Error}$$"]
58+
end
59+
60+
subgraph OFIT["Optimal Fit (Medium Degree)"]
61+
X2["$$x$$"] --> M1["$$\hat{y} = w_1x + w_2x^2 + b$$"]
62+
M1 --> O1["$$\text{Balanced Bias–Variance}$$"]
63+
O1 --> O2["$$\text{Captures True Trend}$$"]
64+
O2 --> O3["$$\text{Low Train \& Test Error}$$"]
65+
end
66+
67+
subgraph OVF["Overfitting (High Degree)"]
68+
X3["$$x$$"] --> H1["$$\hat{y} = \sum_{k=1}^{d} w_k x^k$$"]
69+
H1 --> V1["$$\text{Low Bias}$$"]
70+
V1 --> V2["$$\text{High Variance}$$"]
71+
V2 --> V3["$$\text{Fits Noise}$$"]
72+
V3 --> V4["$$\text{Poor Generalization}$$"]
73+
end
74+
75+
U3 -.->|"$$\text{Increase Degree}$$"| O3
76+
O3 -.->|"$$\text{Too Complex}$$"| V4
77+
```
78+
79+
## 4. Implementation with Scikit-Learn
80+
81+
In Scikit-Learn, we perform Polynomial Regression by using a **Transformer** to generate new features and then passing them to a standard `LinearRegression` model.
82+
83+
```python title="Polynomial Regression with Scikit-Learn"
84+
from sklearn.preprocessing import PolynomialFeatures
85+
from sklearn.linear_model import LinearRegression
86+
from sklearn.pipeline import make_pipeline
87+
88+
# 1. Generate data (Example: a parabola)
89+
# X, y = ...
90+
91+
# 2. Create a pipeline that:
92+
# a) Generates polynomial terms (x^2)
93+
# b) Fits a linear regression to those terms
94+
degree = 2
95+
poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
96+
97+
# 3. Train the model
98+
poly_model.fit(X, y)
99+
100+
# 4. Predict
101+
y_pred = poly_model.predict(X)
102+
103+
```
104+
105+
## 5. Feature Scaling is Mandatory
106+
107+
When you square or cube features, the range of values expands drastically.
108+
109+
* If , then and .
110+
* If , then and .
111+
112+
Because of this explosive growth, you should **always scale your features** (e.g., using `StandardScaler`) before or after applying polynomial transformations to prevent numerical instability.
113+
114+
## 6. Pros and Cons
115+
116+
| Advantages | Disadvantages |
117+
| --- | --- |
118+
| Can model complex, non-linear relationships. | Extremely sensitive to outliers. |
119+
| Broad range of functions can be mapped under it. | High risk of overfitting if the degree is too high. |
120+
| Fits into the linear regression framework. | Becomes computationally expensive with many features. |
121+
122+
123+
## References for More Details
124+
125+
* **[Interactive Polynomial Regression Demo](https://phet.colorado.edu/sims/html/least-squares-regression/latest/least-squares-regression_en.html):** Visualizing how adding degrees changes the line of best fit in real-time.
126+
127+
* **[Scikit-Learn: Polynomial Features](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html):** Understanding how the `interaction_only` parameter works for multiple variables.
128+
129+
---
130+
131+
**Polynomial models can easily become too complex and overfit. How do we keep the model's weights in check?**

0 commit comments

Comments
 (0)