Skip to content

Commit 2ad5a8f

Browse files
authored
Merge pull request #170 from codeharborhub/dev-1
ml docs add
2 parents b995045 + 947e892 commit 2ad5a8f

File tree

3 files changed

+343
-0
lines changed

3 files changed

+343
-0
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
title: K-Fold Cross-Validation
3+
sidebar_label: K-Fold Cross-Validation
4+
description: "Mastering robust model evaluation by rotating training and testing sets to maximize data utility."
5+
tags: [machine-learning, model-evaluation, cross-validation, k-fold, generalization]
6+
---
7+
8+
While a [Train-Test Split](./train-test-split) is a great starting point, it has a major weakness: your results can vary significantly depending on which specific rows end up in the test set.
9+
10+
**K-Fold Cross-Validation** solves this by repeating the split process multiple times and averaging the results, ensuring every single data point gets to be part of the "test set" at least once.
11+
12+
## 1. How the Algorithm Works
13+
14+
The process follows a simple rotation logic:
15+
1. **Split** the data into **K** equal-sized "folds" (usually $K=5$ or $K=10$).
16+
2. **Iterate:** For each fold $i$:
17+
* Treat Fold $i$ as the **Test Set**.
18+
* Treat the remaining $K-1$ folds as the **Training Set**.
19+
* Train the model and record the score.
20+
3. **Aggregate:** Calculate the mean and standard deviation of all $K$ scores.
21+
22+
## 2. Visualizing the Process
23+
24+
```mermaid
25+
graph TB
26+
TITLE["$$\text{K-Fold Cross-Validation}$$"]
27+
28+
%% Dataset
29+
TITLE --> DATA["$$\text{Full Dataset}$$"]
30+
31+
%% Folds
32+
DATA --> F1["$$\text{Fold 1}$$"]
33+
DATA --> F2["$$\text{Fold 2}$$"]
34+
DATA --> F3["$$\text{Fold 3}$$"]
35+
DATA --> Fk["$$\text{Fold } k$$"]
36+
37+
%% Iterations
38+
F1 --> I1["$$\text{Iteration 1}$$<br/>$$\text{Validation: Fold 1}$$<br/>$$\text{Training: Others}$$"]
39+
F2 --> I2["$$\text{Iteration 2}$$<br/>$$\text{Validation: Fold 2}$$<br/>$$\text{Training: Others}$$"]
40+
F3 --> I3["$$\text{Iteration 3}$$<br/>$$\text{Validation: Fold 3}$$<br/>$$\text{Training: Others}$$"]
41+
Fk --> Ik["$$\text{Iteration } k$$<br/>$$\text{Validation: Fold } k$$<br/>$$\text{Training: Others}$$"]
42+
43+
%% Model Training & Evaluation
44+
I1 --> M1["$$\text{Train Model}$$"]
45+
I2 --> M2["$$\text{Train Model}$$"]
46+
I3 --> M3["$$\text{Train Model}$$"]
47+
Ik --> Mk["$$\text{Train Model}$$"]
48+
49+
M1 --> S1["$$\text{Score}_1$$"]
50+
M2 --> S2["$$\text{Score}_2$$"]
51+
M3 --> S3["$$\text{Score}_3$$"]
52+
Mk --> Sk["$$\text{Score}_k$$"]
53+
54+
%% Final Result
55+
S1 --> AVG["$$\text{Average Score}$$"]
56+
S2 --> AVG
57+
S3 --> AVG
58+
Sk --> AVG
59+
60+
AVG --> PERF["$$\text{Cross-Validated Performance}$$"]
61+
62+
```
63+
64+
## 3. Why Use K-Fold?
65+
66+
### A. Reliability (Reducing Variance)
67+
68+
By averaging 10 different test scores, you get a much more stable estimate of how the model will perform on new data. It eliminates the "luck of the draw."
69+
70+
### B. Maximum Data Utility
71+
72+
In a standard split, 20% of your data is never used for training. In K-Fold, every data point is used for training $K-1$ times and for testing exactly once. This is especially vital for small datasets.
73+
74+
### C. Hyperparameter Tuning
75+
76+
K-Fold is the foundation for **Grid Search**. It helps you find the best settings for your model (like the depth of a tree) without overfitting to one specific validation set.
77+
78+
## 4. Implementation with Scikit-Learn
79+
80+
```python
81+
from sklearn.model_selection import cross_val_score, KFold
82+
from sklearn.ensemble import RandomForestClassifier
83+
84+
# 1. Initialize model and data
85+
model = RandomForestClassifier()
86+
87+
# 2. Define the K-Fold strategy
88+
kf = KFold(n_splits=5, shuffle=True, random_state=42)
89+
90+
# 3. Perform Cross-Validation
91+
# This returns an array of 5 scores
92+
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
93+
94+
print(f"Scores for each fold: {scores}")
95+
print(f"Mean Accuracy: {scores.mean():.4f}")
96+
print(f"Standard Deviation: {scores.std():.4f}")
97+
98+
```
99+
100+
## 5. Variations of Cross-Validation
101+
102+
* **Stratified K-Fold:** Used for imbalanced data. It ensures each fold has the same percentage of samples for each class as the whole dataset.
103+
* **Leave-One-Out (LOOCV):** A extreme case where $K$ equals the total number of samples ($N$). Extremely computationally expensive but uses the most data possible.
104+
* **Time-Series Split:** Unlike random K-Fold, this respects the chronological order of data (Training on the past, testing on the future).
105+
106+
## 6. Pros and Cons
107+
108+
| Advantages | Disadvantages |
109+
| --- | --- |
110+
| **Robustness:** Provides a more accurate measure of model generalization. | **Computationally Expensive:** Training the model $K$ times takes $K$ times longer. |
111+
| **Confidence:** The standard deviation tells you how "stable" the model is. | **Not for Big Data:** If your model takes 10 hours to train, doing it 10 times is often impractical. |
112+
113+
## References
114+
115+
* **Scikit-Learn:** [Cross-Validation Guide](https://scikit-learn.org/stable/modules/cross_validation.html)
116+
* **StatQuest:** [K-Fold Cross-Validation Explained](https://www.youtube.com/watch?v=fSytzGwwBVw)
117+
118+
---
119+
120+
**Now that you have a robust way to validate your model, how do you handle data where the classes are heavily skewed (e.g., 99% vs 1%)?**
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
---
2+
title: "Leave-One-Out Cross-Validation (LOOCV)"
3+
sidebar_label: LOOCV
4+
description: "The most exhaustive validation technique: training on N-1 samples and testing on a single observation."
5+
tags: [machine-learning, model-evaluation, loocv, cross-validation, small-data]
6+
---
7+
8+
**Leave-One-Out Cross-Validation (LOOCV)** is an extreme case of [K-Fold Cross-Validation](./k-fold-cross-validation). Instead of splitting the data into 5 or 10 groups, LOOCV sets $K$ equal to $N$, the total number of data points in your set.
9+
10+
In each iteration, the model is trained on every data point except **one**, which is used as the test set.
11+
12+
## 1. How the Algorithm Works
13+
14+
If you have a dataset with $n$ samples:
15+
1. **Select** the first sample to be the test set.
16+
2. **Train** the model on the remaining $n-1$ samples.
17+
3. **Evaluate** the model on the single test sample and record the error.
18+
4. **Repeat** this process $n$ times, so that each sample serves as the test set exactly once.
19+
5. **Average** the $n$ resulting errors to get the final performance metric.
20+
21+
## 2. Mathematical Representation
22+
23+
The LOOCV estimate of the test error is the average of these $n$ test errors:
24+
25+
$$
26+
CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i
27+
$$
28+
29+
Where $Err_i$ is the error (e.g., Mean Squared Error or Misclassification) calculated on the $i^{th}$ observation when the model was fit using all data except that observation.
30+
31+
```mermaid
32+
graph TB
33+
TITLE["$$\text{Leave-One-Out Cross-Validation (LOOCV)}$$"]
34+
35+
%% Dataset
36+
TITLE --> DATA["$$\text{Dataset with } n \text{ Observations}$$"]
37+
38+
%% Leaving One Out
39+
DATA --> L1["$$\text{Hold Out Observation } 1$$"]
40+
DATA --> L2["$$\text{Hold Out Observation } 2$$"]
41+
DATA --> Li["$$\text{Hold Out Observation } i$$"]
42+
DATA --> Ln["$$\text{Hold Out Observation } n$$"]
43+
44+
%% Training
45+
L1 --> T1["$$\text{Train on } n-1 \text{ samples}$$"]
46+
L2 --> T2["$$\text{Train on } n-1 \text{ samples}$$"]
47+
Li --> Ti["$$\text{Train on } n-1 \text{ samples}$$"]
48+
Ln --> Tn["$$\text{Train on } n-1 \text{ samples}$$"]
49+
50+
%% Error Computation
51+
T1 --> E1["$$Err_1$$"]
52+
T2 --> E2["$$Err_2$$"]
53+
Ti --> Ei["$$Err_i$$"]
54+
Tn --> En["$$Err_n$$"]
55+
56+
%% Averaging Errors
57+
E1 --> AVG["$$CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n} Err_i$$"]
58+
E2 --> AVG
59+
Ei --> AVG
60+
En --> AVG
61+
62+
AVG --> EST["$$\text{Estimated Test Error}$$"]
63+
64+
```
65+
66+
## 3. When to Use LOOCV?
67+
68+
### Small Datasets
69+
70+
When you only have 20 or 50 samples, a standard 80/20 split would leave you with very little data for training. LOOCV allows you to use $n-1$ samples for training, maximizing the model's ability to learn the underlying patterns.
71+
72+
### Bias vs. Variance
73+
74+
* **Low Bias:** Since we use almost all the data for training in each step, the model behaves very similarly to how it would if trained on the full dataset.
75+
* **High Variance:** Because the training sets in each iteration are almost identical (overlapping by $n-2$ samples), the outputs are highly correlated. This can lead to a higher variance in the final error estimate compared to K-Fold.
76+
77+
## 4. Implementation with Scikit-Learn
78+
79+
```python
80+
from sklearn.model_selection import LeaveOneOut, cross_val_score
81+
from sklearn.linear_model import LinearRegression
82+
import numpy as np
83+
84+
# 1. Initialize data and model
85+
X = np.array([[1], [2], [3], [4]])
86+
y = np.array([2, 3.9, 6.1, 8.2])
87+
model = LinearRegression()
88+
89+
# 2. Initialize LOOCV
90+
loo = LeaveOneOut()
91+
92+
# 3. Perform Cross-Validation
93+
# This will run 4 times because we have 4 samples
94+
scores = cross_val_score(model, X, y, cv=loo, scoring='neg_mean_squared_error')
95+
96+
print(f"MSE for each iteration: {np.abs(scores)}")
97+
print(f"Average MSE: {np.abs(scores).mean():.4f}")
98+
99+
```
100+
101+
## 5. LOOCV vs. K-Fold Cross-Validation
102+
103+
| Feature | LOOCV | K-Fold ($K=10$) |
104+
| --- | --- | --- |
105+
| **Computations** | $N$ (Total samples) | 10 |
106+
| **Computational Cost** | Very High | Moderate |
107+
| **Bias** | Extremely Low | Higher than LOOCV |
108+
| **Variance** | High | Low |
109+
| **Best For** | Small datasets ($N < 100$) | Large/Standard datasets |
110+
111+
112+
## 6. The "Shortcut" for Linear Regression
113+
114+
For certain models like **Linear Regression**, you don't actually have to train the model times. There is a mathematical identity that allows you to calculate the LOOCV error with a single model fit:
115+
116+
$$
117+
CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2
118+
$$
119+
120+
Where $h_i$ is the leverage (diagonal of the hat matrix). This makes LOOCV as fast as a single training session for linear models!
121+
122+
## References
123+
124+
* **An Introduction to Statistical Learning (ISLR):** Chapter 5.1.2 covers LOOCV in depth.
125+
* **Scikit-Learn:** [LeaveOneOut Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html)
126+
127+
---
128+
129+
**LOOCV is great for small data, but what if your classes are unbalanced (e.g., 99% vs 1%)? Standard LOOCV might struggle to capture the minority class.**
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
title: Train-Test Split
3+
sidebar_label: Train-Test Split
4+
description: "Mastering the data partitioning process to ensure unbiased model evaluation."
5+
tags: [machine-learning, model-evaluation, training, testing, generalization]
6+
---
7+
8+
The **Train-Test Split** is a technique used to evaluate the performance of a machine learning algorithm. It involves taking your primary dataset and partitioning it into two separate subsets: one to build the model and another to validate its predictions.
9+
10+
## 1. Why do we split data?
11+
12+
In Machine Learning, we don't care how well a model remembers the past; we care how well it predicts the **future**.
13+
14+
If we train our model on the *entire* dataset, we have no way of knowing if the model actually learned the underlying patterns or if it simply memorized the noise in that specific data. Testing on the same data used for training is a "cardinal sin" known as **Data Leakage**.
15+
16+
## 2. The Partitioning Logic
17+
18+
Typically, the data is split into two (or sometimes three) parts:
19+
20+
1. **Training Set (70-80%):** This is the data used by the algorithm to learn the relationships between features and targets.
21+
2. **Test Set (20-30%):** This data is kept in a "vault." The model never sees it during training. It is used only at the very end to provide an unbiased evaluation.
22+
23+
```mermaid
24+
graph TB
25+
TITLE["$$\text{Data Partitioning Logic}$$"]
26+
27+
%% Full Dataset
28+
TITLE --> DATA["$$\text{Full Dataset (100\%)}$$"]
29+
30+
%% Split
31+
DATA --> TRAIN["$$\text{Training Set}$$<br/>$$70\% \text{ to } 80\%$$"]
32+
DATA --> TEST["$$\text{Test Set}$$<br/>$$20\% \text{ to } 30\%$$"]
33+
34+
%% Training Path
35+
TRAIN --> LEARN["$$\text{Model Learning}$$"]
36+
LEARN --> FIT["$$\text{Learns Patterns and Relationships}$$"]
37+
38+
%% Test Path
39+
TEST --> VAULT["$$\text{Evaluation Vault}$$"]
40+
VAULT --> LOCK["$$\text{Never Seen During Training}$$"]
41+
LOCK --> EVAL["$$\text{Final Unbiased Evaluation}$$"]
42+
43+
%% Emphasis
44+
FIT -.->|"$$\text{Training Only}$$"| TRAIN
45+
EVAL -.->|"$$\text{Used Once at the End}$$"| TEST
46+
47+
```
48+
49+
## 3. Important Considerations
50+
51+
### Randomness and Reproducibility
52+
53+
When splitting data, we use a random process. However, for scientific consistency, we use a **Random State** (seed). This ensures that every time you run your code, you get the exact same split, making your experiments reproducible.
54+
55+
### Stratification
56+
57+
If you are working with imbalanced classes (e.g., 90% "Healthy", 10% "Sick"), a simple random split might accidentally put all the "Sick" cases in the training set and none in the test set.
58+
**Stratified Splitting** ensures that the proportion of classes is preserved in both the training and testing subsets.
59+
60+
## 4. Implementation with Scikit-Learn
61+
62+
```python
63+
from sklearn.model_selection import train_test_split
64+
65+
# Assume X contains features and y contains the target
66+
X_train, X_test, y_train, y_test = train_test_split(
67+
X,
68+
y,
69+
test_size=0.2, # 20% for testing
70+
random_state=42, # For reproducibility
71+
stratify=y # Keep class proportions equal
72+
)
73+
74+
print(f"Training samples: {len(X_train)}")
75+
print(f"Testing samples: {len(X_test)}")
76+
77+
```
78+
79+
## 5. Pros and Cons
80+
81+
| Advantages | Disadvantages |
82+
| --- | --- |
83+
| **Simplicity:** Very easy to understand and implement. | **High Variance:** If the dataset is small, a different random split can lead to very different results. |
84+
| **Speed:** Fast to compute, as the model is only trained once. | **Waste of Data:** A portion of your valuable data is never used to train the model. |
85+
| **Standard Practice:** The universal starting point for any ML project. | **Not for Time-Series:** Random splitting ruins data where order matters (e.g., Stock prices). |
86+
87+
## References
88+
89+
* **Scikit-Learn:** [train_test_split Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
90+
* **Google ML Crash Course:** [Splitting Data](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data)
91+
92+
---
93+
94+
**A single split is a good start, but what if your "random" test set happens to be particularly easy or hard? To solve this, we use a more robust technique.**

0 commit comments

Comments
 (0)