tutorial/ai-ml/machine-learning/machine-learning-core/scikit-learn/model-selection.mdx at b098e2c66e44a01782a3b783cf790e36f4a5f30e · codeharborhub/tutorial

title

Model Selection & Validation

sidebar_label

Model Selection

description

How to choose the right algorithm, split data correctly, and use Cross-Validation to ensure model reliability.

1. The Scikit-Learn Estimator API

In Scikit-Learn, every model (classifier or regressor) is an Estimator. They all share a consistent interface:

Initialize: model = RandomForestClassifier()
Train: model.fit(X_train, y_train)
Predict: y_pred = model.predict(X_test)

2. Training vs. Testing: The Fundamental Split

The "Golden Rule" of Machine Learning is to never evaluate your model on the same data it used for training. We use train_test_split to create a "hidden" set of data.

from sklearn.model_selection import train_test_split

# Usually 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

:::tip Why stratify=y? For classification, this ensures the ratio of classes (e.g., 90% "No" and 10% "Yes") is identical in both the training and testing sets. :::

3. Cross-Validation (K-Fold)

A single train-test split can be lucky or unlucky depending on which rows ended up in the test set. K-Fold Cross-Validation provides a more stable estimate of model performance.

How it works:

Split the data into equal parts (folds).
Train the model times. Each time, use 1 fold for testing and the remaining folds for training.
Average the scores from all rounds.

Implementation: `cross_val_score`

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

# Perform 5-Fold Cross Validation
scores = cross_val_score(model, X, y, cv=5)

print(f"Mean Accuracy: {scores.mean():.2f}")
print(f"Standard Deviation: {scores.std():.2f}")

4. Comparing Different Models

Model selection often involves running several candidates through the same validation pipeline to see which performs best.

Algorithm	Strengths	Weaknesses
Logistic Regression	Fast, interpretable	Assumes linear relationships
Decision Trees	Easy to visualize	Prone to overfitting
Random Forest	Robust, handles non-linear data	Slower, "Black box"
SVM	Good for high dimensions	Memory intensive

5. Learning Curves: Diagnosing Your Model

A Learning Curve plots the training and validation error against the number of training samples. It helps you identify:

High Bias (Underfitting): Both training and validation errors are high.
High Variance (Overfitting): Low training error but high validation error.

6. The Model Selection Workflow

graph TD
    Start[Load Data] --> Pre[Preprocess Data]
    Pre --> Split[Train-Test Split]
    Split --> Candidates[Try Multiple Algorithms]
    Candidates --> CV[K-Fold Cross-Validation]
    CV --> Best{Compare Scores}
    Best --> Tune[Fine-tune Hyperparameters]
    Best --> Fail[Revise Features/Data]
    
    style CV fill:#f3e5f5,stroke:#7b1fa2,color:#333
    style Best fill:#e1f5fe,stroke:#01579b,color:#333

References for More Details

Scikit-Learn Model Evaluation: Learning about scoring metrics like F1-Score and ROC-AUC.
Cross-Validation Guide: Advanced techniques like StratifiedKFold and TimeSeriesSplit.

Selecting the right model is only half the battle. Once you've chosen an algorithm, you need to "turn the knobs" to find its peak performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. The Scikit-Learn Estimator API

2. Training vs. Testing: The Fundamental Split

3. Cross-Validation (K-Fold)

Implementation: `cross_val_score`

4. Comparing Different Models

5. Learning Curves: Diagnosing Your Model

6. The Model Selection Workflow

References for More Details

Uh oh!

FilesExpand file tree

model-selection.mdx

Latest commit

History

model-selection.mdx

File metadata and controls

1. The Scikit-Learn Estimator API

2. Training vs. Testing: The Fundamental Split

3. Cross-Validation (K-Fold)

Implementation: cross_val_score

4. Comparing Different Models

5. Learning Curves: Diagnosing Your Model

6. The Model Selection Workflow

References for More Details

Implementation: `cross_val_score`