| title | Data Preparation in Scikit-Learn | |||||
|---|---|---|---|---|---|---|
| sidebar_label | Data Preparation | |||||
| description | Transforming raw data into model-ready features using Scikit-Learn's preprocessing and imputation tools. | |||||
| tags |
|
Before feeding data into an algorithm, it must be cleaned and transformed. Scikit-Learn provides a robust suite of Transformers—classes that follow a standard .fit() and .transform() API—to automate this work.
Machine Learning models cannot handle NaN (Not a Number) or null values. The SimpleImputer class helps fill these gaps.
from sklearn.impute import SimpleImputer
import numpy as np
# Sample data with missing values
X = [[1, 2], [np.nan, 3], [7, 6]]
# strategy='mean', 'median', 'most_frequent', or 'constant'
imputer = SimpleImputer(strategy='mean')
X_filled = imputer.fit_transform(X)Computers understand numbers, not words. If you have a column for "City" (New York, Paris, Tokyo), you must encode it.
Creates a new binary column for each category. Best for data without a natural order.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
cities = [['New York'], ['Paris'], ['Tokyo']]
encoded_cities = encoder.fit_transform(cities)Converts categories into integers (). Use this when the order matters (e.g., Small, Medium, Large).
As discussed in our Data Engineering module, scaling ensures that features with large ranges (like Salary) don't overpower features with small ranges (like Age).
Rescales data to have a mean of and a standard deviation of .
Rescales data to a fixed range, usually .
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_filled)One of the most important concepts in Scikit-Learn is the distinction between these two methods:
.fit(): The transformer calculates the parameters (e.g., the mean and standard deviation of your data). Only do this on Training data..transform(): The transformer applies those calculated parameters to the data..fit_transform(): Does both in one step.
graph TD
Train[Training Data] --> Fit[Fit: Learn Mean/Std]
Fit --> TransTrain[Transform Training Data]
Fit --> TransTest[Transform Test Data]
style Fit fill:#f3e5f5,stroke:#7b1fa2,color:#333
:::warning
Never fit on your Test data. This leads to Data Leakage, where your model "cheats" by seeing the distribution of the test set during training.
:::
In real datasets, you have a mix of types: some columns need scaling, others need encoding, and some need nothing. ColumnTransformer allows you to apply different prep steps to different columns simultaneously.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['age', 'income']),
('cat', OneHotEncoder(), ['city', 'gender'])
])
# X_processed = preprocessor.fit_transform(df)- Scikit-Learn Preprocessing Guide: Discovering advanced transformers like
PowerTransformerorPolynomialFeatures. - Imputing Missing Values: Learning about
IterativeImputer(MICE) andKNNImputer.
Manual data preparation can get messy and hard to replicate. To solve this, Scikit-Learn uses a powerful tool to chain all these steps together into a single object.