A concise, structured quick-reference guide covering all major ML algorithms across Supervised and Unsupervised Learning.
- Supervised Learning
- Unsupervised Learning
- Comparison Tables
- Practical Model Selection Guide
- ML Pipeline Overview
Learns a mapping from inputs X β outputs y using labeled training data.
Predict a continuous numeric output.
- Description: Fits a straight line (or hyperplane) through data to minimize residual error.
- Mathematical Intuition: Minimizes the sum of squared differences between predicted and actual values:
Ε· = wα΅x + b, optimized via Ordinary Least Squares (OLS) or Gradient Descent. - When to Use: Linear relationships between features and target; baseline model.
- Advantages: Fast, interpretable, no hyperparameters (OLS), well-understood statistically.
- Disadvantages: Assumes linearity; sensitive to outliers; poor with multicollinearity.
- Example Use Case: Predicting house prices from square footage.
- Description: Extends linear regression by adding polynomial feature terms (xΒ², xΒ³, β¦).
- Mathematical Intuition:
Ε· = wβ + wβx + wβxΒ² + β¦ + wβxβΏβ still linear in coefficients, nonlinear in input space. - When to Use: Non-linear relationships that are smooth and continuous.
- Advantages: Captures curvature; still interpretable.
- Disadvantages: Prone to overfitting at high degrees; sensitive to outliers.
- Example Use Case: Modeling population growth curves.
- Description: Linear regression with L2 regularization to penalize large weights.
- Mathematical Intuition: Minimizes
||y - Xw||Β² + Ξ±||w||Β². Shrinks coefficients toward zero but never exactly to zero. - When to Use: When multicollinearity is present; many features with small effects.
- Advantages: Reduces overfitting; stable with correlated features.
- Disadvantages: Does not perform feature selection; all features retained.
- Example Use Case: Predicting stock returns with many correlated financial indicators.
- Description: Linear regression with L1 regularization that induces sparsity.
- Mathematical Intuition: Minimizes
||y - Xw||Β² + Ξ±||w||β. Drives irrelevant feature weights exactly to zero. - When to Use: High-dimensional data; automatic feature selection is desired.
- Advantages: Sparse model; built-in feature selection; interpretable.
- Disadvantages: Struggles when features are highly correlated; selects only one arbitrarily.
- Example Use Case: Genomics β selecting relevant genes from thousands.
- Description: Combines L1 (Lasso) and L2 (Ridge) penalties.
- Mathematical Intuition:
Loss + Ξ±Β·Ο||w||β + Ξ±Β·(1-Ο)/2Β·||w||Β²where Ο balances L1/L2. - When to Use: Many correlated features; want both sparsity and stability.
- Advantages: Best of Ridge + Lasso; handles correlated groups of features.
- Disadvantages: Two hyperparameters (Ξ±, Ο) to tune.
- Example Use Case: Text regression with highly correlated word features.
- Description: Recursively partitions feature space into regions, predicting the mean of each region.
- Mathematical Intuition: Splits chosen to minimize variance (MSE) within child nodes at each step.
- When to Use: Non-linear data; need interpretability; quick prototyping.
- Advantages: No scaling needed; handles mixed feature types; interpretable.
- Disadvantages: High variance; prone to overfitting; unstable.
- Example Use Case: Predicting insurance claims based on age, region, and history.
- Description: Ensemble of many decision trees trained on random subsets of data and features (bagging).
- Mathematical Intuition: Final prediction = average of all tree predictions. Variance reduced by averaging uncorrelated trees.
- When to Use: Non-linear relationships; tabular data; need robustness.
- Advantages: Robust to overfitting; handles missing data well; feature importance scores.
- Disadvantages: Less interpretable than single tree; slower training; large memory footprint.
- Example Use Case: Predicting energy consumption across buildings.
- Description: Sequentially trains trees where each corrects the residual errors of its predecessor.
- Mathematical Intuition:
F_m(x) = F_{m-1}(x) + Ξ· Β· h_m(x)whereh_mfits the negative gradient of the loss. - When to Use: High-accuracy tabular regression; can handle mixed feature types.
- Advantages: State-of-the-art accuracy on tabular data; flexible loss functions.
- Disadvantages: Slow training; many hyperparameters; can overfit small datasets.
- Example Use Case: Predicting sales revenue with complex interactions.
- Description: Optimized, regularized gradient boosting with second-order gradient approximation and tree pruning.
- Mathematical Intuition: Adds L1/L2 regularization to the boosting objective; uses Taylor expansion for efficient optimization.
- When to Use: Competitions; large tabular datasets; need speed and accuracy.
- Advantages: Fast; handles sparse data natively; built-in regularization; parallel computation.
- Disadvantages: Many hyperparameters; less interpretable; memory-intensive for very large datasets.
- Example Use Case: Click-through rate prediction in advertising.
- Description: Gradient boosting using histogram-based splitting and leaf-wise tree growth.
- Mathematical Intuition: Grows tree leaf-wise (choosing the leaf with max loss reduction) rather than level-wise, enabling faster convergence.
- When to Use: Very large datasets; need fast training; high-cardinality categoricals.
- Advantages: Fastest among boosting libraries; low memory; handles categorical features natively.
- Disadvantages: Can overfit small datasets; leaf-wise growth can be unstable.
- Example Use Case: Real-time fraud scoring on millions of transactions.
- Description: Gradient boosting optimized for categorical features using ordered boosting to prevent target leakage.
- Mathematical Intuition: Uses ordered statistics to encode categoricals without leakage; symmetric decision trees for efficiency.
- When to Use: Datasets with many categorical features; minimal preprocessing desired.
- Advantages: Handles categoricals automatically; robust to overfitting; minimal tuning needed.
- Disadvantages: Slower training than LightGBM; higher memory usage.
- Example Use Case: Predicting customer lifetime value from CRM data.
- Description: Finds a function within an Ξ΅-tube of the training data while maximizing the margin.
- Mathematical Intuition: Minimizes
Β½||w||Β²subject to|y_i - f(x_i)| β€ Ξ΅. Uses kernel trick for nonlinear regression. - When to Use: Small-to-medium datasets; nonlinear relationships; robust to outliers (with Ξ΅-insensitive loss).
- Advantages: Effective in high dimensions; kernel flexibility; outlier-robust.
- Disadvantages: Does not scale well (O(nΒ²βnΒ³)); sensitive to feature scaling; hard to interpret.
- Example Use Case: Predicting protein concentrations from spectroscopic data.
- Description: Predicts the output as the average of the k nearest neighbors in feature space.
- Mathematical Intuition:
Ε· = (1/k) Ξ£ y_ifor the k closest training points (by Euclidean or other distance). - When to Use: Small datasets; no strong parametric assumptions; local structure matters.
- Advantages: Simple; no training phase; naturally handles multi-output regression.
- Disadvantages: Slow at inference (O(n)); sensitive to irrelevant features and scale; suffers in high dimensions.
- Example Use Case: Recommending product prices based on similar past transactions.
- Description: Places a prior distribution over model parameters and updates it using Bayes' theorem to obtain a posterior.
- Mathematical Intuition:
P(w|X,y) β P(y|X,w) Β· P(w). Predictions are distributions, not point estimates. - When to Use: Small data; need uncertainty quantification; incorporating prior knowledge.
- Advantages: Full uncertainty estimates; principled regularization via priors; robust to overfitting.
- Disadvantages: Computationally expensive; requires prior specification; harder to implement.
- Example Use Case: Clinical trial outcome prediction with small patient cohorts.
Predict a discrete class label from input features.
- Description: Linear model for binary (or multiclass) classification using a sigmoid activation.
- Core Idea:
P(y=1|x) = Ο(wα΅x + b)β decision boundary is a hyperplane; trained by maximizing log-likelihood. - When to Use: Binary/multiclass classification; need probability outputs; interpretable baseline.
- Advantages: Fast; interpretable coefficients; calibrated probabilities; works well linearly separable data.
- Disadvantages: Assumes linear decision boundary; poor on complex nonlinear data.
- Example Use Case: Email spam detection.
- Description: Classifies a point based on majority vote of its k nearest neighbors.
- Core Idea: Non-parametric; stores all training data; no explicit training β just distance computation at inference.
- When to Use: Small datasets; non-linear boundaries; interpretable local decisions.
- Advantages: Simple; no assumptions about data distribution; naturally multi-class.
- Disadvantages: Slow inference; sensitive to irrelevant features and scale; struggles in high dimensions.
- Example Use Case: Handwritten digit recognition on small datasets.
- Description: Finds the hyperplane that maximally separates classes; uses kernel trick for non-linear boundaries.
- Core Idea: Maximize margin between support vectors. With kernel K(x,x'), maps to higher-dimensional space implicitly.
- When to Use: High-dimensional data; text classification; small-to-medium datasets; clear margin of separation.
- Advantages: Effective in high dimensions; memory-efficient (only support vectors stored); versatile kernels.
- Disadvantages: Slow on large datasets; sensitive to feature scale; no direct probability output (requires Platt scaling).
- Example Use Case: Image classification, bioinformatics (gene expression).
- Description: Recursively splits data using feature thresholds to create a tree of decision rules.
- Core Idea: At each node, chooses the split that maximizes information gain (entropy) or minimizes Gini impurity.
- When to Use: Need interpretability; mixed feature types; non-linear boundaries; quick baseline.
- Advantages: Interpretable (white-box); no feature scaling needed; handles categorical data.
- Disadvantages: Prone to overfitting; unstable (small data changes β different trees).
- Example Use Case: Loan approval decision systems.
- Description: Ensemble of decorrelated decision trees via bagging + random feature subsets.
- Core Idea: Each tree votes; majority class wins. Diversity via random subsampling reduces variance.
- When to Use: General-purpose tabular classification; need feature importance; robust performance.
- Advantages: Resistant to overfitting; feature importance; handles missing values; parallelizable.
- Disadvantages: Black-box; slow for real-time inference; large memory footprint.
- Example Use Case: Medical diagnosis from patient records.
- Description: Sequentially builds trees, each correcting the errors of the previous ensemble.
- Core Idea: Minimizes a differentiable loss by fitting new trees to the negative gradient (pseudo-residuals).
- When to Use: Tabular data competitions; high accuracy requirements; complex feature interactions.
- Advantages: State-of-the-art on tabular data; flexible loss functions; handles mixed types.
- Disadvantages: Slow training; risk of overfitting; many hyperparameters.
- Example Use Case: Customer churn prediction.
- Description: Regularized, optimized gradient boosting with parallel tree construction and pruning.
- Core Idea: Uses second-order Taylor expansion of loss + L1/L2 regularization. Efficient sparse-aware split-finding.
- When to Use: Large datasets; high-accuracy classification; structured/tabular data.
- Advantages: Speed, accuracy, regularization, handles missing values natively.
- Disadvantages: Many hyperparameters; memory-heavy; less suited for unstructured data.
- Example Use Case: Kaggle competitions; credit scoring.
- Description: Leaf-wise gradient boosting with histogram-based feature binning.
- Core Idea: Grows the leaf with the greatest loss reduction; GOSS sampling and EFB bundling for speed.
- When to Use: Very large datasets; speed-critical applications; high-cardinality categoricals.
- Advantages: Fastest training among boosting algorithms; low memory; native categorical support.
- Disadvantages: Can overfit on small data; less stable than level-wise trees.
- Example Use Case: Real-time bidding and ad ranking.
- Description: Gradient boosting with ordered encoding for categorical features to avoid target leakage.
- Core Idea: Uses oblivious symmetric trees and ordered boosting statistics for categorical variables.
- When to Use: Datasets heavy in categorical variables; want minimal preprocessing.
- Advantages: Best-in-class for categorical data; minimal tuning; robust to overfitting.
- Disadvantages: Slower than LightGBM; higher RAM; slower inference vs. XGBoost.
- Example Use Case: E-commerce recommendation and ranking.
A family of probabilistic classifiers based on Bayes' theorem with feature independence assumptions.
- Description: Assumes continuous features follow a Gaussian (normal) distribution within each class.
- Core Idea:
P(x_i|y) = N(ΞΌ_y, ΟΒ²_y). Estimates mean and variance per class per feature. - When to Use: Continuous features; real-valued sensor data.
- Advantages: Fast; works well with small data; handles real-valued features.
- Disadvantages: Strong independence assumption; poor if features are correlated.
- Example Use Case: Medical diagnosis with continuous measurements.
- Description: Models feature counts (e.g., word frequencies); suited for discrete count data.
- Core Idea:
P(x_i|y) = (count(x_i, y) + Ξ±) / (total_count(y) + Ξ±Β·|V|)β uses Laplace smoothing. - When to Use: Text classification with bag-of-words or TF features.
- Advantages: Extremely fast; works well for text; interpretable.
- Disadvantages: Assumes features are counts (non-negative integers).
- Example Use Case: Spam filtering, topic classification.
- Description: Designed for binary/boolean features (feature present or absent).
- Core Idea:
P(x_i|y) = P_i^{x_i} Β· (1-P_i)^{1-x_i}β penalizes absence of features unlike Multinomial. - When to Use: Binary feature vectors (e.g., word occurrence, not frequency).
- Advantages: Penalizes absence of features; good for short texts.
- Disadvantages: Discards frequency information.
- Example Use Case: Sentiment classification with binary word presence features.
- Description: Multi-layer feedforward network with nonlinear activation functions trained via backpropagation.
- Core Idea: Stacks linear transformations + nonlinearities:
h = Ο(Wx + b). Universal function approximator. - When to Use: Complex, high-dimensional data (images, text, audio); large datasets; non-tabular data.
- Advantages: Learns complex representations; state-of-the-art for unstructured data; scalable.
- Disadvantages: Requires large data; computationally expensive; black-box; sensitive to hyperparameters.
- Example Use Case: Image recognition, NLP, speech recognition.
- Description: Boosting algorithm that combines weak learners (stumps) by re-weighting misclassified samples.
- Core Idea: Each subsequent classifier focuses more on previously misclassified points; final prediction = weighted vote.
- When to Use: Binary classification; clean data (sensitive to noise/outliers); need interpretable ensemble.
- Advantages: Simple; reduces bias; less prone to overfitting than a single tree.
- Disadvantages: Sensitive to noisy data and outliers; slower than Random Forest.
- Example Use Case: Face detection (Viola-Jones framework).
Finds patterns and structure in unlabeled data.
Groups similar data points together without predefined labels.
- Description: Partitions data into K clusters by iteratively assigning points to the nearest centroid and updating centroids.
- Core Idea: Minimize within-cluster sum of squared distances (inertia). Uses Expectation-Maximization style updates.
- When to Use: Large datasets; roughly spherical, equal-sized clusters; known K.
- Advantages: Fast (O(nKd)); scalable; simple; widely available.
- Disadvantages: Requires K upfront; assumes spherical clusters; sensitive to outliers and initialization.
- Description: Like K-Means but uses actual data points (medoids) as cluster centers.
- Core Idea: Minimize sum of dissimilarities to medoid. Medoid = data point minimizing within-cluster distance sum.
- When to Use: Non-Euclidean distances; need robust clustering; medoids should be real data points.
- Advantages: Robust to outliers; works with any distance metric.
- Disadvantages: Computationally expensive O(k(n-k)Β²); slower than K-Means.
- Description: Builds a tree (dendrogram) of clusters by iteratively merging (agglomerative) or splitting (divisive) groups.
- Core Idea: Agglomerative: start with n clusters, merge closest pair at each step. Linkage criteria: single, complete, average, Ward.
- When to Use: Unknown K; want to explore cluster hierarchy; small-to-medium datasets.
- Advantages: No K required; produces dendrogram; works with any distance.
- Disadvantages: O(nΒ² log n) time; cannot undo merges; sensitive to noise.
- Description: Density-Based Spatial Clustering of Applications with Noise. Groups dense regions, labels sparse points as noise.
- Core Idea: A point is a core point if β₯ MinPts neighbors are within Ξ΅. Clusters = connected regions of core points.
- When to Use: Arbitrary-shaped clusters; noise/outlier detection; unknown K.
- Advantages: Finds clusters of arbitrary shape; robust to outliers; no K needed.
- Disadvantages: Sensitive to Ξ΅ and MinPts; struggles with varying-density clusters; not scalable to very high dimensions.
- Description: Hierarchical extension of DBSCAN using a stability-based cluster extraction from the cluster hierarchy.
- Core Idea: Builds a hierarchy of clusters across all density levels; extracts persistent clusters by maximizing stability.
- When to Use: Varying-density clusters; robust noise detection; when DBSCAN requires too much parameter tuning.
- Advantages: Handles varying densities; single key parameter (min_cluster_size); soft clustering available.
- Disadvantages: Higher computational cost than DBSCAN; more complex implementation.
- Description: Non-parametric algorithm that shifts each data point toward the region of highest local density.
- Core Idea: Iteratively moves points toward the mean of points within a kernel bandwidth window until convergence.
- When to Use: Unknown K; blob-shaped clusters; image segmentation.
- Advantages: Automatically finds K; robust to outliers; no cluster shape assumption.
- Disadvantages: Slow O(nΒ²); bandwidth selection is critical; not scalable to large datasets.
- Description: Probabilistic model assuming data is generated from a mixture of K Gaussian distributions.
- Core Idea: Fits K Gaussians via Expectation-Maximization (E-step: compute soft assignments; M-step: update parameters).
- When to Use: Soft/probabilistic cluster assignments; elliptical clusters; density estimation.
- Advantages: Soft assignments; models cluster covariance; principled probabilistic framework.
- Disadvantages: Assumes Gaussian clusters; sensitive to initialization; requires K; can converge to local optima.
- Description: Uses eigenvalues of a graph Laplacian constructed from a similarity matrix to perform dimensionality reduction before clustering.
- Core Idea: Build affinity graph β compute Laplacian β take top-k eigenvectors β cluster with K-Means in eigenspace.
- When to Use: Non-convex clusters; graph/network data; manifold structure.
- Advantages: Can find non-convex clusters; uses global structure of data.
- Disadvantages: Expensive O(nΒ³) eigendecomposition; requires K; large memory for affinity matrix.
- Description: Balanced Iterative Reducing and Clustering using Hierarchies β builds a CF-Tree (Clustering Feature Tree) for summarizing data.
- Core Idea: Incrementally builds a compact summary of data (CF-Tree); final clustering done on leaf nodes.
- When to Use: Very large datasets; streaming data; memory-constrained environments.
- Advantages: Single pass (O(n)); handles large datasets; incremental/online learning.
- Disadvantages: Assumes spherical clusters; sensitive to threshold parameter; not great for high-dimensional data.
- Description: Passes "responsibility" and "availability" messages between data points to identify exemplars.
- Core Idea: Every point is a potential exemplar; messages converge to identify which points best represent clusters.
- When to Use: Unknown K; want algorithm to determine number of clusters; small-to-medium datasets.
- Advantages: Automatically finds K; exemplar-based (real data points as centers).
- Disadvantages: O(nΒ²) memory and time; slow; can produce too many clusters; sensitive to preference parameter.
Reduces the number of features while preserving important structure.
- Description: Projects data onto orthogonal axes of maximum variance, ordered by explained variance.
- Core Idea: Computes eigenvectors of the covariance matrix; projects data onto top-k eigenvectors (principal components).
- When to Use: Linear dimensionality reduction; visualization; noise reduction; preprocessing before ML.
- Advantages: Fast; deterministic; interpretable components; removes correlated features.
- Disadvantages: Linear only; components may not be interpretable semantically; loses non-linear structure.
- Description: Extends PCA to nonlinear manifolds using the kernel trick.
- Core Idea: Implicitly maps data to a high-dimensional feature space via kernel K(x,x'), then applies PCA there.
- When to Use: Non-linear structure in data; manifold data; when standard PCA fails.
- Advantages: Captures non-linear variance; flexible kernels (RBF, poly, etc.).
- Disadvantages: O(nΒ²) memory; no explicit inverse transform; kernel choice matters.
- Description: Supervised dimensionality reduction that maximizes between-class separability while minimizing within-class scatter.
- Core Idea: Finds projection axes that maximize
S_B / S_W(between-class to within-class scatter ratio). - When to Use: Preprocessing for classification; want maximally class-discriminative features.
- Advantages: Supervised (uses labels); maximally class-separable; at most C-1 components (C = classes).
- Disadvantages: Assumes Gaussian class distributions; linear only; requires labeled data.
- Description: Non-linear dimensionality reduction optimized for visualizing high-dimensional data in 2D/3D.
- Core Idea: Preserves local neighborhoods by modeling pairwise similarities as probabilities; minimizes KL divergence between high-D and low-D distributions.
- When to Use: Visualization of high-dimensional data; exploring cluster structure.
- Advantages: Excellent at revealing local cluster structure; widely used for visualization.
- Disadvantages: No out-of-sample extension; slow O(nΒ²); non-deterministic; global structure not preserved.
- Description: Fast, scalable non-linear dimensionality reduction based on Riemannian geometry and fuzzy topology.
- Core Idea: Constructs a fuzzy topological representation of the high-D manifold and optimizes a low-D embedding to match it.
- When to Use: Visualization; preprocessing; faster alternative to t-SNE; preserves more global structure.
- Advantages: Faster than t-SNE; supports out-of-sample projection; preserves global structure better.
- Disadvantages: Hyperparameter sensitive (n_neighbors, min_dist); harder to interpret theoretically.
- Description: Decomposes signals into statistically independent non-Gaussian components (unlike PCA which finds uncorrelated components).
- Core Idea:
X = ASβ finds mixing matrix A such that components S are maximally non-Gaussian and independent. - When to Use: Signal separation (blind source separation); feature extraction from mixed signals.
- Advantages: Finds statistically independent components; great for signal unmixing.
- Disadvantages: Cannot determine ordering or scaling of components; requires non-Gaussian data.
- Description: Neural network trained to compress input into a low-dimensional bottleneck (encoder) and reconstruct it (decoder).
- Core Idea: Minimize reconstruction loss
||x - decoder(encoder(x))||Β². Bottleneck layer = compressed representation. - When to Use: Non-linear DR; anomaly detection; generative modeling; image compression.
- Advantages: Captures highly non-linear structure; flexible architecture; can be extended (VAE, denoising).
- Disadvantages: Requires large data; computationally expensive; black-box; hyperparameter-heavy.
- Description: Statistical model assuming observed variables are linear combinations of latent (unobserved) factors plus noise.
- Core Idea:
X = ΞF + Ξ΅where Ξ = factor loadings, F = latent factors, Ξ΅ = per-feature noise. Fits via EM. - When to Use: Psychometrics; survey data; when noise is heteroskedastic per variable.
- Advantages: Models per-variable noise; interpretable factors; principled statistical framework.
- Disadvantages: Assumes linearity; factor rotation needed for interpretability; sensitive to model specification.
- Description: Factorizes a matrix into
X = UΞ£Vα΅. Truncated SVD keeps top-k singular values for compression. - Core Idea: Equivalent to PCA on centered data but also applicable to sparse/non-square matrices (e.g., TF-IDF matrices).
- When to Use: Text data (LSA); collaborative filtering; matrix approximation; PCA on sparse matrices.
- Advantages: Works on sparse matrices; powerful for text; foundation of many algorithms.
- Disadvantages: Linear; no out-of-sample extension without projection; memory-intensive for dense matrices.
| Algorithm | Type | Handles Nonlinearity | Interpretable | Scalable | Typical Use Case |
|---|---|---|---|---|---|
| Linear Regression | Parametric | β | β High | β | Baseline numeric prediction |
| Polynomial Regression | Parametric | β Limited | β Medium | Smooth curve fitting | |
| Ridge Regression | Parametric | β | β High | β | Multicollinear features |
| Lasso Regression | Parametric | β | β High | β | Feature selection |
| Elastic Net | Parametric | β | β Medium | β | Correlated sparse features |
| Decision Tree | Non-parametric | β | β High | Rule-based prediction | |
| Random Forest | Ensemble | β | β | General tabular regression | |
| Gradient Boosting | Ensemble | β | High-accuracy tabular data | ||
| XGBoost | Ensemble | β | β | Structured data, competitions | |
| LightGBM | Ensemble | β | β β | Very large datasets | |
| CatBoost | Ensemble | β | β | Categorical-heavy data | |
| SVR | Kernel | β | β | Small, high-dimensional data | |
| KNN Regression | Instance-based | β | β High | β | Local pattern data |
| Bayesian Regression | Probabilistic | β | β High | Uncertainty quantification |
| Algorithm | Type | Handles Nonlinearity | Interpretable | Scalable | Typical Use Case |
|---|---|---|---|---|---|
| Logistic Regression | Parametric | β | β High | β | Binary classification baseline |
| KNN | Instance-based | β | β Medium | β | Small-scale classification |
| SVM | Kernel | β (kernel) | Text, high-dimensional data | ||
| Decision Tree | Non-parametric | β | β High | Rule-based classification | |
| Random Forest | Ensemble | β | β | General classification | |
| Gradient Boosting | Ensemble | β | High-accuracy tabular | ||
| XGBoost | Ensemble | β | β | Competitions, credit scoring | |
| LightGBM | Ensemble | β | β β | Large-scale classification | |
| CatBoost | Ensemble | β | β | Categorical-rich data | |
| Gaussian NB | Probabilistic | β | β High | β | Continuous feature classification |
| Multinomial NB | Probabilistic | β | β High | β | Text classification |
| Bernoulli NB | Probabilistic | β | β High | β | Binary feature classification |
| MLP (Neural Net) | Deep Learning | β β | β | β β | Images, text, complex tasks |
| AdaBoost | Ensemble | β | Binary classification |
| Algorithm | Type | Handles Nonlinearity | Requires K | Handles Noise | Scalable | Typical Use Case |
|---|---|---|---|---|---|---|
| K-Means | Partitional | β | β Yes | β | β | Customer segmentation |
| K-Medoids | Partitional | β | β Yes | β | Robust clustering | |
| Hierarchical | Hierarchical | β No | β | Taxonomy, biology | ||
| DBSCAN | Density-based | β | β No | β | Geospatial clustering | |
| HDBSCAN | Density-based | β | β No | β β | Variable density data | |
| Mean Shift | Density-based | β | β No | β | Image segmentation | |
| GMM | Probabilistic | β Yes | Soft probabilistic clusters | |||
| Spectral | Graph-based | β | β Yes | β | Manifold/graph data | |
| BIRCH | Hierarchical | β | β No | β β | Large-scale streaming | |
| Affinity Propagation | Message passing | β No | β | Unknown K, small data |
| Algorithm | Type | Handles Nonlinearity | Supervised | Scalable | Typical Use Case |
|---|---|---|---|---|---|
| PCA | Linear | β | β | β | General preprocessing |
| Kernel PCA | Kernel | β | β | β | Non-linear manifold data |
| LDA | Linear | β | β Yes | β | Pre-classification reduction |
| t-SNE | Non-linear | β | β | β | 2D/3D visualization |
| UMAP | Non-linear | β | β | β | Visualization + preprocessing |
| ICA | Linear | β | β | β | Signal source separation |
| Autoencoders | Deep Learning | β β | β | β | Complex non-linear DR |
| Factor Analysis | Probabilistic | β | β | Latent factor discovery | |
| SVD | Linear | β | β | β | Text (LSA), recommendations |
- Relationship between features and target is approximately linear.
- Interpretability is critical (healthcare, finance, regulatory contexts).
- Dataset is small or medium-sized; training speed matters.
- Features are already well-engineered and informative.
- Use as a fast baseline before trying complex models.
- Algorithms: Linear/Logistic Regression, Ridge, Lasso, Elastic Net, LDA.
- Mixed feature types (numerical + categorical) with minimal preprocessing.
- Non-linear relationships and interactions between features.
- Need interpretable rules and feature importance.
- Robust performance without extensive hyperparameter tuning.
- Algorithms: Decision Tree, Random Forest.
- Maximum predictive accuracy on tabular/structured data is the goal.
- Large datasets with complex feature interactions.
- You can afford longer training times.
- Data is structured; no spatial/sequential patterns to exploit.
- Algorithms: Gradient Boosting, XGBoost, LightGBM, CatBoost, AdaBoost.
- Rule of thumb: XGBoost/LightGBM/CatBoost are the default choice for Kaggle-style tabular problems.
- Small-to-medium dataset; high-dimensional feature space.
- Need to capture non-linear boundaries without deep learning.
- Text or bioinformatics data where custom kernels are meaningful.
- Algorithms: SVM, SVR, Kernel PCA.
- Data is unstructured: images, text, audio, video.
- Very large datasets (10k+ samples) available.
- Feature engineering is impractical β let the network learn representations.
- State-of-the-art performance is required.
- Algorithms: MLP, CNN, RNN, Transformer (task-dependent).
- Caution: Avoid NNs for small tabular datasets β ensemble trees usually win.
Gather raw data from databases, APIs, sensors, or web scraping. Ensure coverage of all relevant conditions. Document data provenance and collection methodology.
- Handle missing values: Imputation (mean/median/mode/KNN) or removal.
- Remove duplicates and fix inconsistencies.
- Encode categoricals: Label encoding, one-hot encoding, target encoding.
- Scale features: StandardScaler (z-score), MinMaxScaler, RobustScaler (for outliers).
- Handle outliers: Clip, transform (log), or use robust algorithms.
- Create new features from domain knowledge (ratios, interactions, aggregates).
- Feature selection: Filter methods (correlation, chiΒ²), wrapper methods (RFE), embedded methods (Lasso, tree importance).
- Dimensionality reduction: PCA/UMAP for high-dimensional data.
- Goal: maximize signal, minimize noise and redundancy.
- Split data: Train / Validation / Test (e.g., 70/15/15) or use K-Fold Cross-Validation.
- Train model on training set; monitor validation performance.
- Track experiments (MLflow, W&B).
| Task | Primary Metrics |
|---|---|
| Regression | MAE, RMSE, RΒ², MAPE |
| Binary Classification | Accuracy, F1-Score, AUC-ROC, Precision, Recall |
| Multi-class Classification | Macro/Micro F1, Confusion Matrix, Top-k Accuracy |
| Clustering | Silhouette Score, Davies-Bouldin, Calinski-Harabasz, Adjusted Rand Index |
| Ranking | NDCG, MAP, MRR |
- Grid Search: Exhaustive; good for small search spaces.
- Random Search: Faster; samples randomly; often better than grid search.
- Bayesian Optimization (Optuna, Hyperopt): Learns from past trials; most efficient.
- Early Stopping: Prevents overfitting in iterative models (GBM, NNs).
- Always tune on validation set, never on test set.
- Serialize model:
pickle,joblib, ONNX for cross-platform. - Serve predictions: REST API (FastAPI, Flask), batch inference, streaming (Kafka).
- Monitor: Track prediction drift, data drift, model performance over time.
- Retrain triggers: Set thresholds on performance degradation or data drift metrics.
- MLOps tools: MLflow, BentoML, SageMaker, Vertex AI, Seldon.
End of ML Algorithms Handbook β Version 1.0
Tip: Use this document as a quick reference. For any algorithm, experiment on your actual data β empirical results always trump theoretical recommendations.