|
| 1 | +# Module overview |
| 2 | + |
| 3 | +## What you will learn |
| 4 | + |
| 5 | +<!-- Give in plain English what the module is about --> |
| 6 | + |
| 7 | +This module gives an intuitive introduction to dimensionality reduction. |
| 8 | + |
| 9 | +We focus on Principal Component Analysis (PCA), the most widely used |
| 10 | +dimensionality reduction technique. PCA is simple enough to build geometric |
| 11 | +intuitions on, yet rich enough to raise non-trivial questions: how do you |
| 12 | +preprocess features before applying it, how many components should you keep, and |
| 13 | +what do you do when the standard heuristics break down, as they do for text |
| 14 | +data? |
| 15 | + |
| 16 | +The module builds on the supervised pipelines from the Linear Models and |
| 17 | +Selecting The Best Model modules, and on the unsupervised foundations from the |
| 18 | +Clustering module. We extend those ideas in two directions. First, we treat |
| 19 | +dimensionality reduction as a preprocessing step inside supervised and |
| 20 | +unsupervised pipelines, and show how to tune it. Second, we apply it to text |
| 21 | +data, where the feature space can have thousands of dimensions and the usual |
| 22 | +rules of thumb stop working. There we also step outside scikit-learn to compare |
| 23 | +PCA against non-linear techniques such as t-SNE and UMAP, which reveal cluster |
| 24 | +structure that linear projections compress away. |
| 25 | + |
| 26 | +## Before getting started |
| 27 | + |
| 28 | +<!-- Give the required skills for the module --> |
| 29 | + |
| 30 | +The required technical skills to carry on this module are: |
| 31 | + |
| 32 | + |
| 33 | +- skills acquired during the "Selecting The Best Model" and "Linear Models" |
| 34 | + modules for basic concepts around hyperparameter stability. |
| 35 | + |
| 36 | +- skills acquired during the "Clustering" module for basic concepts in |
| 37 | + unsupervised learning and for text data preprocessing. |
| 38 | + |
| 39 | +<!-- Point to resources to learning these skills --> |
| 40 | + |
| 41 | +## Objectives and time schedule |
| 42 | + |
| 43 | +<!-- Give the learning objectives --> |
| 44 | + |
| 45 | +The objective in the module are the following: |
| 46 | + |
| 47 | +- Build geometric intuitions on PCA |
| 48 | +- Understand why and how to scale features before applying PCA |
| 49 | +- Use heatmaps to interpret how original features contribute to each component |
| 50 | +- Tune `n_components` as a hyperparameter in a supervised pipeline |
| 51 | +- Choose `n_components` in the unsupervised case using explained variance |
| 52 | + curves, the Kaiser criterion, and silhouette scores, and understand when each |
| 53 | + criterion is appropriate |
| 54 | +- Understand why standard heuristics for choosing `n_components` break down for |
| 55 | + text data, and what practitioners use instead |
| 56 | +- Compare linear (PCA) and non-linear (t-SNE, UMAP) dimensionality reduction |
| 57 | + techniques and understand when each is most informative |
| 58 | + |
| 59 | +<!-- Give the investment in time --> |
| 60 | + |
| 61 | +The estimated time to go through this module is about 3 hours. |
0 commit comments