Skip to content

Commit a1ac75b

Browse files
authored
Add module on dimensionality reduction first iteration
WIP
2 parents 18a19cf + a48c986 commit a1ac75b

19 files changed

Lines changed: 6531 additions & 13 deletions

datasets/wiki_news.csv

Lines changed: 1251 additions & 0 deletions
Large diffs are not rendered by default.

jupyter-book/_toc.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,20 @@ parts:
221221
- file: appendix/acknowledgement
222222
- file: appendix/notebook_timings
223223
- file: appendix/toc_redirect
224+
- caption: 🚧 Dimensionality reduction
225+
chapters:
226+
- file: dimensionality/dimred_module_intro
227+
- file: dimensionality/dimred_pca_index
228+
sections:
229+
- file: python_scripts/dimred_intuitions
230+
- file: python_scripts/dimred_preprocessing
231+
- file: python_scripts/dimred_ex_01
232+
- file: python_scripts/dimred_sol_01
233+
- file: python_scripts/dimred_components
234+
- file: python_scripts/dimred_text
235+
# - file: dimensionality/dimred_quiz_m8_01
236+
# - file: dimensionality/dimred_wrap_up_quiz
237+
- file: dimensionality/dimred_module_take_away
224238
- caption: 🚧 Feature selection
225239
chapters:
226240
- file: feature_selection/feature_selection_module_intro
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Module overview
2+
3+
## What you will learn
4+
5+
<!-- Give in plain English what the module is about -->
6+
7+
This module gives an intuitive introduction to dimensionality reduction.
8+
9+
We focus on Principal Component Analysis (PCA), the most widely used
10+
dimensionality reduction technique. PCA is simple enough to build geometric
11+
intuitions on, yet rich enough to raise non-trivial questions: how do you
12+
preprocess features before applying it, how many components should you keep, and
13+
what do you do when the standard heuristics break down, as they do for text
14+
data?
15+
16+
The module builds on the supervised pipelines from the Linear Models and
17+
Selecting The Best Model modules, and on the unsupervised foundations from the
18+
Clustering module. We extend those ideas in two directions. First, we treat
19+
dimensionality reduction as a preprocessing step inside supervised and
20+
unsupervised pipelines, and show how to tune it. Second, we apply it to text
21+
data, where the feature space can have thousands of dimensions and the usual
22+
rules of thumb stop working. There we also step outside scikit-learn to compare
23+
PCA against non-linear techniques such as t-SNE and UMAP, which reveal cluster
24+
structure that linear projections compress away.
25+
26+
## Before getting started
27+
28+
<!-- Give the required skills for the module -->
29+
30+
The required technical skills to carry on this module are:
31+
32+
33+
- skills acquired during the "Selecting The Best Model" and "Linear Models"
34+
modules for basic concepts around hyperparameter stability.
35+
36+
- skills acquired during the "Clustering" module for basic concepts in
37+
unsupervised learning and for text data preprocessing.
38+
39+
<!-- Point to resources to learning these skills -->
40+
41+
## Objectives and time schedule
42+
43+
<!-- Give the learning objectives -->
44+
45+
The objective in the module are the following:
46+
47+
- Build geometric intuitions on PCA
48+
- Understand why and how to scale features before applying PCA
49+
- Use heatmaps to interpret how original features contribute to each component
50+
- Tune `n_components` as a hyperparameter in a supervised pipeline
51+
- Choose `n_components` in the unsupervised case using explained variance
52+
curves, the Kaiser criterion, and silhouette scores, and understand when each
53+
criterion is appropriate
54+
- Understand why standard heuristics for choosing `n_components` break down for
55+
text data, and what practitioners use instead
56+
- Compare linear (PCA) and non-linear (t-SNE, UMAP) dimensionality reduction
57+
techniques and understand when each is most informative
58+
59+
<!-- Give the investment in time -->
60+
61+
The estimated time to go through this module is about 3 hours.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Intuitions on dimensionality reduction
2+
3+
```{tableofcontents}
4+
5+
```
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Main take-away
2+
3+
## Wrap-up
4+
5+
In this module, we presented the framework used in unsupervised learning with
6+
dimensionality reduction, focusing on PCA and how to chose its number of
7+
components.
8+
9+
We explored the concepts of explained variance, reconstruction error and we saw
10+
how different the distribution of a given feature determines how it behaves
11+
after scaling, and the influence it has on the resulting PC space.
12+
13+
We showed how PCA can be integrated into both supervised and unsupervised
14+
pipelines to reduce computing time and to ease data visualization.
15+
16+
Finally, we introduced TSNE and UMAP as non-linear alternatives to PCA for
17+
visualization.
18+
19+
## To go further
20+
21+
You can refer to the following scikit-learn examples which are related to
22+
the concepts approached during this module:
23+
24+
- [Faces dataset decompositions](https://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html)
25+
- [Manifold learning on handwritten digits: locally linear embedding, Isomap](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html)
26+
- [Comparison of Manifold Learning methods](https://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html)
27+
- [Dimensionality Reduction with Neighborhood Components Analysis](https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_dim_reduction.html)

notebooks/cross_validation_validation_curve.ipynb

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -259,18 +259,18 @@
259259
"errors made during the data collection process (besides not measuring the\n",
260260
"unobserved input feature).\n",
261261
"\n",
262-
"One extreme case could happen if there where samples in the dataset with exactly\n",
263-
"the same input feature values but different values for the target variable. That\n",
264-
"is very unlikely in real life settings, but could be the case if all features\n",
265-
"are categorical or if the numerical features were discretized or rounded up\n",
266-
"naively. In our example, we can imagine two houses having the exact same\n",
267-
"features in our dataset, but having different prices because of the (unmeasured)\n",
268-
"seller's rush.\n",
269-
"\n",
270-
"Apart from this extreme case, it's hard to know for sure what should qualify or\n",
271-
"not as noise and which kind of \"noise\" as introduced above is dominating. But in\n",
272-
"practice, the best way to make our predictive models robust to noise is to\n",
273-
"avoid overfitting models by:\n",
262+
"One extreme case could happen if there where samples in the dataset with\n",
263+
"exactly the same input feature values but different values for the target\n",
264+
"variable. That is very unlikely in real life settings, but could be the case\n",
265+
"if all features are categorical or if the numerical features were discretized\n",
266+
"or rounded up naively. In our example, we can imagine two houses having the\n",
267+
"exact same features in our dataset, but having different prices because of the\n",
268+
"(unmeasured) seller's rush.\n",
269+
"\n",
270+
"Apart from this extreme case, it's hard to know for sure what should qualify\n",
271+
"or not as noise and which kind of \"noise\" as introduced above is dominating.\n",
272+
"But in practice, the best way to make our predictive models robust to noise\n",
273+
"is to avoid overfitting models by:\n",
274274
"\n",
275275
"- selecting models that are simple enough or with tuned hyper-parameters as\n",
276276
" explained in this module;\n",

notebooks/datasets_ames_housing.ipynb

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -288,7 +288,6 @@
288288
"from sklearn.impute import SimpleImputer\n",
289289
"from sklearn.pipeline import make_pipeline\n",
290290
"\n",
291-
"\n",
292291
"numerical_features = [\n",
293292
" \"LotFrontage\",\n",
294293
" \"LotArea\",\n",

0 commit comments

Comments
 (0)