Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,251 changes: 1,251 additions & 0 deletions datasets/wiki_news.csv

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions environment-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dependencies:
- matplotlib-base
- seaborn >= 0.13
- plotly >= 5.10
- skrub
- jupytext
- beautifulsoup4
- IPython
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies:
- pandas >= 1
- matplotlib-base
- seaborn >= 0.13
- skrub
- jupyterlab
- notebook
- plotly >= 5.10
Expand Down
14 changes: 14 additions & 0 deletions jupyter-book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,20 @@ parts:
- file: appendix/acknowledgement
- file: appendix/notebook_timings
- file: appendix/toc_redirect
- caption: 🚧 Dimensionality reduction
chapters:
- file: dimensionality/dimred_module_intro
- file: dimensionality/dimred_pca_index
sections:
- file: python_scripts/dimred_intuitions
- file: python_scripts/dimred_preprocessing
- file: python_scripts/dimred_ex_01
- file: python_scripts/dimred_sol_01
- file: python_scripts/dimred_components
- file: python_scripts/dimred_text
# - file: dimensionality/dimred_quiz_m8_01
# - file: dimensionality/dimred_wrap_up_quiz
- file: dimensionality/dimred_module_take_away
- caption: 🚧 Feature selection
chapters:
- file: feature_selection/feature_selection_module_intro
Expand Down
61 changes: 61 additions & 0 deletions jupyter-book/dimensionality/dimred_module_intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Module overview

## What you will learn

<!-- Give in plain English what the module is about -->

This module gives an intuitive introduction to dimensionality reduction.

We focus on Principal Component Analysis (PCA), the most widely used
dimensionality reduction technique. PCA is simple enough to build geometric
intuitions on, yet rich enough to raise non-trivial questions: how do you
preprocess features before applying it, how many components should you keep, and
what do you do when the standard heuristics break down, as they do for text
data?

The module builds on the supervised pipelines from the Linear Models and
Selecting The Best Model modules, and on the unsupervised foundations from the
Clustering module. We extend those ideas in two directions. First, we treat
dimensionality reduction as a preprocessing step inside supervised and
unsupervised pipelines, and show how to tune it. Second, we apply it to text
data, where the feature space can have thousands of dimensions and the usual
rules of thumb stop working. There we also step outside scikit-learn to compare
PCA against non-linear techniques such as t-SNE and UMAP, which reveal cluster
structure that linear projections compress away.

## Before getting started

<!-- Give the required skills for the module -->

The required technical skills to carry on this module are:


- skills acquired during the "Selecting The Best Model" and "Linear Models"
modules for basic concepts around hyperparameter stability.

- skills acquired during the "Clustering" module for basic concepts in
unsupervised learning and for text data preprocessing.

<!-- Point to resources to learning these skills -->

## Objectives and time schedule

<!-- Give the learning objectives -->

The objective in the module are the following:

- Build geometric intuitions on PCA
- Understand why and how to scale features before applying PCA
- Use heatmaps to interpret how original features contribute to each component
- Tune `n_components` as a hyperparameter in a supervised pipeline
- Choose `n_components` in the unsupervised case using explained variance
curves, the Kaiser criterion, and silhouette scores, and understand when each
criterion is appropriate
- Understand why standard heuristics for choosing `n_components` break down for
text data, and what practitioners use instead
- Compare linear (PCA) and non-linear (t-SNE, UMAP) dimensionality reduction
techniques and understand when each is most informative

<!-- Give the investment in time -->

The estimated time to go through this module is about 3 hours.
5 changes: 5 additions & 0 deletions jupyter-book/dimensionality/dimred_pca_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Intuitions on dimensionality reduction

```{tableofcontents}

```
27 changes: 27 additions & 0 deletions jupyter-book/dimensionality/dimred_take_away.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Main take-away

## Wrap-up

In this module, we presented the framework used in unsupervised learning with
dimensionality reduction, focusing on PCA and how to chose its number of
components.

We explored the concepts of explained variance, reconstruction error and we saw
how different the distribution of a given feature determines how it behaves
after scaling, and the influence it has on the resulting PC space.

We showed how PCA can be integrated into both supervised and unsupervised
pipelines to reduce computing time and to ease data visualization.

Finally, we introduced TSNE and UMAP as non-linear alternatives to PCA for
visualization.

## To go further

You can refer to the following scikit-learn examples which are related to
the concepts approached during this module:

- [Faces dataset decompositions](https://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html)
- [Manifold learning on handwritten digits: locally linear embedding, Isomap](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html)
- [Comparison of Manifold Learning methods](https://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html)
- [Dimensionality Reduction with Neighborhood Components Analysis](https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_dim_reduction.html)
24 changes: 12 additions & 12 deletions notebooks/cross_validation_validation_curve.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -255,18 +255,18 @@
"errors made during the data collection process (besides not measuring the\n",
"unobserved input feature).\n",
"\n",
"One extreme case could happen if there where samples in the dataset with exactly\n",
"the same input feature values but different values for the target variable. That\n",
"is very unlikely in real life settings, but could be the case if all features\n",
"are categorical or if the numerical features were discretized or rounded up\n",
"naively. In our example, we can imagine two houses having the exact same\n",
"features in our dataset, but having different prices because of the (unmeasured)\n",
"seller's rush.\n",
"\n",
"Apart from this extreme case, it's hard to know for sure what should qualify or\n",
"not as noise and which kind of \"noise\" as introduced above is dominating. But in\n",
"practice, the best way to make our predictive models robust to noise is to\n",
"avoid overfitting models by:\n",
"One extreme case could happen if there where samples in the dataset with\n",
"exactly the same input feature values but different values for the target\n",
"variable. That is very unlikely in real life settings, but could be the case\n",
"if all features are categorical or if the numerical features were discretized\n",
"or rounded up naively. In our example, we can imagine two houses having the\n",
"exact same features in our dataset, but having different prices because of the\n",
"(unmeasured) seller's rush.\n",
"\n",
"Apart from this extreme case, it's hard to know for sure what should qualify\n",
"or not as noise and which kind of \"noise\" as introduced above is dominating.\n",
"But in practice, the best way to make our predictive models robust to noise\n",
"is to avoid overfitting models by:\n",
"\n",
"- selecting models that are simple enough or with tuned hyper-parameters as\n",
" explained in this module;\n",
Expand Down
1 change: 0 additions & 1 deletion notebooks/datasets_ames_housing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,6 @@
"from sklearn.impute import SimpleImputer\n",
"from sklearn.pipeline import make_pipeline\n",
"\n",
"\n",
"numerical_features = [\n",
" \"LotFrontage\",\n",
" \"LotArea\",\n",
Expand Down
Loading
Loading