probabl-ai
diff --git a/‎datasets/wiki_news.csv‎
Lines changed: 1251 additions & 0 deletions b/‎datasets/wiki_news.csv‎
Lines changed: 1251 additions & 0 deletions
diff --git a/‎jupyter-book/_toc.yml‎
Lines changed: 14 additions & 0 deletions b/‎jupyter-book/_toc.yml‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎jupyter-book/dimensionality/dimred_module_intro.md‎
Lines changed: 61 additions & 0 deletions b/‎jupyter-book/dimensionality/dimred_module_intro.md‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎jupyter-book/dimensionality/dimred_pca_index.md‎
Lines changed: 5 additions & 0 deletions b/‎jupyter-book/dimensionality/dimred_pca_index.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎jupyter-book/dimensionality/dimred_take_away.md‎
Lines changed: 27 additions & 0 deletions b/‎jupyter-book/dimensionality/dimred_take_away.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎notebooks/cross_validation_validation_curve.ipynb‎
Lines changed: 12 additions & 12 deletions b/‎notebooks/cross_validation_validation_curve.ipynb‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎notebooks/datasets_ames_housing.ipynb‎
Lines changed: 0 additions & 1 deletion b/‎notebooks/datasets_ames_housing.ipynb‎
Lines changed: 0 additions & 1 deletion
@@ -221,6 +221,20 @@ parts:
   - file: appendix/acknowledgement
   - file: appendix/notebook_timings
   - file: appendix/toc_redirect
+- caption: 🚧 Dimensionality reduction
+  chapters:
+  - file: dimensionality/dimred_module_intro
+  - file: dimensionality/dimred_pca_index
+    sections:
+    - file: python_scripts/dimred_intuitions
+    - file: python_scripts/dimred_preprocessing
+    - file: python_scripts/dimred_ex_01
+    - file: python_scripts/dimred_sol_01
+    - file: python_scripts/dimred_components
+    - file: python_scripts/dimred_text
+  #   - file: dimensionality/dimred_quiz_m8_01
+  # - file: dimensionality/dimred_wrap_up_quiz
+  - file: dimensionality/dimred_module_take_away
 - caption: 🚧 Feature selection
   chapters:
   - file: feature_selection/feature_selection_module_intro
 
@@ -0,0 +1,61 @@
+# Module overview
+
+## What you will learn
+
+<!-- Give in plain English what the module is about -->
+
+This module gives an intuitive introduction to dimensionality reduction.
+
+We focus on Principal Component Analysis (PCA), the most widely used
+dimensionality reduction technique. PCA is simple enough to build geometric
+intuitions on, yet rich enough to raise non-trivial questions: how do you
+preprocess features before applying it, how many components should you keep, and
+what do you do when the standard heuristics break down, as they do for text
+data?
+
+The module builds on the supervised pipelines from the Linear Models and
+Selecting The Best Model modules, and on the unsupervised foundations from the
+Clustering module. We extend those ideas in two directions. First, we treat
+dimensionality reduction as a preprocessing step inside supervised and
+unsupervised pipelines, and show how to tune it. Second, we apply it to text
+data, where the feature space can have thousands of dimensions and the usual
+rules of thumb stop working. There we also step outside scikit-learn to compare
+PCA against non-linear techniques such as t-SNE and UMAP, which reveal cluster
+structure that linear projections compress away.
+
+## Before getting started
+
+<!-- Give the required skills for the module -->
+
+The required technical skills to carry on this module are:
+
+
+- skills acquired during the "Selecting The Best Model" and "Linear Models"
+  modules for basic concepts around hyperparameter stability.
+
+- skills acquired during the "Clustering" module for basic concepts in
+  unsupervised learning and for text data preprocessing.
+
+<!-- Point to resources to learning these skills -->
+
+## Objectives and time schedule
+
+<!-- Give the learning objectives -->
+
+The objective in the module are the following:
+
+- Build geometric intuitions on PCA
+- Understand why and how to scale features before applying PCA
+- Use heatmaps to interpret how original features contribute to each component
+- Tune `n_components` as a hyperparameter in a supervised pipeline
+- Choose `n_components` in the unsupervised case using explained variance
+  curves, the Kaiser criterion, and silhouette scores, and understand when each
+  criterion is appropriate
+- Understand why standard heuristics for choosing `n_components` break down for
+  text data, and what practitioners use instead
+- Compare linear (PCA) and non-linear (t-SNE, UMAP) dimensionality reduction
+  techniques and understand when each is most informative
+
+<!-- Give the investment in time -->
+
+The estimated time to go through this module is about 3 hours.
@@ -0,0 +1,5 @@
+# Intuitions on dimensionality reduction
+
+```{tableofcontents}
+
+```
@@ -0,0 +1,27 @@
+# Main take-away
+
+## Wrap-up
+
+In this module, we presented the framework used in unsupervised learning with
+dimensionality reduction, focusing on PCA and how to chose its number of
+components.
+
+We explored the concepts of explained variance, reconstruction error and we saw
+how different the distribution of a given feature determines how it behaves
+after scaling, and the influence it has on the resulting PC space.
+
+We showed how PCA can be integrated into both supervised and unsupervised
+pipelines to reduce computing time and to ease data visualization.
+
+Finally, we introduced TSNE and UMAP as non-linear alternatives to PCA for
+visualization.
+
+## To go further
+
+You can refer to the following scikit-learn examples which are related to
+the concepts approached during this module:
+
+- [Faces dataset decompositions](https://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html)
+- [Manifold learning on handwritten digits: locally linear embedding, Isomap](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html)
+- [Comparison of Manifold Learning methods](https://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html)
+- [Dimensionality Reduction with Neighborhood Components Analysis](https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_dim_reduction.html)
@@ -259,18 +259,18 @@
     "errors made during the data collection process (besides not measuring the\n",
     "unobserved input feature).\n",
     "\n",
-    "One extreme case could happen if there where samples in the dataset with exactly\n",
-    "the same input feature values but different values for the target variable. That\n",
-    "is very unlikely in real life settings, but could be the case if all features\n",
-    "are categorical or if the numerical features were discretized or rounded up\n",
-    "naively. In our example, we can imagine two houses having the exact same\n",
-    "features in our dataset, but having different prices because of the (unmeasured)\n",
-    "seller's rush.\n",
-    "\n",
-    "Apart from this extreme case, it's hard to know for sure what should qualify or\n",
-    "not as noise and which kind of \"noise\" as introduced above is dominating. But in\n",
-    "practice, the best way to make our predictive models robust to noise is to\n",
-    "avoid overfitting models by:\n",
+    "One extreme case could happen if there where samples in the dataset with\n",
+    "exactly the same input feature values but different values for the target\n",
+    "variable. That is very unlikely in real life settings, but could be the case\n",
+    "if all features are categorical or if the numerical features were discretized\n",
+    "or rounded up naively. In our example, we can imagine two houses having the\n",
+    "exact same features in our dataset, but having different prices because of the\n",
+    "(unmeasured) seller's rush.\n",
+    "\n",
+    "Apart from this extreme case, it's hard to know for sure what should qualify\n",
+    "or not as noise and which kind of \"noise\" as introduced above is dominating.\n",
+    "But in practice, the best way to make our predictive models robust to noise\n",
+    "is to avoid overfitting models by:\n",
     "\n",
     "- selecting models that are simple enough or with tuned hyper-parameters as\n",
     "  explained in this module;\n",
 
@@ -288,7 +288,6 @@
     "from sklearn.impute import SimpleImputer\n",
     "from sklearn.pipeline import make_pipeline\n",
     "\n",
-    "\n",
     "numerical_features = [\n",
     "    \"LotFrontage\",\n",
     "    \"LotArea\",\n",
-Original file line number
+Diff line change
@@ @@ -0,0 +1,5 @@ @@
 +# Intuitions on dimensionality reduction
++
 +```{tableofcontents}
++
 +```