Adding sphinx-gallery examples to user guide (#1991)

rcap107 · web-flow · commit 4c5d4ace05c2 · 2026-03-26T14:53:57.000+01:00
diff --git a/.gitignore b/.gitignore
@@ -59,6 +59,7 @@ skrub/datasets/data/*
 # Generated files for doc
 doc/_build
 doc/auto_examples
+doc/auto_tutorials
 doc/generated
 doc/generated_for_index
 doc/reference/generated
diff --git a/doc/Makefile b/doc/Makefile
@@ -53,5 +53,5 @@ linkcheck-noplot:
 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 
 clean:
-	rm -rf _build/ auto_examples/ generated/ generated_for_index/ reference/generated/
+	rm -rf _build/ auto_examples/ auto_tutorials/ generated/ generated_for_index/ reference/generated/
 	rm -f reference/*.rst
diff --git a/doc/conf.py b/doc/conf.py
@@ -474,8 +474,8 @@ def call_garbage_collector(gallery_conf, fname):
         # See https://sphinx-gallery.github.io/stable/configuration.html#link-to-documentation  # noqa
     },
     "filename_pattern": ".*",
-    "examples_dirs": "../examples",
-    "gallery_dirs": "auto_examples",
+    "examples_dirs": ["../examples", "tutorials"],
+    "gallery_dirs": ["auto_examples", "auto_tutorials"],
     "within_subsection_order": FileNameSortKey,
     "download_all_examples": False,
     "binder": {
diff --git a/doc/data_ops.rst b/doc/data_ops.rst
@@ -40,6 +40,7 @@ Data Ops basic concepts
 
    modules/data_ops/basics/what_are_data_ops
    modules/data_ops/basics/building_data_ops_plan
+   auto_tutorials/1110_data_ops_intro
    modules/data_ops/basics/using_previews
    modules/data_ops/basics/direct_access_methods
    modules/data_ops/basics/control_flow
diff --git a/doc/documentation.rst b/doc/documentation.rst
@@ -17,6 +17,7 @@ For class and function details, see the :ref:`API Reference <api_ref>`.
 .. toctree::
    :maxdepth: 3
 
+   auto_tutorials/0000_getting_started
    exploring_a_dataframe
    default_wrangling
    column_level_featurizing
diff --git a/doc/tutorials/0000_getting_started.py b/doc/tutorials/0000_getting_started.py
@@ -1,6 +1,6 @@
 """
-Getting Started
-===============
+Getting Started with skrub
+==========================
 
 This guide showcases some of the features of skrub.
 Much of skrub revolves around simplifying many of the tasks that are involved
@@ -54,11 +54,8 @@
 # %%
 # You can use the interactive display above to explore the dataset visually.
 #
-# It is also possible to tell skrub to replace the default pandas and polars
-# displays with |TableReport| by modifying the global config with
-# |set_config|.
-#
-# .. note::
+# .. admonition:: Additional examples
+#    :collapsible: closed
 #
 #    You can see a few more `example reports`_ online. We also
 #    provide an experimental online demo_ that allows you to select a CSV or
@@ -69,15 +66,15 @@
 #    .. _demo: https://skrub-data.org/skrub-reports/
 #
 # From the report above, we see that there are columns with date and time stored
-# as `object` dtype (cf. "Stats" tab of the report).
+# as ``object`` dtype (cf. "Stats" tab of the report).
 # Datatypes not being parsed correctly is a scenario that occurs commonly after
 # reading a table. We can use the |Cleaner| to address this.
 # In the next section, we show that this transformer does additional cleaning.
 
 # %%
 # Sanitizing data with the |Cleaner|
 # ----------------------------------
-# Here, we use the |Cleaner|, a transformer that sanitizing the
+# Here, we use the |Cleaner|, a transformer that sanitizes the
 # dataframe by parsing nulls and dates, and by dropping "uninformative" columns
 # (e.g., columns with too many nulls or that are constant).
 #
@@ -88,7 +85,7 @@
 TableReport(employees_df)
 
 # %%
-# We can see from the "Stats" tab that now the column `date_first_hired` has been
+# We can see from the "Stats" tab that now the column ``date_first_hired`` has been
 # parsed correctly as a Datetime.
 
 # %%
@@ -197,23 +194,6 @@
 # comparison between the different methods.
 #
 
-# %%
-# Assembling data
-# ---------------
-#
-# Skrub allows imperfect assembly of data, such as joining dataframes
-# on columns that contain typos. Skrub's joiners have ``fit`` and
-# ``transform`` methods, storing information about the data across calls.
-#
-# The |Joiner| allows fuzzy-joining multiple tables, where each row of
-# a main table will be augmented with values from the best match in the auxiliary table.
-# You can control how distant fuzzy-matches are allowed to be with the
-# ``max_dist`` parameter.
-#
-# Skrub also allows you to aggregate multiple tables according to various strategies.
-# You can see other ways to join multiple tables in
-# :ref:`user_guide_joining_dataframes`.
-
 # %%
 # Advanced use cases
 # ----------------------
diff --git a/doc/tutorials/1110_data_ops_intro.py b/doc/tutorials/1110_data_ops_intro.py
@@ -1,24 +1,7 @@
 """
-Introduction to wrangling pipelines for machine-learning skrub DataOps
+Tutorial: Using Data Ops to build a machine-learning pipeline
 =======================================================================
 
-This example shows data wrangling for machine learning using Skrub's
-:ref:`DataOps <user_guide_data_ops_index>`.
-
-The challenge of data-wrangling for machine learning is the need to
-apply the wrangling operations to new data, for prediction.
-
-Skrub's DataOps build pipelines that blend data wrangling and machine
-learning by recording all the operations involved in pre-processing data
-and training models. They result in an a full *learner* that starts from the
-raw data. We will also how show it can be saved, loaded back, and then used to make
-predictions on new, unseen data.
-
-This example is meant to be an introduction to Skrub DataOps, and as such it
-will not cover all the features. Further examples in the gallery
-:ref:`data_ops_examples_ref` go into more detail on Skrub DataOps
-for more complex tasks.
-
 .. currentmodule:: skrub
 
 .. |fetch_employee_salaries| replace:: :func:`datasets.fetch_employee_salaries`
@@ -27,6 +10,7 @@
 .. |skb.mark_as_X| replace:: :meth:`DataOp.skb.mark_as_X`
 .. |skb.mark_as_y| replace:: :meth:`DataOp.skb.mark_as_y`
 .. |TableVectorizer| replace:: :class:`TableVectorizer`
+.. |ToDatetime| replace:: :class:`ToDatetime`
 .. |skb.apply| replace:: :meth:`.skb.apply() <DataOp.skb.apply>`
 .. |HistGradientBoostingRegressor| replace::
    :class:`~sklearn.ensemble.HistGradientBoostingRegressor`
@@ -35,6 +19,41 @@
 .. |make_randomized_search| replace::
    :meth:`.skb.make_randomized_search <DataOp.skb.make_randomized_search>`
 
+This example shows data how we can use skrub's
+:ref:`DataOps <user_guide_data_ops_index>` for building a machine learning pipeline.
+
+The challenge of preparing data for machine learning is the need to
+apply the same data preparation and wrangling operations to new data, for prediction.
+
+Skrub's DataOps build pipelines that blend data wrangling and machine
+learning by recording all the operations involved in pre-processing data
+and training models, as well as the state of the transformers and models used to
+make predictions.
+
+.. admonition:: What is a state?
+   :collapsible: closed
+
+   The state of a transformer or model refers to the internal parameters and
+   attributes that are learned or set during the fitting process. For example,
+   in a :class:`~sklearn.preprocessing.StandardScaler`, the state would include
+   the mean and standard deviation calculated from the training data.
+   In a pre-processing transformer like |ToDatetime|, the state would include the
+   inferred datetime format based on the data it was fitted on.
+   In a machine learning model like |HistGradientBoostingRegressor|, the state
+   would include the fitted parameters of the model after training on the data.
+
+The result of building a DataOps plan is a *learner*, an object with an interface
+similar to that of a scikit-learn estimator, but which contains all the steps in the
+data preparation and model training process, along with the state of all the
+transformers and models: this allows to save the learner, load it back later,
+and use it to make predictions on new data.
+
+This example is meant to be an introduction to Skrub DataOps, and as such it
+will not cover all the features. Further examples in the gallery
+:ref:`data_ops_examples_ref` go into more detail on Skrub DataOps
+for more complex tasks.
+
+
 """
 
 # %%
diff --git a/doc/tutorials/GALLERY_HEADER.txt b/doc/tutorials/GALLERY_HEADER.txt
@@ -0,0 +1 @@
+examples
diff --git a/pyproject.toml b/pyproject.toml
@@ -311,6 +311,7 @@ ignore = [
 # It's fine not to put the import at the top of the file in the examples
 # folder.
 "examples/*" = ["E402"]
+"doc/tutorials/*" = ["E402"]
 "doc/conf.py" = ["E402"]
 # Long exception messages in docstrings
 "skrub/_clean_null_strings.py" = ["E501"]