Skip to content

Commit 4c5d4ac

Browse files
authored
Adding sphinx-gallery examples to user guide (#1991)
1 parent ce7cfd2 commit 4c5d4ac

9 files changed

Lines changed: 52 additions & 48 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ skrub/datasets/data/*
5959
# Generated files for doc
6060
doc/_build
6161
doc/auto_examples
62+
doc/auto_tutorials
6263
doc/generated
6364
doc/generated_for_index
6465
doc/reference/generated

doc/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,5 +53,5 @@ linkcheck-noplot:
5353
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
5454

5555
clean:
56-
rm -rf _build/ auto_examples/ generated/ generated_for_index/ reference/generated/
56+
rm -rf _build/ auto_examples/ auto_tutorials/ generated/ generated_for_index/ reference/generated/
5757
rm -f reference/*.rst

doc/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -474,8 +474,8 @@ def call_garbage_collector(gallery_conf, fname):
474474
# See https://sphinx-gallery.github.io/stable/configuration.html#link-to-documentation # noqa
475475
},
476476
"filename_pattern": ".*",
477-
"examples_dirs": "../examples",
478-
"gallery_dirs": "auto_examples",
477+
"examples_dirs": ["../examples", "tutorials"],
478+
"gallery_dirs": ["auto_examples", "auto_tutorials"],
479479
"within_subsection_order": FileNameSortKey,
480480
"download_all_examples": False,
481481
"binder": {

doc/data_ops.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ Data Ops basic concepts
4040

4141
modules/data_ops/basics/what_are_data_ops
4242
modules/data_ops/basics/building_data_ops_plan
43+
auto_tutorials/1110_data_ops_intro
4344
modules/data_ops/basics/using_previews
4445
modules/data_ops/basics/direct_access_methods
4546
modules/data_ops/basics/control_flow

doc/documentation.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ For class and function details, see the :ref:`API Reference <api_ref>`.
1717
.. toctree::
1818
:maxdepth: 3
1919

20+
auto_tutorials/0000_getting_started
2021
exploring_a_dataframe
2122
default_wrangling
2223
column_level_featurizing
Lines changed: 7 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
"""
2-
Getting Started
3-
===============
2+
Getting Started with skrub
3+
==========================
44
55
This guide showcases some of the features of skrub.
66
Much of skrub revolves around simplifying many of the tasks that are involved
@@ -54,11 +54,8 @@
5454
# %%
5555
# You can use the interactive display above to explore the dataset visually.
5656
#
57-
# It is also possible to tell skrub to replace the default pandas and polars
58-
# displays with |TableReport| by modifying the global config with
59-
# |set_config|.
60-
#
61-
# .. note::
57+
# .. admonition:: Additional examples
58+
# :collapsible: closed
6259
#
6360
# You can see a few more `example reports`_ online. We also
6461
# provide an experimental online demo_ that allows you to select a CSV or
@@ -69,15 +66,15 @@
6966
# .. _demo: https://skrub-data.org/skrub-reports/
7067
#
7168
# From the report above, we see that there are columns with date and time stored
72-
# as `object` dtype (cf. "Stats" tab of the report).
69+
# as ``object`` dtype (cf. "Stats" tab of the report).
7370
# Datatypes not being parsed correctly is a scenario that occurs commonly after
7471
# reading a table. We can use the |Cleaner| to address this.
7572
# In the next section, we show that this transformer does additional cleaning.
7673

7774
# %%
7875
# Sanitizing data with the |Cleaner|
7976
# ----------------------------------
80-
# Here, we use the |Cleaner|, a transformer that sanitizing the
77+
# Here, we use the |Cleaner|, a transformer that sanitizes the
8178
# dataframe by parsing nulls and dates, and by dropping "uninformative" columns
8279
# (e.g., columns with too many nulls or that are constant).
8380
#
@@ -88,7 +85,7 @@
8885
TableReport(employees_df)
8986

9087
# %%
91-
# We can see from the "Stats" tab that now the column `date_first_hired` has been
88+
# We can see from the "Stats" tab that now the column ``date_first_hired`` has been
9289
# parsed correctly as a Datetime.
9390

9491
# %%
@@ -197,23 +194,6 @@
197194
# comparison between the different methods.
198195
#
199196

200-
# %%
201-
# Assembling data
202-
# ---------------
203-
#
204-
# Skrub allows imperfect assembly of data, such as joining dataframes
205-
# on columns that contain typos. Skrub's joiners have ``fit`` and
206-
# ``transform`` methods, storing information about the data across calls.
207-
#
208-
# The |Joiner| allows fuzzy-joining multiple tables, where each row of
209-
# a main table will be augmented with values from the best match in the auxiliary table.
210-
# You can control how distant fuzzy-matches are allowed to be with the
211-
# ``max_dist`` parameter.
212-
#
213-
# Skrub also allows you to aggregate multiple tables according to various strategies.
214-
# You can see other ways to join multiple tables in
215-
# :ref:`user_guide_joining_dataframes`.
216-
217197
# %%
218198
# Advanced use cases
219199
# ----------------------
Lines changed: 37 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,7 @@
11
"""
2-
Introduction to wrangling pipelines for machine-learning skrub DataOps
2+
Tutorial: Using Data Ops to build a machine-learning pipeline
33
=======================================================================
44
5-
This example shows data wrangling for machine learning using Skrub's
6-
:ref:`DataOps <user_guide_data_ops_index>`.
7-
8-
The challenge of data-wrangling for machine learning is the need to
9-
apply the wrangling operations to new data, for prediction.
10-
11-
Skrub's DataOps build pipelines that blend data wrangling and machine
12-
learning by recording all the operations involved in pre-processing data
13-
and training models. They result in an a full *learner* that starts from the
14-
raw data. We will also how show it can be saved, loaded back, and then used to make
15-
predictions on new, unseen data.
16-
17-
This example is meant to be an introduction to Skrub DataOps, and as such it
18-
will not cover all the features. Further examples in the gallery
19-
:ref:`data_ops_examples_ref` go into more detail on Skrub DataOps
20-
for more complex tasks.
21-
225
.. currentmodule:: skrub
236
247
.. |fetch_employee_salaries| replace:: :func:`datasets.fetch_employee_salaries`
@@ -27,6 +10,7 @@
2710
.. |skb.mark_as_X| replace:: :meth:`DataOp.skb.mark_as_X`
2811
.. |skb.mark_as_y| replace:: :meth:`DataOp.skb.mark_as_y`
2912
.. |TableVectorizer| replace:: :class:`TableVectorizer`
13+
.. |ToDatetime| replace:: :class:`ToDatetime`
3014
.. |skb.apply| replace:: :meth:`.skb.apply() <DataOp.skb.apply>`
3115
.. |HistGradientBoostingRegressor| replace::
3216
:class:`~sklearn.ensemble.HistGradientBoostingRegressor`
@@ -35,6 +19,41 @@
3519
.. |make_randomized_search| replace::
3620
:meth:`.skb.make_randomized_search <DataOp.skb.make_randomized_search>`
3721
22+
This example shows data how we can use skrub's
23+
:ref:`DataOps <user_guide_data_ops_index>` for building a machine learning pipeline.
24+
25+
The challenge of preparing data for machine learning is the need to
26+
apply the same data preparation and wrangling operations to new data, for prediction.
27+
28+
Skrub's DataOps build pipelines that blend data wrangling and machine
29+
learning by recording all the operations involved in pre-processing data
30+
and training models, as well as the state of the transformers and models used to
31+
make predictions.
32+
33+
.. admonition:: What is a state?
34+
:collapsible: closed
35+
36+
The state of a transformer or model refers to the internal parameters and
37+
attributes that are learned or set during the fitting process. For example,
38+
in a :class:`~sklearn.preprocessing.StandardScaler`, the state would include
39+
the mean and standard deviation calculated from the training data.
40+
In a pre-processing transformer like |ToDatetime|, the state would include the
41+
inferred datetime format based on the data it was fitted on.
42+
In a machine learning model like |HistGradientBoostingRegressor|, the state
43+
would include the fitted parameters of the model after training on the data.
44+
45+
The result of building a DataOps plan is a *learner*, an object with an interface
46+
similar to that of a scikit-learn estimator, but which contains all the steps in the
47+
data preparation and model training process, along with the state of all the
48+
transformers and models: this allows to save the learner, load it back later,
49+
and use it to make predictions on new data.
50+
51+
This example is meant to be an introduction to Skrub DataOps, and as such it
52+
will not cover all the features. Further examples in the gallery
53+
:ref:`data_ops_examples_ref` go into more detail on Skrub DataOps
54+
for more complex tasks.
55+
56+
3857
"""
3958

4059
# %%

doc/tutorials/GALLERY_HEADER.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
examples

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -311,6 +311,7 @@ ignore = [
311311
# It's fine not to put the import at the top of the file in the examples
312312
# folder.
313313
"examples/*" = ["E402"]
314+
"doc/tutorials/*" = ["E402"]
314315
"doc/conf.py" = ["E402"]
315316
# Long exception messages in docstrings
316317
"skrub/_clean_null_strings.py" = ["E501"]

0 commit comments

Comments
 (0)