Skip to content

Commit be1d1d9

Browse files
committed
Pushing the docs to dev/ for branch: main, commit b964ab004c7fda1a6466e48feafd8b0adcced931
1 parent 7ed7973 commit be1d1d9

184 files changed

Lines changed: 72033 additions & 33700 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Binary file not shown.

dev/_downloads/178044e2019750d5b43147013e440c43/10_expressions_intro.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -90,14 +90,14 @@
9090
},
9191
"outputs": [],
9292
"source": [
93-
"products = products[products[\"basket_ID\"].isin(baskets[\"ID\"])]\nproducts = products.assign(\n total_price=products[\"Nbr_of_prod_purchas\"] * products[\"cash_price\"]\n)\nproducts"
93+
"kept_products = products[products[\"basket_ID\"].isin(baskets[\"ID\"])]\nproducts_with_total = kept_products.assign(\n total_price=kept_products[\"Nbr_of_prod_purchas\"] * kept_products[\"cash_price\"]\n)\nproducts_with_total"
9494
]
9595
},
9696
{
9797
"cell_type": "markdown",
9898
"metadata": {},
9999
"source": [
100-
"We see previews of the output of intermediate results. For\nexample, the added ``\"total_price\"`` column is in the output above.\nThe \"Show graph\" dropdown at the top allows us to check the\nstructure of the pipeline and all the steps it contains.\n\nWith skrub, we do not need to specify a grid of hyperparameters separately\nfrom the pipeline. Instead, we replace a parameter's value with a skrub\n\"choice\" which indicates the range of values we consider during\nhyperparameter selection.\n\nSkrub choices can be nested arbitrarily. They are not restricted to\nparameters of a scikit-learn estimator, but can be anything: choosing\nbetween different estimators, arguments to function calls, whole sections of\nthe pipeline etc.\n\nIn-depth information about choices and hyperparameter/model selection is\nprovided in the `Tuning Pipelines example <example_tuning_pipelines>`.\n\nWe build a skrub ``TableVectorizer`` with different choices of:\nthe type of encoder for high-cardinality categorical or string columns, and\nthe number of components it uses.\n\n"
100+
"We see previews of the output of intermediate results. For\nexample, the added ``\"total_price\"`` column is in the output above.\nThe \"Show graph\" dropdown at the top allows us to check the\nstructure of the pipeline and all the steps it contains.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>We recommend to assign each new skrub expression to a new variable name,\n as is done above. For example ``kept_products = products[...]`` instead of\n reusing the name ``products = products[...]``. This makes it easy to\n backtrack to any step of the pipeline and change the subsequent steps, and\n can avoid ending up in a confusing state in jupyter notebooks when the\n same cell might be re-executed several times.</p></div>\n\nWith skrub, we do not need to specify a grid of hyperparameters separately\nfrom the pipeline. Instead, we replace a parameter's value with a skrub\n\"choice\" which indicates the range of values we consider during\nhyperparameter selection.\n\nSkrub choices can be nested arbitrarily. They are not restricted to\nparameters of a scikit-learn estimator, but can be anything: choosing\nbetween different estimators, arguments to function calls, whole sections of\nthe pipeline etc.\n\nIn-depth information about choices and hyperparameter/model selection is\nprovided in the `Tuning Pipelines example <example_tuning_pipelines>`.\n\nWe build a skrub ``TableVectorizer`` with different choices of:\nthe type of encoder for high-cardinality categorical or string columns, and\nthe number of components it uses.\n\n"
101101
]
102102
},
103103
{
@@ -126,7 +126,7 @@
126126
},
127127
"outputs": [],
128128
"source": [
129-
"vectorized_products = products.skb.apply(vectorizer, exclude_cols=\"basket_ID\")"
129+
"vectorized_products = products_with_total.skb.apply(\n vectorizer, exclude_cols=\"basket_ID\"\n)"
130130
]
131131
},
132132
{
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.

dev/_downloads/3b6391bc7f056e067dfac38156a1a3ee/13_choices.py

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
the range of possible values, by inserting it directly in place of the actual
2020
value. For example we can write:
2121
22-
``RidgeClassifier(alpha=skrub.choose_from([0.1, 1.0, 10.0], name='α'))``
22+
``RidgeClassifier(alpha=skrub.choose_from([0.1, 1.0, 10.0], name='α'))``
2323
2424
instead of:
2525
@@ -101,18 +101,17 @@
101101
#
102102
# Note that ``skrub.choose_float()`` and ``skrub.choose_int()`` can be given a
103103
# ``log`` argument to sample in log scale.
104+
104105
# %%
105106
X, y = skrub.X(texts), skrub.y(labels)
106107

107108
encoder = skrub.MinHashEncoder(
108109
n_components=skrub.choose_int(5, 50, log=True, name="N components")
109110
)
110-
X = X.skb.apply(encoder)
111-
112111
classifier = HistGradientBoostingClassifier(
113112
learning_rate=skrub.choose_float(0.01, 0.9, log=True, name="lr")
114113
)
115-
pred = X.skb.apply(classifier, y=y)
114+
pred = X.skb.apply(encoder).skb.apply(classifier, y=y)
116115

117116
# %%
118117
# We can then obtain an estimator that performs the hyperparameter search with
@@ -161,13 +160,12 @@
161160
# %%
162161
X, y = skrub.X(texts), skrub.y(labels)
163162

164-
X = X.assign(
163+
X.assign(
165164
length=skrub.choose_from(
166165
{"words": X["text"].str.count(r"\b\w+\b"), "chars": X["text"].str.len()},
167166
name="length",
168167
)
169168
)
170-
X
171169

172170
# %%
173171
# ``choose_from`` can be given a dictionary if we want to provide
@@ -190,7 +188,7 @@
190188
},
191189
name="encoder",
192190
)
193-
X = X.skb.apply(encoder, cols="text")
191+
X.skb.apply(encoder, cols="text")
194192

195193
# %%
196194
# In a similar vein, we might want to choose between a HGB classifier and a Ridge
@@ -206,7 +204,7 @@
206204
)
207205
ridge = RidgeClassifier(alpha=skrub.choose_float(0.01, 100, log=True, name="α"))
208206
classifier = skrub.choose_from({"hgb": hgb, "ridge": ridge}, name="classifier")
209-
pred = X.skb.apply(classifier, y=y)
207+
pred = X.skb.apply(encoder).skb.apply(classifier, y=y)
210208
print(pred.skb.describe_param_grid())
211209

212210
# %%
@@ -245,17 +243,17 @@
245243

246244
X, y = skrub.X(texts), skrub.y(labels)
247245

248-
X = X.skb.apply(skrub.MinHashEncoder())
246+
vectorized_X = X.skb.apply(skrub.MinHashEncoder())
249247

250248
estimator_kind = skrub.choose_from(["ridge", "HGB"], name="estimator kind")
251249

252250
scaling = estimator_kind.match({"ridge": StandardScaler(), "HGB": "passthrough"})
253-
X = X.skb.apply(scaling)
251+
scaled_X = vectorized_X.skb.apply(scaling)
254252

255253
classifier = estimator_kind.match(
256254
{"ridge": RidgeClassifier(), "HGB": HistGradientBoostingClassifier()}
257255
)
258-
pred = X.skb.apply(classifier, y=y)
256+
pred = scaled_X.skb.apply(classifier, y=y)
259257
print(pred.skb.describe_param_grid())
260258

261259
# %%
@@ -276,20 +274,21 @@
276274
# of the text as a feature or not. Then, ``if_else()`` will assign the length
277275
# of the text to a new column ``length`` if the choice is ``True``, or do nothing
278276
# if the choice is ``False``.
277+
279278
# %%
280279
X, y = skrub.X(texts), skrub.y(labels)
281280

282281
add_length = skrub.choose_bool(name="add_length")
283-
X = add_length.if_else(X.assign(length=X["text"].str.len()), X).as_expr()
284-
X = X.skb.apply(skrub.MinHashEncoder(n_components=2), cols="text")
282+
with_length = add_length.if_else(X.assign(length=X["text"].str.len()), X).as_expr()
283+
vectorized_X = with_length.skb.apply(skrub.MinHashEncoder(n_components=2), cols="text")
285284

286285
# Note: we can manually set the outcome of a choice when evaluating an
287286
# expression (or fitting an estimator)
288287

289-
X.skb.eval({"add_length": False})
288+
vectorized_X.skb.eval({"add_length": False})
290289

291290
# %%
292-
X.skb.eval({"add_length": True})
291+
vectorized_X.skb.eval({"add_length": True})
293292

294293
# %%
295294
# Arbitrary logic depending on a choice
@@ -299,6 +298,7 @@
299298
# eager logic based on a choice we can resort to using ``skrub.deferred``. For
300299
# example the choice of adding the text length or not could also have been
301300
# written as:
301+
302302
# %%
303303
X, y = skrub.X(texts), skrub.y(labels)
304304

@@ -310,13 +310,14 @@ def extract_features(df, add_length):
310310
return df
311311

312312

313-
X = extract_features(X, skrub.choose_bool(name="add_length"))
314-
X = X.skb.apply(skrub.MinHashEncoder(n_components=2), cols="text")
313+
feat = extract_features(X, skrub.choose_bool(name="add_length")).skb.apply(
314+
skrub.MinHashEncoder(n_components=2), cols="text"
315+
)
315316

316-
X.skb.eval({"add_length": False})
317+
feat.skb.eval({"add_length": False})
317318

318319
# %%
319-
X.skb.eval({"add_length": True})
320+
feat.skb.eval({"add_length": True})
320321

321322
# %%
322323
# Concluding, we have seen how to use skrub's ``choose_from`` objects to tune

dev/_downloads/3c2e8206010663849b8c54b8fea90bcb/11_expressions_explained.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@
281281
"cell_type": "markdown",
282282
"metadata": {},
283283
"source": [
284-
"It is important to note that the original ``orders`` pipeline is not modified\nby the previous cells. Instead, in each cell a new expression is created that\nrepresents the result of the operation.\n\nThis is similar to how pandas and polars\nDataFrames work: when we call a method on a DataFrame, it returns a new\nDataFrame that represents the result of the operation, rather than modifying\nthe original DataFrame in place. However, while in pandas it is possible to\nwork on a DataFrame in place, skrub does not allow this.\n\nWe can check this by looking at the graph of ``orders`` (which remains unmodified),\nand ``new_orders`` (which instead contains the new steps):\n\n"
284+
"It is important to note that the original ``orders`` pipeline is not modified\nby the previous cells. Instead, in each cell a new expression is created that\nrepresents the result of the operation.\n\nThis is similar to how pandas and polars\nDataFrames work: when we call a method on a DataFrame, it returns a new\nDataFrame that represents the result of the operation, rather than modifying\nthe original DataFrame in place. However, while in pandas it is possible to\nwork on a DataFrame in place, skrub does not allow this.\n\nWe can check this by looking at the graph of ``orders`` (which remains unmodified),\nand ``new_orders`` (which instead contains the new steps):\n%%\n\n"
285285
]
286286
},
287287
{

0 commit comments

Comments
 (0)