skrub-data
diff --git a/‎dev/_downloads/0da1186528765de7400f245b4d95af54/10_expressions_intro.zip‎
1.04 KB b/‎dev/_downloads/0da1186528765de7400f245b4d95af54/10_expressions_intro.zip‎
1.04 KB
diff --git a/‎dev/_downloads/178044e2019750d5b43147013e440c43/10_expressions_intro.ipynb‎
Lines changed: 3 additions & 3 deletions b/‎dev/_downloads/178044e2019750d5b43147013e440c43/10_expressions_intro.ipynb‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎dev/_downloads/1beb4a6828e42ba7f09ae925e26e794b/02_text_with_string_encoders.zip‎
0 Bytes b/‎dev/_downloads/1beb4a6828e42ba7f09ae925e26e794b/02_text_with_string_encoders.zip‎
0 Bytes
diff --git a/‎dev/_downloads/1edecd3a067d076f9dde43db0a6ad2ac/00_getting_started.zip‎
0 Bytes b/‎dev/_downloads/1edecd3a067d076f9dde43db0a6ad2ac/00_getting_started.zip‎
0 Bytes
diff --git a/‎dev/_downloads/28079b3b8fa6a36780f883fc70c5a85b/01_encodings.zip‎
0 Bytes b/‎dev/_downloads/28079b3b8fa6a36780f883fc70c5a85b/01_encodings.zip‎
0 Bytes
diff --git a/‎dev/_downloads/32f231520244e2e839b5494dd04fcd09/12_scikit_learn_estimators_in_expressions.zip‎
10 Bytes b/‎dev/_downloads/32f231520244e2e839b5494dd04fcd09/12_scikit_learn_estimators_in_expressions.zip‎
10 Bytes
diff --git a/‎dev/_downloads/3b6391bc7f056e067dfac38156a1a3ee/13_choices.py‎
Lines changed: 20 additions & 19 deletions b/‎dev/_downloads/3b6391bc7f056e067dfac38156a1a3ee/13_choices.py‎
Lines changed: 20 additions & 19 deletions
diff --git a/‎dev/_downloads/3c2e8206010663849b8c54b8fea90bcb/11_expressions_explained.ipynb‎
Lines changed: 1 addition & 1 deletion b/‎dev/_downloads/3c2e8206010663849b8c54b8fea90bcb/11_expressions_explained.ipynb‎
Lines changed: 1 addition & 1 deletion
@@ -90,14 +90,14 @@
       },
       "outputs": [],
       "source": [
-        "products = products[products[\"basket_ID\"].isin(baskets[\"ID\"])]\nproducts = products.assign(\n    total_price=products[\"Nbr_of_prod_purchas\"] * products[\"cash_price\"]\n)\nproducts"
+        "kept_products = products[products[\"basket_ID\"].isin(baskets[\"ID\"])]\nproducts_with_total = kept_products.assign(\n    total_price=kept_products[\"Nbr_of_prod_purchas\"] * kept_products[\"cash_price\"]\n)\nproducts_with_total"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "We see previews of the output of intermediate results. For\nexample, the added ``\"total_price\"`` column is in the output above.\nThe \"Show graph\" dropdown at the top allows us to check the\nstructure of the pipeline and all the steps it contains.\n\nWith skrub, we do not need to specify a grid of hyperparameters separately\nfrom the pipeline. Instead, we replace a parameter's value with a skrub\n\"choice\" which indicates the range of values we consider during\nhyperparameter selection.\n\nSkrub choices can be nested arbitrarily. They are not restricted to\nparameters of a scikit-learn estimator, but can be anything: choosing\nbetween different estimators, arguments to function calls, whole sections of\nthe pipeline etc.\n\nIn-depth information about choices and hyperparameter/model selection is\nprovided in the `Tuning Pipelines example <example_tuning_pipelines>`.\n\nWe build a skrub ``TableVectorizer`` with different choices of:\nthe type of encoder for high-cardinality categorical or string columns, and\nthe number of components it uses.\n\n"
+        "We see previews of the output of intermediate results. For\nexample, the added ``\"total_price\"`` column is in the output above.\nThe \"Show graph\" dropdown at the top allows us to check the\nstructure of the pipeline and all the steps it contains.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>We recommend to assign each new skrub expression to a new variable name,\n   as is done above. For example ``kept_products = products[...]`` instead of\n   reusing the name ``products = products[...]``. This makes it easy to\n   backtrack to any step of the pipeline and change the subsequent steps, and\n   can avoid ending up in a confusing state in jupyter notebooks when the\n   same cell might be re-executed several times.</p></div>\n\nWith skrub, we do not need to specify a grid of hyperparameters separately\nfrom the pipeline. Instead, we replace a parameter's value with a skrub\n\"choice\" which indicates the range of values we consider during\nhyperparameter selection.\n\nSkrub choices can be nested arbitrarily. They are not restricted to\nparameters of a scikit-learn estimator, but can be anything: choosing\nbetween different estimators, arguments to function calls, whole sections of\nthe pipeline etc.\n\nIn-depth information about choices and hyperparameter/model selection is\nprovided in the `Tuning Pipelines example <example_tuning_pipelines>`.\n\nWe build a skrub ``TableVectorizer`` with different choices of:\nthe type of encoder for high-cardinality categorical or string columns, and\nthe number of components it uses.\n\n"
       ]
     },
     {
@@ -126,7 +126,7 @@
       },
       "outputs": [],
       "source": [
-        "vectorized_products = products.skb.apply(vectorizer, exclude_cols=\"basket_ID\")"
+        "vectorized_products = products_with_total.skb.apply(\n    vectorizer, exclude_cols=\"basket_ID\"\n)"
       ]
     },
     {
 
@@ -19,7 +19,7 @@
 the range of possible values, by inserting it directly in place of the actual
 value. For example we can write:
 
- ``RidgeClassifier(alpha=skrub.choose_from([0.1, 1.0, 10.0], name='α'))``
+``RidgeClassifier(alpha=skrub.choose_from([0.1, 1.0, 10.0], name='α'))``
 
 instead of:
 
@@ -101,18 +101,17 @@
 #
 # Note that ``skrub.choose_float()`` and ``skrub.choose_int()`` can be given a
 # ``log`` argument to sample in log scale.
+
 # %%
 X, y = skrub.X(texts), skrub.y(labels)
 
 encoder = skrub.MinHashEncoder(
     n_components=skrub.choose_int(5, 50, log=True, name="N components")
 )
-X = X.skb.apply(encoder)
-
 classifier = HistGradientBoostingClassifier(
     learning_rate=skrub.choose_float(0.01, 0.9, log=True, name="lr")
 )
-pred = X.skb.apply(classifier, y=y)
+pred = X.skb.apply(encoder).skb.apply(classifier, y=y)
 
 # %%
 # We can then obtain an estimator that performs the hyperparameter search with
@@ -161,13 +160,12 @@
 # %%
 X, y = skrub.X(texts), skrub.y(labels)
 
-X = X.assign(
+X.assign(
     length=skrub.choose_from(
         {"words": X["text"].str.count(r"\b\w+\b"), "chars": X["text"].str.len()},
         name="length",
     )
 )
-X
 
 # %%
 # ``choose_from`` can be given a dictionary if we want to provide
@@ -190,7 +188,7 @@
     },
     name="encoder",
 )
-X = X.skb.apply(encoder, cols="text")
+X.skb.apply(encoder, cols="text")
 
 # %%
 # In a similar vein, we might want to choose between a HGB classifier and a Ridge
@@ -206,7 +204,7 @@
 )
 ridge = RidgeClassifier(alpha=skrub.choose_float(0.01, 100, log=True, name="α"))
 classifier = skrub.choose_from({"hgb": hgb, "ridge": ridge}, name="classifier")
-pred = X.skb.apply(classifier, y=y)
+pred = X.skb.apply(encoder).skb.apply(classifier, y=y)
 print(pred.skb.describe_param_grid())
 
 # %%
@@ -245,17 +243,17 @@
 
 X, y = skrub.X(texts), skrub.y(labels)
 
-X = X.skb.apply(skrub.MinHashEncoder())
+vectorized_X = X.skb.apply(skrub.MinHashEncoder())
 
 estimator_kind = skrub.choose_from(["ridge", "HGB"], name="estimator kind")
 
 scaling = estimator_kind.match({"ridge": StandardScaler(), "HGB": "passthrough"})
-X = X.skb.apply(scaling)
+scaled_X = vectorized_X.skb.apply(scaling)
 
 classifier = estimator_kind.match(
     {"ridge": RidgeClassifier(), "HGB": HistGradientBoostingClassifier()}
 )
-pred = X.skb.apply(classifier, y=y)
+pred = scaled_X.skb.apply(classifier, y=y)
 print(pred.skb.describe_param_grid())
 
 # %%
@@ -276,20 +274,21 @@
 # of the text as a feature or not. Then, ``if_else()`` will assign the length
 # of the text to a new column ``length`` if the choice is ``True``, or do nothing
 # if the choice is ``False``.
+
 # %%
 X, y = skrub.X(texts), skrub.y(labels)
 
 add_length = skrub.choose_bool(name="add_length")
-X = add_length.if_else(X.assign(length=X["text"].str.len()), X).as_expr()
-X = X.skb.apply(skrub.MinHashEncoder(n_components=2), cols="text")
+with_length = add_length.if_else(X.assign(length=X["text"].str.len()), X).as_expr()
+vectorized_X = with_length.skb.apply(skrub.MinHashEncoder(n_components=2), cols="text")
 
 # Note: we can manually set the outcome of a choice when evaluating an
 # expression (or fitting an estimator)
 
-X.skb.eval({"add_length": False})
+vectorized_X.skb.eval({"add_length": False})
 
 # %%
-X.skb.eval({"add_length": True})
+vectorized_X.skb.eval({"add_length": True})
 
 # %%
 # Arbitrary logic depending on a choice
@@ -299,6 +298,7 @@
 # eager logic based on a choice we can resort to using ``skrub.deferred``. For
 # example the choice of adding the text length or not could also have been
 # written as:
+
 # %%
 X, y = skrub.X(texts), skrub.y(labels)
 
@@ -310,13 +310,14 @@ def extract_features(df, add_length):
     return df
 
 
-X = extract_features(X, skrub.choose_bool(name="add_length"))
-X = X.skb.apply(skrub.MinHashEncoder(n_components=2), cols="text")
+feat = extract_features(X, skrub.choose_bool(name="add_length")).skb.apply(
+    skrub.MinHashEncoder(n_components=2), cols="text"
+)
 
-X.skb.eval({"add_length": False})
+feat.skb.eval({"add_length": False})
 
 # %%
-X.skb.eval({"add_length": True})
+feat.skb.eval({"add_length": True})
 
 # %%
 # Concluding, we have seen how to use skrub's ``choose_from`` objects to tune
 
@@ -281,7 +281,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "It is important to note that the original ``orders`` pipeline is not modified\nby the previous cells. Instead, in each cell a new expression is created that\nrepresents the result of the operation.\n\nThis is similar to how pandas and polars\nDataFrames work: when we call a method on a DataFrame, it returns a new\nDataFrame that represents the result of the operation, rather than modifying\nthe original DataFrame in place. However, while in pandas it is possible to\nwork on a DataFrame in place, skrub does not allow this.\n\nWe can check this by looking at the graph of ``orders`` (which remains unmodified),\nand ``new_orders`` (which instead contains the new steps):\n\n"
+        "It is important to note that the original ``orders`` pipeline is not modified\nby the previous cells. Instead, in each cell a new expression is created that\nrepresents the result of the operation.\n\nThis is similar to how pandas and polars\nDataFrames work: when we call a method on a DataFrame, it returns a new\nDataFrame that represents the result of the operation, rather than modifying\nthe original DataFrame in place. However, while in pandas it is possible to\nwork on a DataFrame in place, skrub does not allow this.\n\nWe can check this by looking at the graph of ``orders`` (which remains unmodified),\nand ``new_orders`` (which instead contains the new steps):\n%%\n\n"
       ]
     },
     {
Original file line number	Diff line number	Diff line change
`@@ -90,14 +90,14 @@`
`90`	`90`	`},`
`91`	`91`	`"outputs": [],`
`92`	`92`	`"source": [`
`93`		`- "products = products[products[\"basket_ID\"].isin(baskets[\"ID\"])]\nproducts = products.assign(\n total_price=products[\"Nbr_of_prod_purchas\"] * products[\"cash_price\"]\n)\nproducts"`
	`93`	`+ "kept_products = products[products[\"basket_ID\"].isin(baskets[\"ID\"])]\nproducts_with_total = kept_products.assign(\n total_price=kept_products[\"Nbr_of_prod_purchas\"] * kept_products[\"cash_price\"]\n)\nproducts_with_total"`
`94`	`94`	`]`
`95`	`95`	`},`
`96`	`96`	`{`
`97`	`97`	`"cell_type": "markdown",`
`98`	`98`	`"metadata": {},`
`99`	`99`	`"source": [`
`100`		- "We see previews of the output of intermediate results. For\nexample, the added ``\"total_price\"`` column is in the output above.\nThe \"Show graph\" dropdown at the top allows us to check the\nstructure of the pipeline and all the steps it contains.\n\nWith skrub, we do not need to specify a grid of hyperparameters separately\nfrom the pipeline. Instead, we replace a parameter's value with a skrub\n\"choice\" which indicates the range of values we consider during\nhyperparameter selection.\n\nSkrub choices can be nested arbitrarily. They are not restricted to\nparameters of a scikit-learn estimator, but can be anything: choosing\nbetween different estimators, arguments to function calls, whole sections of\nthe pipeline etc.\n\nIn-depth information about choices and hyperparameter/model selection is\nprovided in the `Tuning Pipelines example <example_tuning_pipelines>`.\n\nWe build a skrub ``TableVectorizer`` with different choices of:\nthe type of encoder for high-cardinality categorical or string columns, and\nthe number of components it uses.\n\n"
	`100`	+ "We see previews of the output of intermediate results. For\nexample, the added ``\"total_price\"`` column is in the output above.\nThe \"Show graph\" dropdown at the top allows us to check the\nstructure of the pipeline and all the steps it contains.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>We recommend to assign each new skrub expression to a new variable name,\n as is done above. For example ``kept_products = products[...]`` instead of\n reusing the name ``products = products[...]``. This makes it easy to\n backtrack to any step of the pipeline and change the subsequent steps, and\n can avoid ending up in a confusing state in jupyter notebooks when the\n same cell might be re-executed several times.</p></div>\n\nWith skrub, we do not need to specify a grid of hyperparameters separately\nfrom the pipeline. Instead, we replace a parameter's value with a skrub\n\"choice\" which indicates the range of values we consider during\nhyperparameter selection.\n\nSkrub choices can be nested arbitrarily. They are not restricted to\nparameters of a scikit-learn estimator, but can be anything: choosing\nbetween different estimators, arguments to function calls, whole sections of\nthe pipeline etc.\n\nIn-depth information about choices and hyperparameter/model selection is\nprovided in the `Tuning Pipelines example <example_tuning_pipelines>`.\n\nWe build a skrub ``TableVectorizer`` with different choices of:\nthe type of encoder for high-cardinality categorical or string columns, and\nthe number of components it uses.\n\n"
`101`	`101`	`]`
`102`	`102`	`},`
`103`	`103`	`{`
`@@ -126,7 +126,7 @@`
`126`	`126`	`},`
`127`	`127`	`"outputs": [],`
`128`	`128`	`"source": [`
`129`		`- "vectorized_products = products.skb.apply(vectorizer, exclude_cols=\"basket_ID\")"`
	`129`	`+ "vectorized_products = products_with_total.skb.apply(\n vectorizer, exclude_cols=\"basket_ID\"\n)"`
`130`	`130`	`]`
`131`	`131`	`},`
`132`	`132`	`{`
Original file line number	Diff line number	Diff line change
`@@ -281,7 +281,7 @@`
`281`	`281`	`"cell_type": "markdown",`
`282`	`282`	`"metadata": {},`
`283`	`283`	`"source": [`
`284`		- "It is important to note that the original ``orders`` pipeline is not modified\nby the previous cells. Instead, in each cell a new expression is created that\nrepresents the result of the operation.\n\nThis is similar to how pandas and polars\nDataFrames work: when we call a method on a DataFrame, it returns a new\nDataFrame that represents the result of the operation, rather than modifying\nthe original DataFrame in place. However, while in pandas it is possible to\nwork on a DataFrame in place, skrub does not allow this.\n\nWe can check this by looking at the graph of ``orders`` (which remains unmodified),\nand ``new_orders`` (which instead contains the new steps):\n\n"
	`284`	+ "It is important to note that the original ``orders`` pipeline is not modified\nby the previous cells. Instead, in each cell a new expression is created that\nrepresents the result of the operation.\n\nThis is similar to how pandas and polars\nDataFrames work: when we call a method on a DataFrame, it returns a new\nDataFrame that represents the result of the operation, rather than modifying\nthe original DataFrame in place. However, while in pandas it is possible to\nwork on a DataFrame in place, skrub does not allow this.\n\nWe can check this by looking at the graph of ``orders`` (which remains unmodified),\nand ``new_orders`` (which instead contains the new steps):\n%%\n\n"
`285`	`285`	`]`
`286`	`286`	`},`
`287`	`287`	`{`