Add n_text_features and type_text features to Khiops sklearn supervised estimators by popescu-v · Pull Request #462 · KhiopsML/khiops-python

popescu-v · 2025-08-25T17:17:16Z

closes #39

TODO Before Asking for a Review

Rebase your branch to the latest version of dev (or main for release PRs)
Make sure all CI workflows are green
When adding a public feature/fix: Update the Unreleased section of CHANGELOG.md (no date)
Self-Review: Review "Files Changed" tab and fix any problems you find
API Docs (only if there are changes in docstrings, rst files or samples):
- Check the docs build without warning: see the log of the API Docs workflow
- Check that your changes render well in HTML: download the API Docs artifact and open index.html
- If there are any problems it is faster to iterate by building locally the API Docs

tramora

A few personal comments for my understanding maybe not blocking at all

tramora · 2025-09-01T14:06:00Z

doc/samples/samples_sklearn.rst

+    y_train = data_train_df["negativereason"]
+    y_test = data_test_df["negativereason"]
+
+    # Set Pandas StringDType on the "text" column


On the user point of view, requiring this is a burden. If he forgets to do this, does it fail ?

Yes, we need to document it on the main web site as well (see the issue #39).

On the user point of view, requiring this is a burden. If he forgets to do this, does it fail ?

If the user forgets to force this conversion, then the column would either:

be of a different type (e.g. object), in which case it will be assigned the "Categorical" Khiops type;

or just happen to be of the String dtype, in which case the heuristic would apply.

Hence, there is no failure, but this behavior needs to be thoroughly documented.

tramora · 2025-09-01T14:20:42Z

khiops/sklearn/dataset.py

+
+    .. note::
+        The "Text" Khiops type is inferred if the Numpy type is "string"
+        and the maximum length of the entries of that type is greater than 100.


Maybe worth defining a constant in the file (and referred here) because the value of 100 is arbitrary and could be changed in the future.

This value is identical to the Khiops Core value which servers the same purpose (as explained in the commit message) AFAIU.

tramora · 2025-09-01T14:20:56Z

khiops/sklearn/dataset.py

+                raise TypeError(
+                    type_error_message("max_type_size", max_type_size, int, np.int64)
+                )
+            if max_type_size > 100:


Maybe worth defining a constant in the file because the value of 100 is arbitrary and could be changed in the future.

See above. I'm not sure, as it is only used once.

tramora · 2025-09-01T14:27:08Z

khiops/sklearn/estimators.py

        self,
        n_features=100,
        n_trees=10,
+        n_text_features=10000,


Further in this PR, it is implied that a overhead can occur with this new parameter (unless n_text_features=0 is set), should we warn the user in the docstring about this performance change and if he does not manipulate any text colum he should set the value to 0 ?

Good point. Indeed, each single-column with such text data would be internally converted by Khiops into a set of potentially numerous text variables. But I'm not sure of the performance penalty this entails (if any). Setting n_text_features=0 just tells Khiops to proceed in the "standard" way, where each "stringy" column is Categorical.

khiops/sklearn/dataset.py

khiops/samples/samples_sklearn.py

CHANGELOG.md

folmos-at-orange

I'd merge the commit for the feature with that of the sample, otherwise LGTM.

popescu-v · 2025-09-01T15:43:12Z

I'd merge the commit for the feature with that of the sample, otherwise LGTM.

I'd keep it separate, as the feature is implemented in two commits (that I'd keep as such for clarity).

…ed estimators

Thus, "Text" is assigned to a dataframe input column if: - its type is StringDType, and - its length is > 100 (same heuristic as Khiops core when called through `api.build_dictionary_from_data_table`).

popescu-v linked an issue Aug 25, 2025 that may be closed by this pull request

Support text types in sklearn predictors #39

Closed

popescu-v requested review from folmos-at-orange and tramora September 1, 2025 13:32

popescu-v marked this pull request as ready for review September 1, 2025 13:32

popescu-v force-pushed the 39-support-text-types-in-sklearn-predictors branch 2 times, most recently from aa47498 to b221625 Compare September 1, 2025 14:10

tramora reviewed Sep 1, 2025

View reviewed changes

folmos-at-orange requested changes Sep 1, 2025

View reviewed changes

popescu-v force-pushed the 39-support-text-types-in-sklearn-predictors branch from b221625 to 8e70138 Compare September 1, 2025 15:14

folmos-at-orange self-requested a review September 1, 2025 15:22

folmos-at-orange approved these changes Sep 1, 2025

View reviewed changes

popescu-v force-pushed the 39-support-text-types-in-sklearn-predictors branch from 8e70138 to bd921c0 Compare September 1, 2025 15:35

popescu-v added 4 commits September 1, 2025 18:02

Add n_text_features and type_text features to Khiops sklearn supervis…

a4515af

…ed estimators

Add "Text" Khiops type assignment heuristic

bad0ec2

Thus, "Text" is assigned to a dataframe input column if: - its type is StringDType, and - its length is > 100 (same heuristic as Khiops core when called through `api.build_dictionary_from_data_table`).

Add Text data Sklearn sample

382386d

Update CHANGELOG

4556cf5

popescu-v force-pushed the 39-support-text-types-in-sklearn-predictors branch from bd921c0 to 4556cf5 Compare September 1, 2025 16:02

popescu-v merged commit 9cec2f1 into dev Sep 2, 2025
19 checks passed

popescu-v deleted the 39-support-text-types-in-sklearn-predictors branch September 2, 2025 08:53

Conversation

popescu-v commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO Before Asking for a Review

Uh oh!

tramora left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

popescu-v Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

folmos-at-orange left a comment

Choose a reason for hiding this comment

Uh oh!

popescu-v commented Sep 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

popescu-v commented Aug 25, 2025 •

edited

Loading

popescu-v Sep 1, 2025 •

edited

Loading