Skip to content

Add n_text_features and type_text features to Khiops sklearn supervised estimators#462

Merged
popescu-v merged 4 commits intodevfrom
39-support-text-types-in-sklearn-predictors
Sep 2, 2025
Merged

Add n_text_features and type_text features to Khiops sklearn supervised estimators#462
popescu-v merged 4 commits intodevfrom
39-support-text-types-in-sklearn-predictors

Conversation

@popescu-v
Copy link
Copy Markdown
Collaborator

@popescu-v popescu-v commented Aug 25, 2025

closes #39


TODO Before Asking for a Review

  • Rebase your branch to the latest version of dev (or main for release PRs)
  • Make sure all CI workflows are green
  • When adding a public feature/fix: Update the Unreleased section of CHANGELOG.md (no date)
  • Self-Review: Review "Files Changed" tab and fix any problems you find
  • API Docs (only if there are changes in docstrings, rst files or samples):
    • Check the docs build without warning: see the log of the API Docs workflow
    • Check that your changes render well in HTML: download the API Docs artifact and open index.html
    • If there are any problems it is faster to iterate by building locally the API Docs

@popescu-v popescu-v linked an issue Aug 25, 2025 that may be closed by this pull request
@popescu-v popescu-v marked this pull request as ready for review September 1, 2025 13:32
@popescu-v popescu-v force-pushed the 39-support-text-types-in-sklearn-predictors branch 2 times, most recently from aa47498 to b221625 Compare September 1, 2025 14:10
Copy link
Copy Markdown
Collaborator

@tramora tramora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few personal comments for my understanding maybe not blocking at all

y_train = data_train_df["negativereason"]
y_test = data_test_df["negativereason"]

# Set Pandas StringDType on the "text" column
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the user point of view, requiring this is a burden. If he forgets to do this, does it fail ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to document it on the main web site as well (see the issue #39).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the user point of view, requiring this is a burden. If he forgets to do this, does it fail ?

If the user forgets to force this conversion, then the column would either:

  • be of a different type (e.g. object), in which case it will be assigned the "Categorical" Khiops type;
  • or just happen to be of the String dtype, in which case the heuristic would apply.

Hence, there is no failure, but this behavior needs to be thoroughly documented.


.. note::
The "Text" Khiops type is inferred if the Numpy type is "string"
and the maximum length of the entries of that type is greater than 100.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth defining a constant in the file (and referred here) because the value of 100 is arbitrary and could be changed in the future.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is identical to the Khiops Core value which servers the same purpose (as explained in the commit message) AFAIU.

raise TypeError(
type_error_message("max_type_size", max_type_size, int, np.int64)
)
if max_type_size > 100:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth defining a constant in the file because the value of 100 is arbitrary and could be changed in the future.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. I'm not sure, as it is only used once.

self,
n_features=100,
n_trees=10,
n_text_features=10000,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further in this PR, it is implied that a overhead can occur with this new parameter (unless n_text_features=0 is set), should we warn the user in the docstring about this performance change and if he does not manipulate any text colum he should set the value to 0 ?

Copy link
Copy Markdown
Collaborator Author

@popescu-v popescu-v Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Indeed, each single-column with such text data would be internally converted by Khiops into a set of potentially numerous text variables. But I'm not sure of the performance penalty this entails (if any). Setting n_text_features=0 just tells Khiops to proceed in the "standard" way, where each "stringy" column is Categorical.

@popescu-v popescu-v force-pushed the 39-support-text-types-in-sklearn-predictors branch from b221625 to 8e70138 Compare September 1, 2025 15:14
@folmos-at-orange folmos-at-orange self-requested a review September 1, 2025 15:22
Copy link
Copy Markdown
Member

@folmos-at-orange folmos-at-orange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd merge the commit for the feature with that of the sample, otherwise LGTM.

@popescu-v popescu-v force-pushed the 39-support-text-types-in-sklearn-predictors branch from 8e70138 to bd921c0 Compare September 1, 2025 15:35
@popescu-v
Copy link
Copy Markdown
Collaborator Author

I'd merge the commit for the feature with that of the sample, otherwise LGTM.

I'd keep it separate, as the feature is implemented in two commits (that I'd keep as such for clarity).

Thus, "Text" is assigned to a dataframe input column if:
- its type is StringDType, and
- its length is > 100 (same heuristic as Khiops core when called through
  `api.build_dictionary_from_data_table`).
@popescu-v popescu-v force-pushed the 39-support-text-types-in-sklearn-predictors branch from bd921c0 to 4556cf5 Compare September 1, 2025 16:02
@popescu-v popescu-v merged commit 9cec2f1 into dev Sep 2, 2025
19 checks passed
@popescu-v popescu-v deleted the 39-support-text-types-in-sklearn-predictors branch September 2, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support text types in sklearn predictors

3 participants