Conversation
aa47498 to
b221625
Compare
tramora
left a comment
There was a problem hiding this comment.
A few personal comments for my understanding maybe not blocking at all
doc/samples/samples_sklearn.rst
Outdated
| y_train = data_train_df["negativereason"] | ||
| y_test = data_test_df["negativereason"] | ||
|
|
||
| # Set Pandas StringDType on the "text" column |
There was a problem hiding this comment.
On the user point of view, requiring this is a burden. If he forgets to do this, does it fail ?
There was a problem hiding this comment.
Yes, we need to document it on the main web site as well (see the issue #39).
There was a problem hiding this comment.
On the user point of view, requiring this is a burden. If he forgets to do this, does it fail ?
If the user forgets to force this conversion, then the column would either:
- be of a different type (e.g.
object), in which case it will be assigned the "Categorical" Khiops type; - or just happen to be of the String dtype, in which case the heuristic would apply.
Hence, there is no failure, but this behavior needs to be thoroughly documented.
|
|
||
| .. note:: | ||
| The "Text" Khiops type is inferred if the Numpy type is "string" | ||
| and the maximum length of the entries of that type is greater than 100. |
There was a problem hiding this comment.
Maybe worth defining a constant in the file (and referred here) because the value of 100 is arbitrary and could be changed in the future.
There was a problem hiding this comment.
This value is identical to the Khiops Core value which servers the same purpose (as explained in the commit message) AFAIU.
khiops/sklearn/dataset.py
Outdated
| raise TypeError( | ||
| type_error_message("max_type_size", max_type_size, int, np.int64) | ||
| ) | ||
| if max_type_size > 100: |
There was a problem hiding this comment.
Maybe worth defining a constant in the file because the value of 100 is arbitrary and could be changed in the future.
There was a problem hiding this comment.
See above. I'm not sure, as it is only used once.
| self, | ||
| n_features=100, | ||
| n_trees=10, | ||
| n_text_features=10000, |
There was a problem hiding this comment.
Further in this PR, it is implied that a overhead can occur with this new parameter (unless n_text_features=0 is set), should we warn the user in the docstring about this performance change and if he does not manipulate any text colum he should set the value to 0 ?
There was a problem hiding this comment.
Good point. Indeed, each single-column with such text data would be internally converted by Khiops into a set of potentially numerous text variables. But I'm not sure of the performance penalty this entails (if any). Setting n_text_features=0 just tells Khiops to proceed in the "standard" way, where each "stringy" column is Categorical.
b221625 to
8e70138
Compare
folmos-at-orange
left a comment
There was a problem hiding this comment.
I'd merge the commit for the feature with that of the sample, otherwise LGTM.
8e70138 to
bd921c0
Compare
I'd keep it separate, as the feature is implemented in two commits (that I'd keep as such for clarity). |
Thus, "Text" is assigned to a dataframe input column if: - its type is StringDType, and - its length is > 100 (same heuristic as Khiops core when called through `api.build_dictionary_from_data_table`).
bd921c0 to
4556cf5
Compare
closes #39
TODO Before Asking for a Review
dev(ormainfor release PRs)Unreleasedsection ofCHANGELOG.md(no date)index.html