Description
Khiops 11 supports Text columns which have a specialized AutoML treatment as oppossed to normal strings(Categorical). Sklearn predictors should also support this type.
Questions/Ideas
- Does pandas and/or numpy have a specialized
Text type?
Option 1: Implement it as a Dataset property
- Add to the table specification tuple should have an optional field
text_columns with the names of the text fields or
- Add another field to the spec
table_text_columns indexed by the table name and whose values are the names of the text columns (I prefer this one)
When creating the dictionary the Dataset object will have all the necessary info to add the specified columns as Text The Dataset API is not exposed.
Option 2: Implement it as a fit parameter
As above add table_text_columns but as a fit optional parameter
- I works but the fact that a column is a
Text is part of the description of the dataset
This parameter should be fed to the dictionary creation routine The Dataset API is not exposed.
- [later edit: 2025/08/01] Option 3: Add 2 extra parameters
n_text_features and text_feature_type to:
- Option 3.1: the
KhiopsPredictor estimator initializer (__init__ method)
- Option 3.2: the
KhiopsPredictor's fit method
- Note: The
text_columns needs to be passed as well:
- either to the
KhiopsPredictor initializer
- or to the estimator's
fit method.
- Note 2:
Expose Dataset API it will have two init patters:
- Big Constructor
- Builder pattern
- The big constructor uses the builder
And the dict interface should be maintained The Dataset API is not exposed.
Description
Khiops 11 supports
Textcolumns which have a specialized AutoML treatment as oppossed to normal strings(Categorical). Sklearn predictors should also support this type.Questions/Ideas
Texttype?Option 1: Implement it as aDatasetpropertytext_columnswith the names of the text fields ortable_text_columnsindexed by the table name and whose values are the names of the text columns (I prefer this one)When creating the dictionary theThe Dataset API is not exposed.Datasetobject will have all the necessary info to add the specified columns asTextOption 2: Implement it as afitparameterAs above addtable_text_columnsbut as afitoptional parameterTextis part of the description of the datasetThis parameter should be fed to the dictionary creation routineThe Dataset API is not exposed.n_text_featuresandtext_feature_typeto:KhiopsPredictorestimator initializer (__init__method)KhiopsPredictor'sfitmethodtext_columnsneeds to be passed as well:KhiopsPredictorinitializerfitmethod.StringDTypeshould be used for columns whose Khiops type isText; see https://pandas.pydata.org/docs/user_guide/text.html#working-with-text-data.StringDTypes should be used forTextList(to clarify).Expose Dataset API it will have two init patters:- Big Constructor
- Builder pattern
And theThe Dataset API is not exposed.dictinterface should be maintained