Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
### Added
- (`core`) Dictionary API support for dictionary, variable and variable block
comments, and dictionary and variable block internal comments.
- (`core`) Dictionary `Rule` class and supporting API for adding and getting
rules to / from variables and variable blocks.
- (`sklearn`) `Text` Khiops type support at the estimator level.

### Fixed
Expand Down
10 changes: 8 additions & 2 deletions doc/samples/samples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -655,17 +655,23 @@ Samples
fold_index_variable.name = "FoldIndex"
fold_index_variable.type = "Numerical"
fold_index_variable.used = False
fold_index_variable.rule = "Ceil(Product(" + str(fold_number) + ", Random()))"
dictionary.add_variable(fold_index_variable)

# Create fold indexing rule and set it on `fold_index_variable`
dictionary.get_variable(fold_index_variable.name).set_rule(
kh.Rule("Ceil", kh.Rule("Product", fold_number, kh.Rule("Random()"))),
)

# Add variables that indicate if the instance is in the train dataset:
for fold_index in range(1, fold_number + 1):
is_in_train_dataset_variable = kh.Variable()
is_in_train_dataset_variable.name = "IsInTrainDataset" + str(fold_index)
is_in_train_dataset_variable.type = "Numerical"
is_in_train_dataset_variable.used = False
is_in_train_dataset_variable.rule = "NEQ(FoldIndex, " + str(fold_index) + ")"
dictionary.add_variable(is_in_train_dataset_variable)
dictionary.get_variable(is_in_train_dataset_variable.name).set_rule(
kh.Rule("NEQ", fold_index_variable, fold_index),
)

# Print dictionary with fold variables
print("Dictionary file with fold variables")
Expand Down
15 changes: 15 additions & 0 deletions khiops/core/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -757,9 +757,11 @@ def train_predictor(
Maximum number of text features to construct.
text_features : str, default "words"
Type of the text features. Can be either one of:

- "words": sequences of non-space characters
- "ngrams": sequences of bytes
- "tokens": user-defined

max_trees : int, default 10
Maximum number of trees to construct.
max_pairs : int, default 0
Expand Down Expand Up @@ -788,8 +790,10 @@ def train_predictor(
Maximum number of variable parts produced by preprocessing methods. If equal
to 0 it is automatically calculated.
Special default values for unsupervised analysis:

- If ``discretization_method`` is "EqualWidth" or "EqualFrequency": 10
- If ``grouping_method`` is "BasicGrouping": 10

... :
See :ref:`core-api-common-params`.

Expand Down Expand Up @@ -1181,9 +1185,11 @@ def train_recoder(
Maximum number of text features to construct.
text_features : str, default "words"
Type of the text features. Can be either one of:

- "words": sequences of non-space characters
- "ngrams": sequences of bytes
- "tokens": user-defined

max_trees : int, default 10
Maximum number of trees to construct.
max_pairs : int, default 0
Expand All @@ -1210,13 +1216,16 @@ def train_recoder(
If ``True`` keeps initial numerical variables.
categorical_recoding_method : str
Type of recoding for categorical variables. Types available:

- "part Id" (default): An id for the interval/group
- "part label": A label for the interval/group
- "0-1 binarization": A 0's and 1's coding the interval/group id
- "conditional info": Conditional information of the interval/group
- "none": Keeps the variable as-is

numerical_recoding_method : str
Type of recoding recoding for numerical variables. Types available:

- "part Id" (default): An id for the interval/group
- "part label": A label for the interval/group
- "0-1 binarization": A 0's and 1's coding the interval/group id
Expand All @@ -1226,13 +1235,16 @@ def train_recoder(
- "rank normalization": mean normalized rank (between 0 and 1) of the
instances
- "none": Keeps the variable as-is

pairs_recoding_method : str
Type of recoding for bivariate variables. Types available:

- "part Id" (default): An id for the interval/group
- "part label": A label for the interval/group
- "0-1 binarization": A 0's and 1's coding the interval/group id
- "conditional info": Conditional information of the interval/group
- "none": Keeps the variable as-is

discretization_method : str, default "MODL"
Name of the discretization method in case of unsupervised analysis.
Its valid values are: "MODL", "EqualWidth", "EqualFrequency" or "none".
Expand All @@ -1245,15 +1257,18 @@ def train_recoder(
Maximum number of variable parts produced by preprocessing methods. If equal
to 0 it is automatically calculated.
Special default values for unsupervised analysis:

- If ``discretization_method`` is "EqualWidth" or "EqualFrequency": 10
- If ``grouping_method`` is "BasicGrouping": 10

... :
See :ref:`core-api-common-params`.

Returns
-------
tuple
A 2-tuple containing:

- The path of the JSON file report of the process
- The path of the dictionary containing the recoding model

Expand Down
Loading
Loading