Skip to content

Commit f978cdd

Browse files
committed
Added documentation
Signed-off-by: romit <romit@ibm.com>
1 parent 26359ac commit f978cdd

1 file changed

Lines changed: 2 additions & 1 deletion

File tree

docs/advanced-data-preprocessing.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,8 @@ Each data handler has:
162162
- `sampling` (optional, float): The sampling ratio (0.0 to 1.0) with which to sample a dataset in case of interleaving.
163163
- `split` (optional, dict[str: float]): Defines how to split the dataset into training and validation sets. Requires both `train` and `validation` keys.
164164
- `data_handlers` (optional, list): A list of data handler configurations which preprocess the dataset.
165-
165+
- `dataset_split_name` (optional, str): Name of the dataset split. This is useful for loading HuggingFace datasets with split names that are different from the standard (eg: `train_sft` instead of `train`). If no `dataset_split_name` is provided, `train` is used.
166+
- `shuffle` (optional, bool): If the dataset should be shuffled while splitting into train and validation split. Defaults to `True`. Use caution when using this field and only use when the dataset is already shuffled.
166167

167168
We do provide some sample `data_configs` here, [predefined_data_configs](../tests/artifacts/predefined_data_configs/).
168169

0 commit comments

Comments
 (0)