|
| 1 | +# Data Handlers |
| 2 | +Data handlers, are routines which process a dataset using [HF process frameworks](https://huggingface.co/docs/datasets/en/process) including map, filter, remove, select, and rename. |
| 3 | +All data handler routines are registered with our data preprocessor as a `k:func` object where |
| 4 | +`k` is the name (`str`) of the data handler and `func` (`callable`) is the function which is called. |
| 5 | + |
| 6 | +In the data config, users can request which data handler to apply by requesting the corresponding `name` |
| 7 | +with which the data handler was registered and specifying the appropriate `arguments`. Each data handler accepts two types of arguments via `DataHandlerArguments` (as defined in the data preprocessor [schema](./advanced-data-preprocessing.md#what-is-data-config-schema)), as shown below. |
| 8 | + |
| 9 | +```yaml |
| 10 | +datapreprocessor: |
| 11 | + ... |
| 12 | +datasets: |
| 13 | + - name: ... |
| 14 | + data_paths: |
| 15 | + - ... |
| 16 | + data_handlers: |
| 17 | + - name: str |
| 18 | + arguments: |
| 19 | + argument: object |
| 20 | + ... |
| 21 | + argument: object |
| 22 | + fn_kwargs: |
| 23 | + fn_kwarg: object |
| 24 | + ... |
| 25 | + fn_kwarg: object |
| 26 | + ... |
| 27 | +``` |
| 28 | + |
| 29 | +Arguments to the data handlers are of two types, |
| 30 | + |
| 31 | +Each data handler is a routine passed to an underlying HF API so the `kwargs` supported by the underlying API can be passed via the `arguments` section of the data handler config. In our pre-existing handlers the supported underlying API is either: |
| 32 | + - [Map](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map) |
| 33 | + - [Filter](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.filter) |
| 34 | + - [Rename](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.rename_columns) |
| 35 | + - [Select](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.select) |
| 36 | + - [Remove](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns) |
| 37 | + |
| 38 | +The operations performed for pre-existing handlers can be found in the [Preexisting data handlers](#preexisting-data-handlers) section |
| 39 | + |
| 40 | +For example, users can pass `batched` through `arguments` to ensure [batched processing](https://huggingface.co/docs/datasets/en/about_map_batch) of the data handler. |
| 41 | + |
| 42 | +Users can also pass any number of `kwargs` arguments required for each data handling `routine` function as [`fn_kwargs`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map.fn_kwargs) inside the arguments. |
| 43 | + |
| 44 | + |
| 45 | +## Preexisting data handlers |
| 46 | +This library currently supports the following preexisting data handlers. These handlers could be requested by their same name and users can lookup the function args from [data handlers source code](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py): |
| 47 | + |
| 48 | + |
| 49 | +### `tokenize_and_apply_input_masking`: |
| 50 | +Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets. |
| 51 | +By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/tokenize_and_apply_input_masking.yaml) |
| 52 | + |
| 53 | +Type: MAP |
| 54 | + |
| 55 | +Args: |
| 56 | + - `element`: the HF Dataset element. |
| 57 | + - `tokenizer`: Tokenizer to be used for tokenization. |
| 58 | + - `column_names`: Name of all the columns in the dataset. |
| 59 | + - `input_field_name`: Name of the input (instruction) field in dataset |
| 60 | + - `output_field_name`: Name of the output field in dataset |
| 61 | + - `add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True |
| 62 | + |
| 63 | +Returns formatted Dataset element with input_ids, labels and attention_mask columns |
| 64 | + |
| 65 | +### `add_tokenizer_eos_token`: |
| 66 | +Appends the tokenizer's EOS token to a specified dataset field. |
| 67 | + |
| 68 | +Type: MAP |
| 69 | + |
| 70 | +Args: |
| 71 | + - `element`: the HF Dataset element. |
| 72 | + - `tokenizer`: Tokenizer to be used for the EOS token, which will be appended when formatting the data into a single sequence. Defaults to empty. |
| 73 | + - `dataset_text_field`: Text column name of the dataset where EOS is to be added. |
| 74 | + |
| 75 | +Returns formatted Dataset element with EOS added to dataset_text_field of the element. |
| 76 | + |
| 77 | +### `apply_custom_data_formatting_template`: |
| 78 | +Applies a custom template (e.g., Alpaca style) to format dataset elements. |
| 79 | +By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_template.yaml) |
| 80 | + |
| 81 | +Type: MAP |
| 82 | + |
| 83 | +Args: |
| 84 | + - `element`: the HF Dataset element. |
| 85 | + - `tokenizer`: Tokenizer to be used for the EOS token, which will be appended when formatting the data into a single sequence. Defaults to empty. |
| 86 | + - `dataset_text_field`: Text column name of the dataset where formatted text is saved. |
| 87 | + - `template`: Template to format data with. Features of Dataset should be referred to by their key. |
| 88 | + - `add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True. |
| 89 | + |
| 90 | +Returns formatted Dataset element by formatting dataset with template+tokenizer.EOS_TOKEN, saving the result to dataset_text_field argument. |
| 91 | + |
| 92 | +### `apply_custom_jinja_template`: |
| 93 | +Applies a custom jinja template (e.g., Alpaca style) to format dataset elements. |
| 94 | +By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_jinja_template.yaml) |
| 95 | + |
| 96 | +Type: MAP |
| 97 | + |
| 98 | +Args: |
| 99 | + - `element`: the HF Dataset element |
| 100 | + - `tokenizer`: Tokenizer to be used for the EOS token, which will be appended |
| 101 | + when formatting the data into a single sequence. Defaults to empty. |
| 102 | + - `dataset_text_field`: formatted_dataset_field. |
| 103 | + - `template`: Template to format data with. Features of Dataset should be referred to by their key. |
| 104 | + - `add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True. |
| 105 | + |
| 106 | +Returns formatted HF Dataset element by formatting dataset with provided jinja template, saving the result to dataset_text_field argument. |
| 107 | + |
| 108 | +### `apply_tokenizer_chat_template`: |
| 109 | +Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates. |
| 110 | + |
| 111 | +Type: MAP |
| 112 | + |
| 113 | +Args: |
| 114 | + - `element`: the HF Dataset element. |
| 115 | + - `tokenizer`: Tokenizer to be used. |
| 116 | + - `dataset_text_field`: the field in which to store the rendered text. |
| 117 | + - `conversation_column`: column name where the chat template expects the conversation |
| 118 | + |
| 119 | +Returns formatted HF Dataset element by formatting dataset with tokenizer's chat template, saving the result to dataset_text_field argument. |
| 120 | + |
| 121 | +### `tokenize`: |
| 122 | +Tokenizes one column of the dataset passed as input `dataset_text_field`. |
| 123 | + |
| 124 | +Type: MAP |
| 125 | + |
| 126 | +Args: |
| 127 | + - `element`: the HF Dataset element. |
| 128 | + - `tokenizer`: Tokenizer to be used. |
| 129 | + - `dataset_text_field`: The dataset field to tokenize. |
| 130 | + - `truncation`: Truncation strategy to use, refer the link (https://huggingface.co/docs/transformers/en/pad_truncation). |
| 131 | + - `max_length`: Max length to truncate the samples to. |
| 132 | + |
| 133 | +Returns tokenized dataset element field `dataset_text_field` |
| 134 | + |
| 135 | +### `duplicate_columns`: |
| 136 | +Duplicate one columne of a dataset to another new column. |
| 137 | + |
| 138 | +Type: MAP |
| 139 | + |
| 140 | +Args: |
| 141 | + - `element`: the HF Dataset element |
| 142 | + - `old_column`: Name of the column to be duplicated |
| 143 | + - `new_column`: Name of the new column where dyplicated column is saved |
| 144 | + |
| 145 | +Returns formatted HF dataset element with `new_column` where `old_column` content is deep copied. |
| 146 | + |
| 147 | +### `skip_large_columns`: |
| 148 | +Skips elements which contains certain columns larger than the passed max length in the dataset. |
| 149 | + |
| 150 | +Type: FILTER |
| 151 | + |
| 152 | +Args: |
| 153 | + - `element`: HF dataset element. |
| 154 | + - `column_name`: Name of column to be filtered. |
| 155 | + - `max_length`: Max allowed lenght of column in either characters or tokens. |
| 156 | + |
| 157 | + Returns a filtered dataset which contains elements with length shorter than max length |
| 158 | + |
| 159 | +### `remove_columns`: |
| 160 | +Directly calls [remove_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns) in HF API |
| 161 | + |
| 162 | +Type: REMOVE |
| 163 | + |
| 164 | +Args: |
| 165 | + - `column_names`: Names of columns to be removed from dataset |
| 166 | + |
| 167 | +Removes specified columns of dataset |
| 168 | + |
| 169 | +### `select_columns`: |
| 170 | +Directly calls [select](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.select) in HF API |
| 171 | + |
| 172 | +Type: SELECT |
| 173 | + |
| 174 | +Args: |
| 175 | + - `column_names`: Names of columns to be retained in the new dataset |
| 176 | + |
| 177 | +Create a new dataset with rows selected following the list/array of indices. |
| 178 | + |
| 179 | +### `rename_columns`: |
| 180 | +Directly calls [rename_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.rename_columns) in HF API |
| 181 | + |
| 182 | +Type: RENAME |
| 183 | + |
| 184 | +Args: |
| 185 | + - `column_mapping`: Column names passed as `str:str` from `old_name:new_name` |
| 186 | + |
| 187 | +Returns renamed columns in dataset using provided column mapping. |
| 188 | + |
| 189 | + |
| 190 | +## Extra data handlers |
| 191 | +Users are also allowed to pass custom data handlers using [`sft_trainer.py::train()`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L71) API call via the [`additional_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L89) argument. |
| 192 | + |
| 193 | +The argument expects users to pass a map similar to the existing data handlers `k(str):func(callable)` which will be registered with the data preprocessor via its [`register_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/data/data_processors.py#L65) api |
| 194 | + |
| 195 | +## Examples |
| 196 | +To see typical use-cases and how handlers are linked together, see [data preprocessing recipes](./data-preprocessing-recipes.md). |
0 commit comments