Skip to content

Commit c77f6fd

Browse files
willmjdushyantbehl
authored andcommitted
docs: advanced data handlers doc
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
1 parent 73d6eef commit c77f6fd

2 files changed

Lines changed: 199 additions & 76 deletions

File tree

docs/advanced-data-handlers.md

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# Data Handlers
2+
Data handlers, are routines which process a dataset using [HF process frameworks](https://huggingface.co/docs/datasets/en/process) including map, filter, remove, select, and rename.
3+
All data handler routines are registered with our data preprocessor as a `k:func` object where
4+
`k` is the name (`str`) of the data handler and `func` (`callable`) is the function which is called.
5+
6+
In the data config, users can request which data handler to apply by requesting the corresponding `name`
7+
with which the data handler was registered and specifying the appropriate `arguments`. Each data handler accepts two types of arguments via `DataHandlerArguments` (as defined in the data preprocessor [schema](./advanced-data-preprocessing.md#what-is-data-config-schema)), as shown below.
8+
9+
```yaml
10+
datapreprocessor:
11+
...
12+
datasets:
13+
- name: ...
14+
data_paths:
15+
- ...
16+
data_handlers:
17+
- name: str
18+
arguments:
19+
argument: object
20+
...
21+
argument: object
22+
fn_kwargs:
23+
fn_kwarg: object
24+
...
25+
fn_kwarg: object
26+
...
27+
```
28+
29+
Arguments to the data handlers are of two types,
30+
31+
Each data handler is a routine passed to an underlying HF API so the `kwargs` supported by the underlying API can be passed via the `arguments` section of the data handler config. In our pre-existing handlers the supported underlying API is either:
32+
- [Map](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map)
33+
- [Filter](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.filter)
34+
- [Rename](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.rename_columns)
35+
- [Select](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.select)
36+
- [Remove](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns)
37+
38+
The operations performed for pre-existing handlers can be found in the [Preexisting data handlers](#preexisting-data-handlers) section
39+
40+
For example, users can pass `batched` through `arguments` to ensure [batched processing](https://huggingface.co/docs/datasets/en/about_map_batch) of the data handler.
41+
42+
Users can also pass any number of `kwargs` arguments required for each data handling `routine` function as [`fn_kwargs`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map.fn_kwargs) inside the arguments.
43+
44+
45+
## Preexisting data handlers
46+
This library currently supports the following preexisting data handlers. These handlers could be requested by their same name and users can lookup the function args from [data handlers source code](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py):
47+
48+
49+
### `tokenize_and_apply_input_masking`:
50+
Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.
51+
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/tokenize_and_apply_input_masking.yaml)
52+
53+
Type: MAP
54+
55+
Args:
56+
- `element`: the HF Dataset element.
57+
- `tokenizer`: Tokenizer to be used for tokenization.
58+
- `column_names`: Name of all the columns in the dataset.
59+
- `input_field_name`: Name of the input (instruction) field in dataset
60+
- `output_field_name`: Name of the output field in dataset
61+
- `add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True
62+
63+
Returns formatted Dataset element with input_ids, labels and attention_mask columns
64+
65+
### `add_tokenizer_eos_token`:
66+
Appends the tokenizer's EOS token to a specified dataset field.
67+
68+
Type: MAP
69+
70+
Args:
71+
- `element`: the HF Dataset element.
72+
- `tokenizer`: Tokenizer to be used for the EOS token, which will be appended when formatting the data into a single sequence. Defaults to empty.
73+
- `dataset_text_field`: Text column name of the dataset where EOS is to be added.
74+
75+
Returns formatted Dataset element with EOS added to dataset_text_field of the element.
76+
77+
### `apply_custom_data_formatting_template`:
78+
Applies a custom template (e.g., Alpaca style) to format dataset elements.
79+
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_template.yaml)
80+
81+
Type: MAP
82+
83+
Args:
84+
- `element`: the HF Dataset element.
85+
- `tokenizer`: Tokenizer to be used for the EOS token, which will be appended when formatting the data into a single sequence. Defaults to empty.
86+
- `dataset_text_field`: Text column name of the dataset where formatted text is saved.
87+
- `template`: Template to format data with. Features of Dataset should be referred to by their key.
88+
- `add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True.
89+
90+
Returns formatted Dataset element by formatting dataset with template+tokenizer.EOS_TOKEN, saving the result to dataset_text_field argument.
91+
92+
### `apply_custom_jinja_template`:
93+
Applies a custom jinja template (e.g., Alpaca style) to format dataset elements.
94+
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_jinja_template.yaml)
95+
96+
Type: MAP
97+
98+
Args:
99+
- `element`: the HF Dataset element
100+
- `tokenizer`: Tokenizer to be used for the EOS token, which will be appended
101+
when formatting the data into a single sequence. Defaults to empty.
102+
- `dataset_text_field`: formatted_dataset_field.
103+
- `template`: Template to format data with. Features of Dataset should be referred to by their key.
104+
- `add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True.
105+
106+
Returns formatted HF Dataset element by formatting dataset with provided jinja template, saving the result to dataset_text_field argument.
107+
108+
### `apply_tokenizer_chat_template`:
109+
Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
110+
111+
Type: MAP
112+
113+
Args:
114+
- `element`: the HF Dataset element.
115+
- `tokenizer`: Tokenizer to be used.
116+
- `dataset_text_field`: the field in which to store the rendered text.
117+
- `conversation_column`: column name where the chat template expects the conversation
118+
119+
Returns formatted HF Dataset element by formatting dataset with tokenizer's chat template, saving the result to dataset_text_field argument.
120+
121+
### `tokenize`:
122+
Tokenizes one column of the dataset passed as input `dataset_text_field`.
123+
124+
Type: MAP
125+
126+
Args:
127+
- `element`: the HF Dataset element.
128+
- `tokenizer`: Tokenizer to be used.
129+
- `dataset_text_field`: The dataset field to tokenize.
130+
- `truncation`: Truncation strategy to use, refer the link (https://huggingface.co/docs/transformers/en/pad_truncation).
131+
- `max_length`: Max length to truncate the samples to.
132+
133+
Returns tokenized dataset element field `dataset_text_field`
134+
135+
### `duplicate_columns`:
136+
Duplicate one columne of a dataset to another new column.
137+
138+
Type: MAP
139+
140+
Args:
141+
- `element`: the HF Dataset element
142+
- `old_column`: Name of the column to be duplicated
143+
- `new_column`: Name of the new column where dyplicated column is saved
144+
145+
Returns formatted HF dataset element with `new_column` where `old_column` content is deep copied.
146+
147+
### `skip_large_columns`:
148+
Skips elements which contains certain columns larger than the passed max length in the dataset.
149+
150+
Type: FILTER
151+
152+
Args:
153+
- `element`: HF dataset element.
154+
- `column_name`: Name of column to be filtered.
155+
- `max_length`: Max allowed lenght of column in either characters or tokens.
156+
157+
Returns a filtered dataset which contains elements with length shorter than max length
158+
159+
### `remove_columns`:
160+
Directly calls [remove_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns) in HF API
161+
162+
Type: REMOVE
163+
164+
Args:
165+
- `column_names`: Names of columns to be removed from dataset
166+
167+
Removes specified columns of dataset
168+
169+
### `select_columns`:
170+
Directly calls [select](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.select) in HF API
171+
172+
Type: SELECT
173+
174+
Args:
175+
- `column_names`: Names of columns to be retained in the new dataset
176+
177+
Create a new dataset with rows selected following the list/array of indices.
178+
179+
### `rename_columns`:
180+
Directly calls [rename_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.rename_columns) in HF API
181+
182+
Type: RENAME
183+
184+
Args:
185+
- `column_mapping`: Column names passed as `str:str` from `old_name:new_name`
186+
187+
Returns renamed columns in dataset using provided column mapping.
188+
189+
190+
## Extra data handlers
191+
Users are also allowed to pass custom data handlers using [`sft_trainer.py::train()`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L71) API call via the [`additional_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L89) argument.
192+
193+
The argument expects users to pass a map similar to the existing data handlers `k(str):func(callable)` which will be registered with the data preprocessor via its [`register_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/data/data_processors.py#L65) api
194+
195+
## Examples
196+
To see typical use-cases and how handlers are linked together, see [data preprocessing recipes](./data-preprocessing-recipes.md).

docs/advanced-data-preprocessing.md

Lines changed: 3 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -164,84 +164,11 @@ Probably something like this:
164164
Additionally while loading the dataset, users can specify which columns to rename via `rename_columns` and which to retain via `retain_columns` arguments above.
165165
The order of application of these operations is *strictly rename followed by retain* so users should note that an old column name which is renamed will not be available in retain and hence should be careful while applying these operations. The code will throw a `ValueError` in case user specified a column requested to be renamed via rename argument in retain argument as well.
166166

167-
### How can users specify data handlers.
167+
### Data Handlers
168168

169-
Data handlers, as explained above, are routines which process the dataset using [HF map framework](https://huggingface.co/docs/datasets/en/process#map).
170-
All data handler routines are registered with our data preprocessor as a `k:func` object where
171-
`k` is the name (`str`) of the data handler and `func` (`callable`) is the function which is called.
169+
Data handlers, as explained above, are routines which process the dataset using [HF process frameworks](https://huggingface.co/docs/datasets/en/process) including map, filter, remove, select, and rename.
172170

173-
In the data config, users can request which data handler to apply by requesting the corresponding `name`
174-
with which the data handler was registered and specifying the appropriate `arguments`. Each data handler accepts two types of arguments via `DataHandlerArguments` (as defined in the above [schema](#what-is-data-config-schema)), as shown below.
175-
176-
```yaml
177-
DataHandler:
178-
type: object
179-
additionalProperties: false
180-
properties:
181-
name:
182-
type: string
183-
arguments:
184-
$ref: '#/definitions/DataHandlerArguments'
185-
required:
186-
- arguments
187-
- name
188-
title: DataHandler
189-
DataHandlerArguments:
190-
type: object
191-
additionalProperties: false
192-
properties:
193-
remove_columns:
194-
type: string
195-
batched:
196-
type: boolean
197-
fn_kwargs:
198-
$ref: '#/definitions/DataHandlerFnKwargs'
199-
required:
200-
- fn_kwargs
201-
- remove_columns
202-
title: DataHandlerArguments
203-
DataHandlerFnKwargs:
204-
type: object
205-
properties:
206-
str:
207-
type: str
208-
title: DataHandlerFnKwargs
209-
```
210-
211-
Arguments to the data handlers are of two types,
212-
213-
Each data handler is a routine passed to the underlying [HF Map API]((https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map)) so the `kwargs` supported by the underlying API can be passed via the `arguments` section of the data handler config.
214-
215-
For example, users can pass `remove_columns` to remove any columns from the dataset when executing the particular handler or they can use `batched` to ensure [batched processing](https://huggingface.co/docs/datasets/en/about_map_batch) of the data handler.
216-
217-
Users can also pass any number of `kwargs` arguments required for each data handling `routine` function as [`fn_kwargs`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map.fn_kwargs) inside the arguments.
218-
219-
#### Preexisting data handlers
220-
This library currently supports the following [preexisting data handlers](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py#L156):
221-
- `add_tokenizer_eos_token`:
222-
Appends the tokenizer's EOS token to a specified dataset field.
223-
- `apply_custom_data_formatting_template`:
224-
Applies a custom template (e.g., Alpaca style) to format dataset elements.
225-
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_template.yaml)
226-
- `tokenize_and_apply_input_masking`:
227-
Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.
228-
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/tokenize_and_apply_input_masking.yaml)
229-
- `apply_custom_jinja_template`:
230-
Applies a custom jinja template (e.g., Alpaca style) to format dataset elements.
231-
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_jinja_template.yaml)
232-
- `apply_tokenizer_chat_template`:
233-
Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
234-
- `duplicate_columns`:
235-
Duplicates one column of the dataset to another column.
236-
- `tokenize`:
237-
Tokenizes one column of the dataset passed as input `dataset_text_field`.
238-
239-
These handlers could be requested by their same name and users can lookup the function args from [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py)
240-
241-
#### Extra data handlers
242-
Users are also allowed to pass custom data handlers using [`sft_trainer.py::train()`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L71) API call via the [`additional_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L89) argument.
243-
244-
The argument expects users to pass a map similar to the existing data handlers `k(str):func(callable)` which will be registered with the data preprocessor via its [`register_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/data/data_processors.py#L65) api
171+
For a thorough explanation of data handlers, how to use them, see the [data handlers document](./advanced-data-handlers.md)
245172

246173
### Data Mixing
247174
Dataset mixing allows users to mix multiple datasets often with different `sampling ratios` to ensure the model is trained on a mix of some datasets in specific proportion.

0 commit comments

Comments
 (0)