You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/advanced-data-preprocessing.md
+8-58Lines changed: 8 additions & 58 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,6 @@ Our library also supports a powerful data processing backend which can be used b
4
4
1. Creating custom data processing pipeline for the datasets.
5
5
1. Combining multiple datasets into one, even if they have different formats.
6
6
1. Mixing datasets as required and sampling each dataset with different weights.
7
-
1. Dynamically mixing datasets online based on training signals through fms_acceleration_odm plugin.
8
7
9
8
These things are supported via what we call a [`data_config`](#data-config) which can be passed as an argument to sft trainer.
10
9
@@ -137,6 +136,14 @@ Users can create a data config file in any of YAML or JSON format they choose (w
137
136
- `chat_template` (optional, str): pass `chat_template` via data_config for multi-turn data, replaces existing default chat template.
138
137
- `odm` (optional): if `type` is odm, this field is required to be specific to provide configuration for online data mixing.
139
138
139
+
Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use [Hugging Face Map API](https://huggingface.co/docs/datasets/en/process#map) to apply these routines.
140
+
These functions can process the dataset in any way users require and the `list` of data handlers specified for each dataset are applied in order.
141
+
Each data handler has:
142
+
- `name`: The handler's unique identifier.
143
+
- `arguments`: A dictionary of parameters specific to the handler.
144
+
145
+
#### Online data mixing section
146
+
140
147
`odm`config has the following fields and is required when `datapreprocessor` `type` is `odm`.
141
148
142
149
`odm`:
@@ -154,11 +161,6 @@ Users can create a data config file in any of YAML or JSON format they choose (w
154
161
- `split` (optional, dict[str: float]): Defines how to split the dataset into training and validation sets. Requires both `train` and `validation` keys.
155
162
- `data_handlers` (optional, list): A list of data handler configurations which preprocess the dataset.
156
163
157
-
Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use [Hugging Face Map API](https://huggingface.co/docs/datasets/en/process#map) to apply these routines.
158
-
These functions can process the dataset in any way users require and the `list` of data handlers specified for each dataset are applied in order.
159
-
Each data handler has:
160
-
- `name`: The handler's unique identifier.
161
-
- `arguments`: A dictionary of parameters specific to the handler.
162
164
163
165
We do provide some sample `data_configs` here, [predefined_data_configs](../tests/artifacts/predefined_data_configs/).
164
166
@@ -203,58 +205,6 @@ We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2
203
205
204
206
Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
205
207
206
-
### Online Data Mixing
207
-
Dataset mixing can be dynamic in nature that adapts online during the training based on the training signals. We provide this feature through fms_acceleration_odm plugin and more details can be found [here](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/online-data-mixing).
208
-
209
-
#### How to Use
210
-
211
-
`dataprocessor` `type` has to be set to `odm` and then `odm` config should be provided in the `odm` section of the data config file. An example is shown below:
212
-
213
-
```yaml
214
-
dataprocessor:
215
-
type: odm
216
-
odm:
217
-
update_interval: 1 # update every step
218
-
sampling_interval: 1 # sample category for every sample
219
-
reward_type: validation_loss # uses eval loss of each dataset as reward
220
-
gamma: 0.1 # MAB hyper-parameter
221
-
eta: 0.2 # MAB hyper-parameter
222
-
```
223
-
224
-
Here `update_interval` is set to `1` which is to update MAB on every step with validation loss as reward across the datasets. `sampling_interval` is set to `1` which is to choose a dataset to sample for every sample. `reward_type` is set to `validation_loss` to use validation loss across datasets as a training signal to reward MAB decisions during training. Example `datasets` section can look like below:
225
-
226
-
```yaml
227
-
datasets:
228
-
- name: dataset_1
229
-
split:
230
-
train: 0.8
231
-
validation: 0.2
232
-
data_paths:
233
-
- "FILE_PATH"
234
-
data_handlers:
235
-
- name: tokenize_and_apply_input_masking
236
-
arguments:
237
-
remove_columns: all
238
-
batched: false
239
-
fn_kwargs:
240
-
input_column_name: input
241
-
output_column_name: output
242
-
- name: dataset_2
243
-
split:
244
-
train: 0.9
245
-
validation: 0.1
246
-
data_paths:
247
-
- "FILE_PATH"
248
-
data_handlers:
249
-
- name: tokenize_and_apply_input_masking
250
-
arguments:
251
-
remove_columns: all
252
-
batched: false
253
-
fn_kwargs:
254
-
input_column_name: input
255
-
output_column_name: output
256
-
```
257
-
As you notice, `validation` under `split` is provided for each of the datasets and is necessary to be provided since the `reward_type` is `validation_loss` which requires validation datasets to be available. Same applies to the following rewards: `validation_loss`, `entropy`, `entropy3_varent1`, and `entropy_last_token`. While reward_types `train_loss` and `gradnorm` do not require validation split.
0 commit comments