You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/advanced-data-preprocessing.md
+219Lines changed: 219 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,23 @@ Our library also supports a powerful data processing backend which can be used b
7
7
8
8
These things are supported via what we call a [`data_config`](#data-config) which can be passed as an argument to sft trainer.
9
9
10
+
## Supported Data File Formats
11
+
We support the following file formats via `--training_data_path` argument
12
+
13
+
Data Format | Tested Support
14
+
------------|---------------
15
+
JSON | ✅
16
+
JSONL | ✅
17
+
PARQUET | ✅
18
+
ARROW | ✅
19
+
20
+
As iterated above, we also support passing a HF dataset ID directly via `--training_data_path` argument.
21
+
22
+
**NOTE**: Due to the variety of supported data formats and file types, `--training_data_path` is handled as follows:
23
+
- If `--training_data_path` ends in a valid file extension (e.g., .json, .csv), it is treated as a file.
24
+
- If `--training_data_path` points to a valid folder, it is treated as a folder.
25
+
- If neither of these are true, the data preprocessor tries to load `--training_data_path` as a Hugging Face (HF) dataset ID.
26
+
10
27
## Data Config
11
28
12
29
Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config_path` flag. In this
@@ -320,4 +337,206 @@ This can add extra backslashes to your chat template causing it to become invali
320
337
321
338
We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
322
339
340
+
# Use cases supported via command line argument `training_data_path`
341
+
342
+
For basic users who want to pass command line argument directly to our stack you can refer to the following supported data formats.
343
+
344
+
### 1. Data formats with a single sequence and a specified response_template to use for masking on completion.
345
+
346
+
#### 1.1 Pre-process the dataset
347
+
Pre-process the dataset to contain a single sequence of each data instance containing input + response. The trainer is configured to expect a `response template` as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code.
348
+
349
+
```python
350
+
PROMPT_DICT = {
351
+
"prompt_input": (
352
+
"Below is an instruction that describes a task, paired with an input that provides further context. "
353
+
"Write a response that appropriately completes the request.\n\n"
The `response template` corresponding to the above dataset and the `Llama` tokenizer is: `\n### Response:"`.
376
+
377
+
The same way can be applied to any dataset, with more info can be found [here](https://huggingface.co/docs/trl/main/en/sft_trainer#format-your-input-prompts).
378
+
379
+
Once the data is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer.
380
+
381
+
#### 1.2 Format the dataset on the fly
382
+
Pass a dataset and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of the dataset with `{{field}}`. While tuning, the data will be converted to a single sequence using the template. Data fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-".
Formatting will happen on the fly while tuning. The keys in template should match fields in the dataset file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`.
393
+
394
+
##### In conclusion, if using the reponse_template and single sequence, either the `data_formatter_template` argument or `dataset_text_field` needs to be supplied to the trainer.
395
+
396
+
### 2. Dataset with input and output fields (no response template)
397
+
398
+
Pass a [supported dataset](#supported-data-formats) containing fields `"input"` with source text and `"output"` with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked.
399
+
400
+
The `"input"` and `"output"` field names are mandatory and cannot be changed.
401
+
402
+
Example: For a JSON dataset like, `Train.jsonl`
403
+
404
+
```
405
+
{"input": "### Input: Colorado is a state in USA ### Output:", "output": "USA : Location"}
406
+
{"input": "### Input: Arizona is also a state in USA ### Output:", "output": "USA : Location"}
407
+
```
408
+
409
+
### 3. Chat Style Single/Multi turn datasets
410
+
411
+
Pass a dataset containing single/multi turn chat dataset. Your dataset could follow this format:
412
+
413
+
```
414
+
$ head -n 1 train.jsonl
415
+
{"messages": [{"content": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "role": "system"}, {"content": "Look up a word that rhymes with exist", "role": "user"}, {"content": "I found a word that rhymes with \"exist\":\n1\\. Mist", "role": "assistant"}], "group": "lab_extension", "dataset": "base/full-extension", "metadata": "{\"num_turns\": 1}"}
416
+
```
417
+
418
+
This format supports both single and multi-turn chat scenarios.
419
+
420
+
The chat template used to render the dataset will default to `tokenizer.chat_template` from the model's tokenizer configuration. This can be overridden using the `--chat_template <chat-template-string>` argument. For example, models like [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct), which include a [chat template](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/e0a466fb25b9e07e9c2dc93380a360189700d1f8/tokenizer_config.json#L188) in their `tokenizer_config.json`, do not require users to provide a chat template to process the data.
421
+
422
+
Users do need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of
423
+
`assistant` and `human` response inside the formatted chat template.
424
+
For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) for example, the values shall be.
The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.
431
+
432
+
#### Aligning dataset formats
433
+
In some cases the chat template might not be aligned with the data format of the dataset. For example, consider the following data sample and suppose we want to use the list of contents associated with the `messages` key from the data sample for our multi-turn training job!
434
+
435
+
```
436
+
{
437
+
"messages": [
438
+
{"content": "You are an AI...", "role": "system"},
439
+
{"content": "Look up a word...", "role": "user"},
440
+
{"content": "A word that rhymes is 'mist'", "role": "assistant"}
441
+
],
442
+
"group": "lab_extension",
443
+
"dataset": "base/full-extension",
444
+
"metadata": "{\"num_turns\": 2}"
445
+
}
446
+
```
447
+
Different Chat templates support different data formats and the chat template might not always align with the data format of the dataset!
448
+
449
+
Here is a example of chat template that iterates over the nested data sample by addressing the "messages" key in `for message in messages['messages']` :
{% if loop.last and add_generation_prompt %}{{ '<|assistant|>' }}\
457
+
{% endif %}\
458
+
{% endfor %}
459
+
```
460
+
While the above template might be suitable for certain data formats, not all chat templates access the nested contents in a data sample.
461
+
462
+
In the following example notice the `for message in messages` line which does not access any nested contents in the data and expects the nested content to be passed directly to the chat template!
When working with multi-turn datasets, it's often necessary to extract specific fields from the data depending on the format. For example, in many multi-turn datasets, conversations may be stored under a dedicated key (e.g., `conversations`, `messages`, etc), and you may only need the content of that key for processing.
484
+
485
+
```
486
+
{
487
+
"conversations": [
488
+
{"content": "You are an AI...", "role": "system"},
489
+
{"content": "Look up a word...", "role": "user"},
490
+
{"content": "A word that rhymes is 'mist'", "role": "assistant"}
491
+
],
492
+
"group": "lab_extension",
493
+
"dataset": "base/full-extension",
494
+
"metadata": "{\"num_turns\": 2}"
495
+
}
496
+
497
+
```
498
+
To extract and use the conversations field, pass the following flag when running:
499
+
```
500
+
--dataset_conversation_field "conversations"
501
+
```
502
+
503
+
*Note:* For most cases, users using `Granite3.1+ Instruct` series models which already contain chat template should look to pass `--dataset_conversation_field "messages"` while using multi-turn data on the commandline or use `conversations_column` argument in the [data handler](https://github.com/foundation-model-stack/fms-hf-tuning/blob/30ceecc63f3e2bf3aadba2dfc3336b62187c240f/tests/artifacts/predefined_data_configs/mt_data_granite_3_1B_tokenize_and_mask_handler.yaml#L63) which processes chat template
504
+
505
+
We recommend inspecting the data and chat template to decide if you need to pass this flag.
506
+
507
+
### Guidelines
508
+
509
+
Depending on various scenarios users might need to decide on how to use chat template with their data or which chat template to use for their use case.
510
+
511
+
Following are the Guidelines from us in a flow chart :
512
+

513
+
514
+
Here are some scenarios addressed in the flow chart:
515
+
1. Depending on the model the tokenizer for the model may or may not have a chat template
516
+
2. If the template is available then the `json object schema` of the dataset might not match the chat template's `string format`
517
+
3. There might be special tokens used in chat template which the tokenizer might be unaware of, for example `<|start_of_role|>` which can cause issues during tokenization as it might not be treated as a single token
518
+
519
+
520
+
#### Add Special Tokens
521
+
Working with multi-turn chat data might require the tokenizer to use a few new control tokens ( ex: `<|assistant|>`, `[SYS]` ) as described above in the guidelines. These special tokens might not be present in the tokenizer's vocabulary if the user is using base model.
522
+
523
+
Users can pass `--add_special_tokens` argument which would add the required tokens to the tokenizer's vocabulary.
524
+
For example required special tokens used in `--instruction_template`/`--response_template` can be passed as follows:
Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g.
537
+
538
+
At this time, the data preprocessor does not add EOS tokens to pretokenized datasets, users must ensure EOS tokens are included in their pretokenized data if needed.
Copy file name to clipboardExpand all lines: docs/ept.md
+22-12Lines changed: 22 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,26 +18,36 @@ Lets say you have a `JSONL` data file which contains text to be trained on in ea
18
18
Example dataset,
19
19
20
20
```
21
-
{"Tweet":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
22
-
{"Tweet":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
21
+
{"text":"I am one sample which doesn't exceed the max seq length"}
22
+
{"text":"I am also another sample which doesn't exceed the max seq length"}
23
23
...
24
24
```
25
25
26
26
Sample data config for the above use case.
27
27
```
28
28
dataprocessor:
29
29
type: default
30
+
streaming: false
30
31
datasets:
31
-
- name: non_tokenized_text_dataset
32
+
- name: apply_custom_jinja_template
32
33
data_paths:
33
-
- "<path-to-the-jsonl-dataset>"
34
-
data_handlers:
35
-
- name: add_tokenizer_eos_token
36
-
arguments:
37
-
remove_columns: all
38
-
batched: false
39
-
fn_kwargs:
40
-
dataset_text_field: "dataset_text_field"
34
+
- "FILE_PATH"
35
+
data_handlers:
36
+
- name: apply_custom_jinja_template
37
+
arguments:
38
+
remove_columns: all
39
+
batched: false
40
+
fn_kwargs:
41
+
formatted_text_column_name: "formatted_text"
42
+
template: '{{element["text"]}}{{eos_token}}'
43
+
- name: tokenize
44
+
arguments:
45
+
remove_columns: all
46
+
batched: false
47
+
fn_kwargs:
48
+
text_column_name: "formatted_text"
49
+
truncation: false
50
+
max_length: 4096
41
51
```
42
52
43
53
And the commandline passed to the library should include following.
@@ -46,7 +56,7 @@ And the commandline passed to the library should include following.
46
56
--data_config_path <path to the data config> --packing=True --max_seq_len 8192
47
57
```
48
58
49
-
Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `Tweet` column before passing that as a dataset.
59
+
Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `text` column before tokenizing and passing that as a dataset.
For each of the requested trackers the code expects you to pass a config to the `sft_trainer.train` function which can be specified through `tracker_conifgs` argument [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/a9b8ec8d1d50211873e63fa4641054f704be8712/tuning/sft_trainer.py#L78) details of which are present below.
> Note: After installing, if you wish to use [FlashAttention](https://github.com/Dao-AILab/flash-attention), then you need to install these requirements:
19
+
```sh
20
+
pip install fms-hf-tuning[dev]
21
+
pip install fms-hf-tuning[flash-attn]
22
+
```
23
+
[FlashAttention](https://github.com/Dao-AILab/flash-attention) requires the [CUDA Toolit](https://developer.nvidia.com/cuda-toolkit) to be pre-installed.
24
+
25
+
*Debug recommendation:* While training, if you encounter flash-attn errors such as `undefined symbol`, you can follow the below steps for clean installation of flash binaries. This may occur when having multiple environments sharing the pip cache directory or torch version is updated.
26
+
27
+
```sh
28
+
pip uninstall flash-attn
29
+
pip cache purge
30
+
pip install fms-hf-tuning[flash-attn]
31
+
```
32
+
33
+
## Using FMS-Acceleration
34
+
35
+
`fms-acceleration` is a collection of plugins that packages that accelerate fine-tuning / training of large models, as part of the `fms-hf-tuning` suite. For more details see [this document](./docs/tuning-techniques.md#fms-acceleration).
36
+
37
+
If you wish to use [fms-acceleration](https://github.com/foundation-model-stack/fms-acceleration), you need to install it.
38
+
```
39
+
pip install fms-hf-tuning[fms-accel]
40
+
```
41
+
42
+
## Using Experiment Trackers
43
+
Experiment tracking in fms-hf-tuning allows users to track their experiments with known trackers like [Aimstack](https://aimstack.io/), [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html), [Clearml Tracking](https://clear.ml/) or custom trackers built into the code like
The code supports currently these trackers out of the box,
47
+
*`FileLoggingTracker` : A built in tracker which supports logging training loss to a file.
48
+
- Since this is builin no need to install anything.
49
+
*`Aimstack` : A popular opensource tracker which can be used to track any metrics or metadata from the experiments.
50
+
- Install by running
51
+
`pip install fms-hf-tuning[aim]`
52
+
*`MLflow Tracking` : Another popular opensource tracker which stores metrics, metadata or even artifacts from experiments.
53
+
- Install by running
54
+
`pip install fms-hf-tuning[mlflow]`
55
+
*`Clearml Tracking` : Another opensource tracker which stores metrics, metadata or even artifacts from experiments.
56
+
- Install by running
57
+
`pip install fms-hf-tuning[clearml]`
58
+
59
+
Note. All trackers expect some arguments or can be customized by passing command line arguments which are described in our document on [experiment tracking](./experiment-tracking.md). For further details on enabling and using the trackers use the experiment tracking document.
60
+
61
+
## Training Mamba Models
62
+
63
+
To train Mamba models one needs to have `mamba-ssm` package installed which is compatible with fms-hf-tuning to ensure the optimal training. Not using this package while training Mamba models can result in higher resource usage and suboptimal performance.
0 commit comments