Skip to content

Commit bc39f95

Browse files
feat: Restructure README (#598)
* restructure README Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com> * split readme Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com> * Update README.md Co-authored-by: Praveen Jayachandran <praveenj83@users.noreply.github.com> Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * update readme Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com> * Update README.md Co-authored-by: Praveen Jayachandran <praveenj83@users.noreply.github.com> Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com> Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Co-authored-by: Praveen Jayachandran <praveenj83@users.noreply.github.com>
1 parent 7e261d2 commit bc39f95

8 files changed

Lines changed: 1224 additions & 1073 deletions

README.md

Lines changed: 49 additions & 1058 deletions
Large diffs are not rendered by default.

docs/advanced-data-preprocessing.md

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,23 @@ Our library also supports a powerful data processing backend which can be used b
77

88
These things are supported via what we call a [`data_config`](#data-config) which can be passed as an argument to sft trainer.
99

10+
## Supported Data File Formats
11+
We support the following file formats via `--training_data_path` argument
12+
13+
Data Format | Tested Support
14+
------------|---------------
15+
JSON | ✅
16+
JSONL | ✅
17+
PARQUET | ✅
18+
ARROW | ✅
19+
20+
As iterated above, we also support passing a HF dataset ID directly via `--training_data_path` argument.
21+
22+
**NOTE**: Due to the variety of supported data formats and file types, `--training_data_path` is handled as follows:
23+
- If `--training_data_path` ends in a valid file extension (e.g., .json, .csv), it is treated as a file.
24+
- If `--training_data_path` points to a valid folder, it is treated as a folder.
25+
- If neither of these are true, the data preprocessor tries to load `--training_data_path` as a Hugging Face (HF) dataset ID.
26+
1027
## Data Config
1128

1229
Data config is a configuration file which `sft_trainer.py` supports as an argument via `--data_config_path` flag. In this
@@ -320,4 +337,206 @@ This can add extra backslashes to your chat template causing it to become invali
320337
321338
We provide some example data configs [here](../tests/artifacts/predefined_data_configs/)
322339
340+
# Use cases supported via command line argument `training_data_path`
341+
342+
For basic users who want to pass command line argument directly to our stack you can refer to the following supported data formats.
343+
344+
### 1. Data formats with a single sequence and a specified response_template to use for masking on completion.
345+
346+
#### 1.1 Pre-process the dataset
347+
Pre-process the dataset to contain a single sequence of each data instance containing input + response. The trainer is configured to expect a `response template` as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code.
348+
349+
```python
350+
PROMPT_DICT = {
351+
"prompt_input": (
352+
"Below is an instruction that describes a task, paired with an input that provides further context. "
353+
"Write a response that appropriately completes the request.\n\n"
354+
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
355+
),
356+
"prompt_no_input": (
357+
"Below is an instruction that describes a task. "
358+
"Write a response that appropriately completes the request.\n\n"
359+
"### Instruction:\n{instruction}\n\n### Response:"
360+
),
361+
}
362+
363+
def format_alpaca_fn(example):
364+
prompt_input, prompt_no_input = PROMPT_DICT['prompt_input'], PROMPT_DICT['prompt_no_input']
365+
output = prompt_input.format_map(example) if example.get("input", "") != "" else prompt_no_input.format_map(example)
366+
output = f"{output} {example['output']}"
367+
return {"output": output}
368+
369+
ds = datasets.load_dataset('json', data_files='./stanford_alpaca/alpaca_data.json')
370+
371+
alpaca_ds = ds['train'].map(format_alpaca_fn, remove_columns=['instruction', 'input'])
372+
alpaca_ds.to_json("sft_alpaca_data.json")
373+
```
374+
375+
The `response template` corresponding to the above dataset and the `Llama` tokenizer is: `\n### Response:"`.
376+
377+
The same way can be applied to any dataset, with more info can be found [here](https://huggingface.co/docs/trl/main/en/sft_trainer#format-your-input-prompts).
378+
379+
Once the data is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer.
380+
381+
#### 1.2 Format the dataset on the fly
382+
Pass a dataset and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of the dataset with `{{field}}`. While tuning, the data will be converted to a single sequence using the template. Data fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-".
383+
384+
Example: Train.json
385+
`[{ "input" : <text>,
386+
"output" : <text>,
387+
},
388+
...
389+
]`
390+
data_formatter_template: `### Input: {{input}} \n\n## Label: {{output}}`
391+
392+
Formatting will happen on the fly while tuning. The keys in template should match fields in the dataset file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`.
393+
394+
##### In conclusion, if using the reponse_template and single sequence, either the `data_formatter_template` argument or `dataset_text_field` needs to be supplied to the trainer.
395+
396+
### 2. Dataset with input and output fields (no response template)
397+
398+
Pass a [supported dataset](#supported-data-formats) containing fields `"input"` with source text and `"output"` with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked.
399+
400+
The `"input"` and `"output"` field names are mandatory and cannot be changed.
401+
402+
Example: For a JSON dataset like, `Train.jsonl`
403+
404+
```
405+
{"input": "### Input: Colorado is a state in USA ### Output:", "output": "USA : Location"}
406+
{"input": "### Input: Arizona is also a state in USA ### Output:", "output": "USA : Location"}
407+
```
408+
409+
### 3. Chat Style Single/Multi turn datasets
410+
411+
Pass a dataset containing single/multi turn chat dataset. Your dataset could follow this format:
412+
413+
```
414+
$ head -n 1 train.jsonl
415+
{"messages": [{"content": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "role": "system"}, {"content": "Look up a word that rhymes with exist", "role": "user"}, {"content": "I found a word that rhymes with \"exist\":\n1\\. Mist", "role": "assistant"}], "group": "lab_extension", "dataset": "base/full-extension", "metadata": "{\"num_turns\": 1}"}
416+
```
417+
418+
This format supports both single and multi-turn chat scenarios.
419+
420+
The chat template used to render the dataset will default to `tokenizer.chat_template` from the model's tokenizer configuration. This can be overridden using the `--chat_template <chat-template-string>` argument. For example, models like [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct), which include a [chat template](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/e0a466fb25b9e07e9c2dc93380a360189700d1f8/tokenizer_config.json#L188) in their `tokenizer_config.json`, do not require users to provide a chat template to process the data.
421+
422+
Users do need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of
423+
`assistant` and `human` response inside the formatted chat template.
424+
For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) for example, the values shall be.
425+
```
426+
--instruction_template "<|start_of_role|>user<|end_of_role|>"
427+
--response_template "<|start_of_role|>assistant<|end_of_role|>"
428+
```
323429
430+
The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.
431+
432+
#### Aligning dataset formats
433+
In some cases the chat template might not be aligned with the data format of the dataset. For example, consider the following data sample and suppose we want to use the list of contents associated with the `messages` key from the data sample for our multi-turn training job!
434+
435+
```
436+
{
437+
"messages": [
438+
{"content": "You are an AI...", "role": "system"},
439+
{"content": "Look up a word...", "role": "user"},
440+
{"content": "A word that rhymes is 'mist'", "role": "assistant"}
441+
],
442+
"group": "lab_extension",
443+
"dataset": "base/full-extension",
444+
"metadata": "{\"num_turns\": 2}"
445+
}
446+
```
447+
Different Chat templates support different data formats and the chat template might not always align with the data format of the dataset!
448+
449+
Here is a example of chat template that iterates over the nested data sample by addressing the "messages" key in `for message in messages['messages']` :
450+
```
451+
{% for message in messages['messages'] %}\
452+
{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + eos_token }}\
453+
{% elif message['role'] == 'system' %}{{ '<|system|>\n' + message['content'] + eos_token }}\
454+
{% elif message['role'] == 'assistant' %}{{ '<|assistant|>\n' + message['content'] + eos_token }}\
455+
{% endif %}\
456+
{% if loop.last and add_generation_prompt %}{{ '<|assistant|>' }}\
457+
{% endif %}\
458+
{% endfor %}
459+
```
460+
While the above template might be suitable for certain data formats, not all chat templates access the nested contents in a data sample.
461+
462+
In the following example notice the `for message in messages` line which does not access any nested contents in the data and expects the nested content to be passed directly to the chat template!
463+
464+
```
465+
{%- for message in messages %}\
466+
{%- if message['role'] == 'system' %}\
467+
{{- '<|system|>\n' + message['content'] + '\n' }}\
468+
{%- elif message['role'] == 'user' %}\
469+
{{- '<|user|>\n' + message['content'] + '\n' }}\
470+
{%- elif message['role'] == 'assistant' %}\
471+
{%- if not loop.last %}\
472+
{{- '<|assistant|>\n' + message['content'] + eos_token + '\n' }}\
473+
{%- else %}\
474+
{{- '<|assistant|>\n' + message['content'] + eos_token }}\
475+
{%- endif %}\
476+
{%- endif %}\
477+
{%- if loop.last and add_generation_prompt %}\
478+
{{- '<|assistant|>\n' }}\
479+
{%- endif %}\
480+
{%- endfor %}
481+
```
482+
483+
When working with multi-turn datasets, it's often necessary to extract specific fields from the data depending on the format. For example, in many multi-turn datasets, conversations may be stored under a dedicated key (e.g., `conversations`, `messages`, etc), and you may only need the content of that key for processing.
484+
485+
```
486+
{
487+
"conversations": [
488+
{"content": "You are an AI...", "role": "system"},
489+
{"content": "Look up a word...", "role": "user"},
490+
{"content": "A word that rhymes is 'mist'", "role": "assistant"}
491+
],
492+
"group": "lab_extension",
493+
"dataset": "base/full-extension",
494+
"metadata": "{\"num_turns\": 2}"
495+
}
496+
497+
```
498+
To extract and use the conversations field, pass the following flag when running:
499+
```
500+
--dataset_conversation_field "conversations"
501+
```
502+
503+
*Note:* For most cases, users using `Granite3.1+ Instruct` series models which already contain chat template should look to pass `--dataset_conversation_field "messages"` while using multi-turn data on the commandline or use `conversations_column` argument in the [data handler](https://github.com/foundation-model-stack/fms-hf-tuning/blob/30ceecc63f3e2bf3aadba2dfc3336b62187c240f/tests/artifacts/predefined_data_configs/mt_data_granite_3_1B_tokenize_and_mask_handler.yaml#L63) which processes chat template
504+
505+
We recommend inspecting the data and chat template to decide if you need to pass this flag.
506+
507+
### Guidelines
508+
509+
Depending on various scenarios users might need to decide on how to use chat template with their data or which chat template to use for their use case.
510+
511+
Following are the Guidelines from us in a flow chart :
512+
![guidelines for chat template](docs/images/chat_template_guide.jpg)
513+
514+
Here are some scenarios addressed in the flow chart:
515+
1. Depending on the model the tokenizer for the model may or may not have a chat template
516+
2. If the template is available then the `json object schema` of the dataset might not match the chat template's `string format`
517+
3. There might be special tokens used in chat template which the tokenizer might be unaware of, for example `<|start_of_role|>` which can cause issues during tokenization as it might not be treated as a single token
518+
519+
520+
#### Add Special Tokens
521+
Working with multi-turn chat data might require the tokenizer to use a few new control tokens ( ex: `<|assistant|>`, `[SYS]` ) as described above in the guidelines. These special tokens might not be present in the tokenizer's vocabulary if the user is using base model.
522+
523+
Users can pass `--add_special_tokens` argument which would add the required tokens to the tokenizer's vocabulary.
524+
For example required special tokens used in `--instruction_template`/`--response_template` can be passed as follows:
525+
526+
```
527+
python -m tuning.sft_trainer \
528+
...
529+
--add_special_tokens "<|start_of_role|>" "<|end_of_role|>" \
530+
--instruction_template "<|start_of_role|>user<|end_of_role|>" \
531+
--response_template "<|start_of_role|>assistant<|end_of_role|>"
532+
```
533+
534+
### 4. Pre tokenized datasets.
535+
536+
Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g.
537+
538+
At this time, the data preprocessor does not add EOS tokens to pretokenized datasets, users must ensure EOS tokens are included in their pretokenized data if needed.
539+
540+
```
541+
python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow
542+
```

docs/ept.md

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -18,26 +18,36 @@ Lets say you have a `JSONL` data file which contains text to be trained on in ea
1818
Example dataset,
1919

2020
```
21-
{"Tweet":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
22-
{"Tweet":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
21+
{"text":"I am one sample which doesn't exceed the max seq length"}
22+
{"text":"I am also another sample which doesn't exceed the max seq length"}
2323
...
2424
```
2525

2626
Sample data config for the above use case.
2727
```
2828
dataprocessor:
2929
type: default
30+
streaming: false
3031
datasets:
31-
- name: non_tokenized_text_dataset
32+
- name: apply_custom_jinja_template
3233
data_paths:
33-
- "<path-to-the-jsonl-dataset>"
34-
data_handlers:
35-
- name: add_tokenizer_eos_token
36-
arguments:
37-
remove_columns: all
38-
batched: false
39-
fn_kwargs:
40-
dataset_text_field: "dataset_text_field"
34+
- "FILE_PATH"
35+
data_handlers:
36+
- name: apply_custom_jinja_template
37+
arguments:
38+
remove_columns: all
39+
batched: false
40+
fn_kwargs:
41+
formatted_text_column_name: "formatted_text"
42+
template: '{{element["text"]}}{{eos_token}}'
43+
- name: tokenize
44+
arguments:
45+
remove_columns: all
46+
batched: false
47+
fn_kwargs:
48+
text_column_name: "formatted_text"
49+
truncation: false
50+
max_length: 4096
4151
```
4252

4353
And the commandline passed to the library should include following.
@@ -46,7 +56,7 @@ And the commandline passed to the library should include following.
4656
--data_config_path <path to the data config> --packing=True --max_seq_len 8192
4757
```
4858

49-
Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `Tweet` column before passing that as a dataset.
59+
Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `text` column before tokenizing and passing that as a dataset.
5060

5161
### Multiple Non Tokenized Datasets
5262

docs/experiment-tracking.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,6 @@ sft_trainer.train(train_args=training_args,...)
1818

1919
For each of the requested trackers the code expects you to pass a config to the `sft_trainer.train` function which can be specified through `tracker_conifgs` argument [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/a9b8ec8d1d50211873e63fa4641054f704be8712/tuning/sft_trainer.py#L78) details of which are present below.
2020

21-
22-
23-
2421
## Tracker Configurations
2522

2623
## File Logging Tracker

docs/installation.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Table of Contents
2+
3+
Table of contents:
4+
- [Basic Installations](#basic-installation)
5+
- [Installing FlashAttention](#using-flashattention)
6+
- [Installing Fms Acceleration](#using-fms-acceleration)
7+
- [Installing Mamba Model Support](#training-mamba-models)
8+
- [Installing Experiment Tracker Support](#using-experiment-trackers)
9+
10+
## Basic Installation
11+
12+
```
13+
pip install fms-hf-tuning
14+
```
15+
16+
## Using FlashAttention
17+
18+
> Note: After installing, if you wish to use [FlashAttention](https://github.com/Dao-AILab/flash-attention), then you need to install these requirements:
19+
```sh
20+
pip install fms-hf-tuning[dev]
21+
pip install fms-hf-tuning[flash-attn]
22+
```
23+
[FlashAttention](https://github.com/Dao-AILab/flash-attention) requires the [CUDA Toolit](https://developer.nvidia.com/cuda-toolkit) to be pre-installed.
24+
25+
*Debug recommendation:* While training, if you encounter flash-attn errors such as `undefined symbol`, you can follow the below steps for clean installation of flash binaries. This may occur when having multiple environments sharing the pip cache directory or torch version is updated.
26+
27+
```sh
28+
pip uninstall flash-attn
29+
pip cache purge
30+
pip install fms-hf-tuning[flash-attn]
31+
```
32+
33+
## Using FMS-Acceleration
34+
35+
`fms-acceleration` is a collection of plugins that packages that accelerate fine-tuning / training of large models, as part of the `fms-hf-tuning` suite. For more details see [this document](./docs/tuning-techniques.md#fms-acceleration).
36+
37+
If you wish to use [fms-acceleration](https://github.com/foundation-model-stack/fms-acceleration), you need to install it.
38+
```
39+
pip install fms-hf-tuning[fms-accel]
40+
```
41+
42+
## Using Experiment Trackers
43+
Experiment tracking in fms-hf-tuning allows users to track their experiments with known trackers like [Aimstack](https://aimstack.io/), [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html), [Clearml Tracking](https://clear.ml/) or custom trackers built into the code like
44+
[FileLoggingTracker](./tuning/trackers/filelogging_tracker.py)
45+
46+
The code supports currently these trackers out of the box,
47+
* `FileLoggingTracker` : A built in tracker which supports logging training loss to a file.
48+
- Since this is builin no need to install anything.
49+
* `Aimstack` : A popular opensource tracker which can be used to track any metrics or metadata from the experiments.
50+
- Install by running
51+
`pip install fms-hf-tuning[aim]`
52+
* `MLflow Tracking` : Another popular opensource tracker which stores metrics, metadata or even artifacts from experiments.
53+
- Install by running
54+
`pip install fms-hf-tuning[mlflow]`
55+
* `Clearml Tracking` : Another opensource tracker which stores metrics, metadata or even artifacts from experiments.
56+
- Install by running
57+
`pip install fms-hf-tuning[clearml]`
58+
59+
Note. All trackers expect some arguments or can be customized by passing command line arguments which are described in our document on [experiment tracking](./experiment-tracking.md). For further details on enabling and using the trackers use the experiment tracking document.
60+
61+
## Training Mamba Models
62+
63+
To train Mamba models one needs to have `mamba-ssm` package installed which is compatible with fms-hf-tuning to ensure the optimal training. Not using this package while training Mamba models can result in higher resource usage and suboptimal performance.
64+
65+
Install this as
66+
```
67+
pip install fms-hf-tuning[mamba]
68+
```
69+
70+
71+
```
72+
pip install fms-hf-tuning
73+
```

0 commit comments

Comments
 (0)