Skip to content

Commit 9cd18ac

Browse files
authored
Merge pull request #95 from spzala/addexample
Add a simple example of running FMS on AIU
2 parents 7dcfb1d + 5197e71 commit 9cd18ac

4 files changed

Lines changed: 244 additions & 137 deletions

File tree

README.md

Lines changed: 3 additions & 137 deletions
Original file line numberDiff line numberDiff line change
@@ -139,144 +139,10 @@ export TORCH_SENDNN_LOG=CRITICAL
139139
export DT_DEEPRT_VERBOSE=-1
140140
```
141141

142-
## Example runs
142+
## How to use Foundation Model Stack (FMS) on AIU hardware
143+
The [scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/scripts) directory provides various scripts to use FMS on AIU hardware for many use cases. These scripts provide robust support for passing desired command line options for running encoder and decoder models along with other use cases. Refer to the documentation on [using different scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/scripts/README.md) for more details.
143144

144-
Tensor parallel execution is only supported on the AIU through the [Foundation Model Stack](https://github.com/foundation-model-stack/foundation-model-stack).
145-
146-
The `--nproc-per-node` command line option controls the number of AIUs to use (number of parallel processes).
147-
148-
### Small Toy
149-
150-
The `small-toy.py` is a slimmed down version of the Big Toy model. The purpose of this model is to demonstrate how to run a tensor parallel model with the FMS on AIU hardware.
151-
152-
```bash
153-
cd ${HOME}/aiu-fms-testing-utils/scripts
154-
155-
# 1 AIU (sequential)
156-
# Inductor (CPU) backend (default)
157-
torchrun --nproc-per-node 1 ./small-toy.py
158-
# AIU backend
159-
torchrun --nproc-per-node 1 ./small-toy.py --backend aiu
160-
161-
# 2 AIUs (tensor parallel)
162-
# Inductor (CPU) backend (default)
163-
torchrun --nproc-per-node 2 ./small-toy.py
164-
# AIU backend
165-
torchrun --nproc-per-node 2 ./small-toy.py --backend aiu
166-
```
167-
168-
Example Output
169-
170-
```console
171-
shell$ torchrun --nproc-per-node 4 ./small-toy.py --backend aiu
172-
------------------------------------------------------------
173-
0 / 4 : Python Version : 3.11.7
174-
0 / 4 : PyTorch Version : 2.2.2+cpu
175-
0 / 4 : Dynamo Backend : aiu -> sendnn
176-
0 / 4 : PCI Addr. for Rank 0 : 0000:bd:00.0
177-
0 / 4 : PCI Addr. for Rank 1 : 0000:b6:00.0
178-
0 / 4 : PCI Addr. for Rank 2 : 0000:b9:00.0
179-
0 / 4 : PCI Addr. for Rank 3 : 0000:b5:00.0
180-
------------------------------------------------------------
181-
0 / 4 : Creating the model...
182-
0 / 4 : Compiling the model...
183-
0 / 4 : Running model: First Time...
184-
0 / 4 : Running model: Second Time...
185-
0 / 4 : Done
186-
```
187-
188-
### Roberta
189-
190-
The `roberta.py` is a simple version of the Roberta model. The purpose of this model is to demonstrate how to run a tensor parallel model with the FMS on AIU hardware.
191-
192-
**Note**: We need to disable the Tensor Parallel `Embedding` conversion to avoid the use of a `torch.distributed` interface that `gloo` does not support. Namely `torch.ops._c10d_functional.all_gather_into_tensor`. The `roberta.py` script will set the following envar to avoid the problematic conversion. This will be removed in a future PyTorch release.
193-
194-
```shell
195-
export DISTRIBUTED_STRATEGY_IGNORE_MODULES=WordEmbedding,Embedding
196-
```
197-
198-
```bash
199-
cd ${HOME}/aiu-fms-testing-utils/scripts
200-
201-
# 1 AIU (sequential)
202-
# Inductor (CPU) backend (default)
203-
torchrun --nproc-per-node 1 ./roberta.py
204-
# AIU backend
205-
torchrun --nproc-per-node 1 ./roberta.py --backend aiu
206-
207-
# 2 AIUs (tensor parallel)
208-
# Inductor (CPU) backend (default)
209-
torchrun --nproc-per-node 2 ./roberta.py
210-
# AIU backend
211-
torchrun --nproc-per-node 2 ./roberta.py --backend aiu
212-
```
213-
214-
Example Output
215-
216-
```console
217-
shell$ torchrun --nproc-per-node 2 ./roberta.py --backend aiu
218-
------------------------------------------------------------
219-
0 / 2 : Python Version : 3.11.7
220-
0 / 2 : PyTorch Version : 2.2.2+cpu
221-
0 / 2 : Dynamo Backend : aiu -> sendnn
222-
0 / 2 : PCI Addr. for Rank 0 : 0000:bd:00.0
223-
0 / 2 : PCI Addr. for Rank 1 : 0000:b6:00.0
224-
------------------------------------------------------------
225-
0 / 2 : Creating the model...
226-
0 / 2 : Compiling the model...
227-
0 / 2 : Running model: First Time...
228-
0 / 2 : Answer: (0.11509) Miss Piggy is a pig.
229-
0 / 2 : Running model: Second Time...
230-
0 / 2 : Answer: (0.11509) Miss Piggy is a pig.
231-
0 / 2 : Done
232-
```
233-
234-
### LLaMA/Granite
235-
236-
```bash
237-
export DT_OPT=varsub=1,lxopt=1,opfusion=1,arithfold=1,dataopt=1,patchinit=1,patchprog=1,autopilot=1,weipreload=0,kvcacheopt=1,progshareopt=1
238-
239-
# run 194m on AIU
240-
python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama3.194m --tokenizer=/home/senuser/llama3.194m --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --compile --default_dtype=fp16 --compile_dynamic
241-
242-
# run 194m on CPU
243-
python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama3.194m --tokenizer=/home/senuser/llama3.194m --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --default_dtype=fp32
244-
245-
# run 7b on AIU
246-
python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama2.7b --tokenizer=/home/senuser/llama2.7b --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --compile --default_dtype=fp16 --compile_dynamic
247-
248-
# run 7b on CPU
249-
python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama2.7b--tokenizer=/home/senuser/llama2.7b --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --default_dtype=fp32
250-
251-
# run gpt_bigcode (granite) 3b on AIU
252-
python3 inference.py --architecture=gpt_bigcode --variant=ibm.3b --model_path=/home/senuser/gpt_bigcode.granite.3b/*00002.bin --model_source=hf --tokenizer=/home/senuser/gpt_bigcode.granite.3b --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --prompt_type=code --compile --default_dtype=fp16 --compile_dynamic
253-
254-
# run gpt_bigcode (granite) 3b on CPU
255-
python3 inference.py --architecture=gpt_bigcode --variant=ibm.3b --model_path=/home/senuser/gpt_bigcode.granite.3b/*00002.bin --model_source=hf --tokenizer=/home/senuser/gpt_bigcode.granite.3b --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --prompt_type=code --default_dtype=fp32
256-
```
257-
258-
To try mini-batch, use `--batch_input`
259-
260-
For the validation script, here are a few examples:
261-
262-
```bash
263-
export DT_OPT=varsub=1,lxopt=1,opfusion=1,arithfold=1,dataopt=1,patchinit=1,patchprog=1,autopilot=1,weipreload=0,kvcacheopt=1,progshareopt=1
264-
265-
# Run a llama 194m model, grab the example inputs in the script, generate validation tokens on cpu, validate token equivalency:
266-
python3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --compile_dynamic
267-
268-
# Run a llama 194m model, grab the example inputs in a folder, generate validation tokens on cpu, validate token equivalency:
269-
python3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --prompt_path=/home/devel/aiu-fms-testing-utils/prompts/test/*.txt --compile_dynamic
270-
271-
# Run a llama 194m model, grab the example inputs in a folder, grab validation text from a folder, validate token equivalency (will only validate up to max(max_new_tokens, tokens_in_validation_file)):
272-
python3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --prompt_path=/home/devel/aiu-fms-testing-utils/prompts/test/*.txt --validation_files_path=/home/devel/aiu-fms-testing-utils/prompts/validation/*.txt --compile_dynamic
273-
274-
# Validate a reduced size version of llama 8b
275-
python3 scripts/validation.py --architecture=hf_configured --model_path=/home/devel/models/llama-8b --tokenizer=/home/devel/models/llama-8b --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --extra_get_model_kwargs nlayers=3 --compile_dynamic
276-
```
277-
278-
To run a logits-based validation, pass `--validation_level=1` to the validation script. This will check for the logits output to match at every step of the model through cross-entropy loss.
279-
You can control the acceptable threshold with `--logits_loss_threshold`
145+
The [examples](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/examples) directory provides small examples aimed at helping understand the general workflow of running a model using FMS on AIU hardware.
280146

281147
## Common Errors
282148

examples/README.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Small examples of using Foundation Model Stack (FMS) on AIU hardware
2+
3+
The [scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/scripts) directory provides robust scripts allowing users to pass various command-line options. You should use them according to your use case. However, considering they are robust and bigger, it can be difficult to follow the flow quickly. The examples provided here serve a short workflow, helping users quickly understand how to run FMS on AIU hardware.
4+
5+
We will walk through an example of running the IBM Granite model with the AIU backend.
6+
The first step is to make sure that the required libraries are available, which includes [aiu-fms-testing-utils](https://github.com/foundation-model-stack/aiu-fms-testing-utils), [fms](https://github.com/foundation-model-stack/foundation-model-stack), [HF Transformers](https://huggingface.co/docs/hub/en/transformers), [torch](https://pytorch.org/get-started/locally/) and torch_sendnn. Depending on your use case, you may need other libraries as well.
7+
8+
For our example code, we will use the following libraries.
9+
```python
10+
import math
11+
import os
12+
import torch
13+
14+
from aiu_fms_testing_utils.utils import warmup_model
15+
from aiu_fms_testing_utils.utils.aiu_setup import dprint
16+
from fms.models import get_model
17+
from fms.utils.generation import generate, pad_input_ids
18+
from torch_sendnn import torch_sendnn
19+
from transformers import AutoTokenizer
20+
```
21+
22+
Now, add the model setup and tokenizer details.
23+
```python
24+
# We will provide our model as a variant as below. If you have a model available locally, you can use model_path variable instead of variant.
25+
variant = "ibm-granite/granite-3.0-8b-base" # or "ibm-ai-platform/micro-g3.3-8b-instruct-1b" etc.
26+
model = get_model(
27+
architecture="hf_pretrained",
28+
variant=variant,
29+
device_type="cpu",
30+
data_type=torch.float16,
31+
fused_weights=False,
32+
)
33+
model.eval()
34+
torch.set_grad_enabled(False)
35+
model.compile(backend="sendnn") # Compile with the AIU sendnn backend
36+
37+
# Tokenize
38+
tokenizer = AutoTokenizer.from_pretrained(variant)
39+
```
40+
41+
Also, since we are using Granite decoder model, set compilation mode to offline_decoder instead of using the default.
42+
```python
43+
os.environ.setdefault("COMPILATION_MODE", "offline_decoder")
44+
```
45+
46+
Now, let's define prompt.
47+
```python
48+
template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{}\n\n### Response:"
49+
prompt = template.format("Provide a list of instructions for preparing chicken soup.")
50+
input_ids = tokenizer.encode(prompt, return_tensors="pt")
51+
input_ids, extra_generation_kwargs = pad_input_ids([input_ids.squeeze(0)], min_pad_length=math.ceil(input_ids.size(1)/64) * 64)
52+
# only_last_token optimization
53+
extra_generation_kwargs["only_last_token"] = True
54+
# Set a desired number
55+
max_new_tokens = 16
56+
```
57+
58+
That's it! We are ready to generate model response.
59+
```python
60+
warmup_model(model, input_ids, max_new_tokens=max_new_tokens, **extra_generation_kwargs)
61+
62+
# Generate model response
63+
result = generate(
64+
model,
65+
input_ids,
66+
max_new_tokens=max_new_tokens,
67+
use_cache=True,
68+
max_seq_len=model.config.max_expected_seq_len,
69+
contiguous_cache=True,
70+
do_sample=False,
71+
extra_kwargs=extra_generation_kwargs,
72+
)
73+
```
74+
75+
Optionally, we will also print the output.
76+
```python
77+
# Print output
78+
def print_result(result):
79+
output_str = tokenizer.convert_tokens_to_string(
80+
tokenizer.convert_ids_to_tokens(result)
81+
)
82+
dprint(output_str)
83+
print("...")
84+
85+
for i in range(result.shape[0]):
86+
print_result(result[i])
87+
```
88+
89+
The output should be similar to the one below.
90+
```
91+
### Instruction:
92+
Provide a list of instructions for preparing chicken soup.
93+
94+
### Response:
95+
1. Gather ingredients: chicken, vegetables (carrots, celery, onions), herbs (parsley, thyme, bay leaves), salt, pep
96+
...
97+
```
98+
You can find this code under the `run_granite3.py`.

examples/run_granite3.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
import math
2+
import os
3+
import torch
4+
5+
from aiu_fms_testing_utils.utils import warmup_model
6+
from aiu_fms_testing_utils.utils.aiu_setup import dprint
7+
from fms.models import get_model
8+
from fms.utils.generation import generate, pad_input_ids
9+
from torch_sendnn import torch_sendnn # noqa: F401
10+
from transformers import AutoTokenizer
11+
12+
# We will provide our model as a variant as below. If you have a model available locally, you can use model_path variable instead of variant.
13+
variant = "ibm-granite/granite-3.0-8b-base" # or "ibm-ai-platform/micro-g3.3-8b-instruct-1b" etc.
14+
model = get_model(
15+
architecture="hf_pretrained",
16+
variant=variant,
17+
device_type="cpu",
18+
data_type=torch.float16,
19+
fused_weights=False,
20+
)
21+
model.eval()
22+
torch.set_grad_enabled(False)
23+
model.compile(backend="sendnn") # Compile with the AIU sendnn backend
24+
25+
# Tokenize
26+
tokenizer = AutoTokenizer.from_pretrained(variant)
27+
28+
os.environ.setdefault("COMPILATION_MODE", "offline_decoder")
29+
30+
template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{}\n\n### Response:"
31+
prompt = template.format("Provide a list of instructions for preparing chicken soup.")
32+
input_ids = tokenizer.encode(prompt, return_tensors="pt")
33+
input_ids, extra_generation_kwargs = pad_input_ids(
34+
[input_ids.squeeze(0)], min_pad_length=math.ceil(input_ids.size(1) / 64) * 64
35+
)
36+
# only_last_token optimization
37+
extra_generation_kwargs["only_last_token"] = True
38+
# Set a desired number
39+
max_new_tokens = 16
40+
41+
warmup_model(model, input_ids, max_new_tokens=max_new_tokens, **extra_generation_kwargs)
42+
# Generate model response
43+
result = generate(
44+
model,
45+
input_ids,
46+
max_new_tokens=max_new_tokens,
47+
use_cache=True,
48+
max_seq_len=model.config.max_expected_seq_len,
49+
contiguous_cache=True,
50+
do_sample=False,
51+
extra_kwargs=extra_generation_kwargs,
52+
)
53+
54+
55+
# Print output
56+
def print_result(result):
57+
output_str = tokenizer.convert_tokens_to_string(
58+
tokenizer.convert_ids_to_tokens(result)
59+
)
60+
dprint(output_str)
61+
print("...")
62+
63+
64+
for i in range(result.shape[0]):
65+
print_result(result[i])

0 commit comments

Comments
 (0)