Merge pull request #95 from spzala/addexample

JRosenkranz · web-flow · commit 9cd18ac6a304 · 2025-08-19T11:48:17.000-04:00
Add a simple example of running FMS on AIU
diff --git a/README.md b/README.md
@@ -139,144 +139,10 @@ export TORCH_SENDNN_LOG=CRITICAL
 export DT_DEEPRT_VERBOSE=-1
 ```
 
-## Example runs
+## How to use Foundation Model Stack (FMS) on AIU hardware
+The [scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/scripts) directory provides various scripts to use FMS on AIU hardware for many use cases. These scripts provide robust support for passing desired command line options for running encoder and decoder models along with other use cases. Refer to the documentation on [using different scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/scripts/README.md) for more details.
 
- Tensor parallel execution is only supported on the AIU through the [Foundation Model Stack](https://github.com/foundation-model-stack/foundation-model-stack).
-
-The `--nproc-per-node` command line option controls the number of AIUs to use (number of parallel processes).
-
-### Small Toy
-
-The `small-toy.py` is a slimmed down version of the Big Toy model. The purpose of this model is to demonstrate how to run a tensor parallel model with the FMS on AIU hardware.
-
-```bash
-cd ${HOME}/aiu-fms-testing-utils/scripts
-
-# 1 AIU (sequential)
-# Inductor (CPU) backend (default)
-torchrun --nproc-per-node 1 ./small-toy.py
-# AIU backend
-torchrun --nproc-per-node 1 ./small-toy.py --backend aiu
-
-# 2 AIUs (tensor parallel)
-# Inductor (CPU) backend (default)
-torchrun --nproc-per-node 2 ./small-toy.py
-# AIU backend
-torchrun --nproc-per-node 2 ./small-toy.py --backend aiu
-```
-
-Example Output
-
-```console
-shell$ torchrun --nproc-per-node 4 ./small-toy.py --backend aiu
-------------------------------------------------------------
-0 / 4 : Python Version  : 3.11.7
-0 / 4 : PyTorch Version : 2.2.2+cpu
-0 / 4 : Dynamo Backend  : aiu -> sendnn
-0 / 4 : PCI Addr. for Rank 0 : 0000:bd:00.0
-0 / 4 : PCI Addr. for Rank 1 : 0000:b6:00.0
-0 / 4 : PCI Addr. for Rank 2 : 0000:b9:00.0
-0 / 4 : PCI Addr. for Rank 3 : 0000:b5:00.0
-------------------------------------------------------------
-0 / 4 : Creating the model...
-0 / 4 : Compiling the model...
-0 / 4 : Running model: First Time...
-0 / 4 : Running model: Second Time...
-0 / 4 : Done
-```
-
-### Roberta
-
-The `roberta.py` is a simple version of the Roberta model. The purpose of this model is to demonstrate how to run a tensor parallel model with the FMS on AIU hardware.
-
-**Note**: We need to disable the Tensor Parallel `Embedding` conversion to avoid the use of a `torch.distributed` interface that `gloo` does not support. Namely `torch.ops._c10d_functional.all_gather_into_tensor`. The `roberta.py` script will set the following envar to avoid the problematic conversion. This will be removed in a future PyTorch release.
-
-```shell
-export DISTRIBUTED_STRATEGY_IGNORE_MODULES=WordEmbedding,Embedding
-```
-
-```bash
-cd ${HOME}/aiu-fms-testing-utils/scripts
-
-# 1 AIU (sequential)
-# Inductor (CPU) backend (default)
-torchrun --nproc-per-node 1 ./roberta.py
-# AIU backend
-torchrun --nproc-per-node 1 ./roberta.py --backend aiu
-
-# 2 AIUs (tensor parallel)
-# Inductor (CPU) backend (default)
-torchrun --nproc-per-node 2 ./roberta.py
-# AIU backend
-torchrun --nproc-per-node 2 ./roberta.py --backend aiu
-```
-
-Example Output
-
-```console
-shell$ torchrun --nproc-per-node 2 ./roberta.py --backend aiu
-------------------------------------------------------------
-0 / 2 : Python Version  : 3.11.7
-0 / 2 : PyTorch Version : 2.2.2+cpu
-0 / 2 : Dynamo Backend  : aiu -> sendnn
-0 / 2 : PCI Addr. for Rank 0 : 0000:bd:00.0
-0 / 2 : PCI Addr. for Rank 1 : 0000:b6:00.0
-------------------------------------------------------------
-0 / 2 : Creating the model...
-0 / 2 : Compiling the model...
-0 / 2 : Running model: First Time...
-0 / 2 : Answer: (0.11509) Miss Piggy is a pig.
-0 / 2 : Running model: Second Time...
-0 / 2 : Answer: (0.11509) Miss Piggy is a pig.
-0 / 2 : Done
-```
-
-### LLaMA/Granite
-
-```bash
-export DT_OPT=varsub=1,lxopt=1,opfusion=1,arithfold=1,dataopt=1,patchinit=1,patchprog=1,autopilot=1,weipreload=0,kvcacheopt=1,progshareopt=1
-
-# run 194m on AIU
-python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama3.194m --tokenizer=/home/senuser/llama3.194m --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --compile --default_dtype=fp16 --compile_dynamic
-
-# run 194m on CPU
-python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama3.194m --tokenizer=/home/senuser/llama3.194m --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --default_dtype=fp32
-
-# run 7b on AIU
-python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama2.7b --tokenizer=/home/senuser/llama2.7b --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --compile --default_dtype=fp16 --compile_dynamic
-
-# run 7b on CPU
-python3 inference.py --architecture=hf_pretrained --model_path=/home/senuser/llama2.7b--tokenizer=/home/senuser/llama2.7b --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --default_dtype=fp32
-
-# run gpt_bigcode (granite) 3b on AIU
-python3 inference.py --architecture=gpt_bigcode --variant=ibm.3b --model_path=/home/senuser/gpt_bigcode.granite.3b/*00002.bin --model_source=hf --tokenizer=/home/senuser/gpt_bigcode.granite.3b --unfuse_weights --min_pad_length 64 --device_type=aiu --max_new_tokens=5 --prompt_type=code --compile --default_dtype=fp16 --compile_dynamic
-
-# run gpt_bigcode (granite) 3b on CPU
-python3 inference.py --architecture=gpt_bigcode --variant=ibm.3b --model_path=/home/senuser/gpt_bigcode.granite.3b/*00002.bin --model_source=hf --tokenizer=/home/senuser/gpt_bigcode.granite.3b --unfuse_weights --min_pad_length 64 --device_type=cpu --max_new_tokens=5 --prompt_type=code --default_dtype=fp32
-```
-
-To try mini-batch, use `--batch_input`
-
-For the validation script, here are a few examples:
-
-```bash
-export DT_OPT=varsub=1,lxopt=1,opfusion=1,arithfold=1,dataopt=1,patchinit=1,patchprog=1,autopilot=1,weipreload=0,kvcacheopt=1,progshareopt=1
-
-# Run a llama 194m model, grab the example inputs in the script, generate validation tokens on cpu, validate token equivalency: 
-python3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --compile_dynamic
-
-# Run a llama 194m model, grab the example inputs in a folder, generate validation tokens on cpu, validate token equivalency:
-python3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --prompt_path=/home/devel/aiu-fms-testing-utils/prompts/test/*.txt --compile_dynamic
-
-# Run a llama 194m model, grab the example inputs in a folder, grab validation text from a folder, validate token equivalency (will only validate up to max(max_new_tokens, tokens_in_validation_file)):
-python3 scripts/validation.py --architecture=hf_pretrained --model_path=/home/devel/models/llama-194m --tokenizer=/home/devel/models/llama-194m --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --prompt_path=/home/devel/aiu-fms-testing-utils/prompts/test/*.txt --validation_files_path=/home/devel/aiu-fms-testing-utils/prompts/validation/*.txt --compile_dynamic
-
-# Validate a reduced size version of llama 8b
-python3 scripts/validation.py --architecture=hf_configured --model_path=/home/devel/models/llama-8b --tokenizer=/home/devel/models/llama-8b --unfuse_weights --batch_size=1 --min_pad_length=64 --max_new_tokens=10 --extra_get_model_kwargs nlayers=3 --compile_dynamic
-```
-
-To run a logits-based validation, pass `--validation_level=1` to the validation script. This will check for the logits output to match at every step of the model through cross-entropy loss.
-You can control the acceptable threshold with `--logits_loss_threshold`
+The [examples](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/examples) directory provides small examples aimed at helping understand the general workflow of running a model using FMS on AIU hardware.
 
 ## Common Errors
 
diff --git a/examples/README.md b/examples/README.md
@@ -0,0 +1,98 @@
+# Small examples of using Foundation Model Stack (FMS) on AIU hardware
+
+The [scripts](https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/scripts) directory provides robust scripts allowing users to pass various command-line options. You should use them according to your use case. However, considering they are robust and bigger, it can be difficult to follow the flow quickly. The examples provided here serve a short workflow, helping users quickly understand how to run FMS on AIU hardware.
+
+We will walk through an example of running the IBM Granite model with the AIU backend.
+The first step is to make sure that the required libraries are available, which includes [aiu-fms-testing-utils](https://github.com/foundation-model-stack/aiu-fms-testing-utils), [fms](https://github.com/foundation-model-stack/foundation-model-stack), [HF Transformers](https://huggingface.co/docs/hub/en/transformers), [torch](https://pytorch.org/get-started/locally/) and torch_sendnn. Depending on your use case, you may need other libraries as well.
+
+For our example code, we will use the following libraries.
+```python
+import math
+import os
+import torch
+
+from aiu_fms_testing_utils.utils import warmup_model
+from aiu_fms_testing_utils.utils.aiu_setup import dprint
+from fms.models import get_model
+from fms.utils.generation import generate, pad_input_ids
+from torch_sendnn import torch_sendnn
+from transformers import AutoTokenizer
+```
+
+Now, add the model setup and tokenizer details.
+```python
+# We will provide our model as a variant as below. If you have a model available locally, you can use model_path variable instead of variant.
+variant = "ibm-granite/granite-3.0-8b-base" # or "ibm-ai-platform/micro-g3.3-8b-instruct-1b" etc.
+model = get_model(
+    architecture="hf_pretrained",
+    variant=variant,
+    device_type="cpu",
+    data_type=torch.float16,
+    fused_weights=False,
+)
+model.eval()
+torch.set_grad_enabled(False)
+model.compile(backend="sendnn") # Compile with the AIU sendnn backend
+
+# Tokenize
+tokenizer = AutoTokenizer.from_pretrained(variant)
+```
+
+Also, since we are using Granite decoder model, set compilation mode to offline_decoder instead of using the default.
+```python
+os.environ.setdefault("COMPILATION_MODE", "offline_decoder")
+```
+
+Now, let's define prompt.
+```python
+template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{}\n\n### Response:"
+prompt = template.format("Provide a list of instructions for preparing chicken soup.")
+input_ids = tokenizer.encode(prompt, return_tensors="pt")
+input_ids, extra_generation_kwargs = pad_input_ids([input_ids.squeeze(0)], min_pad_length=math.ceil(input_ids.size(1)/64) * 64)
+# only_last_token optimization
+extra_generation_kwargs["only_last_token"] = True
+# Set a desired number
+max_new_tokens = 16
+```
+
+That's it! We are ready to generate model response.
+```python
+warmup_model(model, input_ids, max_new_tokens=max_new_tokens, **extra_generation_kwargs)
+
+# Generate model response
+result = generate(
+    model,
+    input_ids,
+    max_new_tokens=max_new_tokens,
+    use_cache=True,
+    max_seq_len=model.config.max_expected_seq_len,
+    contiguous_cache=True,
+    do_sample=False,
+    extra_kwargs=extra_generation_kwargs,
+)
+```
+
+Optionally, we will also print the output.
+```python
+# Print output
+def print_result(result):
+    output_str = tokenizer.convert_tokens_to_string(
+        tokenizer.convert_ids_to_tokens(result)
+    )
+    dprint(output_str)
+    print("...")
+
+for i in range(result.shape[0]):
+    print_result(result[i])
+```
+
+The output should be similar to the one below.
+```
+### Instruction:
+Provide a list of instructions for preparing chicken soup.
+
+### Response:
+1. Gather ingredients: chicken, vegetables (carrots, celery, onions), herbs (parsley, thyme, bay leaves), salt, pep
+...
+```
+You can find this code under the `run_granite3.py`. 
diff --git a/examples/run_granite3.py b/examples/run_granite3.py
@@ -0,0 +1,65 @@
+import math
+import os
+import torch
+
+from aiu_fms_testing_utils.utils import warmup_model
+from aiu_fms_testing_utils.utils.aiu_setup import dprint
+from fms.models import get_model
+from fms.utils.generation import generate, pad_input_ids
+from torch_sendnn import torch_sendnn  # noqa: F401
+from transformers import AutoTokenizer
+
+# We will provide our model as a variant as below. If you have a model available locally, you can use model_path variable instead of variant.
+variant = "ibm-granite/granite-3.0-8b-base"  # or "ibm-ai-platform/micro-g3.3-8b-instruct-1b" etc.
+model = get_model(
+    architecture="hf_pretrained",
+    variant=variant,
+    device_type="cpu",
+    data_type=torch.float16,
+    fused_weights=False,
+)
+model.eval()
+torch.set_grad_enabled(False)
+model.compile(backend="sendnn")  # Compile with the AIU sendnn backend
+
+# Tokenize
+tokenizer = AutoTokenizer.from_pretrained(variant)
+
+os.environ.setdefault("COMPILATION_MODE", "offline_decoder")
+
+template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{}\n\n### Response:"
+prompt = template.format("Provide a list of instructions for preparing chicken soup.")
+input_ids = tokenizer.encode(prompt, return_tensors="pt")
+input_ids, extra_generation_kwargs = pad_input_ids(
+    [input_ids.squeeze(0)], min_pad_length=math.ceil(input_ids.size(1) / 64) * 64
+)
+# only_last_token optimization
+extra_generation_kwargs["only_last_token"] = True
+# Set a desired number
+max_new_tokens = 16
+
+warmup_model(model, input_ids, max_new_tokens=max_new_tokens, **extra_generation_kwargs)
+# Generate model response
+result = generate(
+    model,
+    input_ids,
+    max_new_tokens=max_new_tokens,
+    use_cache=True,
+    max_seq_len=model.config.max_expected_seq_len,
+    contiguous_cache=True,
+    do_sample=False,
+    extra_kwargs=extra_generation_kwargs,
+)
+
+
+# Print output
+def print_result(result):
+    output_str = tokenizer.convert_tokens_to_string(
+        tokenizer.convert_ids_to_tokens(result)
+    )
+    dprint(output_str)
+    print("...")
+
+
+for i in range(result.shape[0]):
+    print_result(result[i])
diff --git a/scripts/README.md b/scripts/README.md