Skip to content

Commit 60b1351

Browse files
Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill (pytorch#20273)
### Summary Calibration dataset: - Replace HF AutoModel token generation with direct tokenization of curated corpus (llm eval tasks or JSON samples) - Add default calibration samples: assets/samples/{text,vision,audio}.json - Support Dataloader-based calibration Architecture: - Introduce PTQStrategy + DecoderInference as unified calibration forward-pass primitives; remove decoder_utils.graph_module_inference - Refactor dataset.py into dataset/ package: builders, collators, config, datasets, loaders, preprocessors, schema ### Test plan Test CI: - ExampleLLMScript - TestExampleMultimodalityScript
1 parent 3a9abee commit 60b1351

35 files changed

Lines changed: 2238 additions & 2578 deletions

backends/qualcomm/tests/test_qnn_delegate.py

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8199,7 +8199,6 @@ def test_static_llm_model(self): # noqa: C901
81998199
"1024",
82008200
"--max_context_len",
82018201
"1024",
8202-
"--skip_user_prompt_calibration",
82038202
]
82048203

82058204
match self.static_llm_eval_method:
@@ -8249,10 +8248,17 @@ def test_static_llm_model(self): # noqa: C901
82498248
]
82508249
)
82518250
case _:
8252-
cmds.remove("--skip_user_prompt_calibration")
82538251
logging.warning(
82548252
"No llm eval method chosen. Only generate model output."
82558253
)
8254+
cmds.extend(
8255+
[
8256+
"--calib_tasks",
8257+
"wikitext",
8258+
"--calib_limit",
8259+
"1",
8260+
]
8261+
)
82568262

82578263
if is_llama_model:
82588264
cmds.extend(
@@ -8425,6 +8431,10 @@ def test_codegen2_1b(self):
84258431
"128",
84268432
"--max_context_len",
84278433
"128",
8434+
"--calib_tasks",
8435+
"wikitext",
8436+
"--calib_limit",
8437+
"1",
84288438
]
84298439
self.add_default_cmds(cmds)
84308440

@@ -8486,6 +8496,10 @@ def test_llama_stories_260k(self):
84868496
"128",
84878497
"--max_context_len",
84888498
"128",
8499+
"--calib_tasks",
8500+
"wikitext",
8501+
"--calib_limit",
8502+
"1",
84898503
]
84908504
self.add_default_cmds(cmds)
84918505

@@ -8549,6 +8563,10 @@ def test_llama_stories_110m(self):
85498563
"128",
85508564
"--max_context_len",
85518565
"128",
8566+
"--calib_tasks",
8567+
"wikitext",
8568+
"--calib_limit",
8569+
"1",
85528570
]
85538571
if self.use_fp16:
85548572
cmds.append("--use_fp16")
@@ -8702,7 +8720,7 @@ class VLMSpecs(MLLMSpecs):
87028720
def setUp(self):
87038721
self.alm_specs = {
87048722
"granite_speech_3_3-2b": TestExampleMultimodalityScript.ALMSpecs(
8705-
max_seq_len=512,
8723+
max_seq_len=1024,
87068724
sm8650_token_rate=5,
87078725
sm8750_token_rate=8,
87088726
encoder_pte_size=900_000_000, # 900MB
@@ -8714,7 +8732,7 @@ def setUp(self):
87148732
}
87158733
self.vlm_specs = {
87168734
"smolvlm_500m_instruct": TestExampleMultimodalityScript.VLMSpecs(
8717-
max_seq_len=128,
8735+
max_seq_len=1024,
87188736
sm8650_token_rate=50,
87198737
sm8750_token_rate=55,
87208738
encoder_pte_size=110_000_000, # 110MB
@@ -8724,7 +8742,7 @@ def setUp(self):
87248742
golden_image_feature="city",
87258743
),
87268744
"internvl3_1b": TestExampleMultimodalityScript.VLMSpecs(
8727-
max_seq_len=320,
8745+
max_seq_len=1024,
87288746
sm8650_token_rate=11,
87298747
sm8750_token_rate=13,
87308748
encoder_pte_size=425_000_000, # 425MB
@@ -8776,6 +8794,8 @@ def test_static_asr(self):
87768794
"kv",
87778795
"--max_seq_len",
87788796
f"{alm_specs.max_seq_len}",
8797+
"--calib_samples",
8798+
"./examples/qualcomm/oss_scripts/llama/assets/samples/audio.json",
87798799
]
87808800
if self.compile_only:
87818801
cmds.extend(["--compile_only"])
@@ -8859,6 +8879,8 @@ def test_static_vlm(self):
88598879
"kv",
88608880
"--max_seq_len",
88618881
f"{vlm_specs.max_seq_len}",
8882+
"--calib_samples",
8883+
"./examples/qualcomm/oss_scripts/llama/assets/samples/vision.json",
88628884
]
88638885
if self.compile_only:
88648886
cmds.extend(["--compile_only"])

examples/qualcomm/oss_scripts/llama/README.md

Lines changed: 78 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -130,12 +130,12 @@ python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL
130130
Default example using hybrid mode.
131131
```bash
132132
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --calib_tasks wikitext --calib_limit 1
133-
```
133+
134134

135135
#### Codegen2
136136
Default example using kv mode.
137137
```bash
138-
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model codegen2_1b --model_mode kv --max_seq_len 1024 --prompt "def hello_world():"
138+
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model codegen2_1b --model_mode kv --max_seq_len 1024 --prompt "def hello_world():" --calib_samples examples/qualcomm/oss_scripts/llama/assets/samples/text.json
139139
```
140140

141141
#### Gemma 2B
@@ -210,7 +210,17 @@ Default example using hybrid mode.
210210
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smollm3-3b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --calib_tasks wikitext --calib_limit 1
211211
```
212212

213-
## Multimodal Support
213+
#### Using custom calibration samples for LLMs
214+
215+
Instead of `--calib_tasks`, you can supply your own conversation JSON files via `--calib_samples`. The samples are fed into the quantization calibration pass to collect activation observer statistics — they do not affect the inference prompt. This is useful when you want to calibrate on domain-specific or instruct-format data rather than a generic lm_eval task.
216+
217+
```bash
218+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smollm2_135m --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --calib_samples examples/qualcomm/oss_scripts/llama/assets/samples/text.json
219+
```
220+
221+
You can also provide both `--calib_tasks` and `--calib_samples` at the same time; the pipeline concatenates both data sources for calibration.
222+
223+
214224

215225
### Overview
216226

@@ -268,7 +278,7 @@ pip install soundfile
268278

269279
Default example using hybrid mode.
270280
```bash
271-
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true"
281+
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true" --calib_samples examples/qualcomm/oss_scripts/llama/assets/samples/audio.json
272282
```
273283

274284
### Specifying Custom Audio
@@ -281,9 +291,6 @@ You can specify a custom audio file for ALM models using the `--audio_path` flag
281291
- **Local file paths**: Absolute or relative paths to `.wav` files on your system
282292
- Example: `"/path/to/your/audio.wav"`
283293
284-
**Default behavior:**
285-
If `--audio_path` is not specified, the system will automatically use the default audio file defined in the model's configuration file (`encoder/encoder_config.py`).
286-
287294
#### Audio Preprocessing
288295
289296
The audio encoder configuration is defined in `encoder/encoder_config.py`:
@@ -294,7 +301,6 @@ The audio encoder configuration is defined in `encoder/encoder_config.py`:
294301
class GraniteSpeechEncoder(AudioModalityConfig):
295302
encoder_class = GraniteSpeechCTCEncoderWrapper
296303
audio_seq_len = 171
297-
audio_url = "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true" # Default audio (content: "After his nap, ...")
298304
quant_recipe = GraniteSpeechEncoderQuantRecipe
299305
```
300306
@@ -351,13 +357,13 @@ Vision-Language Models (VLMs) combine computer vision and natural language proce
351357
#### SmolVLM 500M
352358
Default example using hybrid mode.
353359
```bash
354-
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode hybrid --prefill_ar_len 16 --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
360+
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode hybrid --prefill_ar_len 16 --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" --calib_samples examples/qualcomm/oss_scripts/llama/assets/samples/vision.json
355361
```
356362
357363
#### InternVL 1B
358364
Default example using hybrid mode.
359365
```bash
360-
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model internvl3_1b --model_mode hybrid --prefill_ar_len 32 --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "http://images.cocodataset.org/val2017/000000039769.jpg"
366+
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model internvl3_1b --model_mode hybrid --prefill_ar_len 32 --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "http://images.cocodataset.org/val2017/000000039769.jpg" --calib_samples examples/qualcomm/oss_scripts/llama/assets/samples/vision.json
361367
```
362368
363369
### Specifying Custom Image
@@ -370,9 +376,6 @@ Take a example image of Statue-of-Liberty in New York Bay
370376
- **Local file paths**: Absolute or relative paths to image files on your system
371377
- Example: [`./examples/qualcomm/oss_scripts/llama/assets/samples/images/Statue-of-Liberty-Island-New-York-Bay.png`](assets/samples/images/Statue-of-Liberty-Island-New-York-Bay.png)
372378
373-
**Default behavior:**
374-
If `--image_path` is not specified, the system will automatically use the default image URL defined in the model's configuration file (`encoder/encoder_config.py`).
375-
376379
#### Image Preprocessing
377380
378381
Each VLM model has specific preprocessing requirements defined in its configuration:
@@ -385,7 +388,6 @@ class SmolVLMEncoder(VisionModalityConfig):
385388
img_seq_len = 64
386389
img_resized_h = 512
387390
img_resized_w = 512
388-
img_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" # Default image
389391
quant_recipe = SmolVLMEncoderQuantRecipe
390392
```
391393
@@ -427,7 +429,7 @@ PROMPT2="Answer the question: What's the main object in first image?"
427429
PROMPT3="<image>Caption this image."
428430
429431
# Execute the multi-turn conversation
430-
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"
432+
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL" --calib_samples examples/qualcomm/oss_scripts/llama/assets/samples/vision.json
431433
```
432434
433435
**How it works:**
@@ -453,16 +455,19 @@ The VLM inference pipeline consists of:
453455
- Special tokens (e.g., `<image>`, `<|fake_token_around_image|>`, `<fake_token_around_image>`) mark modality boundaries (see [tokenizer.py](tokenizer.py))
454456

455457
```python
456-
# Special tokens for Vision-Language Model
457-
VLM_SPECIAL_TOKENS = {
458-
"smolvlm_500m_instruct": {
459-
"image_token": "<image>",
460-
"global_img": "<global-img>",
461-
"fake_wrap_start": "<fake_token_around_image>",
462-
"fake_wrap_end": "<fake_token_around_image>",
463-
},
464-
...
465-
}
458+
# Token fields on each encoder config subclass (encoder/encoder_config.py)
459+
@dataclass(init=False, frozen=True)
460+
class SmolVLMEncoder(VisionModalityConfig):
461+
img_token = "<image>"
462+
fake_wrap_start = "<fake_token_around_image>"
463+
fake_wrap_end = "<fake_token_around_image>"
464+
global_img_token = "<global-img>"
465+
466+
@dataclass(init=False, frozen=True)
467+
class InternVL3Encoder(VisionModalityConfig):
468+
img_token = "<IMG_CONTEXT>"
469+
fake_wrap_start = "<img>"
470+
fake_wrap_end = "</img>"
466471
```
467472
- Final fused sequence: `[batch, img_seq_len + text_seq_len, hidden_dim]`
468473

@@ -545,16 +550,13 @@ From the example script above, 1 wikitext sample is used to evaluate all 3 phase
545550
Example:
546551
```bash
547552
# 1st run to compile with --calib_limit 1
548-
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --calib_tasks wikitext --calib_limit 1 --compile_only
553+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --calib_tasks wikitext --calib_limit 1 -a ${FOLDER_TO_PRE_GEN_PTE} --compile_only
549554
```
550555
```bash
551556
# 2nd run to perform QNN device execution with --eval_limit 3
552-
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --eval_tasks wikitext --eval_limit 3 --pre_gen_pte ${PATH_TO_ARTIFACT_IN_1ST_RUN} --quant_attrs_path ${PATH_TO_ARTIFACT_IN_1ST_RUN}/kv_llama_qnn_quant_attrs.json
557+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --eval_tasks wikitext --eval_limit 3 --pre_gen_pte ${FOLDER_TO_PRE_GEN_PTE}
553558
```
554559

555-
#### Tasks quantization calibration
556-
If `--calib_tasks ${TASK}` is not provided, the program will use `--prompt ${PROMPT}` as the dataset for quantization calibration.
557-
`--calib_tasks` and `--eval_tasks` are independent flags. `--calib_tasks` controls which tasks are used for quantization calibration, while `--eval_tasks` controls which tasks are used for perplexity evaluation. They can be set to different tasks or limits as needed.
558560

559561
#### SQNR Evalution
560562
To evaluate QNN's output logits against the golden logits from `nn.Module`, users can provide the flag `--sqnr_eval`. Please note that SQNR evaluation will only compare the logits of the user's prompt and will not compare the new tokens generated by the model.
@@ -563,6 +565,52 @@ Example:
563565
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods sqnr_eval
564566
```
565567

568+
569+
570+
#### Quantization
571+
572+
The calibration data is independent from the runtime evaluation set, and only affects quantization quality, not the inference output.
573+
574+
Calibration data is required for compilation. There are two ways to supply it:
575+
576+
1. **`--calib_tasks`** — calibrate on one or more lm_eval tasks (tune with `--calib_limit` and `--calib_num_fewshot`). LLM-only.
577+
2. **`--calib_samples`** — calibrate on custom conversation samples provided as JSON files (see format below). Required for multimodal models (VLM/ALM).
578+
579+
For LLMs, provide at least one of the two; for multimodal models, `--calib_samples` is mandatory.
580+
581+
Calibration and runtime evaluation use separate flag sets and can target different tasks or limits as needed:
582+
583+
| Purpose | Flags |
584+
|---|---|
585+
| Calibration data (lm_eval tasks) | `--calib_tasks`, `--calib_limit`, `--calib_num_fewshot` |
586+
| Calibration data (custom samples) | `--calib_samples` (JSON files, HuggingFace message format) |
587+
588+
##### Custom calibration samples (`--calib_samples`)
589+
590+
`--calib_samples` accepts one or more JSON files. Each file is a flat list of sample objects. Each sample has a `messages` field following the HuggingFace chat template, and an optional `files` field for media inputs (local paths or URLs):
591+
592+
```json
593+
[
594+
{
595+
"files": ["path/or/url/to/files"],
596+
"messages": [
597+
{"role": "user", "content": "..." },
598+
{"role": "assistant", "content": "..."}
599+
]
600+
}
601+
]
602+
```
603+
604+
`files` is only required for multimodal models (VLM: image paths/URLs, ALM: audio paths/URLs). For LLM-only models, `files` can be omitted. `content` can be a plain string or a list of HuggingFace content blocks (e.g. `[{"type": "image"}, {"type": "text", "text": "..."}]` for vision inputs).
605+
606+
Ready-to-use examples for each model type are provided under `assets/samples/`:
607+
608+
| Model type | Example file |
609+
|---|---|
610+
| LLM | [assets/samples/text.json](assets/samples/text.json) |
611+
| ALM (audio) | [assets/samples/audio.json](assets/samples/audio.json) |
612+
| VLM (vision) | [assets/samples/vision.json](assets/samples/vision.json) |
613+
566614
#### Quantization Guidance
567615

568616
To automatically identify sensitive layers and generate a mixed-precision recipe suggestion, add the `--quant_recipe_suggestion` flag. During calibration, the analyzer compares FP32 and QDQ intermediate outputs layer-by-layer using SQNR, then writes two files to the working directory:

0 commit comments

Comments
 (0)