You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --calib_tasks wikitext --calib_limit 1
@@ -210,7 +210,17 @@ Default example using hybrid mode.
210
210
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smollm3-3b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --calib_tasks wikitext --calib_limit 1
211
211
```
212
212
213
-
## Multimodal Support
213
+
#### Using custom calibration samples for LLMs
214
+
215
+
Instead of `--calib_tasks`, you can supply your own conversation JSON files via `--calib_samples`. The samples are fed into the quantization calibration pass to collect activation observer statistics — they do not affect the inference prompt. This is useful when you want to calibrate on domain-specific or instruct-format data rather than a generic lm_eval task.
216
+
217
+
```bash
218
+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smollm2_135m --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --calib_samples examples/qualcomm/oss_scripts/llama/assets/samples/text.json
219
+
```
220
+
221
+
You can also provide both `--calib_tasks` and `--calib_samples` at the same time; the pipeline concatenates both data sources for calibration.
222
+
223
+
214
224
215
225
### Overview
216
226
@@ -268,7 +278,7 @@ pip install soundfile
268
278
269
279
Default example using hybrid mode.
270
280
```bash
271
-
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true"
281
+
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true" --calib_samples examples/qualcomm/oss_scripts/llama/assets/samples/audio.json
272
282
```
273
283
274
284
### Specifying Custom Audio
@@ -281,9 +291,6 @@ You can specify a custom audio file for ALM models using the `--audio_path` flag
281
291
- **Local file paths**: Absolute or relative paths to `.wav` files on your system
282
292
- Example: `"/path/to/your/audio.wav"`
283
293
284
-
**Default behavior:**
285
-
If `--audio_path` is not specified, the system will automatically use the default audio file defined in the model's configuration file (`encoder/encoder_config.py`).
286
-
287
294
#### Audio Preprocessing
288
295
289
296
The audio encoder configuration is defined in `encoder/encoder_config.py`:
@@ -294,7 +301,6 @@ The audio encoder configuration is defined in `encoder/encoder_config.py`:
294
301
class GraniteSpeechEncoder(AudioModalityConfig):
295
302
encoder_class = GraniteSpeechCTCEncoderWrapper
296
303
audio_seq_len = 171
297
-
audio_url ="https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true"# Default audio (content: "After his nap, ...")
298
304
quant_recipe = GraniteSpeechEncoderQuantRecipe
299
305
```
300
306
@@ -351,13 +357,13 @@ Vision-Language Models (VLMs) combine computer vision and natural language proce
If `--image_path` is not specified, the system will automatically use the default image URL defined in the model's configuration file (`encoder/encoder_config.py`).
375
-
376
379
#### Image Preprocessing
377
380
378
381
Each VLM model has specific preprocessing requirements defined in its configuration:
@@ -385,7 +388,6 @@ class SmolVLMEncoder(VisionModalityConfig):
@@ -453,16 +455,19 @@ The VLM inference pipeline consists of:
453
455
- Special tokens (e.g., `<image>`, `<|fake_token_around_image|>`, `<fake_token_around_image>`) mark modality boundaries (see [tokenizer.py](tokenizer.py))
454
456
455
457
```python
456
-
# Special tokens for Vision-Language Model
457
-
VLM_SPECIAL_TOKENS= {
458
-
"smolvlm_500m_instruct": {
459
-
"image_token": "<image>",
460
-
"global_img": "<global-img>",
461
-
"fake_wrap_start": "<fake_token_around_image>",
462
-
"fake_wrap_end": "<fake_token_around_image>",
463
-
},
464
-
...
465
-
}
458
+
# Token fields on each encoder config subclass (encoder/encoder_config.py)
459
+
@dataclass(init=False, frozen=True)
460
+
class SmolVLMEncoder(VisionModalityConfig):
461
+
img_token = "<image>"
462
+
fake_wrap_start = "<fake_token_around_image>"
463
+
fake_wrap_end = "<fake_token_around_image>"
464
+
global_img_token = "<global-img>"
465
+
466
+
@dataclass(init=False, frozen=True)
467
+
class InternVL3Encoder(VisionModalityConfig):
468
+
img_token = "<IMG_CONTEXT>"
469
+
fake_wrap_start = "<img>"
470
+
fake_wrap_end = "</img>"
466
471
```
467
472
- Final fused sequence: `[batch, img_seq_len + text_seq_len, hidden_dim]`
468
473
@@ -545,16 +550,13 @@ From the example script above, 1 wikitext sample is used to evaluate all 3 phase
545
550
Example:
546
551
```bash
547
552
# 1st run to compile with --calib_limit 1
548
-
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --calib_tasks wikitext --calib_limit 1 --compile_only
553
+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --calib_tasks wikitext --calib_limit 1 -a ${FOLDER_TO_PRE_GEN_PTE} --compile_only
549
554
```
550
555
```bash
551
556
# 2nd run to perform QNN device execution with --eval_limit 3
552
-
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --eval_tasks wikitext --eval_limit 3 --pre_gen_pte ${PATH_TO_ARTIFACT_IN_1ST_RUN} --quant_attrs_path ${PATH_TO_ARTIFACT_IN_1ST_RUN}/kv_llama_qnn_quant_attrs.json
557
+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --eval_tasks wikitext --eval_limit 3 --pre_gen_pte ${FOLDER_TO_PRE_GEN_PTE}
553
558
```
554
559
555
-
#### Tasks quantization calibration
556
-
If `--calib_tasks ${TASK}` is not provided, the program will use `--prompt ${PROMPT}` as the dataset for quantization calibration.
557
-
`--calib_tasks` and `--eval_tasks` are independent flags. `--calib_tasks` controls which tasks are used for quantization calibration, while `--eval_tasks` controls which tasks are used for perplexity evaluation. They can be set to different tasks or limits as needed.
558
560
559
561
#### SQNR Evalution
560
562
To evaluate QNN's output logits against the golden logits from `nn.Module`, users can provide the flag `--sqnr_eval`. Please note that SQNR evaluation will only compare the logits of the user's prompt and will not compare the new tokens generated by the model.
@@ -563,6 +565,52 @@ Example:
563
565
python examples/qualcomm/oss_scripts/llama/llama.py --build_folder build-android --device ${SERIAL_NUM} --soc_model ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods sqnr_eval
564
566
```
565
567
568
+
569
+
570
+
#### Quantization
571
+
572
+
The calibration data is independent from the runtime evaluation set, and only affects quantization quality, not the inference output.
573
+
574
+
Calibration data is required for compilation. There are two ways to supply it:
575
+
576
+
1. **`--calib_tasks`** — calibrate on one or more lm_eval tasks (tune with `--calib_limit` and `--calib_num_fewshot`). LLM-only.
577
+
2. **`--calib_samples`** — calibrate on custom conversation samples provided as JSON files (see format below). Required for multimodal models (VLM/ALM).
578
+
579
+
For LLMs, provide at least one of the two;for multimodal models, `--calib_samples` is mandatory.
580
+
581
+
Calibration and runtime evaluation use separate flag sets and can target different tasks or limits as needed:
582
+
583
+
| Purpose | Flags |
584
+
|---|---|
585
+
| Calibration data (lm_eval tasks) |`--calib_tasks`, `--calib_limit`, `--calib_num_fewshot`|
`--calib_samples` accepts one or more JSON files. Each file is a flat list of sample objects. Each sample has a `messages` field following the HuggingFace chat template, and an optional `files` field for media inputs (local paths or URLs):
591
+
592
+
```json
593
+
[
594
+
{
595
+
"files": ["path/or/url/to/files"],
596
+
"messages": [
597
+
{"role": "user", "content": "..." },
598
+
{"role": "assistant", "content": "..."}
599
+
]
600
+
}
601
+
]
602
+
```
603
+
604
+
`files` is only required for multimodal models (VLM: image paths/URLs, ALM: audio paths/URLs). For LLM-only models, `files` can be omitted. `content` can be a plain string or a list of HuggingFace content blocks (e.g. `[{"type": "image"}, {"type": "text", "text": "..."}]`for vision inputs).
605
+
606
+
Ready-to-use examples for each model type are provided under `assets/samples/`:
To automatically identify sensitive layers and generate a mixed-precision recipe suggestion, add the `--quant_recipe_suggestion` flag. During calibration, the analyzer compares FP32 and QDQ intermediate outputs layer-by-layer using SQNR, then writes two files to the working directory:
0 commit comments