For Llama enablement, please see the Llama README page for complete details.
This page contains Llama2 specific instructions and information.
We have verified running Llama 2 7B mobile applications efficiently on select devices including the iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22 and S24, and OnePlus 12.
Since Llama 2 7B needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
Llama 2 7B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an adb binary-based approach.
| Device | Groupwise 4-bit (128) | Groupwise 4-bit (256) |
|---|---|---|
| Galaxy S22 | 8.15 tokens/second | 8.3 tokens/second |
| Galaxy S24 | 10.66 tokens/second | 11.26 tokens/second |
| OnePlus 12 | 11.55 tokens/second | 11.6 tokens/second |
Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000, based on WikiText perplexity using LM Eval.
| Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256) |
|---|---|---|---|
| Llama 2 7B | 9.2 | 10.2 | 10.7 |
You can export and run the original Llama 2 7B model.
-
Llama 2 pretrained parameters can be downloaded from Meta's official website or from Hugging Face.
-
Edit
params.jsonfile. Replace"vocab_size": -1with"vocab_size": 32000. This is a short-term workaround. -
Export model and generate
.ptefile:python -m extension.llm.export.export_llm base.checkpoint=<checkpoint.pth> base.params=<params.json> model.use_kv_cache=True model.use_sdpa_with_kv_cache=True backend.xnnpack.enabled=True quantization.qmode="8da4w" quantization.group_size=128 model.dtype_override="fp32" -
Create tokenizer.bin.
python -m pytorch_tokenizers.tools.llama2c.convert -t <tokenizer.model> -o tokenizer.binPass the converted
tokenizer.binfile instead oftokenizer.modelfor subsequent steps.
Running will be the same by following this step.