This tutorial covers the end-to-end workflow for building an Android demo app using Qualcomm AI accelerators on device. More specifically, it covers:
- Export and quantization of Llama models against the Qualcomm backend.
- Building and linking libraries that are required for on-device inference on the Android platform using Qualcomm AI accelerators.
- Building the Android demo app itself.
Verified on Linux CentOS, QNN SDK v2.26, Python 3.10, Android NDK r27b.
Phones verified: OnePlus 12, Samsung 24+, Samsung 23.
- Download and unzip QNN SDK v2.26
- Download and unzip Android NDK r27b
- Android phone with Snapdragon 8 Gen 3 (SM8650) or Gen 2 (SM8550). Gen 1 and lower SoC might be supported but not fully validated.
- Desired Llama model weights in .pth format. You can download them on Hugging Face (Example).
In this section, we will need to set up the ExecuTorch repo first with Conda environment management. Make sure you have Conda available in your system (or follow the instructions to install it here). The commands below are running on Linux (CentOS).
Checkout ExecuTorch repo and sync submodules:
git clone -b viable/strict https://github.com/pytorch/executorch.git && cd executorch
Create either a Python virtual environment:
python3 -m venv .venv && source .venv/bin/activate && pip install --upgrade pip
Or a Conda environment:
conda create -n et_xnnpack python=3.10.0 && conda activate et_xnnpack
Install dependencies:
./install_executorch.sh
# Set these variables correctly for your environment
export ANDROID_NDK_ROOT=$HOME/android-ndk-r27b # Download Android SDK and unzip to home directory
export QNN_SDK_ROOT=$HOME/Your-SDK-Root # Folder contains lib
export EXECUTORCH_ROOT=$HOME/repos/executorch
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang/:$LD_LIBRARY_PATH
export PYTHONPATH=$EXECUTORCH_ROOT/..
cp schema/program.fbs exir/_serialize/program.fbs
cp schema/scalar_type.fbs exir/_serialize/scalar_type.fbs
./backends/qualcomm/scripts/build.sh --release
cmake -DPYTHON_EXECUTABLE=python \
-DCMAKE_INSTALL_PREFIX=cmake-out \
-DEXECUTORCH_ENABLE_LOGGING=1 \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
-DEXECUTORCH_BUILD_QNN=ON \
-DQNN_SDK_ROOT=${QNN_SDK_ROOT} \
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_LLM=ON \
-Bcmake-out .
cmake --build cmake-out -j16 --target install --config Release
Next we need to build and compile the Llama runner. This is similar to the requirements for running Llama with XNNPACK.
./examples/models/llama/install_requirements.sh
cmake -DPYTHON_EXECUTABLE=python \
-DCMAKE_INSTALL_PREFIX=cmake-out \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_LLM=ON \
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
-DEXECUTORCH_BUILD_QNN=ON \
-Bcmake-out/examples/models/llama \
examples/models/llama
cmake --build cmake-out/examples/models/llama -j16 --config Release
QNN backend currently supports exporting to these data types: fp32, int4/ int8 with PTQ, int4 with SpinQuant (Llama 3 only).
We also support export for different Qualcomm SoC. We have verified SM8650(V75) and SM8550(V73). To export for different SoC, add "--soc_model SM8550" in your export command. Without setting this flag, the export will default to SM8650.
We support PTQ by default. The entire export may take ~20 minutes (Llama 3.1 8B). However, there is accuracy regression and we are working on improving it. 8B models might need 16GB RAM on the device to run.
Examples:
# 4 bits weight only quantize
python -m extension.llm.export.export_llm base.checkpoint="${MODEL_DIR}/consolidated.00.pth" base.params="${MODEL_DIR}/params.json" model.use_kv_cache=True model.enable_dynamic_shape=False backend.qnn.enabled=True quantization.pt2e_quantize="qnn_16a4w" model.dtype_override="fp32" base.metadata='"{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}"' export.output_name="test.pte"
If the model is really big, it may require model sharding because the Qualcomm DSP is a 32bit system and has a 4GB size limit . For example for Llama 3 8B models, we need to shard the model into 4, but ExecuTorch still packages it into one PTE file. Here is an example:
# 8 bits quantization with 4 shards
python -m extension.llm.export.export_llm base.checkpoint="${MODEL_DIR}/consolidated.00.pth" base.params="${MODEL_DIR}/params.json" model.use_kv_cache=True model.enable_dynamic_shape=False backend.qnn.enabled=True quantization.pt2e_quantize="qnn_8a8w" model.dtype_override="fp32" backend.qnn.num_sharding=4 base.metadata='"{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}"' export.output_name="test.pte"
Note: if you encountered issues below
[ERROR] [Qnn ExecuTorch]: Cannot Open QNN library libQnnHtp.so, with error: libc++.so.1: cannot open shared object file: No such file or directory
Resolve by:
- Install older QNN such as 2.23 or below and copy it from ${QNN_SDK_ROOT}/lib/x86_64-linux-clang
- Install it with apt-get by yourself
- Install it with script in ${QNN_SDK_ROOT}/bin/check-linux-dependency.sh You could refer to QNN SDK document
- Install it with Conda:
conda install -c conda-forge libcxx=14.0.0
After installment, you will need to check libc++.so.1 in your LD_LIBRARY_PATH or system lib. Refer to this PR for more detail.
You may also wonder what the "--metadata" flag is doing. This flag helps export the model with proper special tokens added that the runner can detect EOS tokens easily.
Convert tokenizer for Llama 2
python -m pytorch_tokenizers.tools.llama2c.convert -t tokenizer.model -o tokenizer.bin
Rename tokenizer for Llama 3 with command: mv tokenizer.model tokenizer.bin. We are updating the demo app to support tokenizer in original format directly.
We also support Llama 3 8B for Spinquant where the accuracy regression is minimal.
Deploying large language models like Llama 3 on-device presents the following challenges:
- The model size is too large to fit in device memory for inference.
- High model loading and inference time.
- Difficulty in quantization.
To address these challenges, we have implemented the following solutions:
- Using --pt2e_quantize qnn_16a4w to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference.
- Using --num_sharding 8 to shard the model into sub-parts.
- Performing graph transformations to convert or decompose operations into more accelerator-friendly operations.
- Using --optimized_rotation_path <path_to_optimized_matrix> to apply R1 and R2 of Spin Quant to improve accuracy.
- Using --calibration_data "<|start_header_id|>system<|end_header_id|..." to ensure that during the quantization of Llama 3 8B Instruct, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to the model card of meta llama3 instruct.
To get the optimized matrix, please refer to SpinQuant on GitHub. You can download the optimized rotation matrices in the Quantized Models section. Please choose "LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0".
To export Llama 3 8B instruct with the Qualcomm AI Engine Direct Backend, ensure the following:
- The host machine has more than 100GB of memory (RAM + swap space).
- The entire process takes a few hours.
- 8B models might need 16GB RAM on the device to run.
# Please note that calibration_data must include the prompt template for special tokens.
python -m extension.llm.export.export_llm base.tokenizer=<path_to_tokenizer.model> base.params=<path_to_params.json> base.checkpoint=<path_to_checkpoint_for_Meta-Llama-3-8B-Instruct> model.use_kv_cache=True backend.qnn.enabled=True quantization.pt2e_quantize="qnn_16a4w" model.enable_dynamic_shape=False backend.qnn.num_sharding=8 backend.qnn.calibration_tasks="wikitext" backend.qnn.calibration_limit=1 backend.qnn.calibration_seq_length=128 backend.qnn.optimized_rotation_path=<path_to_optimized_matrix> backend.qnn.calibration_data="<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
Once you have the model and tokenizer ready, you can push them to the device before we start building the Android demo app.
adb shell mkdir -p /data/local/tmp/llama
adb push llama-exported.pte /data/local/tmp/llama
adb push tokenizer.bin /data/local/tmp/llama
- Open a terminal window and navigate to the root directory of the ExecuTorch:
cd $EXECUTORCH_ROOT- Set the following environment variables:
export ANDROID_NDK=<path_to_android_ndk>
export ANDROID_ABIS=arm64-v8a
export QNN_SDK_ROOT=<path_to_qnn_sdk>Note: <path_to_android_ndk> is the root for the NDK, which is usually under ~/Library/Android/sdk/ndk/XX.Y.ZZZZZ for macOS, and contains NOTICE and README.md. We use <path_to_android_ndk>/build/cmake/android.toolchain.cmake for CMake to cross-compile.
- Create a directory to hold the AAR
mkdir -p aar-out
export BUILD_AAR_DIR=aar-out- Run the following command to build the AAR:
sh scripts/build_android_library.sh- Now go to the demo app root (containing the main README.md) and copy the AAR to the app:
cd $DEMO_APP
mkdir -p app/libs
cp $EXECUTORCH_ROOT/aar-out/executorch.aar app/libs/executorch.aarThis runs the shell script which configures the required core ExecuTorch, Llama 2/3, and Android libraries, builds them into an AAR, and copies it to the app.
- Add QNN runtime dependency to Gradle:
implementation("com.qualcomm.qti:qnn-runtime:2.33.0")
First, make sure your Android phone's chipset version is compatible with this demo (SM8650, SM8550). You can find the Qualcomm chipset version here in the mapping.
Open Android Studio and select "Open an existing Android Studio project" to open LlamaDemo. Run the app (^R). This builds and launches the app on the phone.
Without Android Studio UI, we can run Gradle directly to build the app. We need to set up the Android SDK path and invoke Gradle.
export ANDROID_HOME=<path_to_android_sdk_home>
cd LlamaDemo
./gradlew :app:installDebug
If the app successfully runs on your device, you should see something like the screenshot below:
If you encountered any bugs or issues following this tutorial, please file a bug/issue here on GitHub.
