Skip to content

Commit 985eeb7

Browse files
authored
docs: add Android LLM runner page and HuggingFace (#19611)
## Summary 1. New `docs/source/llm/run-on-android.md`, a Java reference for the `executorch-android` AAR runner. Same shape as `run-on-ios.md`. Covers `LlmModule`, the `LlmModuleConfig` builder, `LlmGenerationConfig`, the `LlmCallback` methods, `load`/`stop`/`resetContext`, and the image/audio prefill variants. Points at LlamaDemo. 2. Added `run-on-android` to the LLM toctree in `working-with-llms.md`, sitting between the Qualcomm page and iOS. 3. In `getting-started.md`, swapped the two GitHub example links for the in-docs Android and iOS pages so users stay in the docs. 4. Added a tip admonition to `using-executorch-export.md` under Model Preparation, sending HF Hub users to `export-llm-optimum.md` before the manual flow. 5. Cleaned up `export-llm-optimum.md`. Removed the leftover "Method 1" framing since only the CLI path is documented, bumped the orphaned subheadings up a level, and pointed the Running on Device links at the new Android page and the existing iOS page (sample apps kept inline). Fixes #8790 cc @mergennachin @AlannaBurke @larryliu0820 @cccclai @helunwencser @jackzhxng @byjlw
1 parent 6ca2589 commit 985eeb7

5 files changed

Lines changed: 216 additions & 13 deletions

File tree

docs/source/llm/export-llm-optimum.md

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -45,15 +45,11 @@ Optimum ExecuTorch supports a wide range of model architectures including decode
4545

4646
For the complete list of supported models, see the [Optimum ExecuTorch documentation](https://github.com/huggingface/optimum-executorch#-supported-models).
4747

48-
## Export Methods
48+
## CLI Export
4949

50-
Optimum ExecuTorch offers two ways to export models:
50+
The `optimum-cli` command is the recommended way to export Hugging Face models. It provides a single invocation that downloads the model from the Hub, applies the configured optimizations, and writes the resulting `.pte` file.
5151

52-
### Method 1: CLI Export
53-
54-
The CLI is the simplest way to export models. It provides a single command to convert models from Hugging Face Hub to ExecuTorch format.
55-
56-
#### Basic Export
52+
### Basic Export
5753

5854
```bash
5955
optimum-cli export executorch \
@@ -63,7 +59,7 @@ optimum-cli export executorch \
6359
--output_dir="./smollm2_exported"
6460
```
6561

66-
#### With Optimizations
62+
### With Optimizations
6763

6864
Add custom SDPA, KV cache optimization, and quantization:
6965

@@ -79,7 +75,7 @@ optimum-cli export executorch \
7975
--output_dir="./smollm2_exported"
8076
```
8177

82-
#### Available CLI Arguments
78+
### Available CLI Arguments
8379

8480
Key arguments for LLM export include `--model`, `--task`, `--recipe` (backend), `--use_custom_sdpa`, `--use_custom_kv_cache`, `--qlinear` (linear quantization), `--qembedding` (embedding quantization), and `--max_seq_len`.
8581

@@ -156,8 +152,8 @@ print(generated_text)
156152
After verifying your model works correctly, deploy it to device:
157153

158154
- [Running with C++](run-with-c-plus-plus.md) - Run exported models using ExecuTorch's C++ runtime
159-
- [Running on Android](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - Deploy to Android devices
160-
- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) - Deploy to iOS devices
155+
- [Running on Android](run-on-android.md) - Java APIs for the `executorch-android` AAR (sample app: [LlamaDemo](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android))
156+
- [Running on iOS](run-on-ios.md) - Objective-C and Swift APIs for the `ExecuTorchLLM` framework (sample app: [etLLM](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple))
161157

162158
## Performance
163159

docs/source/llm/getting-started.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,6 @@ Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) expor
2525

2626
### Running
2727
- [Running with C++](run-with-c-plus-plus.md)
28-
- [Running on Android (XNNPack)](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android)
28+
- [Running on Android](run-on-android.md)
2929
- [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md)
30-
- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple)
30+
- [Running on iOS](run-on-ios.md)

docs/source/llm/run-on-android.md

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# Running LLMs on Android
2+
3+
ExecuTorch's LLM-specific runtime components provide an experimental Java interface around the core C++ LLM runtime, available through the `executorch-android` AAR.
4+
5+
## Prerequisites
6+
7+
Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide.
8+
9+
To add the `executorch-android` library to your app, see [Using ExecuTorch on Android](../using-executorch-android.md). The LLM runner classes are bundled inside the same AAR as the generic `Module` API.
10+
11+
## Runtime API
12+
13+
Once the `executorch-android` AAR is on your classpath, you can import the LLM runner classes from the `org.pytorch.executorch.extension.llm` package.
14+
15+
### Importing
16+
17+
```java
18+
import org.pytorch.executorch.extension.llm.LlmModule;
19+
import org.pytorch.executorch.extension.llm.LlmModuleConfig;
20+
import org.pytorch.executorch.extension.llm.LlmGenerationConfig;
21+
import org.pytorch.executorch.extension.llm.LlmCallback;
22+
```
23+
24+
### LlmModule
25+
26+
The `LlmModule` class provides a simple Java interface for loading a text-generation model, configuring its tokenizer, generating token streams, and stopping execution. It also supports multimodal models that accept image and audio inputs alongside a text prompt.
27+
28+
This API is experimental and subject to change.
29+
30+
#### Initialization
31+
32+
Create an `LlmModule` by specifying paths to your serialized model (`.pte`) and tokenizer files. For text-only models, the simple constructor is enough:
33+
34+
```java
35+
LlmModule module = new LlmModule(
36+
"/data/local/tmp/llama-3.2-instruct.pte",
37+
"/data/local/tmp/tokenizer.model",
38+
0.8f);
39+
```
40+
41+
For finer control (multimodal model type, BOS/EOS handling, supplementary data files, load mode), use `LlmModuleConfig` with the fluent builder:
42+
43+
```java
44+
LlmModuleConfig config = LlmModuleConfig.create()
45+
.modulePath("/data/local/tmp/llama-3.2-instruct.pte")
46+
.tokenizerPath("/data/local/tmp/tokenizer.model")
47+
.temperature(0.8f)
48+
.modelType(LlmModuleConfig.MODEL_TYPE_TEXT)
49+
.loadMode(LlmModuleConfig.LOAD_MODE_MMAP)
50+
.build();
51+
52+
LlmModule module = new LlmModule(config);
53+
```
54+
55+
Available load modes are `LOAD_MODE_FILE`, `LOAD_MODE_MMAP` (default), `LOAD_MODE_MMAP_USE_MLOCK`, and `LOAD_MODE_MMAP_USE_MLOCK_IGNORE_ERRORS`. Available model types are `MODEL_TYPE_TEXT`, `MODEL_TYPE_TEXT_VISION`, and `MODEL_TYPE_MULTIMODAL`.
56+
57+
Construction itself is lightweight and does not load the program data immediately.
58+
59+
#### Loading
60+
61+
Explicitly load the model before generation to avoid paying the load cost during your first `generate` call.
62+
63+
```java
64+
int status = module.load();
65+
if (status != 0) {
66+
// Handle load failure (status is an ExecuTorch runtime error code).
67+
}
68+
```
69+
70+
If you skip this step, the model is loaded lazily on the first `generate` call.
71+
72+
#### Generating
73+
74+
Generate tokens from a text prompt by passing an `LlmCallback` that receives each token as it is produced. The same callback also receives a JSON-encoded statistics string when generation completes.
75+
76+
```java
77+
LlmCallback callback = new LlmCallback() {
78+
@Override
79+
public void onResult(String token) {
80+
// Called once per generated token. Append to your UI buffer here.
81+
System.out.print(token);
82+
}
83+
84+
@Override
85+
public void onStats(String statsJson) {
86+
// Called once when generation finishes. See extension/llm/runner/stats.h
87+
// for the field definitions.
88+
System.out.println("\n" + statsJson);
89+
}
90+
91+
@Override
92+
public void onError(int errorCode, String message) {
93+
// Called if the runtime reports an error during generation.
94+
}
95+
};
96+
97+
module.generate("Once upon a time", callback);
98+
```
99+
100+
For full control over generation parameters, use `LlmGenerationConfig`:
101+
102+
```java
103+
LlmGenerationConfig genConfig = LlmGenerationConfig.create()
104+
.seqLen(2048)
105+
.temperature(0.8f)
106+
.echo(false)
107+
.build();
108+
109+
module.generate("Once upon a time", genConfig, callback);
110+
```
111+
112+
`LlmGenerationConfig` exposes `echo`, `maxNewTokens`, `seqLen`, `temperature`, `numBos`, `numEos`, and `warming`. Defaults match the C++ `GenerationConfig` documented in [Running LLMs with C++](run-with-c-plus-plus.md).
113+
114+
#### Stopping Generation
115+
116+
If you need to interrupt a long-running generation, call `stop()` from another thread (or from inside the `onResult` callback):
117+
118+
```java
119+
module.stop();
120+
```
121+
122+
Generation also runs synchronously on the calling thread, so make sure you invoke `generate()` off the main thread (for example, on a `HandlerThread` or via a `java.util.concurrent.Executor`).
123+
124+
#### Resetting
125+
126+
To clear the prefilled tokens from the KV cache and reset the start position to 0, call:
127+
128+
```java
129+
module.resetContext();
130+
```
131+
132+
This is the equivalent of `reset()` on the iOS runner and `reset()` on the C++ `IRunner`.
133+
134+
### Multimodal Inputs
135+
136+
For models declared as `MODEL_TYPE_TEXT_VISION` or `MODEL_TYPE_MULTIMODAL`, image and audio data are provided through dedicated prefill methods. After prefilling all modalities, call `generate()` with the text prompt to produce the response.
137+
138+
#### Images
139+
140+
Raw uint8 pixel data in CHW order can be supplied as an `int[]`, or as a direct `ByteBuffer` to avoid JNI array copies:
141+
142+
```java
143+
// As int[]
144+
int[] pixels = ...; // length == channels * height * width
145+
module.prefillImages(pixels, /*width=*/336, /*height=*/336, /*channels=*/3);
146+
147+
// As direct ByteBuffer (preferred for large images)
148+
ByteBuffer buffer = ByteBuffer.allocateDirect(3 * 336 * 336);
149+
buffer.put(rawBytes).rewind();
150+
module.prefillImages(buffer, 336, 336, 3);
151+
```
152+
153+
Pre-normalized float pixel data is also supported, both as a `float[]` and as a direct `ByteBuffer` in native byte order:
154+
155+
```java
156+
float[] normalized = ...; // length == channels * height * width
157+
module.prefillImages(normalized, 336, 336, 3);
158+
159+
ByteBuffer floatBuffer = ByteBuffer
160+
.allocateDirect(3 * 336 * 336 * Float.BYTES)
161+
.order(ByteOrder.nativeOrder());
162+
// fill floatBuffer with normalized values, then:
163+
module.prefillNormalizedImage(floatBuffer, 336, 336, 3);
164+
```
165+
166+
#### Audio
167+
168+
Preprocessed audio features (for example mel spectrograms produced by a Whisper preprocessor) can be supplied as `byte[]` or `float[]`:
169+
170+
```java
171+
module.prefillAudio(features, /*batchSize=*/1, /*nBins=*/128, /*nFrames=*/3000);
172+
```
173+
174+
Raw audio samples can be supplied with `prefillRawAudio`:
175+
176+
```java
177+
module.prefillRawAudio(samples, /*batchSize=*/1, /*nChannels=*/1, /*nSamples=*/16000);
178+
```
179+
180+
#### Generating with Multimodal Prefill
181+
182+
After prefilling each modality, run `generate()` with the text prompt as usual:
183+
184+
```java
185+
module.prefillImages(pixels, 336, 336, 3);
186+
module.generate("What's in this image?", callback);
187+
```
188+
189+
For text-vision models, a convenience overload accepts the image and prompt together:
190+
191+
```java
192+
module.generate(
193+
pixels, /*width=*/336, /*height=*/336, /*channels=*/3,
194+
"What's in this image?",
195+
/*seqLen=*/768,
196+
callback,
197+
/*echo=*/false);
198+
```
199+
200+
## Demo
201+
202+
See the [Llama Android demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo) in `executorch-examples` for an end-to-end project that wires `LlmModule`, `LlmCallback`, and a `HandlerThread` into a chat UI.

docs/source/llm/working-with-llms.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,6 @@ export-llm-optimum
1515
export-custom-llm
1616
run-with-c-plus-plus
1717
build-run-llama3-qualcomm-ai-engine-direct-backend
18+
run-on-android
1819
run-on-ios
1920
```

docs/source/using-executorch-export.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,10 @@ Commonly used hardware backends are listed below. For mobile, consider using XNN
4545

4646
The export process takes in a standard PyTorch model, typically a `torch.nn.Module`. This can be an custom model definition, or a model from an existing source, such as TorchVision or HuggingFace. See [Getting Started with ExecuTorch](getting-started.md) for an example of lowering a TorchVision model.
4747

48+
:::{tip}
49+
Exporting a model from the [Hugging Face Hub](https://huggingface.co/models)? Use the [Optimum ExecuTorch](llm/export-llm-optimum.md) integration. It wraps the export and lowering steps below in a single CLI invocation and supports a wide range of decoder, encoder, multimodal, and seq2seq architectures out of the box.
50+
:::
51+
4852
Model export is done from Python. This is commonly done through a Python script or from an interactive Python notebook, such as Jupyter or Colab. The example below shows instantiation and inputs for a simple PyTorch model. The inputs are prepared as a tuple of torch.Tensors, and the model can run with these inputs.
4953

5054
```python

0 commit comments

Comments
 (0)