|
| 1 | +# Running LLMs on Android |
| 2 | + |
| 3 | +ExecuTorch's LLM-specific runtime components provide an experimental Java interface around the core C++ LLM runtime, available through the `executorch-android` AAR. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide. |
| 8 | + |
| 9 | +To add the `executorch-android` library to your app, see [Using ExecuTorch on Android](../using-executorch-android.md). The LLM runner classes are bundled inside the same AAR as the generic `Module` API. |
| 10 | + |
| 11 | +## Runtime API |
| 12 | + |
| 13 | +Once the `executorch-android` AAR is on your classpath, you can import the LLM runner classes from the `org.pytorch.executorch.extension.llm` package. |
| 14 | + |
| 15 | +### Importing |
| 16 | + |
| 17 | +```java |
| 18 | +import org.pytorch.executorch.extension.llm.LlmModule; |
| 19 | +import org.pytorch.executorch.extension.llm.LlmModuleConfig; |
| 20 | +import org.pytorch.executorch.extension.llm.LlmGenerationConfig; |
| 21 | +import org.pytorch.executorch.extension.llm.LlmCallback; |
| 22 | +``` |
| 23 | + |
| 24 | +### LlmModule |
| 25 | + |
| 26 | +The `LlmModule` class provides a simple Java interface for loading a text-generation model, configuring its tokenizer, generating token streams, and stopping execution. It also supports multimodal models that accept image and audio inputs alongside a text prompt. |
| 27 | + |
| 28 | +This API is experimental and subject to change. |
| 29 | + |
| 30 | +#### Initialization |
| 31 | + |
| 32 | +Create an `LlmModule` by specifying paths to your serialized model (`.pte`) and tokenizer files. For text-only models, the simple constructor is enough: |
| 33 | + |
| 34 | +```java |
| 35 | +LlmModule module = new LlmModule( |
| 36 | + "/data/local/tmp/llama-3.2-instruct.pte", |
| 37 | + "/data/local/tmp/tokenizer.model", |
| 38 | + 0.8f); |
| 39 | +``` |
| 40 | + |
| 41 | +For finer control (multimodal model type, BOS/EOS handling, supplementary data files, load mode), use `LlmModuleConfig` with the fluent builder: |
| 42 | + |
| 43 | +```java |
| 44 | +LlmModuleConfig config = LlmModuleConfig.create() |
| 45 | + .modulePath("/data/local/tmp/llama-3.2-instruct.pte") |
| 46 | + .tokenizerPath("/data/local/tmp/tokenizer.model") |
| 47 | + .temperature(0.8f) |
| 48 | + .modelType(LlmModuleConfig.MODEL_TYPE_TEXT) |
| 49 | + .loadMode(LlmModuleConfig.LOAD_MODE_MMAP) |
| 50 | + .build(); |
| 51 | + |
| 52 | +LlmModule module = new LlmModule(config); |
| 53 | +``` |
| 54 | + |
| 55 | +Available load modes are `LOAD_MODE_FILE`, `LOAD_MODE_MMAP` (default), `LOAD_MODE_MMAP_USE_MLOCK`, and `LOAD_MODE_MMAP_USE_MLOCK_IGNORE_ERRORS`. Available model types are `MODEL_TYPE_TEXT`, `MODEL_TYPE_TEXT_VISION`, and `MODEL_TYPE_MULTIMODAL`. |
| 56 | + |
| 57 | +Construction itself is lightweight and does not load the program data immediately. |
| 58 | + |
| 59 | +#### Loading |
| 60 | + |
| 61 | +Explicitly load the model before generation to avoid paying the load cost during your first `generate` call. |
| 62 | + |
| 63 | +```java |
| 64 | +int status = module.load(); |
| 65 | +if (status != 0) { |
| 66 | + // Handle load failure (status is an ExecuTorch runtime error code). |
| 67 | +} |
| 68 | +``` |
| 69 | + |
| 70 | +If you skip this step, the model is loaded lazily on the first `generate` call. |
| 71 | + |
| 72 | +#### Generating |
| 73 | + |
| 74 | +Generate tokens from a text prompt by passing an `LlmCallback` that receives each token as it is produced. The same callback also receives a JSON-encoded statistics string when generation completes. |
| 75 | + |
| 76 | +```java |
| 77 | +LlmCallback callback = new LlmCallback() { |
| 78 | + @Override |
| 79 | + public void onResult(String token) { |
| 80 | + // Called once per generated token. Append to your UI buffer here. |
| 81 | + System.out.print(token); |
| 82 | + } |
| 83 | + |
| 84 | + @Override |
| 85 | + public void onStats(String statsJson) { |
| 86 | + // Called once when generation finishes. See extension/llm/runner/stats.h |
| 87 | + // for the field definitions. |
| 88 | + System.out.println("\n" + statsJson); |
| 89 | + } |
| 90 | + |
| 91 | + @Override |
| 92 | + public void onError(int errorCode, String message) { |
| 93 | + // Called if the runtime reports an error during generation. |
| 94 | + } |
| 95 | +}; |
| 96 | + |
| 97 | +module.generate("Once upon a time", callback); |
| 98 | +``` |
| 99 | + |
| 100 | +For full control over generation parameters, use `LlmGenerationConfig`: |
| 101 | + |
| 102 | +```java |
| 103 | +LlmGenerationConfig genConfig = LlmGenerationConfig.create() |
| 104 | + .seqLen(2048) |
| 105 | + .temperature(0.8f) |
| 106 | + .echo(false) |
| 107 | + .build(); |
| 108 | + |
| 109 | +module.generate("Once upon a time", genConfig, callback); |
| 110 | +``` |
| 111 | + |
| 112 | +`LlmGenerationConfig` exposes `echo`, `maxNewTokens`, `seqLen`, `temperature`, `numBos`, `numEos`, and `warming`. Defaults match the C++ `GenerationConfig` documented in [Running LLMs with C++](run-with-c-plus-plus.md). |
| 113 | + |
| 114 | +#### Stopping Generation |
| 115 | + |
| 116 | +If you need to interrupt a long-running generation, call `stop()` from another thread (or from inside the `onResult` callback): |
| 117 | + |
| 118 | +```java |
| 119 | +module.stop(); |
| 120 | +``` |
| 121 | + |
| 122 | +Generation also runs synchronously on the calling thread, so make sure you invoke `generate()` off the main thread (for example, on a `HandlerThread` or via a `java.util.concurrent.Executor`). |
| 123 | + |
| 124 | +#### Resetting |
| 125 | + |
| 126 | +To clear the prefilled tokens from the KV cache and reset the start position to 0, call: |
| 127 | + |
| 128 | +```java |
| 129 | +module.resetContext(); |
| 130 | +``` |
| 131 | + |
| 132 | +This is the equivalent of `reset()` on the iOS runner and `reset()` on the C++ `IRunner`. |
| 133 | + |
| 134 | +### Multimodal Inputs |
| 135 | + |
| 136 | +For models declared as `MODEL_TYPE_TEXT_VISION` or `MODEL_TYPE_MULTIMODAL`, image and audio data are provided through dedicated prefill methods. After prefilling all modalities, call `generate()` with the text prompt to produce the response. |
| 137 | + |
| 138 | +#### Images |
| 139 | + |
| 140 | +Raw uint8 pixel data in CHW order can be supplied as an `int[]`, or as a direct `ByteBuffer` to avoid JNI array copies: |
| 141 | + |
| 142 | +```java |
| 143 | +// As int[] |
| 144 | +int[] pixels = ...; // length == channels * height * width |
| 145 | +module.prefillImages(pixels, /*width=*/336, /*height=*/336, /*channels=*/3); |
| 146 | + |
| 147 | +// As direct ByteBuffer (preferred for large images) |
| 148 | +ByteBuffer buffer = ByteBuffer.allocateDirect(3 * 336 * 336); |
| 149 | +buffer.put(rawBytes).rewind(); |
| 150 | +module.prefillImages(buffer, 336, 336, 3); |
| 151 | +``` |
| 152 | + |
| 153 | +Pre-normalized float pixel data is also supported, both as a `float[]` and as a direct `ByteBuffer` in native byte order: |
| 154 | + |
| 155 | +```java |
| 156 | +float[] normalized = ...; // length == channels * height * width |
| 157 | +module.prefillImages(normalized, 336, 336, 3); |
| 158 | + |
| 159 | +ByteBuffer floatBuffer = ByteBuffer |
| 160 | + .allocateDirect(3 * 336 * 336 * Float.BYTES) |
| 161 | + .order(ByteOrder.nativeOrder()); |
| 162 | +// fill floatBuffer with normalized values, then: |
| 163 | +module.prefillNormalizedImage(floatBuffer, 336, 336, 3); |
| 164 | +``` |
| 165 | + |
| 166 | +#### Audio |
| 167 | + |
| 168 | +Preprocessed audio features (for example mel spectrograms produced by a Whisper preprocessor) can be supplied as `byte[]` or `float[]`: |
| 169 | + |
| 170 | +```java |
| 171 | +module.prefillAudio(features, /*batchSize=*/1, /*nBins=*/128, /*nFrames=*/3000); |
| 172 | +``` |
| 173 | + |
| 174 | +Raw audio samples can be supplied with `prefillRawAudio`: |
| 175 | + |
| 176 | +```java |
| 177 | +module.prefillRawAudio(samples, /*batchSize=*/1, /*nChannels=*/1, /*nSamples=*/16000); |
| 178 | +``` |
| 179 | + |
| 180 | +#### Generating with Multimodal Prefill |
| 181 | + |
| 182 | +After prefilling each modality, run `generate()` with the text prompt as usual: |
| 183 | + |
| 184 | +```java |
| 185 | +module.prefillImages(pixels, 336, 336, 3); |
| 186 | +module.generate("What's in this image?", callback); |
| 187 | +``` |
| 188 | + |
| 189 | +For text-vision models, a convenience overload accepts the image and prompt together: |
| 190 | + |
| 191 | +```java |
| 192 | +module.generate( |
| 193 | + pixels, /*width=*/336, /*height=*/336, /*channels=*/3, |
| 194 | + "What's in this image?", |
| 195 | + /*seqLen=*/768, |
| 196 | + callback, |
| 197 | + /*echo=*/false); |
| 198 | +``` |
| 199 | + |
| 200 | +## Demo |
| 201 | + |
| 202 | +See the [Llama Android demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo) in `executorch-examples` for an end-to-end project that wires `LlmModule`, `LlmCallback`, and a `HandlerThread` into a chat UI. |
0 commit comments