docs: add Android LLM runner page and HuggingFace (#19611)

omkar-334 · web-flow · commit 985eeb74f0af · 2026-05-18T10:01:48.000-07:00
## Summary 1. New `docs/source/llm/run-on-android.md`, a Java reference for the `executorch-android` AAR runner. Same shape as `run-on-ios.md`. Covers `LlmModule`, the `LlmModuleConfig` builder, `LlmGenerationConfig`, the `LlmCallback` methods, `load`/`stop`/`resetContext`, and the image/audio prefill variants. Points at LlamaDemo. 2. Added `run-on-android` to the LLM toctree in `working-with-llms.md`, sitting between the Qualcomm page and iOS. 3. In `getting-started.md`, swapped the two GitHub example links for the in-docs Android and iOS pages so users stay in the docs. 4. Added a tip admonition to `using-executorch-export.md` under Model Preparation, sending HF Hub users to `export-llm-optimum.md` before the manual flow. 5. Cleaned up `export-llm-optimum.md`. Removed the leftover "Method 1" framing since only the CLI path is documented, bumped the orphaned subheadings up a level, and pointed the Running on Device links at the new Android page and the existing iOS page (sample apps kept inline). Fixes #8790 cc @mergennachin @AlannaBurke @larryliu0820 @cccclai @helunwencser @jackzhxng @byjlw
diff --git a/docs/source/llm/export-llm-optimum.md b/docs/source/llm/export-llm-optimum.md
@@ -45,15 +45,11 @@ Optimum ExecuTorch supports a wide range of model architectures including decode
 
 For the complete list of supported models, see the [Optimum ExecuTorch documentation](https://github.com/huggingface/optimum-executorch#-supported-models).
 
-## Export Methods
+## CLI Export
 
-Optimum ExecuTorch offers two ways to export models:
+The `optimum-cli` command is the recommended way to export Hugging Face models. It provides a single invocation that downloads the model from the Hub, applies the configured optimizations, and writes the resulting `.pte` file.
 
-### Method 1: CLI Export
-
-The CLI is the simplest way to export models. It provides a single command to convert models from Hugging Face Hub to ExecuTorch format.
-
-#### Basic Export
+### Basic Export
 
 ```bash
 optimum-cli export executorch \
@@ -63,7 +59,7 @@ optimum-cli export executorch \
     --output_dir="./smollm2_exported"
 ```
 
-#### With Optimizations
+### With Optimizations
 
 Add custom SDPA, KV cache optimization, and quantization:
 
@@ -79,7 +75,7 @@ optimum-cli export executorch \
     --output_dir="./smollm2_exported"
 ```
 
-#### Available CLI Arguments
+### Available CLI Arguments
 
 Key arguments for LLM export include `--model`, `--task`, `--recipe` (backend), `--use_custom_sdpa`, `--use_custom_kv_cache`, `--qlinear` (linear quantization), `--qembedding` (embedding quantization), and `--max_seq_len`.
 
@@ -156,8 +152,8 @@ print(generated_text)
 After verifying your model works correctly, deploy it to device:
 
 - [Running with C++](run-with-c-plus-plus.md) - Run exported models using ExecuTorch's C++ runtime
-- [Running on Android](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - Deploy to Android devices
-- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) - Deploy to iOS devices
+- [Running on Android](run-on-android.md) - Java APIs for the `executorch-android` AAR (sample app: [LlamaDemo](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android))
+- [Running on iOS](run-on-ios.md) - Objective-C and Swift APIs for the `ExecuTorchLLM` framework (sample app: [etLLM](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple))
 
 ## Performance
 
diff --git a/docs/source/llm/getting-started.md b/docs/source/llm/getting-started.md
@@ -25,6 +25,6 @@ Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) expor
 
 ### Running
 - [Running with C++](run-with-c-plus-plus.md)
-- [Running on Android (XNNPack)](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android)
+- [Running on Android](run-on-android.md)
 - [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md)
-- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple)
+- [Running on iOS](run-on-ios.md)
diff --git a/docs/source/llm/run-on-android.md b/docs/source/llm/run-on-android.md
@@ -0,0 +1,202 @@
+# Running LLMs on Android
+
+ExecuTorch's LLM-specific runtime components provide an experimental Java interface around the core C++ LLM runtime, available through the `executorch-android` AAR.
+
+## Prerequisites
+
+Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide.
+
+To add the `executorch-android` library to your app, see [Using ExecuTorch on Android](../using-executorch-android.md). The LLM runner classes are bundled inside the same AAR as the generic `Module` API.
+
+## Runtime API
+
+Once the `executorch-android` AAR is on your classpath, you can import the LLM runner classes from the `org.pytorch.executorch.extension.llm` package.
+
+### Importing
+
+```java
+import org.pytorch.executorch.extension.llm.LlmModule;
+import org.pytorch.executorch.extension.llm.LlmModuleConfig;
+import org.pytorch.executorch.extension.llm.LlmGenerationConfig;
+import org.pytorch.executorch.extension.llm.LlmCallback;
+```
+
+### LlmModule
+
+The `LlmModule` class provides a simple Java interface for loading a text-generation model, configuring its tokenizer, generating token streams, and stopping execution. It also supports multimodal models that accept image and audio inputs alongside a text prompt.
+
+This API is experimental and subject to change.
+
+#### Initialization
+
+Create an `LlmModule` by specifying paths to your serialized model (`.pte`) and tokenizer files. For text-only models, the simple constructor is enough:
+
+```java
+LlmModule module = new LlmModule(
+    "/data/local/tmp/llama-3.2-instruct.pte",
+    "/data/local/tmp/tokenizer.model",
+    0.8f);
+```
+
+For finer control (multimodal model type, BOS/EOS handling, supplementary data files, load mode), use `LlmModuleConfig` with the fluent builder:
+
+```java
+LlmModuleConfig config = LlmModuleConfig.create()
+    .modulePath("/data/local/tmp/llama-3.2-instruct.pte")
+    .tokenizerPath("/data/local/tmp/tokenizer.model")
+    .temperature(0.8f)
+    .modelType(LlmModuleConfig.MODEL_TYPE_TEXT)
+    .loadMode(LlmModuleConfig.LOAD_MODE_MMAP)
+    .build();
+
+LlmModule module = new LlmModule(config);
+```
+
+Available load modes are `LOAD_MODE_FILE`, `LOAD_MODE_MMAP` (default), `LOAD_MODE_MMAP_USE_MLOCK`, and `LOAD_MODE_MMAP_USE_MLOCK_IGNORE_ERRORS`. Available model types are `MODEL_TYPE_TEXT`, `MODEL_TYPE_TEXT_VISION`, and `MODEL_TYPE_MULTIMODAL`.
+
+Construction itself is lightweight and does not load the program data immediately.
+
+#### Loading
+
+Explicitly load the model before generation to avoid paying the load cost during your first `generate` call.
+
+```java
+int status = module.load();
+if (status != 0) {
+  // Handle load failure (status is an ExecuTorch runtime error code).
+}
+```
+
+If you skip this step, the model is loaded lazily on the first `generate` call.
+
+#### Generating
+
+Generate tokens from a text prompt by passing an `LlmCallback` that receives each token as it is produced. The same callback also receives a JSON-encoded statistics string when generation completes.
+
+```java
+LlmCallback callback = new LlmCallback() {
+  @Override
+  public void onResult(String token) {
+    // Called once per generated token. Append to your UI buffer here.
+    System.out.print(token);
+  }
+
+  @Override
+  public void onStats(String statsJson) {
+    // Called once when generation finishes. See extension/llm/runner/stats.h
+    // for the field definitions.
+    System.out.println("\n" + statsJson);
+  }
+
+  @Override
+  public void onError(int errorCode, String message) {
+    // Called if the runtime reports an error during generation.
+  }
+};
+
+module.generate("Once upon a time", callback);
+```
+
+For full control over generation parameters, use `LlmGenerationConfig`:
+
+```java
+LlmGenerationConfig genConfig = LlmGenerationConfig.create()
+    .seqLen(2048)
+    .temperature(0.8f)
+    .echo(false)
+    .build();
+
+module.generate("Once upon a time", genConfig, callback);
+```
+
+`LlmGenerationConfig` exposes `echo`, `maxNewTokens`, `seqLen`, `temperature`, `numBos`, `numEos`, and `warming`. Defaults match the C++ `GenerationConfig` documented in [Running LLMs with C++](run-with-c-plus-plus.md).
+
+#### Stopping Generation
+
+If you need to interrupt a long-running generation, call `stop()` from another thread (or from inside the `onResult` callback):
+
+```java
+module.stop();
+```
+
+Generation also runs synchronously on the calling thread, so make sure you invoke `generate()` off the main thread (for example, on a `HandlerThread` or via a `java.util.concurrent.Executor`).
+
+#### Resetting
+
+To clear the prefilled tokens from the KV cache and reset the start position to 0, call:
+
+```java
+module.resetContext();
+```
+
+This is the equivalent of `reset()` on the iOS runner and `reset()` on the C++ `IRunner`.
+
+### Multimodal Inputs
+
+For models declared as `MODEL_TYPE_TEXT_VISION` or `MODEL_TYPE_MULTIMODAL`, image and audio data are provided through dedicated prefill methods. After prefilling all modalities, call `generate()` with the text prompt to produce the response.
+
+#### Images
+
+Raw uint8 pixel data in CHW order can be supplied as an `int[]`, or as a direct `ByteBuffer` to avoid JNI array copies:
+
+```java
+// As int[]
+int[] pixels = ...;       // length == channels * height * width
+module.prefillImages(pixels, /*width=*/336, /*height=*/336, /*channels=*/3);
+
+// As direct ByteBuffer (preferred for large images)
+ByteBuffer buffer = ByteBuffer.allocateDirect(3 * 336 * 336);
+buffer.put(rawBytes).rewind();
+module.prefillImages(buffer, 336, 336, 3);
+```
+
+Pre-normalized float pixel data is also supported, both as a `float[]` and as a direct `ByteBuffer` in native byte order:
+
+```java
+float[] normalized = ...;  // length == channels * height * width
+module.prefillImages(normalized, 336, 336, 3);
+
+ByteBuffer floatBuffer = ByteBuffer
+    .allocateDirect(3 * 336 * 336 * Float.BYTES)
+    .order(ByteOrder.nativeOrder());
+// fill floatBuffer with normalized values, then:
+module.prefillNormalizedImage(floatBuffer, 336, 336, 3);
+```
+
+#### Audio
+
+Preprocessed audio features (for example mel spectrograms produced by a Whisper preprocessor) can be supplied as `byte[]` or `float[]`:
+
+```java
+module.prefillAudio(features, /*batchSize=*/1, /*nBins=*/128, /*nFrames=*/3000);
+```
+
+Raw audio samples can be supplied with `prefillRawAudio`:
+
+```java
+module.prefillRawAudio(samples, /*batchSize=*/1, /*nChannels=*/1, /*nSamples=*/16000);
+```
+
+#### Generating with Multimodal Prefill
+
+After prefilling each modality, run `generate()` with the text prompt as usual:
+
+```java
+module.prefillImages(pixels, 336, 336, 3);
+module.generate("What's in this image?", callback);
+```
+
+For text-vision models, a convenience overload accepts the image and prompt together:
+
+```java
+module.generate(
+    pixels, /*width=*/336, /*height=*/336, /*channels=*/3,
+    "What's in this image?",
+    /*seqLen=*/768,
+    callback,
+    /*echo=*/false);
+```
+
+## Demo
+
+See the [Llama Android demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo) in `executorch-examples` for an end-to-end project that wires `LlmModule`, `LlmCallback`, and a `HandlerThread` into a chat UI.
diff --git a/docs/source/llm/working-with-llms.md b/docs/source/llm/working-with-llms.md
@@ -15,5 +15,6 @@ export-llm-optimum
 export-custom-llm
 run-with-c-plus-plus
 build-run-llama3-qualcomm-ai-engine-direct-backend
+run-on-android
 run-on-ios
 ```
diff --git a/docs/source/using-executorch-export.md b/docs/source/using-executorch-export.md
@@ -45,6 +45,10 @@ Commonly used hardware backends are listed below. For mobile, consider using XNN
 
 The export process takes in a standard PyTorch model, typically a `torch.nn.Module`. This can be an custom model definition, or a model from an existing source, such as TorchVision or HuggingFace. See [Getting Started with ExecuTorch](getting-started.md) for an example of lowering a TorchVision model.
 
+:::{tip}
+Exporting a model from the [Hugging Face Hub](https://huggingface.co/models)? Use the [Optimum ExecuTorch](llm/export-llm-optimum.md) integration. It wraps the export and lowering steps below in a single CLI invocation and supports a wide range of decoder, encoder, multimodal, and seq2seq architectures out of the box.
+:::
+
 Model export is done from Python. This is commonly done through a Python script or from an interactive Python notebook, such as Jupyter or Colab. The example below shows instantiation and inputs for a simple PyTorch model. The inputs are prepared as a tuple of torch.Tensors, and the model can run with these inputs.
 
 ```python