|
| 1 | +# SmolLM2 -> Ethos-U Quickstart |
| 2 | + |
| 3 | +> **Heads-up:** This Ethos-U post-training quantization flow is still |
| 4 | +> experimental. The current recommended path is `w8a16` with |
| 5 | +> `quantization.quantize_scope=linear`, which places the linear layers on |
| 6 | +> Ethos-U while the remaining FP32 operators still run on the Corstone-320 FVP |
| 7 | +> host CPU. That hybrid setup is deliberate: it is the simplest path in this |
| 8 | +> example that still produces meaningful text. |
| 9 | +> |
| 10 | +> This example exports the base `HuggingFaceTB/SmolLM2-135M` checkpoint via |
| 11 | +> `base.model_class=smollm2`, so fetch the matching tokenizer from the same |
| 12 | +> model family. Do not mix this flow with the `SmolLM2-135M-Instruct` |
| 13 | +> tokenizer/checkpoint pair unless you intentionally change the exported model. |
| 14 | +
|
| 15 | +This document focuses on one validated flow: |
| 16 | + |
| 17 | +1. Export one generation-ready full-logits `w8a16` PTE with a fixed sequence window of 32. |
| 18 | +2. Build one runner that embeds that PTE and uses semihosting for host-side |
| 19 | + input/output tensor exchange. |
| 20 | +3. Run a short prompt-generation smoke test on Corstone-320 FVP. |
| 21 | +4. Optionally evaluate Wikitext perplexity with the same full-logits artifact. |
| 22 | + |
| 23 | +In this example, semihosting is mainly a convenient FVP integration path for |
| 24 | +passing meaningful input tensors into the runner and reading output tensors back |
| 25 | +out. The Python host script does the tokenization and prompt preprocessing, then |
| 26 | +uses semihosting to provide the resulting input tensor to the model and collect |
| 27 | +the output logits. Embedding the PTE is a separate convenience that avoids |
| 28 | +copying the model file at runtime. On real silicon, the same preprocessing would |
| 29 | +more likely populate the model input buffer directly from software rather than |
| 30 | +via semihosting. |
| 31 | + |
| 32 | +The example uses a fixed sequence length of 32 because that is the current |
| 33 | +validated tradeoff for this branch on Corstone-320 FVP. Larger windows were more |
| 34 | +expensive in runtime and stalled in our experiments, while smaller windows were |
| 35 | +easier to validate earlier but produced weaker prompts and less representative |
| 36 | +perplexity results. This branch also does not use KV-cache decoding, so every |
| 37 | +generated token recomputes attention across the whole window and larger sequence |
| 38 | +lengths become even more costly. If KV-cache support is added later, it should |
| 39 | +reduce the incremental decode cost, but it is not the direct reason seq32 was |
| 40 | +chosen here. |
| 41 | + |
| 42 | +## 0. Prerequisites |
| 43 | + |
| 44 | +Run all commands from the repository root. |
| 45 | + |
| 46 | +Use an activated Python environment before running the setup commands below, |
| 47 | +because `examples/arm/setup.sh` installs Python packages into the active |
| 48 | +environment. A conda environment or Python `venv` both work; see |
| 49 | +[`docs/source/using-executorch-building-from-source.md`](../../../docs/source/using-executorch-building-from-source.md) |
| 50 | +for the general ExecuTorch environment setup. |
| 51 | + |
| 52 | +```bash |
| 53 | +cd /path/to/executorch |
| 54 | +source /path/to/venv/bin/activate |
| 55 | +``` |
| 56 | + |
| 57 | +Install the Arm Ethos-U dependencies and generate `setup_path.sh`: |
| 58 | + |
| 59 | +```bash |
| 60 | +examples/arm/setup.sh \ |
| 61 | + --i-agree-to-the-contained-eula \ |
| 62 | + --enable-ethos-u-deps |
| 63 | +``` |
| 64 | + |
| 65 | +Source the generated Arm setup: |
| 66 | + |
| 67 | +```bash |
| 68 | +source examples/arm/arm-scratch/setup_path.sh |
| 69 | +``` |
| 70 | + |
| 71 | +Install the helper Python packages used by this example: |
| 72 | + |
| 73 | +```bash |
| 74 | +pip install -U "huggingface_hub[cli]" datasets |
| 75 | +pip install -e ./extension/llm/tokenizers/ |
| 76 | +``` |
| 77 | + |
| 78 | +Build the ExecuTorch Arm libraries once so the runner wrappers can find the |
| 79 | +`executorch` package in `arm_test`: |
| 80 | + |
| 81 | +```bash |
| 82 | +bash backends/arm/scripts/build_executorch.sh |
| 83 | +``` |
| 84 | + |
| 85 | +If you want the broader Arm backend setup flow, see `examples/arm/README.md`. |
| 86 | + |
| 87 | +## 1. Tokenizer |
| 88 | + |
| 89 | +Download the tokenizer that matches the exported base SmolLM2 checkpoint: |
| 90 | + |
| 91 | +```bash |
| 92 | +mkdir -p data/tokenizers/smollm2 |
| 93 | +hf download HuggingFaceTB/SmolLM2-135M tokenizer.json \ |
| 94 | + --local-dir data/tokenizers/smollm2 |
| 95 | +``` |
| 96 | + |
| 97 | +## 2. Recommended configuration |
| 98 | + |
| 99 | +These are the settings used by the main flow in this README: |
| 100 | + |
| 101 | +- `quantization.pt2e_quantize=ethosu_16a8w` |
| 102 | +- `quantization.quantize_scope=linear` |
| 103 | +- `export.max_seq_length=32` |
| 104 | +- `export.max_context_length=32` |
| 105 | +- `quantization.calibration_seq_length=32` |
| 106 | +- `quantization.calibration_limit=62` |
| 107 | +- `backend.ethosu.target=ethos-u85-256` |
| 108 | +- `backend.ethosu.system_config=Ethos_U85_SYS_DRAM_High` |
| 109 | +- `backend.ethosu.memory_mode=Dedicated_Sram_512KB` |
| 110 | + |
| 111 | +Why these settings matter: |
| 112 | + |
| 113 | +- `linear` scope means only the linear layers are quantized onto Ethos-U. This |
| 114 | + is the current validated path for meaningful output in this example. |
| 115 | +- `max_seq_length=32` and `calibration_seq_length=32` are kept equal so the |
| 116 | + quantizer observes the same token-window shape that the runtime will execute. |
| 117 | + Keeping them aligned avoids calibrating a shape that the deployed runner never |
| 118 | + uses. |
| 119 | +- `calibration_limit=62` is the current fuller-calibration setting for this |
| 120 | + README. With the newer full-logits calibration path, larger limits are now |
| 121 | + practical enough to use by default. For quicker iteration, `calibration_limit=2` |
| 122 | + is the fast validation setting discussed later in this document. |
| 123 | + |
| 124 | +## 3. Export the generation artifact |
| 125 | + |
| 126 | +This command produces the full-logits PTE used for the generation smoke test and optional perplexity evaluation. Static non-KV calibration uses padded prefixes, so calibrated exports must produce full logits to let calibration select the last real token position instead of a padded position. |
| 127 | + |
| 128 | +```bash |
| 129 | +bash examples/arm/smollm2_example_ethos_u/export_smollm2_ethosu.sh \ |
| 130 | + --mode=w8a16 \ |
| 131 | + --max_seq_length=32 \ |
| 132 | + --max_context_length=32 \ |
| 133 | + --calibration_limit=62 \ |
| 134 | + --calibration_seq_length=32 \ |
| 135 | + --quantize_scope=linear |
| 136 | +``` |
| 137 | + |
| 138 | +What this command does: |
| 139 | + |
| 140 | +- `--mode=w8a16` selects the 16-bit activation, 8-bit weight Ethos-U quantizer. |
| 141 | +- By default the helper writes the exported `.pte` into the repository root, so |
| 142 | + the runner build commands below can reference the artifact by filename. |
| 143 | +- `--max_seq_length=32` fixes the deployed token window to 32 tokens. |
| 144 | +- `--max_context_length=32` keeps prompt context management consistent with that |
| 145 | + same fixed window. |
| 146 | +- `--calibration_limit=62` uses the fuller calibration setting now recommended |
| 147 | + for this example. |
| 148 | +- `--calibration_seq_length=32` calibrates on the same token length that the |
| 149 | + runtime will execute. |
| 150 | +- `--quantize_scope=linear` keeps the validated hybrid setup where linear layers |
| 151 | + run on Ethos-U and the rest of the graph remains FP32. |
| 152 | + |
| 153 | +The output artifact is named: |
| 154 | + |
| 155 | +```text |
| 156 | +smollm2_ethosu_seq32_w8a16_wikitext_full_logits.pte |
| 157 | +``` |
| 158 | + |
| 159 | +## 4. Build the semihosting runner |
| 160 | + |
| 161 | +Build one runner that embeds the generation artifact: |
| 162 | + |
| 163 | +```bash |
| 164 | +bash examples/arm/smollm2_example_ethos_u/build_executor_runner_semihosting.sh \ |
| 165 | + --pte=smollm2_ethosu_seq32_w8a16_wikitext_full_logits.pte \ |
| 166 | + --output=smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out \ |
| 167 | + --method_pool_size=0x01000000 \ |
| 168 | + --scratch_pool_size=0x00400000 \ |
| 169 | + --input_file_pool_size=0x00100000 |
| 170 | +``` |
| 171 | + |
| 172 | +What this command does: |
| 173 | + |
| 174 | +- Builds a semihosting `arm_executor_runner` ELF so the host can pass |
| 175 | + preprocessed input tensors in and read output tensors back out easily on FVP. |
| 176 | + In this flow the PTE is embedded in that runner as a separate convenience. |
| 177 | +- Uses the validated `Ethos_U85_SYS_DRAM_High` and `Dedicated_Sram_512KB` |
| 178 | + defaults from the build helper, so you do not need to pass them explicitly in |
| 179 | + the common case. |
| 180 | +- Sets three allocator pool sizes that keep the embedded-PTE full-logits runner inside a |
| 181 | + practical Corstone-320 DDR budget. |
| 182 | + |
| 183 | +How to read the pool sizes: |
| 184 | + |
| 185 | +- `method_pool_size` stores long-lived runtime objects such as the loaded |
| 186 | + method and model state. |
| 187 | +- `scratch_pool_size` is temporary workspace used during execution. |
| 188 | +- `input_file_pool_size` is the buffer used to load semihosted input files such |
| 189 | + as `i0.bin`. |
| 190 | + |
| 191 | +These values are not universal tuning rules. They are simply the validated pool |
| 192 | +sizes for this example's seq32 embedded-PTE runner. Start with them unless you |
| 193 | +are actively changing the export shape or runtime integration. |
| 194 | + |
| 195 | +## 5. Run a generation smoke test |
| 196 | + |
| 197 | +Use `generate_sampled.py` to tokenize the prompt on the host, write the input |
| 198 | +tensor file expected by the semihosting runner, launch FVP, read back the |
| 199 | +output logits, and decode the generated token IDs into text: |
| 200 | + |
| 201 | +```bash |
| 202 | +python examples/arm/smollm2_example_ethos_u/generate_sampled.py \ |
| 203 | + --fvp examples/arm/arm-scratch/FVP-corstone320/models/Linux64_GCC-9.3/FVP_Corstone_SSE-320 \ |
| 204 | + --runner smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out/arm_executor_runner \ |
| 205 | + --embedded-pte \ |
| 206 | + --tokenizer data/tokenizers/smollm2/tokenizer.json \ |
| 207 | + --prompt "Once upon a time in a small village," \ |
| 208 | + --window 32 \ |
| 209 | + --max-new-tokens 2 \ |
| 210 | + --full-logits \ |
| 211 | + --temperature 0 \ |
| 212 | + --top-p 0.9 \ |
| 213 | + --repetition-penalty 1.1 |
| 214 | +``` |
| 215 | + |
| 216 | +How to interpret the main options: |
| 217 | + |
| 218 | +- `--embedded-pte` tells the script not to copy a separate `program.pte`, |
| 219 | + because the runner already contains the model. |
| 220 | +- `--window 32` must match the exported `max_seq_length`. If these differ, the |
| 221 | + runner will reject the input tensor shape. |
| 222 | +- `--max-new-tokens 2` keeps the smoke test short. The goal here is to show the |
| 223 | + end-to-end path works, not to benchmark long decoding. |
| 224 | +- `--full-logits` tells `generate_sampled.py` to select the last valid prompt |
| 225 | + row from the `[window, vocab]` output. This matches the calibrated static |
| 226 | + non-KV export path and avoids sampling from padded positions. |
| 227 | +- `--temperature 0` switches to greedy decoding, which is the most stable way |
| 228 | + to compare short smoke runs. |
| 229 | +- `--top-p 0.9` is kept for consistency with the broader sampling interface, |
| 230 | + but it does not affect greedy decoding when `--temperature 0`. |
| 231 | +- `--repetition-penalty 1.1` still matters in greedy mode because it modifies |
| 232 | + the logits before `argmax`. |
| 233 | + |
| 234 | +## 6. Optional: evaluate Wikitext perplexity |
| 235 | + |
| 236 | +The calibrated generation artifact already returns full logits for every token position in the 32-token window, so the same PTE and runner can be used for perplexity scoring. |
| 237 | + |
| 238 | +### 6.1 Build the matching runner |
| 239 | + |
| 240 | +```bash |
| 241 | +bash examples/arm/smollm2_example_ethos_u/build_executor_runner_semihosting.sh \ |
| 242 | + --pte=smollm2_ethosu_seq32_w8a16_wikitext_full_logits.pte \ |
| 243 | + --output=smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out \ |
| 244 | + --method_pool_size=0x01000000 \ |
| 245 | + --scratch_pool_size=0x00400000 \ |
| 246 | + --input_file_pool_size=0x00100000 |
| 247 | +``` |
| 248 | + |
| 249 | +The full-logits artifact uses `--method_pool_size=0x01000000` (`16 MiB`). |
| 250 | + |
| 251 | +### 6.2 Run perplexity |
| 252 | + |
| 253 | +```bash |
| 254 | +python examples/arm/smollm2_example_ethos_u/eval_wikitext_perplexity.py \ |
| 255 | + --fvp examples/arm/arm-scratch/FVP-corstone320/models/Linux64_GCC-9.3/FVP_Corstone_SSE-320 \ |
| 256 | + --runner-w8a8 smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out/arm_executor_runner \ |
| 257 | + --runner-w8a16 smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out/arm_executor_runner \ |
| 258 | + --prompts-file outputs/$(date +%F)/wikitext_prompts_seq32.txt \ |
| 259 | + --num-prompts 100 \ |
| 260 | + --ppl-prompts 100 \ |
| 261 | + --min-prompt-tokens 32 \ |
| 262 | + --max-prompt-tokens 32 \ |
| 263 | + --max-tokens-per-prompt 32 \ |
| 264 | + --window 32 \ |
| 265 | + --timeout 36000 \ |
| 266 | + --refresh-prompts |
| 267 | +``` |
| 268 | + |
| 269 | +Why the prompt settings are all 32 here: |
| 270 | + |
| 271 | +- `--window 32` must match the export shape. |
| 272 | +- `--min-prompt-tokens 32` and `--max-prompt-tokens 32` force every prompt to |
| 273 | + fill exactly one scoring window, which makes the comparison easier to reason |
| 274 | + about. |
| 275 | +- `--max-tokens-per-prompt 32` keeps scoring aligned with that same fixed |
| 276 | + window. |
| 277 | +- `--num-prompts 100` builds a reusable prompt file with enough samples for a |
| 278 | + stable comparison. |
| 279 | +- `--ppl-prompts 100` then scores all prompts from that file. Lower this value |
| 280 | + when you want a quicker but noisier local check. |
| 281 | + |
| 282 | +The evaluator script compares two runners, which is why it asks for both |
| 283 | +`--runner-w8a8` and `--runner-w8a16`. In this simplified `w8a16`-only flow, it |
| 284 | +is acceptable to pass the same runner to both options when you only want one |
| 285 | +number from the validated artifact. |
| 286 | + |
| 287 | +## 7. Additional notes |
| 288 | + |
| 289 | +### Why padding is needed for full-logits evaluation |
| 290 | + |
| 291 | +The full-logits export returns one logits row per position in the fixed window. |
| 292 | +Short prompts therefore need padding so the runtime still receives a tensor with |
| 293 | +exactly 32 token slots. For perplexity, the evaluator right-pads the prompt so |
| 294 | +the real tokens stay at the front of the causal window and each target token is |
| 295 | +scored against the matching row. This preserves the usual left-to-right causal |
| 296 | +ordering even though the deployed runtime works with fixed-size inputs. |
| 297 | + |
| 298 | +### What `full` quantization scope means |
| 299 | + |
| 300 | +`quantization.quantize_scope=full` asks the export stack to quantize more than |
| 301 | +just the linear layers. That path exists for experimentation, but it is not the |
| 302 | +validated path in this README because the linear-only setup is the one that |
| 303 | +currently produces the clearest end-to-end result on Ethos-U FVP. |
| 304 | + |
| 305 | +### Can calibration be faster? |
| 306 | + |
| 307 | +Yes. The quickest way to iterate is to lower `--calibration_limit`. The tradeoff |
| 308 | +is that you are collecting activation statistics from fewer samples, which can |
| 309 | +hurt perplexity and generation quality. Keep `--calibration_seq_length` aligned |
| 310 | +with `--max_seq_length`; if they differ, the calibration run is no longer |
| 311 | +measuring the same tensor shapes that the deployed model will execute. In the |
| 312 | +older non-KV path, calibration was especially slow because it often replayed |
| 313 | +many partial prefixes position by position. The newer full-logits path can |
| 314 | +observe a whole 32-token window in one pass, so larger limits are now much more |
| 315 | +practical. |
| 316 | + |
| 317 | +In the saved seq32 runs in this branch, `--calibration_limit=62` is now |
| 318 | +bearable as the fuller-calibration setting, while `--calibration_limit=2` |
| 319 | +remains the fast validation option. On the 100-prompt perplexity check, `2` |
| 320 | +scored best, but `62` was still competitive and is the more conservative |
| 321 | +default when export turnaround is less important than fuller calibration. |
| 322 | + |
| 323 | +### Historical seq8 artifacts |
| 324 | + |
| 325 | +Earlier experiments in this directory used smaller seq8 exports and separate |
| 326 | +included-PTE runners. They are useful as implementation history, but they are |
| 327 | +not the main path for this README because they add options without improving the |
| 328 | +clarity of the validated seq32 `w8a16` workflow. |
| 329 | + |
| 330 | +### Clean-checkout checklist |
| 331 | + |
| 332 | +If the example fails on a clean checkout, the most common missing pieces are: |
| 333 | + |
| 334 | +- `huggingface_hub[cli]` for the `hf download` command. |
| 335 | +- `datasets` for rebuilding Wikitext prompts in the perplexity script. |
| 336 | +- `pytorch_tokenizers`, installed from `./extension/llm/tokenizers/`. |
| 337 | +- `backends/arm/scripts/build_executorch.sh`, which populates the default |
| 338 | + `arm_test` build root used by the runner wrappers. |
0 commit comments