Skip to content

Commit 07b8c20

Browse files
authored
Arm backend: add SmolLM2 Ethos-U export, generation and eval flow (pytorch#20063)
- semihosting and FVP runner build helpers - sampled text generation from prompt files - Wikitext full-logits perplexity evaluation on FVP - example prompts and documentation for reproducing results cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Xingguo Li <xingguo.li@arm.com>
1 parent 0881b22 commit 07b8c20

8 files changed

Lines changed: 1617 additions & 0 deletions

File tree

Lines changed: 338 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,338 @@
1+
# SmolLM2 -> Ethos-U Quickstart
2+
3+
> **Heads-up:** This Ethos-U post-training quantization flow is still
4+
> experimental. The current recommended path is `w8a16` with
5+
> `quantization.quantize_scope=linear`, which places the linear layers on
6+
> Ethos-U while the remaining FP32 operators still run on the Corstone-320 FVP
7+
> host CPU. That hybrid setup is deliberate: it is the simplest path in this
8+
> example that still produces meaningful text.
9+
>
10+
> This example exports the base `HuggingFaceTB/SmolLM2-135M` checkpoint via
11+
> `base.model_class=smollm2`, so fetch the matching tokenizer from the same
12+
> model family. Do not mix this flow with the `SmolLM2-135M-Instruct`
13+
> tokenizer/checkpoint pair unless you intentionally change the exported model.
14+
15+
This document focuses on one validated flow:
16+
17+
1. Export one generation-ready full-logits `w8a16` PTE with a fixed sequence window of 32.
18+
2. Build one runner that embeds that PTE and uses semihosting for host-side
19+
input/output tensor exchange.
20+
3. Run a short prompt-generation smoke test on Corstone-320 FVP.
21+
4. Optionally evaluate Wikitext perplexity with the same full-logits artifact.
22+
23+
In this example, semihosting is mainly a convenient FVP integration path for
24+
passing meaningful input tensors into the runner and reading output tensors back
25+
out. The Python host script does the tokenization and prompt preprocessing, then
26+
uses semihosting to provide the resulting input tensor to the model and collect
27+
the output logits. Embedding the PTE is a separate convenience that avoids
28+
copying the model file at runtime. On real silicon, the same preprocessing would
29+
more likely populate the model input buffer directly from software rather than
30+
via semihosting.
31+
32+
The example uses a fixed sequence length of 32 because that is the current
33+
validated tradeoff for this branch on Corstone-320 FVP. Larger windows were more
34+
expensive in runtime and stalled in our experiments, while smaller windows were
35+
easier to validate earlier but produced weaker prompts and less representative
36+
perplexity results. This branch also does not use KV-cache decoding, so every
37+
generated token recomputes attention across the whole window and larger sequence
38+
lengths become even more costly. If KV-cache support is added later, it should
39+
reduce the incremental decode cost, but it is not the direct reason seq32 was
40+
chosen here.
41+
42+
## 0. Prerequisites
43+
44+
Run all commands from the repository root.
45+
46+
Use an activated Python environment before running the setup commands below,
47+
because `examples/arm/setup.sh` installs Python packages into the active
48+
environment. A conda environment or Python `venv` both work; see
49+
[`docs/source/using-executorch-building-from-source.md`](../../../docs/source/using-executorch-building-from-source.md)
50+
for the general ExecuTorch environment setup.
51+
52+
```bash
53+
cd /path/to/executorch
54+
source /path/to/venv/bin/activate
55+
```
56+
57+
Install the Arm Ethos-U dependencies and generate `setup_path.sh`:
58+
59+
```bash
60+
examples/arm/setup.sh \
61+
--i-agree-to-the-contained-eula \
62+
--enable-ethos-u-deps
63+
```
64+
65+
Source the generated Arm setup:
66+
67+
```bash
68+
source examples/arm/arm-scratch/setup_path.sh
69+
```
70+
71+
Install the helper Python packages used by this example:
72+
73+
```bash
74+
pip install -U "huggingface_hub[cli]" datasets
75+
pip install -e ./extension/llm/tokenizers/
76+
```
77+
78+
Build the ExecuTorch Arm libraries once so the runner wrappers can find the
79+
`executorch` package in `arm_test`:
80+
81+
```bash
82+
bash backends/arm/scripts/build_executorch.sh
83+
```
84+
85+
If you want the broader Arm backend setup flow, see `examples/arm/README.md`.
86+
87+
## 1. Tokenizer
88+
89+
Download the tokenizer that matches the exported base SmolLM2 checkpoint:
90+
91+
```bash
92+
mkdir -p data/tokenizers/smollm2
93+
hf download HuggingFaceTB/SmolLM2-135M tokenizer.json \
94+
--local-dir data/tokenizers/smollm2
95+
```
96+
97+
## 2. Recommended configuration
98+
99+
These are the settings used by the main flow in this README:
100+
101+
- `quantization.pt2e_quantize=ethosu_16a8w`
102+
- `quantization.quantize_scope=linear`
103+
- `export.max_seq_length=32`
104+
- `export.max_context_length=32`
105+
- `quantization.calibration_seq_length=32`
106+
- `quantization.calibration_limit=62`
107+
- `backend.ethosu.target=ethos-u85-256`
108+
- `backend.ethosu.system_config=Ethos_U85_SYS_DRAM_High`
109+
- `backend.ethosu.memory_mode=Dedicated_Sram_512KB`
110+
111+
Why these settings matter:
112+
113+
- `linear` scope means only the linear layers are quantized onto Ethos-U. This
114+
is the current validated path for meaningful output in this example.
115+
- `max_seq_length=32` and `calibration_seq_length=32` are kept equal so the
116+
quantizer observes the same token-window shape that the runtime will execute.
117+
Keeping them aligned avoids calibrating a shape that the deployed runner never
118+
uses.
119+
- `calibration_limit=62` is the current fuller-calibration setting for this
120+
README. With the newer full-logits calibration path, larger limits are now
121+
practical enough to use by default. For quicker iteration, `calibration_limit=2`
122+
is the fast validation setting discussed later in this document.
123+
124+
## 3. Export the generation artifact
125+
126+
This command produces the full-logits PTE used for the generation smoke test and optional perplexity evaluation. Static non-KV calibration uses padded prefixes, so calibrated exports must produce full logits to let calibration select the last real token position instead of a padded position.
127+
128+
```bash
129+
bash examples/arm/smollm2_example_ethos_u/export_smollm2_ethosu.sh \
130+
--mode=w8a16 \
131+
--max_seq_length=32 \
132+
--max_context_length=32 \
133+
--calibration_limit=62 \
134+
--calibration_seq_length=32 \
135+
--quantize_scope=linear
136+
```
137+
138+
What this command does:
139+
140+
- `--mode=w8a16` selects the 16-bit activation, 8-bit weight Ethos-U quantizer.
141+
- By default the helper writes the exported `.pte` into the repository root, so
142+
the runner build commands below can reference the artifact by filename.
143+
- `--max_seq_length=32` fixes the deployed token window to 32 tokens.
144+
- `--max_context_length=32` keeps prompt context management consistent with that
145+
same fixed window.
146+
- `--calibration_limit=62` uses the fuller calibration setting now recommended
147+
for this example.
148+
- `--calibration_seq_length=32` calibrates on the same token length that the
149+
runtime will execute.
150+
- `--quantize_scope=linear` keeps the validated hybrid setup where linear layers
151+
run on Ethos-U and the rest of the graph remains FP32.
152+
153+
The output artifact is named:
154+
155+
```text
156+
smollm2_ethosu_seq32_w8a16_wikitext_full_logits.pte
157+
```
158+
159+
## 4. Build the semihosting runner
160+
161+
Build one runner that embeds the generation artifact:
162+
163+
```bash
164+
bash examples/arm/smollm2_example_ethos_u/build_executor_runner_semihosting.sh \
165+
--pte=smollm2_ethosu_seq32_w8a16_wikitext_full_logits.pte \
166+
--output=smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out \
167+
--method_pool_size=0x01000000 \
168+
--scratch_pool_size=0x00400000 \
169+
--input_file_pool_size=0x00100000
170+
```
171+
172+
What this command does:
173+
174+
- Builds a semihosting `arm_executor_runner` ELF so the host can pass
175+
preprocessed input tensors in and read output tensors back out easily on FVP.
176+
In this flow the PTE is embedded in that runner as a separate convenience.
177+
- Uses the validated `Ethos_U85_SYS_DRAM_High` and `Dedicated_Sram_512KB`
178+
defaults from the build helper, so you do not need to pass them explicitly in
179+
the common case.
180+
- Sets three allocator pool sizes that keep the embedded-PTE full-logits runner inside a
181+
practical Corstone-320 DDR budget.
182+
183+
How to read the pool sizes:
184+
185+
- `method_pool_size` stores long-lived runtime objects such as the loaded
186+
method and model state.
187+
- `scratch_pool_size` is temporary workspace used during execution.
188+
- `input_file_pool_size` is the buffer used to load semihosted input files such
189+
as `i0.bin`.
190+
191+
These values are not universal tuning rules. They are simply the validated pool
192+
sizes for this example's seq32 embedded-PTE runner. Start with them unless you
193+
are actively changing the export shape or runtime integration.
194+
195+
## 5. Run a generation smoke test
196+
197+
Use `generate_sampled.py` to tokenize the prompt on the host, write the input
198+
tensor file expected by the semihosting runner, launch FVP, read back the
199+
output logits, and decode the generated token IDs into text:
200+
201+
```bash
202+
python examples/arm/smollm2_example_ethos_u/generate_sampled.py \
203+
--fvp examples/arm/arm-scratch/FVP-corstone320/models/Linux64_GCC-9.3/FVP_Corstone_SSE-320 \
204+
--runner smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out/arm_executor_runner \
205+
--embedded-pte \
206+
--tokenizer data/tokenizers/smollm2/tokenizer.json \
207+
--prompt "Once upon a time in a small village," \
208+
--window 32 \
209+
--max-new-tokens 2 \
210+
--full-logits \
211+
--temperature 0 \
212+
--top-p 0.9 \
213+
--repetition-penalty 1.1
214+
```
215+
216+
How to interpret the main options:
217+
218+
- `--embedded-pte` tells the script not to copy a separate `program.pte`,
219+
because the runner already contains the model.
220+
- `--window 32` must match the exported `max_seq_length`. If these differ, the
221+
runner will reject the input tensor shape.
222+
- `--max-new-tokens 2` keeps the smoke test short. The goal here is to show the
223+
end-to-end path works, not to benchmark long decoding.
224+
- `--full-logits` tells `generate_sampled.py` to select the last valid prompt
225+
row from the `[window, vocab]` output. This matches the calibrated static
226+
non-KV export path and avoids sampling from padded positions.
227+
- `--temperature 0` switches to greedy decoding, which is the most stable way
228+
to compare short smoke runs.
229+
- `--top-p 0.9` is kept for consistency with the broader sampling interface,
230+
but it does not affect greedy decoding when `--temperature 0`.
231+
- `--repetition-penalty 1.1` still matters in greedy mode because it modifies
232+
the logits before `argmax`.
233+
234+
## 6. Optional: evaluate Wikitext perplexity
235+
236+
The calibrated generation artifact already returns full logits for every token position in the 32-token window, so the same PTE and runner can be used for perplexity scoring.
237+
238+
### 6.1 Build the matching runner
239+
240+
```bash
241+
bash examples/arm/smollm2_example_ethos_u/build_executor_runner_semihosting.sh \
242+
--pte=smollm2_ethosu_seq32_w8a16_wikitext_full_logits.pte \
243+
--output=smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out \
244+
--method_pool_size=0x01000000 \
245+
--scratch_pool_size=0x00400000 \
246+
--input_file_pool_size=0x00100000
247+
```
248+
249+
The full-logits artifact uses `--method_pool_size=0x01000000` (`16 MiB`).
250+
251+
### 6.2 Run perplexity
252+
253+
```bash
254+
python examples/arm/smollm2_example_ethos_u/eval_wikitext_perplexity.py \
255+
--fvp examples/arm/arm-scratch/FVP-corstone320/models/Linux64_GCC-9.3/FVP_Corstone_SSE-320 \
256+
--runner-w8a8 smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out/arm_executor_runner \
257+
--runner-w8a16 smollm2_ethosu_seq32_w8a16_wikitext_full_logits/cmake-out/arm_executor_runner \
258+
--prompts-file outputs/$(date +%F)/wikitext_prompts_seq32.txt \
259+
--num-prompts 100 \
260+
--ppl-prompts 100 \
261+
--min-prompt-tokens 32 \
262+
--max-prompt-tokens 32 \
263+
--max-tokens-per-prompt 32 \
264+
--window 32 \
265+
--timeout 36000 \
266+
--refresh-prompts
267+
```
268+
269+
Why the prompt settings are all 32 here:
270+
271+
- `--window 32` must match the export shape.
272+
- `--min-prompt-tokens 32` and `--max-prompt-tokens 32` force every prompt to
273+
fill exactly one scoring window, which makes the comparison easier to reason
274+
about.
275+
- `--max-tokens-per-prompt 32` keeps scoring aligned with that same fixed
276+
window.
277+
- `--num-prompts 100` builds a reusable prompt file with enough samples for a
278+
stable comparison.
279+
- `--ppl-prompts 100` then scores all prompts from that file. Lower this value
280+
when you want a quicker but noisier local check.
281+
282+
The evaluator script compares two runners, which is why it asks for both
283+
`--runner-w8a8` and `--runner-w8a16`. In this simplified `w8a16`-only flow, it
284+
is acceptable to pass the same runner to both options when you only want one
285+
number from the validated artifact.
286+
287+
## 7. Additional notes
288+
289+
### Why padding is needed for full-logits evaluation
290+
291+
The full-logits export returns one logits row per position in the fixed window.
292+
Short prompts therefore need padding so the runtime still receives a tensor with
293+
exactly 32 token slots. For perplexity, the evaluator right-pads the prompt so
294+
the real tokens stay at the front of the causal window and each target token is
295+
scored against the matching row. This preserves the usual left-to-right causal
296+
ordering even though the deployed runtime works with fixed-size inputs.
297+
298+
### What `full` quantization scope means
299+
300+
`quantization.quantize_scope=full` asks the export stack to quantize more than
301+
just the linear layers. That path exists for experimentation, but it is not the
302+
validated path in this README because the linear-only setup is the one that
303+
currently produces the clearest end-to-end result on Ethos-U FVP.
304+
305+
### Can calibration be faster?
306+
307+
Yes. The quickest way to iterate is to lower `--calibration_limit`. The tradeoff
308+
is that you are collecting activation statistics from fewer samples, which can
309+
hurt perplexity and generation quality. Keep `--calibration_seq_length` aligned
310+
with `--max_seq_length`; if they differ, the calibration run is no longer
311+
measuring the same tensor shapes that the deployed model will execute. In the
312+
older non-KV path, calibration was especially slow because it often replayed
313+
many partial prefixes position by position. The newer full-logits path can
314+
observe a whole 32-token window in one pass, so larger limits are now much more
315+
practical.
316+
317+
In the saved seq32 runs in this branch, `--calibration_limit=62` is now
318+
bearable as the fuller-calibration setting, while `--calibration_limit=2`
319+
remains the fast validation option. On the 100-prompt perplexity check, `2`
320+
scored best, but `62` was still competitive and is the more conservative
321+
default when export turnaround is less important than fuller calibration.
322+
323+
### Historical seq8 artifacts
324+
325+
Earlier experiments in this directory used smaller seq8 exports and separate
326+
included-PTE runners. They are useful as implementation history, but they are
327+
not the main path for this README because they add options without improving the
328+
clarity of the validated seq32 `w8a16` workflow.
329+
330+
### Clean-checkout checklist
331+
332+
If the example fails on a clean checkout, the most common missing pieces are:
333+
334+
- `huggingface_hub[cli]` for the `hf download` command.
335+
- `datasets` for rebuilding Wikitext prompts in the perplexity script.
336+
- `pytorch_tokenizers`, installed from `./extension/llm/tokenizers/`.
337+
- `backends/arm/scripts/build_executorch.sh`, which populates the default
338+
`arm_test` build root used by the runner wrappers.

0 commit comments

Comments
 (0)