Skip to content

Commit cf46ff9

Browse files
committed
Update on "[ET Device Support] Schema changes: device info on Tensor and buffer-level device array"
This diff adds device placement information to the ExecuTorch schema to support representing tensor-level device type information, which will be the basic requirement for the following tensor_parser updates. This is part of the Phase 1 implementation to make ET device type work E2E without user-specified device placement. Design doc: https://docs.google.com/document/d/1lwd9BlohmwkN5EEvRulO_b-XnZBwv1nMb5l2K3jfuwA/edit?tab=t.0#heading=h.o6anuvkix4bu Differential Revision: [D93635657](https://our.internmc.facebook.com/intern/diff/D93635657/) [ghstack-poisoned]
2 parents 59a28f2 + d062e75 commit cf46ff9

101 files changed

Lines changed: 3381 additions & 926 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/building/SKILL.md

Lines changed: 211 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,223 @@
11
---
22
name: building
3-
description: Build ExecuTorch runners or C++ libraries. Use when compiling runners for Llama, Whisper, or other models, or building the C++ runtime.
3+
description: Build ExecuTorch from source — Python package, C++ runtime, runners, cross-compilation, and backend-specific builds. Use when compiling anything in the ExecuTorch repo, diagnosing build failures, or setting up platform-specific builds.
44
---
55

6-
# Building
6+
# Building ExecuTorch
77

8-
## Runners (Makefile)
8+
## Step 1: Ensure Python environment (detect and fix automatically)
9+
10+
**Path A — conda (preferred):**
11+
```bash
12+
# Initialize conda for non-interactive shells (required in Claude Code / CI)
13+
eval "$(conda shell.bash hook 2>/dev/null)"
14+
15+
# Check if executorch conda env exists; create if not
16+
conda env list 2>/dev/null | grep executorch || \
17+
ls "$(conda info --base 2>/dev/null)/envs/" 2>/dev/null | grep executorch || \
18+
conda create -yn executorch python=3.12
19+
20+
# Activate
21+
conda activate executorch
22+
```
23+
24+
**Path B — no conda (fall back to venv):**
25+
```bash
26+
# Find a compatible Python (3.10–3.13). On macOS with only Homebrew Python 3.14+,
27+
# install a compatible version first: brew install python@3.12
28+
python3.12 -m venv .executorch-venv # or python3.11, python3.10, python3.13
29+
source .executorch-venv/bin/activate
30+
pip install --upgrade pip
31+
```
32+
33+
**Then verify (either path):**
34+
35+
Run `python --version` and `cmake --version`. Fix automatically:
36+
- **Python not 3.10–3.13**: recreate the env with a correct Python version.
37+
- **cmake missing or < 3.24**: run `pip install 'cmake>=3.24'` inside the env.
38+
- **cmake >= 4.0**: works in practice, no action needed.
39+
40+
Parallel jobs: `$(sysctl -n hw.ncpu)` on macOS, `$(nproc)` on Linux.
41+
42+
## Step 2: Build
43+
44+
Route based on what the user asks for:
45+
- User mentions **Android** → skip to [Cross-compilation: Android](#cross-compilation)
46+
- User mentions **iOS** or **frameworks** → skip to [Cross-compilation: iOS](#cross-compilation)
47+
- User mentions a **model name** (llama, whisper, etc.) → skip to [LLM / ASR model runner](#llm--asr-model-runner-simplest-path-for-running-models)
48+
- User mentions **C++ runtime** or **cmake** → skip to [C++ runtime](#c-runtime-standalone)
49+
- Otherwise → default to **Python package** below
50+
51+
### Python package (default)
952
```bash
10-
make help # list all targets
11-
make llama-cpu # Llama
12-
make whisper-metal # Whisper on Metal
13-
make gemma3-cuda # Gemma3 on CUDA
53+
conda activate executorch
54+
./install_executorch.sh --editable # editable install from source
1455
```
56+
This handles everything: submodules, deps, C++ build, Python install. Takes ~10 min on Apple Silicon.
57+
58+
For subsequent rebuilds (deps already present): `pip install -e . --no-build-isolation`
59+
60+
For minimal install (skip example deps): `./install_executorch.sh --minimal`
61+
62+
Enable additional backends:
63+
```bash
64+
CMAKE_ARGS="-DEXECUTORCH_BUILD_COREML=ON -DEXECUTORCH_BUILD_MPS=ON" ./install_executorch.sh --editable
65+
```
66+
67+
Verify: `python -c "from executorch.exir import to_edge_transform_and_lower; print('OK')"`
68+
69+
### LLM / ASR model runner (simplest path for running models)
70+
71+
```bash
72+
conda activate executorch
73+
make <model>-<backend>
74+
```
75+
76+
Available targets (run `make help` for full list):
77+
78+
| Target | Backend | macOS | Linux |
79+
|--------|---------|-------|-------|
80+
| `llama-cpu` | CPU | yes | yes |
81+
| `llama-cuda` | CUDA || yes |
82+
| `llama-cuda-debug` | CUDA (debug) || yes |
83+
| `llava-cpu` | CPU | yes | yes |
84+
| `whisper-cpu` | CPU | yes | yes |
85+
| `whisper-metal` | Metal | yes ||
86+
| `whisper-cuda` | CUDA || yes |
87+
| `parakeet-cpu` | CPU | yes | yes |
88+
| `parakeet-metal` | Metal | yes ||
89+
| `parakeet-cuda` | CUDA || yes |
90+
| `voxtral-cpu` | CPU | yes | yes |
91+
| `voxtral-cuda` | CUDA || yes |
92+
| `voxtral-metal` | Metal | yes ||
93+
| `voxtral_realtime-cpu` | CPU | yes | yes |
94+
| `voxtral_realtime-cuda` | CUDA || yes |
95+
| `voxtral_realtime-metal` | Metal | yes ||
96+
| `gemma3-cpu` | CPU | yes | yes |
97+
| `gemma3-cuda` | CUDA || yes |
98+
| `sortformer-cpu` | CPU | yes | yes |
99+
| `sortformer-cuda` | CUDA || yes |
100+
| `silero-vad-cpu` | CPU | yes | yes |
101+
| `clean` || yes | yes |
15102

16103
Output: `cmake-out/examples/models/<model>/<runner>`
17104

18-
## C++ Libraries (CMake)
105+
### C++ runtime (standalone)
106+
107+
**With presets (recommended):**
108+
109+
| Platform | Command |
110+
|----------|---------|
111+
| macOS | `cmake -B cmake-out --preset macos` (uses Xcode generator — requires Xcode) |
112+
| Linux | `cmake -B cmake-out --preset linux -DCMAKE_BUILD_TYPE=Release` |
113+
| Windows | `cmake -B cmake-out --preset windows -T ClangCL` |
114+
115+
Then: `cmake --build cmake-out --config Release -j$(sysctl -n hw.ncpu)` (macOS) or `cmake --build cmake-out -j$(nproc)` (Linux)
116+
117+
**LLM libraries via workflow presets** (configure + build + install in one command):
118+
```bash
119+
cmake --workflow --preset llm-release # CPU
120+
cmake --workflow --preset llm-release-metal # Metal (macOS)
121+
cmake --workflow --preset llm-release-cuda # CUDA (Linux/Windows)
122+
```
123+
124+
**Manual CMake (custom flags):**
125+
```bash
126+
cmake -B cmake-out \
127+
-DCMAKE_BUILD_TYPE=Release \
128+
-DEXECUTORCH_BUILD_XNNPACK=ON \
129+
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
130+
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
131+
-DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \
132+
-DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
133+
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
134+
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON
135+
cmake --build cmake-out --parallel "$(nproc 2>/dev/null || sysctl -n hw.ncpu)"
136+
```
137+
138+
Run `cmake --list-presets` to see all available presets.
139+
140+
### Cross-compilation
141+
142+
**iOS/macOS frameworks:**
143+
```bash
144+
./scripts/build_apple_frameworks.sh --coreml --mps --xnnpack
145+
```
146+
Link in Xcode with `-all_load` linker flag.
147+
148+
**Android:**
149+
150+
Requires `ANDROID_NDK` on PATH (typically set by Android Studio or standalone NDK install).
19151
```bash
20-
cmake --list-presets # list presets
21-
cmake --workflow --preset llm-release # LLM CPU
22-
cmake --workflow --preset llm-release-metal # LLM Metal
152+
# Verify NDK is available
153+
echo $ANDROID_NDK # must point to NDK root, e.g. ~/Library/Android/sdk/ndk/<version>
154+
export ANDROID_ABIS=arm64-v8a BUILD_AAR_DIR=aar-out
155+
mkdir -p $BUILD_AAR_DIR && sh scripts/build_android_library.sh
23156
```
157+
158+
## Key build options
159+
160+
Most commonly needed flags (full list: `CMakeLists.txt`):
161+
162+
| Flag | What it enables |
163+
|------|-----------------|
164+
| `EXECUTORCH_BUILD_XNNPACK` | XNNPACK CPU backend |
165+
| `EXECUTORCH_BUILD_COREML` | Core ML (macOS/iOS) |
166+
| `EXECUTORCH_BUILD_MPS` | MPS GPU (macOS/iOS) |
167+
| `EXECUTORCH_BUILD_METAL` | Metal compute (macOS, requires EXTENSION_TENSOR) |
168+
| `EXECUTORCH_BUILD_CUDA` | CUDA GPU (Linux/Windows, requires EXTENSION_TENSOR) |
169+
| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | Optimized kernels |
170+
| `EXECUTORCH_BUILD_KERNELS_QUANTIZED` | Quantized kernels |
171+
| `EXECUTORCH_BUILD_EXTENSION_MODULE` | Module extension (requires DATA_LOADER + FLAT_TENSOR + NAMED_DATA_MAP) |
172+
| `EXECUTORCH_BUILD_EXTENSION_LLM` | LLM extension |
173+
| `EXECUTORCH_BUILD_TESTS` | Unit tests (`ctest --test-dir cmake-out --output-on-failure`) |
174+
| `EXECUTORCH_BUILD_DEVTOOLS` | DevTools (Inspector, ETDump) |
175+
| `EXECUTORCH_OPTIMIZE_SIZE` | Size-optimized build (`-Os`, no exceptions/RTTI) |
176+
| `CMAKE_BUILD_TYPE` | `Release` or `Debug` (5-10x slower). Some presets (e.g. `llm-release`) set this; others require it explicitly. |
177+
178+
## Troubleshooting
179+
180+
| Symptom | Fix |
181+
|---------|-----|
182+
| Missing headers / `CMakeLists.txt not found` in third-party | `git submodule sync --recursive && git submodule update --init --recursive` |
183+
| Mysterious failures after `git pull` or branch switch | `rm -rf cmake-out/ pip-out/ && git submodule sync && git submodule update --init --recursive` |
184+
| `conda env list` PermissionError | Use `CONDA_NO_PLUGINS=true conda env list` or check env dir directly |
185+
| CMake >= 4.0 | Works in practice despite `< 4.0` in docs; only fix if build actually fails |
186+
| `externally-managed-environment` / PEP 668 error | You're using system Python, not conda. Activate conda env first. |
187+
| pip conflicts with torch versions | Fresh conda env; or `./install_executorch.sh --use-pt-pinned-commit` |
188+
| Missing `Python.h` (Linux) | `sudo apt install python3.X-dev` |
189+
| Missing operator registrations at runtime | Link kernel libs with `-Wl,-force_load,<lib>` (macOS) or `-Wl,--whole-archive <lib> -Wl,--no-whole-archive` (Linux) |
190+
| `install_executorch.sh` fails on Intel Mac | No prebuilt PyTorch wheels; use `--use-pt-pinned-commit --minimal` |
191+
| XNNPACK build errors about cpuinfo/pthreadpool | Ensure `EXECUTORCH_BUILD_CPUINFO=ON` and `EXECUTORCH_BUILD_PTHREADPOOL=ON` (both ON by default) |
192+
| Duplicate kernel registration abort | Only link one `gen_operators_lib` per target |
193+
194+
## Build output
195+
196+
**From `./install_executorch.sh` (Python package):**
197+
198+
| Artifact | Location |
199+
|----------|----------|
200+
| Python package | `site-packages/executorch` |
201+
202+
**From CMake builds** (`cmake --install` with `CMAKE_INSTALL_PREFIX=cmake-out`):
203+
204+
| Artifact | Location |
205+
|----------|----------|
206+
| Core runtime | `cmake-out/lib/libexecutorch.a` |
207+
| XNNPACK backend | `cmake-out/lib/libxnnpack_backend.a` |
208+
| executor_runner | `cmake-out/executor_runner` (Ninja/Make) or `cmake-out/Release/executor_runner` (Xcode) |
209+
| Model runners | `cmake-out/examples/models/<model>/<runner>` |
210+
211+
**From cross-compilation:**
212+
213+
| Artifact | Location |
214+
|----------|----------|
215+
| iOS frameworks | `cmake-out/*.xcframework` |
216+
| Android AAR | `aar-out/` |
217+
218+
## Tips
219+
- Always use `Release` for benchmarking; `Debug` is 5–10x slower
220+
- `ccache` is auto-detected if installed (`brew install ccache`)
221+
- `Ninja` is faster than Make (`-G Ninja`) — but `--preset macos` uses Xcode generator
222+
- For LLM workflows, `make <model>-<backend>` is the simplest path
223+
- After `git pull`, clean and re-init submodules before rebuilding

backends/arm/_passes/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,7 @@
136136
from .rewrite_le_lt_to_ge_gt_pass import RewriteLeLtToGeGtPass # noqa
137137
from .rewrite_matmul import RewriteMatmulPass # noqa
138138
from .rewrite_pad import RewritePadPass # noqa
139+
from .rewrite_slice import RewriteSlicePass # noqa
139140
from .rewrite_upsample import RewriteUpsamplePass # noqa
140141
from .scalars_to_attribute_pass import ScalarsToAttributePass # noqa
141142
from .size_adjust_input_pass import SizeAdjustInputPass # noqa

backends/arm/_passes/arm_pass_manager.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,7 @@
121121
RewriteLeLtToGeGtPass,
122122
RewriteMatmulPass,
123123
RewritePadPass,
124+
RewriteSlicePass,
124125
RewriteUpsamplePass,
125126
ScalarsToAttributePass,
126127
SizeAdjustInputPass,
@@ -374,6 +375,7 @@ def _tosa_pipeline(
374375
RewriteConvPass(exported_program),
375376
RewriteMatmulPass(),
376377
RewritePadPass(),
378+
RewriteSlicePass(),
377379
]
378380
)
379381

backends/arm/_passes/arm_pass_utils.py

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
import traceback
1010
from inspect import isclass
11-
from typing import Optional, Sequence
11+
from typing import List, Optional, Sequence, Tuple
1212

1313
import torch
1414
import torch.fx
@@ -17,6 +17,10 @@
1717
from executorch.exir import ExportedProgram
1818
from executorch.exir.dialects._ops import ops as exir_ops
1919
from executorch.exir.dialects.edge._ops import EdgeOpOverload
20+
from executorch.exir.graph_module import (
21+
_get_control_flow_submodules,
22+
get_control_flow_submodules,
23+
)
2024

2125
from torch._export.utils import (
2226
get_buffer,
@@ -29,6 +33,7 @@
2933
from torch._ops import OpOverload
3034
from torch._subclasses.fake_tensor import FakeTensor
3135
from torch.export.graph_signature import InputKind
36+
from torch.fx import GraphModule, Node
3237

3338

3439
def is_submodule_node(node: torch.fx.Node):
@@ -284,3 +289,45 @@ def set_node_arg(node: torch.fx.Node, i: int | str, value):
284289
def get_output_dim_orders(graph_module):
285290
output_node = graph_module.graph.output_node()
286291
return [get_first_fake_tensor(node).dim_order() for node in output_node.args[0]]
292+
293+
294+
def is_nested_control_flow_graph(graph_module: GraphModule) -> bool:
295+
"""Returns True if graph_module is a nested control-flow graph."""
296+
297+
# Find all top-level control-flow submodules
298+
top_cf = get_control_flow_submodules(graph_module)
299+
# For each submodule, see if it itself has control-flow inside
300+
for _, submod, _ in top_cf:
301+
if get_control_flow_submodules(submod):
302+
return True
303+
return False
304+
305+
306+
def get_cond_while_submodules_nested(
307+
graph_module: GraphModule,
308+
apply_quantization: bool = False,
309+
) -> List[Tuple[str, GraphModule, Node]]:
310+
"""Recursively find cond/while_loop submodules in an GraphModule.
311+
312+
In nested control flow graphs, FX records the submodule functions
313+
(true/false or cond/body) in reverse order compared to top-level graphs. We
314+
must swap the indices when nested so that cond (first) and body/true_fn
315+
(second) are consistently identified across all nesting levels.
316+
317+
"""
318+
319+
# Determine arg indices based on nesting and whether only cond branch is needed
320+
nested = is_nested_control_flow_graph(graph_module)
321+
# cond: [true_fn, false_fn] or swapped if nested
322+
cond_indices = [2, 1] if nested else [1, 2]
323+
# while_loop: [cond_fn, body_fn] or swapped if nested
324+
while_indices = [1, 0] if nested else [0, 1]
325+
if apply_quantization:
326+
# only keep the cond_fn for while_loop (first index) when quantizing.
327+
while_indices = [while_indices[0]]
328+
mapping = {
329+
torch.ops.higher_order.cond: cond_indices,
330+
torch.ops.higher_order.while_loop: while_indices,
331+
}
332+
# collect cond/while submodules (using mapping indices)
333+
return _get_control_flow_submodules(graph_module, mapping)

backends/arm/_passes/control_flow_const_inline.py

Lines changed: 27 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,14 @@
55

66
from typing import Set, Type
77

8+
import torch
89
from executorch.backends.arm._passes.arm_pass import ArmPass
10+
from executorch.backends.arm._passes.arm_pass_utils import (
11+
get_cond_while_submodules_nested,
12+
is_submodule_node,
13+
)
914
from executorch.backends.transforms.utils import is_get_attr_node
1015
from executorch.exir.dialects._ops import ops as exir_ops
11-
from executorch.exir.graph_module import get_cond_while_submodules
12-
1316
from executorch.exir.pass_base import ExportPass, PassResult
1417
from torch.fx import GraphModule
1518

@@ -27,15 +30,23 @@ class ControlFlowConstInlinePass(ArmPass):
2730

2831
_passes_required_after: Set[Type[ExportPass]] = set()
2932

30-
def __init__(self, *args, **kwargs):
31-
super().__init__(*args, **kwargs)
33+
_targeted_ops = {
34+
torch.ops.higher_order.cond,
35+
torch.ops.higher_order.while_loop,
36+
}
3237

33-
def call(self, graph_module: GraphModule) -> PassResult:
38+
def _convert_getattr(self, graph_module):
3439
modified = False
35-
36-
for _, submodule, _ in get_cond_while_submodules(graph_module):
40+
for _, submodule, _ in get_cond_while_submodules_nested(graph_module):
3741
for submodule_node in submodule.graph.nodes:
38-
if is_get_attr_node(submodule_node):
42+
if submodule_node.target in self._targeted_ops:
43+
self._convert_getattr(submodule)
44+
45+
# For nested control flow, a "node" may be may actually be GraphModule.
46+
# Enure we are only checking for nodes here.
47+
if is_get_attr_node(submodule_node) and not is_submodule_node(
48+
submodule_node
49+
):
3950
val = getattr(
4051
submodule_node.graph.owning_module, submodule_node.target
4152
)
@@ -53,6 +64,14 @@ def call(self, graph_module: GraphModule) -> PassResult:
5364
submodule_node.replace_all_uses_with(const_node)
5465
submodule.graph.erase_node(submodule_node)
5566
modified = True
67+
return modified
68+
69+
def __init__(self, *args, **kwargs):
70+
super().__init__(*args, **kwargs)
71+
72+
def call(self, graph_module: GraphModule) -> PassResult:
73+
74+
modified = self._convert_getattr(graph_module)
5675

5776
if modified:
5877
graph_module.recompile()

0 commit comments

Comments
 (0)