Skip to content

Commit e6cd4a6

Browse files
committed
Update on "[ET Device Support] DeviceAllocator interface and DeviceAllocatorRegistry"
This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.). **DeviceAllocator interface provides:** - `init_buffer()` - Initialize memory buffer pools for memory-planned tensors - `get_offset_address()` - Get pointer to offset within pre-allocated buffer - `allocate()` / `deallocate()` - Dynamic device memory allocation - `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device - `device_type()` - Returns the device type this allocator handles **DeviceAllocatorRegistry provides:** - Singleton registry mapping DeviceType → DeviceAllocator - `register_allocator()` / `get_allocator()` methods - Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly) **Design notes:** - Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime - Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry) - Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/) [ghstack-poisoned]
2 parents 6659f13 + 7f91860 commit e6cd4a6

132 files changed

Lines changed: 6232 additions & 2545 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.ci/scripts/wheel/pre_build_script.sh

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,34 +9,41 @@ set -euxo pipefail
99

1010
# This script is run before building ExecuTorch binaries
1111

12-
if [[ "$(uname -m)" == "aarch64" ]]; then
13-
# On some Linux aarch64 systems, the "atomic" library is not found during linking.
14-
# To work around this, replace "atomic" with the literal ${ATOMIC_LIB} so the
15-
# build system uses the full path to the atomic library.
16-
file="extension/llm/tokenizers/third-party/sentencepiece/src/CMakeLists.txt"
17-
sed 's/list(APPEND SPM_LIBS "atomic")/list(APPEND SPM_LIBS ${ATOMIC_LIB})/' \
18-
"$file" > "${file}.tmp" && mv "${file}.tmp" "$file"
19-
20-
grep -n 'list(APPEND SPM_LIBS ${ATOMIC_LIB})' "$file" && \
21-
echo "the file $file has been modified for atomic to use full path"
12+
# Initialize submodules here instead of during checkout so we can use OpenSSL
13+
# on Windows (schannel fails with SEC_E_ILLEGAL_MESSAGE on some gitlab hosts).
14+
UNAME_S=$(uname -s)
15+
if [[ $UNAME_S == *"MINGW"* || $UNAME_S == *"MSYS"* ]]; then
16+
git -c http.sslBackend=openssl submodule update --init
17+
else
18+
git submodule update --init
2219
fi
2320

2421
# Clone nested submodules for tokenizers - this is a workaround for recursive
2522
# submodule clone failing due to path length limitations on Windows. Eventually,
2623
# we should update the core job in test-infra to enable long paths before
2724
# checkout to avoid needing to do this.
2825
pushd extension/llm/tokenizers
29-
UNAME_S=$(uname -s)
3026
if [[ $UNAME_S == *"MINGW"* || $UNAME_S == *"MSYS"* ]]; then
3127
git -c http.sslBackend=openssl submodule update --init
3228
else
3329
git submodule update --init
3430
fi
3531
popd
3632

33+
if [[ "$(uname -m)" == "aarch64" ]]; then
34+
# On some Linux aarch64 systems, the "atomic" library is not found during linking.
35+
# To work around this, replace "atomic" with the literal ${ATOMIC_LIB} so the
36+
# build system uses the full path to the atomic library.
37+
file="extension/llm/tokenizers/third-party/sentencepiece/src/CMakeLists.txt"
38+
sed 's/list(APPEND SPM_LIBS "atomic")/list(APPEND SPM_LIBS ${ATOMIC_LIB})/' \
39+
"$file" > "${file}.tmp" && mv "${file}.tmp" "$file"
40+
41+
grep -n 'list(APPEND SPM_LIBS ${ATOMIC_LIB})' "$file" && \
42+
echo "the file $file has been modified for atomic to use full path"
43+
fi
44+
3745
# On Windows, enable symlinks and re-checkout the current revision to create
3846
# the symlinked src/ directory. This is needed to build the wheel.
39-
UNAME_S=$(uname -s)
4047
if [[ $UNAME_S == *"MINGW"* || $UNAME_S == *"MSYS"* ]]; then
4148
echo "Enabling symlinks on Windows"
4249
git config core.symlinks true

.github/workflows/build-wheels-windows.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,4 +64,6 @@ jobs:
6464
smoke-test-script: ${{ matrix.smoke-test-script }}
6565
trigger-event: ${{ github.event_name }}
6666
wheel-build-params: "--verbose"
67-
submodules: true
67+
# Submodules are initialized in pre_build_script.sh with OpenSSL to avoid
68+
# schannel SSL errors on Windows when cloning from non-GitHub hosts.
69+
submodules: false

.github/workflows/cuda.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,9 @@ jobs:
135135
# Run CUDA backend Python tests
136136
python -m pytest backends/cuda/tests backends/cuda/passes/tests -v -o "addopts="
137137
138+
# Build Qwen3.5 MoE runner (ExecuTorch already built above)
139+
cd examples/models/qwen3_5_moe && cmake --workflow --preset qwen3-5-moe-cuda
140+
138141
export-model-cuda-artifact:
139142
name: export-model-cuda-artifact
140143
# Skip this job if the pull request is from a fork (HuggingFace secrets are not available)

CMakePresets.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,8 @@
152152
"llm-release"
153153
],
154154
"cacheVariables": {
155-
"EXECUTORCH_BUILD_CUDA": "ON"
155+
"EXECUTORCH_BUILD_CUDA": "ON",
156+
"CMAKE_CUDA_ARCHITECTURES": "native"
156157
},
157158
"condition": {
158159
"type": "inList",

Makefile

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@
9191
#
9292
# ==============================================================================
9393

94-
.PHONY: voxtral-cuda voxtral-cpu voxtral-metal voxtral_realtime-cuda voxtral_realtime-cpu voxtral_realtime-metal whisper-cuda whisper-cuda-debug whisper-cpu whisper-metal parakeet-cuda parakeet-cuda-debug parakeet-cpu parakeet-metal parakeet-vulkan dinov2-cuda dinov2-cuda-debug sortformer-cuda sortformer-cpu silero-vad-cpu llama-cuda llama-cuda-debug llama-cpu llava-cpu gemma3-cuda gemma3-cpu clean help
94+
.PHONY: voxtral-cuda voxtral-cpu voxtral-metal voxtral_realtime-cuda voxtral_realtime-cpu voxtral_realtime-metal whisper-cuda whisper-cuda-debug whisper-cpu whisper-metal parakeet-cuda parakeet-cuda-debug parakeet-cpu parakeet-metal parakeet-vulkan dinov2-cuda dinov2-cuda-debug sortformer-cuda sortformer-cpu silero-vad-cpu llama-cuda llama-cuda-debug llama-cpu llava-cpu gemma3-cuda gemma3-cpu qwen3_5_moe-cuda clean help
9595

9696
help:
9797
@echo "This Makefile adds targets to build runners for various models on various backends. Run using \`make <target>\`. Available targets:"
@@ -121,6 +121,7 @@ help:
121121
@echo " llava-cpu - Build Llava runner with CPU backend"
122122
@echo " gemma3-cuda - Build Gemma3 runner with CUDA backend"
123123
@echo " gemma3-cpu - Build Gemma3 runner with CPU backend"
124+
@echo " qwen3_5_moe-cuda - Build Qwen3.5 MoE runner with CUDA backend"
124125
@echo " clean - Clean build artifacts"
125126

126127
voxtral-cuda:
@@ -362,6 +363,15 @@ gemma3-cpu:
362363
@echo "✓ Build complete!"
363364
@echo " Binary: cmake-out/examples/models/gemma3/gemma3_e2e_runner"
364365

366+
qwen3_5_moe-cuda:
367+
@echo "==> Building and installing ExecuTorch with CUDA..."
368+
cmake --workflow --preset llm-release-cuda
369+
@echo "==> Building Qwen3.5 MoE runner with CUDA..."
370+
cd examples/models/qwen3_5_moe && cmake --workflow --preset qwen3-5-moe-cuda
371+
@echo ""
372+
@echo "✓ Build complete!"
373+
@echo " Binary: cmake-out/examples/models/qwen3_5_moe/qwen3_5_moe_runner"
374+
365375
clean:
366376
rm -rf cmake-out \
367377
extension/llm/tokenizers/build \

backends/arm/_passes/arm_pass.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111

1212
from executorch.backends.arm.constants import DISALLOW_TFA_META_KEY
1313
from executorch.backends.arm.tosa.mapping import TosaSpecialDtype
14+
from executorch.exir.dialects._ops import ops as exir_ops
1415
from executorch.exir.pass_base import ExportPass, NodeMetadata, ProxyValue
1516
from torch.fx import GraphModule
1617
from torch.fx.passes.infra.pass_base import PassResult
@@ -124,3 +125,31 @@ def call_shape_operator(
124125
shape_meta.data[TosaSpecialDtype.meta_key()] = TosaSpecialDtype.SHAPE
125126
# Call the super (ArmPass) call operator with updated meta
126127
return self.call_operator(op, args, kwargs, shape_meta, updated)
128+
129+
def call_scalar(self, value: int | float, meta: NodeMetadata | dict[str, Any]):
130+
"""Return a scalar value for the current pass stage.
131+
132+
In transform-for-annotation passes this returns the Python scalar
133+
directly. In later passes it materializes a `(1,)` `aten.full` node
134+
using the output dtype/device from `meta["val"]` when available.
135+
136+
"""
137+
138+
if self.is_tfa_pass:
139+
return value
140+
141+
kwargs = {}
142+
if "val" in meta:
143+
val = meta["val"]
144+
if isinstance(val, tuple):
145+
val = val[0]
146+
kwargs = {"device": val.device, "dtype": val.dtype}
147+
148+
return ArmPass.call_operator(
149+
self,
150+
op=exir_ops.edge.aten.full.default,
151+
args=((1,), value),
152+
kwargs=kwargs,
153+
meta=meta,
154+
updated=True,
155+
)

backends/arm/_passes/arm_pass_manager.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -557,6 +557,12 @@ def transform_for_annotation_pipeline(self, graph_module: GraphModule):
557557
DecomposeDivTensorModePass(tfa_pass=True),
558558
DecomposeWhereScalarOtherPass(tfa_pass=True),
559559
RewriteInplaceArithmeticPass(tfa_pass=True),
560+
DecomposeAddSubAlphaPass(tfa_pass=True),
561+
DecomposeLeakyReLUPass(tfa_pass=True),
562+
DecomposeGroupNormPass(tfa_pass=True),
563+
DecomposeLayerNormPass(tfa_pass=True),
564+
DecomposeVarPass(tfa_pass=True),
565+
DecomposeMeanDimPass(graph_module, self.tosa_spec, tfa_pass=True),
560566
]
561567
)
562568

@@ -573,16 +579,10 @@ def transform_for_annotation_pipeline(self, graph_module: GraphModule):
573579
self.add_passes(
574580
[
575581
NormalizeWhileInitialArgsPass(use_exir_clone=False, tfa_pass=True),
576-
DecomposeAddSubAlphaPass(tfa_pass=True),
577-
DecomposeGroupNormPass(tfa_pass=True),
578-
DecomposeLayerNormPass(tfa_pass=True),
579-
DecomposeVarPass(tfa_pass=True),
580-
DecomposeMeanDimPass(graph_module, self.tosa_spec, tfa_pass=True),
581582
DecomposeNotEqualPass(tfa_pass=True),
582583
DecomposeCosineSimilarityPass(tfa_pass=True),
583584
DecomposeGluPass(tfa_pass=True),
584585
DecomposeDivPass(tfa_pass=True),
585-
DecomposeLeakyReLUPass(tfa_pass=True),
586586
DecomposeLinalgVectorNormPass(tfa_pass=True),
587587
DecomposeSqrtPass(tfa_pass=True),
588588
DecomposeAdaptiveAvgPool2dPass(tfa_pass=True),

backends/arm/_passes/arm_pass_utils.py

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
import torch.fx
1515
from executorch.backends.arm.common.debug import get_node_debug_info
1616
from executorch.backends.arm.common.type import ensure_type
17+
from executorch.backends.arm.tosa.mapping import TosaSpecialDtype
1718
from executorch.exir import ExportedProgram
1819
from executorch.exir.dialects._ops import ops as exir_ops
1920
from executorch.exir.dialects.edge._ops import EdgeOpOverload
@@ -172,6 +173,30 @@ def create_node(
172173
return node
173174

174175

176+
def create_shape_node(
177+
graph: torch.fx.Graph,
178+
op_target: EdgeOpOverload,
179+
args: tuple = (),
180+
kwargs: Optional[dict] = None,
181+
from_node: Optional[torch.fx.Node] = None,
182+
):
183+
"""Adds a shape node to 'graph'.
184+
185+
graph.inserting_before/after() should be used before the call to decide
186+
where to insert the node.
187+
188+
"""
189+
node = create_node(
190+
graph=graph,
191+
op_target=op_target,
192+
args=args,
193+
kwargs=kwargs,
194+
from_node=from_node,
195+
)
196+
node.meta[TosaSpecialDtype.meta_key()] = TosaSpecialDtype.SHAPE
197+
return node
198+
199+
175200
def insert_q_dq_pair(
176201
graph: torch.fx.Graph,
177202
anchor: torch.fx.Node,
@@ -211,6 +236,38 @@ def meta_without_qparams(meta: NodeMetadata) -> NodeMetadata:
211236
return NodeMetadata(plain_meta_dict)
212237

213238

239+
def insert_scalar(
240+
graph: torch.fx.Graph,
241+
value: int | float,
242+
meta: NodeMetadata | dict,
243+
from_node: torch.fx.Node,
244+
is_tfa_pass: bool = False,
245+
) -> torch.fx.Node | int | float:
246+
"""Insert an `aten.full` scalar node for direct graph-rewrite passes."""
247+
248+
if is_tfa_pass:
249+
return value
250+
251+
kwargs = {}
252+
val = None
253+
if "val" in meta:
254+
val = meta["val"]
255+
if isinstance(val, tuple):
256+
val = val[0]
257+
kwargs = {"device": val.device, "dtype": val.dtype}
258+
259+
scalar = create_node(
260+
graph=graph,
261+
op_target=exir_ops.edge.aten.full.default,
262+
args=((1,), value),
263+
kwargs=kwargs,
264+
from_node=from_node,
265+
)
266+
if val is not None:
267+
scalar.meta["val"] = torch.full((1,), value, **kwargs)
268+
return scalar
269+
270+
214271
def get_first_fake_tensor(node: torch.fx.Node) -> FakeTensor:
215272
"""Returns a FakeTensor from the meta field of 'node'.
216273

backends/arm/_passes/decompose_add_sub_alpha_pass.py

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -30,24 +30,20 @@ def _get_ops(op):
3030
if op is exir_ops.edge.aten.add.Tensor:
3131
return (
3232
exir_ops.edge.aten.mul.Tensor,
33-
exir_ops.edge.aten.full.default,
3433
exir_ops.edge.aten.add.Tensor,
3534
)
3635
return (
3736
torch.ops.aten.mul.Tensor,
38-
torch.ops.aten.full.default,
3937
torch.ops.aten.add.Tensor,
4038
)
4139
if op in _SUB_OPS:
4240
if op is exir_ops.edge.aten.sub.Tensor:
4341
return (
4442
exir_ops.edge.aten.mul.Tensor,
45-
exir_ops.edge.aten.full.default,
4643
exir_ops.edge.aten.sub.Tensor,
4744
)
4845
return (
4946
torch.ops.aten.mul.Tensor,
50-
torch.ops.aten.full.default,
5147
torch.ops.aten.sub.Tensor,
5248
)
5349
raise RuntimeError(f"Unsupported operator {op}")
@@ -72,19 +68,12 @@ def call_operator(self, op, args, kwargs, meta, updated: bool | None = False):
7268
if not _should_decompose(alpha):
7369
return super().call_operator(op, args, kwargs, meta, updated)
7470

75-
mul_op, full_op, binary_op = _get_ops(op)
71+
mul_op, binary_op = _get_ops(op)
7672
lhs, rhs = args
7773

78-
alpha_full = super().call_operator(
79-
full_op,
80-
((1,), float(alpha)),
81-
{"device": meta["val"].device, "dtype": meta["val"].dtype},
82-
meta,
83-
updated=True,
84-
)
8574
scaled_rhs = super().call_operator(
8675
mul_op,
87-
(rhs, alpha_full),
76+
(rhs, super().call_scalar(alpha, meta)),
8877
{},
8978
meta,
9079
updated=True,

backends/arm/_passes/decompose_asin_and_acos_pass.py

Lines changed: 5 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,6 @@ def get_decomposition(op) -> tuple:
4242
exir_ops.edge.aten.gt.Scalar,
4343
exir_ops.edge.aten.lt.Scalar,
4444
exir_ops.edge.aten.sub.Tensor,
45-
exir_ops.edge.aten.full_like.default,
4645
exir_ops.edge.aten.neg.default,
4746
)
4847

@@ -79,15 +78,12 @@ def _build_polynomial(
7978
"""Helper function to build polynomial from coefficients and
8079
variable.
8180
"""
82-
full_like_op, add_op, mul_op_scalar, mul_op = (
83-
exir_ops.edge.aten.full_like.default,
81+
add_op, mul_op_scalar, mul_op = (
8482
exir_ops.edge.aten.add.Tensor,
8583
exir_ops.edge.aten.mul.Scalar,
8684
exir_ops.edge.aten.mul.Tensor,
8785
)
88-
result = super().call_operator(
89-
full_like_op, (variable, coefficients[0]), {}, meta, True
90-
)
86+
result = super().call_scalar(coefficients[0], meta)
9187
for coeff in coefficients[1:]:
9288
result = super().call_operator(
9389
add_op,
@@ -150,7 +146,6 @@ def call_operator(self, op, args, kwargs, meta):
150146
gt_op,
151147
lt_op,
152148
sub_op,
153-
full_like_op,
154149
neg_op,
155150
) = get_decomposition(op)
156151

@@ -179,7 +174,7 @@ def call_operator(self, op, args, kwargs, meta):
179174

180175
# Step 2: Compute the transformed approximation for large values
181176
# Calculate z = -0.5 * (|x| - 1)
182-
tmp_ones = super().call_operator(full_like_op, (x_abs, one), {}, meta, True)
177+
tmp_ones = super().call_scalar(one, meta)
183178
tmp = super().call_operator(sub_op, (x_abs, tmp_ones), {}, meta, True)
184179
z = super().call_operator(mul_op_scalar, (tmp, neg_half), {}, meta, True)
185180

@@ -201,9 +196,7 @@ def call_operator(self, op, args, kwargs, meta):
201196
t2 = super().call_operator(mul_op_scalar, (t1, two), {}, meta, True)
202197

203198
diff = super().call_operator(sub_op_scalar, (t2, pi_over_2), {}, meta, True)
204-
tmp_neg_ones = super().call_operator(
205-
full_like_op, (diff, neg_one), {}, meta, True
206-
)
199+
tmp_neg_ones = super().call_scalar(neg_one, meta)
207200
asin_large = super().call_operator(mul_op, (diff, tmp_neg_ones), {}, meta, True)
208201

209202
asin_unsigned = self._combine_branches(
@@ -218,9 +211,7 @@ def call_operator(self, op, args, kwargs, meta):
218211

219212
if op in edge_acos_op:
220213
# If x <= 0.5: acos(x) = pi/2 - asin(x)
221-
const_tensor = super().call_operator(
222-
full_like_op, (x, pi_over_2), {}, meta, True
223-
)
214+
const_tensor = super().call_scalar(pi_over_2, meta)
224215
acos_small = super().call_operator(
225216
sub_op, (const_tensor, asin), {}, meta, True
226217
)

0 commit comments

Comments
 (0)