Skip to content

[ETVK] WebGPU runtime#18808

Merged
digantdesai merged 9 commits intomainfrom
wgpu
Apr 29, 2026
Merged

[ETVK] WebGPU runtime#18808
digantdesai merged 9 commits intomainfrom
wgpu

Conversation

@digantdesai
Copy link
Copy Markdown
Contributor

wgpu prototype

Copilot AI review requested due to automatic review settings April 10, 2026 04:18
@digantdesai digantdesai requested a review from SS-JIA as a code owner April 10, 2026 04:18
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18808

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 10, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI review requested due to automatic review settings April 10, 2026 04:34
@digantdesai
Copy link
Copy Markdown
Contributor Author

@claude review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 10, 2026

Claude finished @digantdesai's task in 3m 52s —— View job


Code Review: WebGPU Backend Prototype

  • Read all changed files and understand the architecture
  • Review build system (CMakeLists.txt, scripts)
  • Review runtime code (WebGPUBackend, WebGPUDevice, WebGPUGraph)
  • Review operator implementation (add op, registry)
  • Review tests
  • Review documentation and project structure

Overall this is a well-structured prototype that cleanly integrates a WebGPU runtime backend into ExecuTorch. The architecture of reusing the Vulkan serialization format and partitioner is pragmatic for bootstrapping. Below are my findings.


Critical Issues

1. Backend name collision with Vulkan — WebGPUBackend.cpp:133

The WebGPU backend registers itself as "VulkanBackend":

Backend backend{"VulkanBackend", &cls};

This means the Vulkan and WebGPU backends are mutually exclusive at link time — linking both into the same binary will cause a duplicate registration error. The TODO.md acknowledges this, but the code has no guard. At minimum, consider a static_assert or #ifdef guard that prevents both EXECUTORCH_BUILD_VULKAN and EXECUTORCH_BUILD_WEBGPU from being active simultaneously. Alternatively, register under a distinct name like "WebGPUBackend" and handle the delegate ID mapping in the partitioner/serialization layer.

Fix this →

2. std::vector heap allocations in WebGPUBackend::execute()WebGPUBackend.cpp:100-118

The execute() method is called on every inference. It creates two std::vectors (inputs, outputs) with heap allocations on every call:

std::vector<std::pair<const void*, size_t>> inputs;
...
std::vector<std::pair<void*, size_t>> outputs;

ExecuTorch's runtime is designed for environments where heap allocation during inference is undesirable. Consider using stack-allocated small buffers (e.g., executorch::runtime::Span or a fixed-size array with a reasonable max), or storing these vectors as members of WebGPUGraph to avoid per-call allocation.

Fix this →

3. uniform_buffer leak in BinaryOp.cpp:159

The comment says // uniform_buffer is kept alive by the bind group, but in WebGPU, the bind group holds a reference to the buffer only while the bind group is alive. The uniform_buffer is never explicitly released — it is not tracked in any cleanup list. The WebGPUGraph destructor releases pipeline and bind_group per dispatch, but after bind_group is released, the uniform buffer's refcount drops to zero only if wgpu-native correctly tracks internal references. For safety, either:

  • Track uniform buffers in a std::vector<WGPUBuffer> on WebGPUGraph and release them in the destructor, or
  • Add the buffer to the WebGPUDispatch struct.

Fix this →


Moderate Issues

4. WebGPUDelegateHeader is a near-copy of VulkanDelegateHeader

WebGPUDelegateHeader.h/.cpp is almost identical to VulkanDelegateHeader.h/.cpp — same magic bytes (VH00), same field layout, same parsing logic. The byte-decoding helper functions (getUInt64LE, getUInt32LE, getUInt16LE) are duplicated rather than reused from the Vulkan implementation (which exports them in its header). Consider either:

  • Reusing VulkanDelegateHeader directly (since you already depend on vulkan_schema), or
  • Extracting the shared parsing into a common utility.

5. No bounds checking on value ID access — WebGPUGraph.h:72-85

The get_tensor(), get_double(), get_int() methods perform no bounds checking:

WebGPUTensor& get_tensor(int id) {
    return tensors_[id];
}

An out-of-bounds id (from a corrupt or unsupported FlatBuffer) will cause undefined behavior. Consider at minimum a debug-mode assertion, or returning a Result type.

6. Exception safety concerns with std::vector usage in runtime code

The runtime uses std::vector extensively (WebGPUGraph.h:128-140) and std::string (operator registry). ExecuTorch core avoids these due to heap allocation constraints in embedded environments. For a prototype this is reasonable, but it should be noted as a known divergence from ExecuTorch conventions.

7. is_available() always returns trueWebGPUBackend.cpp:40-42

bool WebGPUBackend::is_available() const {
  return true;
}

This should attempt to check if a WebGPU device can actually be created (or at least if the global context exists). Returning true unconditionally means the runtime will attempt to use the backend even on machines without GPU support, leading to a confusing exception during init() instead of a clean fallback.

8. wgpuDevicePoll blocking call in copy_outputsWebGPUGraph.cpp:303

wgpuDevicePoll(device_, true, nullptr) blocks until all GPU work completes. This is fine for correctness but means every inference call fully synchronizes. This is noted for awareness — future work on pipelining should revisit this.


Minor / Style Issues

9. EXECUTORCH_BUILD_WEBGPU_TEST not declared via define_overridable_option

EXECUTORCH_BUILD_WEBGPU is properly declared in tools/cmake/preset/default.cmake, but EXECUTORCH_BUILD_WEBGPU_TEST (used in backends/webgpu/CMakeLists.txt:91) is not. It works because CMake treats undefined variables as falsy, but for consistency with the codebase pattern it should be declared.

10. Hardcoded libwgpu_native.aCMakeLists.txt:50

if(NOT EXISTS "${WGPU_NATIVE_DIR}/lib/libwgpu_native.a")

This assumes a static archive with a Unix naming convention. It won't work on Windows (.lib) and prevents using shared libraries. Consider using find_library() or at least documenting that Windows is not currently supported.

11. Shader source duplication — binary_add.wgsl vs binary_add_wgsl.h

The WGSL shader exists as both a standalone .wgsl file and as a C++ string literal in a header. If one is updated without the other, they'll silently diverge. Consider either generating the header from the .wgsl at build time (like the Vulkan backend's ShaderLibrary.cmake does with SPIR-V), or removing the standalone .wgsl file.

12. conftest.py workaround — test/conftest.py

The LeafSpec monkey-patching workaround for "PyTorch 2.11" is concerning. This should have a version check so it doesn't apply on fixed versions, and ideally a tracking issue linked.

13. Missing __init__.py at test root level

backends/webgpu/test/ops/__init__.py exists (empty) but backends/webgpu/test/__init__.py does not. This may cause import issues depending on how tests are discovered.

14. TODO.md checked into the repo

Generally TODOs are tracked in issues rather than checked-in markdown files, which tend to go stale. Consider moving this content to a GitHub issue or discussion.


What Looks Good

  • Clean separation of concerns: device abstraction, graph execution, operator registry, and backend interface are well-factored.
  • The operator registration pattern (WEBGPU_REGISTER_OPERATORS / WEBGPU_REGISTER_OP) is ergonomic and mirrors Vulkan's approach.
  • Good test coverage for a prototype: Python export tests cover multiple tensor shapes/broadcasting, and the native test validates end-to-end GPU execution.
  • The WebGPUMemoryStats utility is a nice addition for debugging.
  • ShaderLibrary.cmake fix (gating glslc requirement on EXECUTORCH_BUILD_VULKAN) is the right approach to avoid breaking WebGPU-only builds.
  • Well-written README with clear architecture diagram and quick-start guide.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a prototype WebGPU backend to ExecuTorch, enabling delegated GPU execution via wgpu-native while reusing the existing Vulkan delegate serialization format.

Changes:

  • Introduce EXECUTORCH_BUILD_WEBGPU and wire WebGPU backend into the top-level CMake build.
  • Add a new backends/webgpu runtime (backend interface, graph builder/executor, device/context setup) plus a single aten.add.Tensor operator implemented in WGSL.
  • Add initial tests and helper scripts for exporting a model and running a native (wgpu-native) end-to-end validation.

Reviewed changes

Copilot reviewed 24 out of 26 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tools/cmake/preset/default.cmake Adds EXECUTORCH_BUILD_WEBGPU build option.
CMakeLists.txt Conditionally adds the WebGPU backend subdirectory and backend list entry.
backends/webgpu/CMakeLists.txt Defines webgpu_backend library, imports wgpu-native, and adds a native test target.
backends/webgpu/README.md Documents the prototype backend, architecture, and quick start.
backends/webgpu/TODO.md Captures prototype limitations and planned roadmap.
backends/webgpu/runtime/WebGPUBackend.h Declares the backend interface implementation.
backends/webgpu/runtime/WebGPUBackend.cpp Implements backend init/execute/destroy and registers the backend.
backends/webgpu/runtime/WebGPUDelegateHeader.h Declares delegate header parsing for VH00/VK00 blobs.
backends/webgpu/runtime/WebGPUDelegateHeader.cpp Implements VH00 header parsing/validation.
backends/webgpu/runtime/WebGPUDevice.h Declares native WebGPU context creation and global default context APIs.
backends/webgpu/runtime/WebGPUDevice.cpp Implements wgpu-native instance/adapter/device acquisition and teardown.
backends/webgpu/runtime/WebGPUGraph.h Declares graph structure (tensors, dispatches) and execution APIs.
backends/webgpu/runtime/WebGPUGraph.cpp Implements VkGraph (VK00) parsing, buffer creation, dispatch recording, and execution.
backends/webgpu/runtime/ops/OperatorRegistry.h Introduces a simple operator registry and registration macros.
backends/webgpu/runtime/ops/OperatorRegistry.cpp Implements the registry lookup/registration singleton.
backends/webgpu/runtime/ops/add/BinaryOp.cpp Implements aten.add.Tensor via a compute pipeline + uniform params.
backends/webgpu/runtime/ops/add/binary_add.wgsl WGSL shader source for elementwise add with alpha.
backends/webgpu/runtime/ops/add/binary_add_wgsl.h Embeds the WGSL shader as a C++ string constant.
backends/webgpu/scripts/setup-wgpu-native.sh Downloads prebuilt wgpu-native binaries for native testing.
backends/webgpu/test/conftest.py Adds a PyTorch LeafSpec workaround for test runs.
backends/webgpu/test/test_build_webgpu.sh End-to-end script: pytest export, export .pte, build, run native test.
backends/webgpu/test/test_webgpu_native.cpp Native test runner that loads a .pte and checks output correctness.
backends/webgpu/test/ops/init.py Marks the ops test directory as a Python package.
backends/webgpu/test/ops/add/test_add.py Python export tests using VulkanPartitioner and a helper to export a .pte.
backends/vulkan/cmake/ShaderLibrary.cmake Adjusts the glslc presence check to be conditional on EXECUTORCH_BUILD_VULKAN.
.gitignore Ignores backends/webgpu/third-party/ downloads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +15 to +25
# Ensure vulkan_schema is available even when EXECUTORCH_BUILD_VULKAN is OFF.
# The WebGPU backend reuses the Vulkan FlatBuffer serialization format.
if(NOT TARGET vulkan_schema)
# We need the schema generation from the Vulkan backend. Build only the
# schema target by including the Vulkan CMakeLists.txt. The full Vulkan
# backend will only build if EXECUTORCH_BUILD_VULKAN is ON (which gates the
# vulkan_backend target), but vulkan_schema is unconditionally defined.
add_subdirectory(
${CMAKE_CURRENT_SOURCE_DIR}/../vulkan
${CMAKE_CURRENT_BINARY_DIR}/_vulkan_schema
)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_subdirectory(../vulkan ...) pulls in the full Vulkan CMakeLists.txt, which currently unconditionally builds shader libraries and the vulkan_backend target (not just vulkan_schema). This makes EXECUTORCH_BUILD_WEBGPU=ON effectively require the full Vulkan toolchain (e.g., glslc) and can also introduce duplicate backend registration. Consider factoring vulkan_schema into a standalone CMake include, or adding a schema-only mode/guards in backends/vulkan/CMakeLists.txt so including it here does not build the full Vulkan backend.

Suggested change
# Ensure vulkan_schema is available even when EXECUTORCH_BUILD_VULKAN is OFF.
# The WebGPU backend reuses the Vulkan FlatBuffer serialization format.
if(NOT TARGET vulkan_schema)
# We need the schema generation from the Vulkan backend. Build only the
# schema target by including the Vulkan CMakeLists.txt. The full Vulkan
# backend will only build if EXECUTORCH_BUILD_VULKAN is ON (which gates the
# vulkan_backend target), but vulkan_schema is unconditionally defined.
add_subdirectory(
${CMAKE_CURRENT_SOURCE_DIR}/../vulkan
${CMAKE_CURRENT_BINARY_DIR}/_vulkan_schema
)
# WebGPU reuses the Vulkan FlatBuffer serialization format and therefore
# requires the vulkan_schema target to be defined before this file is
# processed. Do not pull in ../vulkan here with add_subdirectory(), because
# that imports the full Vulkan backend build and can introduce extra
# toolchain requirements (for example shader compilation tools) as well as
# duplicate backend registration side effects.
if(NOT TARGET vulkan_schema)
message(FATAL_ERROR
"webgpu_backend requires the vulkan_schema target, but it is not "
"available. Provide vulkan_schema before including "
"backends/webgpu/CMakeLists.txt. Do not use add_subdirectory(../vulkan) "
"from here; instead expose vulkan_schema via a schema-only Vulkan CMake "
"include or define vulkan_schema earlier in the build.")

Copilot uses AI. Check for mistakes.

namespace {
auto cls = WebGPUBackend();
Backend backend{"VulkanBackend", &cls};
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This backend registers under the name "VulkanBackend", which collides with the real Vulkan backend (and register_backend() rejects duplicate names). If both are linked (including indirectly via the add_subdirectory(../vulkan ...) in this PR), one registration will fail and the delegate may run on the wrong backend or not be available. Please enforce mutual exclusion at CMake/config time, or register under a distinct backend name and update the delegate ID/export path accordingly.

Suggested change
Backend backend{"VulkanBackend", &cls};
Backend backend{"WebGPUBackend", &cls};

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +75
// Parse header to locate flatbuffer and constant data
Result<WebGPUDelegateHeader> header =
WebGPUDelegateHeader::parse(processed->data());
if (!header.ok()) {
ET_LOG(Error, "WebGPUDelegateHeader may be corrupt");
return header.error();
}

const uint8_t* buffer_start =
reinterpret_cast<const uint8_t*>(processed->data());
const uint8_t* flatbuffer_data = buffer_start + header->flatbuffer_offset;
const uint8_t* constant_data = buffer_start + header->bytes_offset;

// Verify FlatBuffer identifier
if (!vkgraph::VkGraphBufferHasIdentifier(flatbuffer_data)) {
ET_LOG(
Error,
"WebGPU delegate FlatBuffer identifier mismatch (expected VK00)");
return Error::DelegateInvalidCompatibility;
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebGPUDelegateHeader offsets are used to compute flatbuffer_data / constant_data without checking they fall within processed->size(). A malformed/corrupt delegate blob could cause out-of-bounds reads (or FlatBuffers identifier checks on invalid memory). Please validate flatbuffer_offset + flatbuffer_size <= processed->size() and bytes_offset + bytes_size <= processed->size() (with overflow-safe arithmetic) before pointer arithmetic.

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +66
WGPUBufferDescriptor uniform_desc = {};
uniform_desc.size = sizeof(AddParams);
uniform_desc.usage = WGPUBufferUsage_Uniform | WGPUBufferUsage_CopyDst;
uniform_desc.mappedAtCreation = true;
WGPUBuffer uniform_buffer = wgpuDeviceCreateBuffer(device, &uniform_desc);
void* mapped = wgpuBufferGetMappedRange(uniform_buffer, 0, sizeof(AddParams));
std::memcpy(mapped, &params, sizeof(AddParams));
wgpuBufferUnmap(uniform_buffer);

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uniform buffer created for params is never released. Even if the bind group retains its own reference, the original uniform_buffer handle still needs wgpuBufferRelease() (typically after wgpuDeviceCreateBindGroup). Otherwise each graph build leaks a WebGPU buffer object.

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +16

#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <string>
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file uses std::vector, std::min/std::max, and std::exception but does not include the corresponding headers (<vector>, <algorithm>, <exception>). This will fail to compile on standard toolchains that don't indirectly include them.

Suggested change
#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <string>
#include <algorithm>
#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <exception>
#include <string>
#include <vector>

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +186
for (unsigned i = 0; i < fb_output_ids->size(); i++) {
int oid = static_cast<int>(fb_output_ids->Get(i));
output_ids_.push_back(oid);

// Create staging buffer for output readback
WGPUBufferDescriptor staging_desc = {};
staging_desc.size = tensors_[oid].nbytes > 0 ? tensors_[oid].nbytes : 4;
staging_desc.usage = WGPUBufferUsage_MapRead | WGPUBufferUsage_CopyDst;
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oid from graph->output_ids() is used to index tensors_[oid] without validating that oid is in-range and refers to a tensor value. A corrupt/malicious FlatBuffer can cause out-of-bounds access or staging buffers sized from the wrong value type. Please validate 0 <= oid < num_vals and value_types_[oid] == Tensor before indexing/creating staging buffers.

Copilot uses AI. Check for mistakes.
Comment on lines +196 to +203
for (unsigned i = 0; i < chain->size(); i++) {
const auto* op_call = chain->Get(i);
std::string op_name = op_call->name()->str();

if (!webgpu_operator_registry().has_op(op_name)) {
throw std::runtime_error(
"WebGPU backend: unsupported op: " + op_name);
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op_call->name()->str() assumes op_call and op_call->name() are non-null. FlatBuffers fields are optional; a malformed model could crash here. Please null-check op_call/name() and throw a descriptive error if missing.

Copilot uses AI. Check for mistakes.
if (flatbuffer_size == 0) {
return false;
}
if (bytes_offset < flatbuffer_offset + flatbuffer_size) {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_valid() compares bytes_offset < flatbuffer_offset + flatbuffer_size using 32-bit fields; flatbuffer_offset + flatbuffer_size can overflow and incorrectly pass validation. Consider doing all offset/size arithmetic in uint64_t and (ideally) validating against the actual buffer length (e.g., pass the blob size into parse() or validate in the caller).

Suggested change
if (bytes_offset < flatbuffer_offset + flatbuffer_size) {
const uint64_t flatbuffer_end =
static_cast<uint64_t>(flatbuffer_offset) +
static_cast<uint64_t>(flatbuffer_size);
if (static_cast<uint64_t>(bytes_offset) < flatbuffer_end) {

Copilot uses AI. Check for mistakes.
Comment thread backends/webgpu/runtime/WebGPUGraph.cpp Outdated
Comment on lines +121 to +126
tensor.nbytes = numel * vk_datatype_size(vk_tensor->datatype());

// Create GPU buffer
WGPUBufferDescriptor buf_desc = {};
buf_desc.size = tensor.nbytes > 0 ? tensor.nbytes : 4;
buf_desc.usage =
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensor.nbytes is computed as numel * vk_datatype_size(...), but vk_datatype_size() returns 0 for unknown/unsupported dtypes. The current code then creates a 4-byte buffer and continues, which can mask delegate incompatibilities and lead to incorrect execution later. Since the WebGPU prototype only supports fp32 today, it would be safer to explicitly validate vk_tensor->datatype() == FLOAT32 (or at least vk_datatype_size(...) > 0) and throw an error when unsupported.

Copilot uses AI. Check for mistakes.
Comment thread backends/webgpu/README.md
@@ -0,0 +1,113 @@
# WebGPU Backend

Run ExecuTorch models on the GPU via [WebGPU](https://www.w3.org/TR/webgpu/). The backend compiles delegated subgraphs into WGSL compute shaders executed natively through [wgpu-native](https://github.com/gfx-rs/wgpu-native) (Metal on macOS, Vulkan on Linux/Windows).
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README states wgpu-native runs on "Vulkan on Linux/Windows", but both the setup script (setup-wgpu-native.sh) and the CMake link logic only handle macOS/Linux (no Windows zip selection, and links dl m pthread in the non-APPLE branch). Either add Windows support or clarify in the README that Windows is not supported yet for this prototype.

Suggested change
Run ExecuTorch models on the GPU via [WebGPU](https://www.w3.org/TR/webgpu/). The backend compiles delegated subgraphs into WGSL compute shaders executed natively through [wgpu-native](https://github.com/gfx-rs/wgpu-native) (Metal on macOS, Vulkan on Linux/Windows).
Run ExecuTorch models on the GPU via [WebGPU](https://www.w3.org/TR/webgpu/). The backend compiles delegated subgraphs into WGSL compute shaders executed natively through [wgpu-native](https://github.com/gfx-rs/wgpu-native) (Metal on macOS, Vulkan on Linux). Windows is not supported yet in this prototype.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@SS-JIA SS-JIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp to land now and unblock iteration. As a follow up I'll investigate how to have extend Vulkan's ComputeGraph abstractions for other GPU APIs.

Parses the VH00/VK00 FlatBuffer envelope from the Vulkan
partitioner to extract the serialized graph payload.
Operator registry with registration macros, WGSL binary-add
shader (plus inline C++ header), and the aten.add.Tensor
implementation that creates a compute pipeline and records
dispatch.
Buffer management, pipeline creation, and compute dispatch.
Parses the Vulkan FlatBuffer delegate blob and builds a
runnable graph of compute passes.
BackendInterface implementation that wires init/execute into
ExecuTorch. Registers as "VulkanBackend" to consume .pte files
from the Vulkan partitioner directly.
CMake integration: backend library target, Vulkan FlatBuffer
schema dependency, root build flags, and glslc guard fix.
Export tests verify fp32 torch.add models produce a .pte with
VulkanBackend delegate: 2D/3D/4D shapes, broadcasting,
self-add, scalar add, and chained adds. Includes TODO with
architecture notes and next steps.
WebGPUDevice wraps wgpu-native (Metal/Vulkan) behind a
uniform C++ interface. Includes a setup script that downloads
prebuilt wgpu-native binaries.
Wire wgpu-native into the CMake build and integrate
WebGPUDevice into the compute graph for native
Metal/Vulkan execution.
C++ test runner that loads a .pte and runs inference via
wgpu-native. End-to-end build script that exports a model,
builds the native runtime, and validates output.
Copilot AI review requested due to automatic review settings April 28, 2026 16:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 26 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

void OperatorRegistry::register_op(
const std::string& name,
const OpFunction& fn) {
table_.insert(std::make_pair(name, fn));
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OperatorRegistry::register_op uses unordered_map::insert(), which silently ignores duplicate registrations for the same op name. This can hide accidental double-registration and leave an unexpected implementation active. Consider detecting duplicates and either overwriting intentionally or throwing/logging when an op is already registered.

Suggested change
table_.insert(std::make_pair(name, fn));
const auto [it, inserted] = table_.insert(std::make_pair(name, fn));
if (!inserted) {
throw std::runtime_error(
"WebGPU OperatorRegistry: duplicate operator registration: " + name);
}

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +106
// Request adapter using AllowSpontaneous mode (fires during
// wgpuInstanceProcessEvents or any other API call).
AdapterResult adapter_result;
WGPURequestAdapterCallbackInfo adapter_cb = {};
adapter_cb.mode = WGPUCallbackMode_AllowSpontaneous;
adapter_cb.callback = on_adapter_request;
adapter_cb.userdata1 = &adapter_result;

wgpuInstanceRequestAdapter(ctx.instance, nullptr, adapter_cb);
while (!adapter_result.done) {
wgpuInstanceProcessEvents(ctx.instance);
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_webgpu_context() busy-waits on adapter/device requests with while (!*_result.done) { wgpuInstanceProcessEvents(...) } and has no timeout/failsafe. If the callback never fires (driver/runtime issue), this will hang indefinitely. Please add a bounded timeout and throw a clear error when exceeded.

Copilot uses AI. Check for mistakes.
namespace {
auto cls = WebGPUBackend();
Backend backend{"VulkanBackend", &cls};
static auto success_with_compiler = register_backend(backend);
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebGPU backend registers itself under the name "VulkanBackend". If EXECUTORCH_BUILD_VULKAN and EXECUTORCH_BUILD_WEBGPU are both enabled, register_backend() will reject the duplicate name and one backend will silently fail to register (the return value is currently ignored). Please enforce mutual exclusivity in CMake (or register under a distinct name and update export path), and/or check the register_backend() return value and surface a clear error.

Suggested change
static auto success_with_compiler = register_backend(backend);
static const bool success_with_compiler = []() {
const Error err = register_backend(backend);
if (err != Error::Ok) {
ET_LOG(
Error,
"Failed to register WebGPU backend under name '%s' (possible duplicate registration). Error code: 0x%x",
"VulkanBackend",
static_cast<unsigned int>(err));
return false;
}
return true;
}();

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +119
const size_t num_inputs = graph->input_ids().size();
const size_t num_outputs = graph->output_ids().size();

// Copy inputs from EValue tensors to GPU buffers
std::vector<std::pair<const void*, size_t>> inputs;
inputs.reserve(num_inputs);
for (size_t i = 0; i < num_inputs; i++) {
const auto& tensor = args[i]->toTensor();
inputs.emplace_back(tensor.const_data_ptr(), tensor.nbytes());
}
graph->copy_inputs(inputs);

// Execute the compute graph
graph->execute();

// Copy outputs from GPU staging buffers to EValue tensor data pointers
std::vector<std::pair<void*, size_t>> outputs;
outputs.reserve(num_outputs);
for (size_t i = 0; i < num_outputs; i++) {
const size_t arg_idx = num_inputs + i;
auto& tensor = args[arg_idx]->toTensor();
outputs.emplace_back(tensor.mutable_data_ptr(), tensor.nbytes());
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execute() assumes outputs start at index num_inputs (arg_idx = num_inputs + i). Other backends compute the output offset as args.size() - num_outputs, which is more robust if args includes extra values or input count differs from args layout. Please compute output_offset from args.size(), and validate args.size() >= num_inputs + num_outputs before indexing.

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +208
std::string op_name = op_call->name()->str();

if (!webgpu_operator_registry().has_op(op_name)) {
throw std::runtime_error("WebGPU backend: unsupported op: " + op_name);
}

const auto* fb_args = op_call->args();
std::vector<int> args;
if (fb_args) {
for (unsigned j = 0; j < fb_args->size(); j++) {
args.push_back(static_cast<int>(fb_args->Get(j)));
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operator name is read via op_call->name()->str() without checking op_call->name() for null. FlatBuffers strings can be null in malformed/corrupt inputs, which would crash here. Please validate required fields (name, args) during build and return a compatibility error instead of crashing.

Suggested change
std::string op_name = op_call->name()->str();
if (!webgpu_operator_registry().has_op(op_name)) {
throw std::runtime_error("WebGPU backend: unsupported op: " + op_name);
}
const auto* fb_args = op_call->args();
std::vector<int> args;
if (fb_args) {
for (unsigned j = 0; j < fb_args->size(); j++) {
args.push_back(static_cast<int>(fb_args->Get(j)));
}
if (!op_call) {
throw std::runtime_error(
"WebGPU backend: incompatible graph: operator call is missing");
}
const auto* fb_name = op_call->name();
if (!fb_name) {
throw std::runtime_error(
"WebGPU backend: incompatible graph: operator name is missing");
}
std::string op_name = fb_name->str();
if (!webgpu_operator_registry().has_op(op_name)) {
throw std::runtime_error("WebGPU backend: unsupported op: " + op_name);
}
const auto* fb_args = op_call->args();
if (!fb_args) {
throw std::runtime_error(
"WebGPU backend: incompatible graph: args are missing for op: " +
op_name);
}
std::vector<int> args;
for (unsigned j = 0; j < fb_args->size(); j++) {
args.push_back(static_cast<int>(fb_args->Get(j)));

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +56
echo "Downloading wgpu-native ${WGPU_VERSION} for ${PLATFORM}-${WGPU_ARCH}..."
TMPDIR_DL="$(mktemp -d)"
trap "rm -rf ${TMPDIR_DL}" EXIT

curl -sL "${URL}" -o "${TMPDIR_DL}/${ZIP_NAME}"

mkdir -p "${WGPU_DIR}"
unzip -qo "${TMPDIR_DL}/${ZIP_NAME}" -d "${WGPU_DIR}"

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup-wgpu-native.sh downloads and unzips a release artifact but doesn't check curl/unzip success (curl is run with -sL, which can still write an HTML error page, and unzip will still exit 0 in some cases if the file isn't a valid zip). Consider adding curl -f (fail on HTTP errors) and validating the expected output file (lib/libwgpu_native.a) exists after unzip to make failures actionable.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +119
// Access tensors by value ID (used by op implementations).
WebGPUTensor& get_tensor(int id) {
return tensors_[id];
}
const WebGPUTensor& get_tensor(int id) const {
return tensors_[id];
}

// Access scalar values stored during graph build.
double get_double(int id) const {
return doubles_[id];
}
int64_t get_int(int id) const {
return ints_[id];
}

WGPUDevice device() const {
return device_;
}
WGPUQueue queue() const {
return queue_;
}

void add_dispatch(WebGPUDispatch dispatch) {
dispatches_.push_back(dispatch);
}

void add_uniform_buffer_bytes(size_t bytes) {
uniform_buffer_bytes_ += bytes;
}

void set_instance(WGPUInstance instance) {
instance_ = instance;
}
void set_device(WGPUDevice device) {
device_ = device;
}

WebGPUMemoryStats memory_stats() const;

int num_values() const {
return static_cast<int>(value_types_.size());
}

enum class ValueType { Tensor, Int, Double, Bool, Null, String };

ValueType get_value_type(int id) const {
return value_types_[id];
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_tensor()/get_value_type()/get_int()/get_double() index internal vectors with operator[] and do not validate that the value id is in range. Since ids come from the delegate flatbuffer, malformed/corrupt programs could cause out-of-bounds access and a hard crash. Please add bounds checks (e.g., use at() or explicit range checks) and fail build/init gracefully (throw to be caught by init()).

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +75
if(APPLE)
target_link_libraries(
webgpu_backend PRIVATE "-framework Metal" "-framework QuartzCore"
"-framework CoreGraphics" "-framework Foundation"
)
else()
target_link_libraries(webgpu_backend PRIVATE dl m pthread)
endif()

target_compile_options(webgpu_backend PRIVATE -fexceptions)

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CMakeLists.txt unconditionally adds GCC/Clang-only flags ("-fexceptions") and links POSIX-only libs (dl/m/pthread) under the non-APPLE branch. On WIN32/MSVC this will fail to configure/build if EXECUTORCH_BUILD_WEBGPU is enabled. Please add proper compiler/platform guards (e.g., /EHsc on MSVC, and appropriate Windows system libs) or explicitly disable WebGPU on unsupported platforms with a clear message.

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +67
float max_error = 0.0f;
int check_count = std::min(size, 1024);
for (int i = 0; i < check_count; i++) {
float expected = a_data[i] + b_data[i];
float error = std::abs(out_data[i] - expected);
max_error = std::max(max_error, error);
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_webgpu_native.cpp uses std::min/std::max but doesn't include . This can fail to compile on some standard libraries (notably libc++) because these functions are only guaranteed to be declared via . Please add the missing include to avoid relying on transitive includes.

Copilot uses AI. Check for mistakes.
@digantdesai digantdesai merged commit 173c9e2 into main Apr 29, 2026
170 of 179 checks passed
@digantdesai digantdesai deleted the wgpu branch April 29, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants