Skip to content

[ET-VK][q8_ops] Add int8x4_buffer_to_nchw shader and refactor Int8x4Staging#17708

Merged
SS-JIA merged 1 commit intogh/SS-JIA/450/basefrom
gh/SS-JIA/450/head
Feb 25, 2026
Merged

[ET-VK][q8_ops] Add int8x4_buffer_to_nchw shader and refactor Int8x4Staging#17708
SS-JIA merged 1 commit intogh/SS-JIA/450/basefrom
gh/SS-JIA/450/head

Conversation

@SS-JIA
Copy link
Copy Markdown
Contributor

@SS-JIA SS-JIA commented Feb 25, 2026

Stack from ghstack (oldest at bottom):

Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover
the full staging lifecycle for kInt8x4 buffer tensors.

Rename and split of the old prepack function:
The old add_staging_to_int8x4_buffer_node (which used a static dispatch
node for prepacking TensorRef data into a packed int8x4 buffer) is renamed
to add_prepack_int8x4_buffer_node to clarify its role. Two new runtime
staging functions are added alongside it:

  • add_staging_to_int8x4_buffer_node: reads NCHW data from a staging buffer
    into a kInt8x4 buffer tensor at execute time, using a DynamicDispatchNode
    wrapping the existing nchw_to_int8x4_buffer shader.
  • add_int8x4_buffer_to_staging_node: writes packed int8x4 data back from a
    kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time,
    using a new int8x4_buffer_to_nchw shader.

New shader (int8x4_buffer_to_nchw.glsl):
Implements the reverse of nchw_to_int8x4_buffer. One thread per output
int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered
element indices, looks up each element's position in the packed int8x4 buffer
via tensor4d_idx_to_buf_idx, extracts the packed byte, and assembles 4
bytes into a single output int32. Works for any GPUMemoryLayout.

Staging.cpp dispatch:
add_staging_to_tensor_node and add_tensor_to_staging_node now both
dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4.
prepack_op is updated to call add_prepack_int8x4_buffer_node.

TestQ8taBinary.cpp is updated to include Int8x4Staging.h and call
add_prepack_int8x4_buffer_node.

Differential Revision: D94364640

…taging

Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover
the full staging lifecycle for kInt8x4 buffer tensors.

**Rename and split of the old prepack function:**
The old `add_staging_to_int8x4_buffer_node` (which used a static dispatch
node for prepacking TensorRef data into a packed int8x4 buffer) is renamed
to `add_prepack_int8x4_buffer_node` to clarify its role. Two new runtime
staging functions are added alongside it:

- `add_staging_to_int8x4_buffer_node`: reads NCHW data from a staging buffer
  into a kInt8x4 buffer tensor at execute time, using a `DynamicDispatchNode`
  wrapping the existing `nchw_to_int8x4_buffer` shader.
- `add_int8x4_buffer_to_staging_node`: writes packed int8x4 data back from a
  kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time,
  using a new `int8x4_buffer_to_nchw` shader.

**New shader (int8x4_buffer_to_nchw.glsl):**
Implements the reverse of `nchw_to_int8x4_buffer`. One thread per output
int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered
element indices, looks up each element's position in the packed int8x4 buffer
via `tensor4d_idx_to_buf_idx`, extracts the packed byte, and assembles 4
bytes into a single output int32. Works for any GPUMemoryLayout.

**Staging.cpp dispatch:**
`add_staging_to_tensor_node` and `add_tensor_to_staging_node` now both
dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4.
`prepack_op` is updated to call `add_prepack_int8x4_buffer_node`.

**TestQ8taBinary.cpp** is updated to include Int8x4Staging.h and call
`add_prepack_int8x4_buffer_node`.

Differential Revision: [D94364640](https://our.internmc.facebook.com/intern/diff/D94364640/)

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 25, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17708

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1c319ca with merge base 63f9724 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 25, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@SS-JIA SS-JIA merged commit ee8a6a1 into gh/SS-JIA/450/base Feb 25, 2026
203 of 206 checks passed
@SS-JIA SS-JIA deleted the gh/SS-JIA/450/head branch February 25, 2026 19:12
@SS-JIA SS-JIA temporarily deployed to cherry-pick-bot February 25, 2026 19:12 — with GitHub Actions Inactive
SS-JIA pushed a commit that referenced this pull request Feb 25, 2026
…taging

Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover
the full staging lifecycle for kInt8x4 buffer tensors.

**Rename and split of the old prepack function:**
The old `add_staging_to_int8x4_buffer_node` (which used a static dispatch
node for prepacking TensorRef data into a packed int8x4 buffer) is renamed
to `add_prepack_int8x4_buffer_node` to clarify its role. Two new runtime
staging functions are added alongside it:

- `add_staging_to_int8x4_buffer_node`: reads NCHW data from a staging buffer
  into a kInt8x4 buffer tensor at execute time, using a `DynamicDispatchNode`
  wrapping the existing `nchw_to_int8x4_buffer` shader.
- `add_int8x4_buffer_to_staging_node`: writes packed int8x4 data back from a
  kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time,
  using a new `int8x4_buffer_to_nchw` shader.

**New shader (int8x4_buffer_to_nchw.glsl):**
Implements the reverse of `nchw_to_int8x4_buffer`. One thread per output
int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered
element indices, looks up each element's position in the packed int8x4 buffer
via `tensor4d_idx_to_buf_idx`, extracts the packed byte, and assembles 4
bytes into a single output int32. Works for any GPUMemoryLayout.

**Staging.cpp dispatch:**
`add_staging_to_tensor_node` and `add_tensor_to_staging_node` now both
dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4.
`prepack_op` is updated to call `add_prepack_int8x4_buffer_node`.

**TestQ8taBinary.cpp** is updated to include Int8x4Staging.h and call
`add_prepack_int8x4_buffer_node`.

Differential Revision: [D94364640](https://our.internmc.facebook.com/intern/diff/D94364640/)

ghstack-source-id: 344667754
Pull Request resolved: #17708
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants