[XPU] Fix RuntimeError in addcmul when tensor2 is a CPU scalar tensor by Copilot · Pull Request #3608 · intel/torch-xpu-ops

Copilot · 2026-05-11T04:01:16Z

torch.addcmul on XPU raised RuntimeError: iter.device(arg).is_xpu() whenever tensor2 was a CPU scalar tensor, because gpu_kernel asserts all operands are on XPU with no fallback for CPU scalars.

Changes

src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp
- Add AddcmulCpuScalarFunctor and AddcmulComplexCpuScalarFunctor: 2-argument functors that capture the CPU scalar value of tensor2 as a compile-time constant, eliminating it from the iterator before dispatch.
- In addcmul_kernel, detect iter.is_cpu_scalar(3), extract the scalar via iter.scalar_value<>, call iter.remove_operand(3), then dispatch the 2-arg functor — matching the pattern used by opmath_gpu_kernel_with_scalars for binary CPU scalar handling.
- Use AT_DISPATCH_ALL_TYPES_AND2 (not _COMPLEX) in the non-complex CPU-scalar branch, since complex types are already handled by the preceding if block.
test/repro/test_addcmul_cpu_scalar.py: Reproducer covering all affected dtypes (float32, float64, complex64, complex128, int8, int16, int32, int64, uint8).

# Previously raised RuntimeError on XPU
a = torch.rand(3, 3, dtype=torch.float32, device="xpu")
b = torch.rand(3, 3, dtype=torch.float32, device="xpu")
c = torch.tensor(2.0, dtype=torch.float32, device="cpu")  # CPU scalar
out = torch.addcmul(a, b, c, value=0.5)  # now works correctly

Test: test/xpu/test_torch_xpu.py::TestTorchDeviceTypeXPU::test_addcmul_use_cpu_scalar_True_xpu_*

Stonepia · 2026-05-11T06:30:55Z

@copilot fix the ci failue

Copilot · 2026-05-11T06:33:11Z

@copilot fix the ci failue

Fixed in 27ec115. Three lint issues were addressed:

TESTOWNERS: Added copyright header and # Owner(s): ["module: intel"] to test/repro/test_addcmul_cpu_scalar.py
TEST_HAS_MAIN: Rewrote the test as a TestCase-based class with if __name__ == "__main__": run_tests()
CLANGFORMAT: Broke the long gpu_kernel(iter, AddcmulComplexCpuScalarFunctor<scalar_t>(alpha, c_val)) call across two lines in PointwiseOpsKernels.cpp

Stonepia · 2026-05-11T06:53:03Z

@copilot fix lint error

Copilot · 2026-05-11T06:54:35Z

@copilot fix lint error

Fixed in 5ccc543. The remaining lint error was a trailing blank line at the end of test/repro/test_addcmul_cpu_scalar.py (NEWLINE + FLAKE8 W391). Removed the extra newline so the file now ends with exactly one \n after run_tests().

liangan1

General LGTM. No logic issue, but need to double check the implementation of CUDA. @Stonepia

liangan1 · 2026-05-12T08:20:38Z

@copilot how cuda process this cpu input? whether this fix is aligned with cuda?

Copilot · 2026-05-12T08:22:14Z

@copilot how cuda process this cpu input? whether this fix is aligned with cuda?

The XPU fix is aligned with CUDA. Here's how CUDA handles the CPU scalar case (in PointwiseOpsKernel.cu):

Detects iter.is_cpu_scalar(3) for tensor2
Extracts the scalar: iter.scalar_value<scalar_t>(3)
Removes the operand: iter.remove_operand(3)
Dispatches a 2-argument kernel with the scalar captured (CUDA uses a helper function addcmul_cuda_scalar_tensor2_kernel; XPU captures it in a functor AddcmulCpuScalarFunctor / AddcmulComplexCpuScalarFunctor)

Both implementations:

Use AT_DISPATCH_COMPLEX_TYPES for complex dtypes, AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, ...) for non-complex
Apply the same math: a + alpha * (b * c) where c is the captured CPU scalar
Use an accumulator type (at::acc_type<scalar_t, true> in CUDA, at::acc_type_device<scalar_t, kXPU> in XPU) for fp16/bf16 precision

The XPU fix matches the CUDA pattern exactly — the only structural difference is that CUDA extracts the logic into a named helper function while XPU uses captured-value functors, which is consistent with the rest of the XPU kernel style.

liangan1

LGTM

liangan1 · 2026-05-21T03:26:53Z

@copilot whether the ci fails is related to the change of this pr?

Copilot · 2026-05-21T03:28:17Z

@copilot whether the ci fails is related to the change of this pr?

The CI failures are not related to this PR's changes. The 21 new failures fall into these categories:

AOT autograd / invoke subgraph (TestAOTJointWithDescriptors, TestInvokeSubgraphCompile) — testing autograd partitioning and subgraph annotation features
Autograd transforms JVP/VJP (TestAutogradFunctionCPU/XPU in test_eager_transforms_xpu) — testing save_for_jvp and save_for_vjp paths
Custom ops autograd (TestCustomOpAPI::test_no_grad_skips_autograd)
Flash attention meta test (TestMetaXPU::test_meta_outplace_torch_ops_aten__flash_attention_forward_xpu_float16)

None of these touch pointwise ops or the addcmul kernel. Our changes are limited to src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp (addcmul CPU scalar handling) and test/repro/test_addcmul_cpu_scalar.py. The flash attention meta failure likely comes from commit bb49876 (Enable Op _flash_attention_forward/_flash_attention_backward #3404) which was merged into this branch, not from our addcmul fix.

Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/2eb174d7-463e-49b6-80b8-5c3ef0029f6d Co-authored-by: laifenxiawucha <184592536+laifenxiawucha@users.noreply.github.com>

…g-format fix Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/c1669bef-ca65-42e3-bfa5-3a3de8924609 Co-authored-by: Stonepia <12094956+Stonepia@users.noreply.github.com>

Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/c0ea05f8-8757-47a9-a0cb-d08cfd8a6b0b Co-authored-by: Stonepia <12094956+Stonepia@users.noreply.github.com>

chuanqi129 · 2026-05-21T10:22:04Z

@copilot New UT failures detected in op_ut tests (6 total). Please check these new failures and analyze whether they are caused by the PR changes.

Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status.

Job log: https://github.com/intel/torch-xpu-ops/actions/runs/26208134635
Full new failure report: https://github.com/intel/torch-xpu-ops/actions/runs/26208134635/artifacts/7133217738

New failures:

Category	Class name	Test name	Status	Message
op_ut	third_party.torch-xpu-ops.test.xpu.dynamo.test_functions_xpu.DefaultsTests	test_zip_strict	failed	torch._dynamo.exc.Unsupported: Observed exception
op_ut	third_party.torch-xpu-ops.test.xpu.functorch.test_control_flow_xpu.TestControlFlowTraced	test_while_loop_autograd_simple	failed	AssertionError: 'clas[152 chars] zeros: "i64[]" = torch.ops.aten.zeros.def[3514 chars]1)\n' != 'clas[152 chars] clone: "f32[3, 3]" = torch.ops.aten.clone[3636 chars]1)\n'
op_ut	third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_tensor_xpu.TestQuantizedTensorXPU	test_qtensor_sub_byte_aligned_cols_xpu	failed	RuntimeError: setStorage: sizes [1, 4], strides [4, 1], storage offset 0, and itemsize 1 requiring a storage size of 4 are out of bounds for storage of size 2
op_ut	third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_tensor_xpu.TestQuantizedTensorXPU	test_qtensor_sub_byte_not_aligned_cols_xpu	failed	RuntimeError: setStorage: sizes [1, 5], strides [5, 1], storage offset 0, and itemsize 1 requiring a storage size of 5 are out of bounds for storage of size 3
op_ut	third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_tensor_xpu.TestQuantizedTensorXPU	test_qtensor_per_channel_load_save_xpu	failed	RuntimeError: setStorage: sizes [20, 10], strides [10, 1], storage offset 0, and itemsize 1 requiring a storage size of 200 are out of bounds for storage of size 100

... and 1 more failures. See full list in artifact.

Instructions:

Download the full new failure report artifact for complete error messages
Analyze whether the failures are caused by PR changes
If caused by PR changes, fix those failures on this PR branch directly
Add failing test cases with pytest commands in the PR description for the reproducer job to use
Use reply comments to report analysis status - do NOT modify the PR description plan directly

Copilot · 2026-05-21T10:26:49Z

@copilot New UT failures detected in op_ut tests (6 total). Please check these new failures and analyze whether they are caused by the PR changes.

**Important: Do NOT update the plan in the PR de...

None of the 6 new failures are caused by this PR. Our changes are limited to two files:

src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp — adds CPU scalar handling in addcmul_kernel
test/repro/test_addcmul_cpu_scalar.py — reproducer test

The failing tests touch entirely different areas:

test_zip_strict — dynamo tracing of Python zip(..., strict=True), unrelated to pointwise ops
test_while_loop_autograd_simple — control flow autograd tracing, unrelated
test_qtensor_sub_byte_* / test_qtensor_per_channel_load_save_xpu — quantized tensor sub-byte storage layout (setStorage bounds check), unrelated to addcmul or any code we touched

These failures are pre-existing issues or regressions from other commits merged into the branch (e.g. bb49876 flash attention, 9c719e9 TensorCompare kernels, etc.) and are not caused by this PR.

Copilot AI assigned Copilot and laifenxiawucha May 11, 2026

Copilot started work on behalf of laifenxiawucha May 11, 2026 04:29 View session

Copilot AI linked an issue May 11, 2026 that may be closed by this pull request

[UT] "RuntimeError: iter.device(arg).is_xpu()" in test_torch_xpu.py #2560

Closed

7 tasks

Copilot AI changed the title ~~[WIP] Fix RuntimeError in XPU kernel for torch.addcmul with CPU scalar~~ [XPU] Fix RuntimeError in addcmul when tensor2 is a CPU scalar tensor May 11, 2026

Copilot AI requested a review from laifenxiawucha May 11, 2026 04:37

Copilot finished work on behalf of laifenxiawucha May 11, 2026 04:37

Copilot started work on behalf of Stonepia May 11, 2026 06:31 View session

Copilot finished work on behalf of Stonepia May 11, 2026 06:33

Copilot AI requested a review from Stonepia May 11, 2026 06:33

Copilot started work on behalf of Stonepia May 11, 2026 06:53 View session

Copilot finished work on behalf of Stonepia May 11, 2026 06:55

Stonepia added the ai_generated label May 12, 2026

liangan1 reviewed May 12, 2026

View reviewed changes

Copilot started work on behalf of liangan1 May 12, 2026 08:20 View session

Copilot finished work on behalf of liangan1 May 12, 2026 08:22

Copilot AI requested a review from liangan1 May 12, 2026 08:22

liangan1 approved these changes May 12, 2026

View reviewed changes

This comment was marked as duplicate.

Sign in to view

Copilot started work on behalf of liangan1 May 21, 2026 03:27 View session

liangan1 marked this pull request as ready for review May 21, 2026 03:27

github-actions Bot added disable_e2e Disable all e2e test jobs for the PR disable_distributed Disable distributed UT test jobs for the PR labels May 21, 2026

chuanqi129 marked this pull request as draft May 21, 2026 03:27

chuanqi129 marked this pull request as ready for review May 21, 2026 03:27

Copilot finished work on behalf of liangan1 May 21, 2026 03:28

Copilot AI requested a review from liangan1 May 21, 2026 03:28

liangan1 approved these changes May 21, 2026

View reviewed changes

Copilot AI and others added 4 commits May 21, 2026 13:51

Add CPU scalar support for tensor2 in addcmul_kernel

20eee9a

Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/2eb174d7-463e-49b6-80b8-5c3ef0029f6d Co-authored-by: laifenxiawucha <184592536+laifenxiawucha@users.noreply.github.com>

Address code review feedback: fix dispatch macro and test assertions

9862ce9

Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/2eb174d7-463e-49b6-80b8-5c3ef0029f6d Co-authored-by: laifenxiawucha <184592536+laifenxiawucha@users.noreply.github.com>

Fix CI lint failures: add ownership header, run_tests block, and clan…

91c3396

…g-format fix Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/c1669bef-ca65-42e3-bfa5-3a3de8924609 Co-authored-by: Stonepia <12094956+Stonepia@users.noreply.github.com>

Fix trailing newline in test_addcmul_cpu_scalar.py (NEWLINE/FLAKE8 W391)

81cafd9

Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/c0ea05f8-8757-47a9-a0cb-d08cfd8a6b0b Co-authored-by: Stonepia <12094956+Stonepia@users.noreply.github.com>

chuanqi129 force-pushed the copilot/fix-runtimeerror-xpu-kernel branch from ac40dfd to 81cafd9 Compare May 21, 2026 05:51

Copilot started work on behalf of chuanqi129 May 21, 2026 10:22 View session

Copilot finished work on behalf of chuanqi129 May 21, 2026 10:27

Copilot AI requested a review from chuanqi129 May 21, 2026 10:27

chuanqi129 merged commit db39431 into main May 21, 2026
23 of 24 checks passed

chuanqi129 deleted the copilot/fix-runtimeerror-xpu-kernel branch May 21, 2026 23:48

liangan1 added the agent:merged This PR is auto generated by copilot agent and merged after engineer reviewed. label May 22, 2026

Conversation

Copilot AI commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

Stonepia commented May 11, 2026

Uh oh!

Copilot AI commented May 11, 2026

Uh oh!

Stonepia commented May 11, 2026

Uh oh!

Copilot AI commented May 11, 2026

Uh oh!

liangan1 left a comment

Choose a reason for hiding this comment

Uh oh!

liangan1 commented May 12, 2026

Uh oh!

Copilot AI commented May 12, 2026

Uh oh!

liangan1 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as duplicate.

liangan1 commented May 21, 2026

Uh oh!

Copilot AI commented May 21, 2026

Uh oh!

chuanqi129 commented May 21, 2026

Uh oh!

Copilot AI commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Copilot AI commented May 11, 2026 •

edited

Loading