Skip to content

[XPU] Fix RuntimeError in addcmul when tensor2 is a CPU scalar tensor#3608

Merged
chuanqi129 merged 4 commits into
mainfrom
copilot/fix-runtimeerror-xpu-kernel
May 21, 2026
Merged

[XPU] Fix RuntimeError in addcmul when tensor2 is a CPU scalar tensor#3608
chuanqi129 merged 4 commits into
mainfrom
copilot/fix-runtimeerror-xpu-kernel

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 11, 2026

torch.addcmul on XPU raised RuntimeError: iter.device(arg).is_xpu() whenever tensor2 was a CPU scalar tensor, because gpu_kernel asserts all operands are on XPU with no fallback for CPU scalars.

Changes

  • src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp

    • Add AddcmulCpuScalarFunctor and AddcmulComplexCpuScalarFunctor: 2-argument functors that capture the CPU scalar value of tensor2 as a compile-time constant, eliminating it from the iterator before dispatch.
    • In addcmul_kernel, detect iter.is_cpu_scalar(3), extract the scalar via iter.scalar_value<>, call iter.remove_operand(3), then dispatch the 2-arg functor — matching the pattern used by opmath_gpu_kernel_with_scalars for binary CPU scalar handling.
    • Use AT_DISPATCH_ALL_TYPES_AND2 (not _COMPLEX) in the non-complex CPU-scalar branch, since complex types are already handled by the preceding if block.
  • test/repro/test_addcmul_cpu_scalar.py: Reproducer covering all affected dtypes (float32, float64, complex64, complex128, int8, int16, int32, int64, uint8).

# Previously raised RuntimeError on XPU
a = torch.rand(3, 3, dtype=torch.float32, device="xpu")
b = torch.rand(3, 3, dtype=torch.float32, device="xpu")
c = torch.tensor(2.0, dtype=torch.float32, device="cpu")  # CPU scalar
out = torch.addcmul(a, b, c, value=0.5)  # now works correctly

Test: test/xpu/test_torch_xpu.py::TestTorchDeviceTypeXPU::test_addcmul_use_cpu_scalar_True_xpu_*

Copilot AI linked an issue May 11, 2026 that may be closed by this pull request
7 tasks
Copilot AI changed the title [WIP] Fix RuntimeError in XPU kernel for torch.addcmul with CPU scalar [XPU] Fix RuntimeError in addcmul when tensor2 is a CPU scalar tensor May 11, 2026
Copilot AI requested a review from laifenxiawucha May 11, 2026 04:37
@Stonepia
Copy link
Copy Markdown
Contributor

@copilot fix the ci failue

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 11, 2026

@copilot fix the ci failue

Fixed in 27ec115. Three lint issues were addressed:

  1. TESTOWNERS: Added copyright header and # Owner(s): ["module: intel"] to test/repro/test_addcmul_cpu_scalar.py
  2. TEST_HAS_MAIN: Rewrote the test as a TestCase-based class with if __name__ == "__main__": run_tests()
  3. CLANGFORMAT: Broke the long gpu_kernel(iter, AddcmulComplexCpuScalarFunctor<scalar_t>(alpha, c_val)) call across two lines in PointwiseOpsKernels.cpp

Copilot AI requested a review from Stonepia May 11, 2026 06:33
@Stonepia
Copy link
Copy Markdown
Contributor

@copilot fix lint error

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 11, 2026

@copilot fix lint error

Fixed in 5ccc543. The remaining lint error was a trailing blank line at the end of test/repro/test_addcmul_cpu_scalar.py (NEWLINE + FLAKE8 W391). Removed the extra newline so the file now ends with exactly one \n after run_tests().

Copy link
Copy Markdown
Contributor

@liangan1 liangan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM. No logic issue, but need to double check the implementation of CUDA. @Stonepia

@liangan1
Copy link
Copy Markdown
Contributor

@copilot how cuda process this cpu input? whether this fix is aligned with cuda?

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 12, 2026

@copilot how cuda process this cpu input? whether this fix is aligned with cuda?

The XPU fix is aligned with CUDA. Here's how CUDA handles the CPU scalar case (in PointwiseOpsKernel.cu):

  1. Detects iter.is_cpu_scalar(3) for tensor2
  2. Extracts the scalar: iter.scalar_value<scalar_t>(3)
  3. Removes the operand: iter.remove_operand(3)
  4. Dispatches a 2-argument kernel with the scalar captured (CUDA uses a helper function addcmul_cuda_scalar_tensor2_kernel; XPU captures it in a functor AddcmulCpuScalarFunctor / AddcmulComplexCpuScalarFunctor)

Both implementations:

  • Use AT_DISPATCH_COMPLEX_TYPES for complex dtypes, AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, ...) for non-complex
  • Apply the same math: a + alpha * (b * c) where c is the captured CPU scalar
  • Use an accumulator type (at::acc_type<scalar_t, true> in CUDA, at::acc_type_device<scalar_t, kXPU> in XPU) for fp16/bf16 precision

The XPU fix matches the CUDA pattern exactly — the only structural difference is that CUDA extracts the logic into a named helper function while XPU uses captured-value functors, which is consistent with the rest of the XPU kernel style.

Copilot AI requested a review from liangan1 May 12, 2026 08:22
Copy link
Copy Markdown
Contributor

@liangan1 liangan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions

This comment was marked as duplicate.

@liangan1
Copy link
Copy Markdown
Contributor

@copilot whether the ci fails is related to the change of this pr?

@liangan1 liangan1 marked this pull request as ready for review May 21, 2026 03:27
@github-actions github-actions Bot added disable_e2e Disable all e2e test jobs for the PR disable_distributed Disable distributed UT test jobs for the PR labels May 21, 2026
@chuanqi129 chuanqi129 marked this pull request as draft May 21, 2026 03:27
@chuanqi129 chuanqi129 marked this pull request as ready for review May 21, 2026 03:27
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 21, 2026

@copilot whether the ci fails is related to the change of this pr?

The CI failures are not related to this PR's changes. The 21 new failures fall into these categories:

  1. AOT autograd / invoke subgraph (TestAOTJointWithDescriptors, TestInvokeSubgraphCompile) — testing autograd partitioning and subgraph annotation features
  2. Autograd transforms JVP/VJP (TestAutogradFunctionCPU/XPU in test_eager_transforms_xpu) — testing save_for_jvp and save_for_vjp paths
  3. Custom ops autograd (TestCustomOpAPI::test_no_grad_skips_autograd)
  4. Flash attention meta test (TestMetaXPU::test_meta_outplace_torch_ops_aten__flash_attention_forward_xpu_float16)

None of these touch pointwise ops or the addcmul kernel. Our changes are limited to src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp (addcmul CPU scalar handling) and test/repro/test_addcmul_cpu_scalar.py. The flash attention meta failure likely comes from commit bb49876 (Enable Op _flash_attention_forward/_flash_attention_backward #3404) which was merged into this branch, not from our addcmul fix.

Copilot AI requested a review from liangan1 May 21, 2026 03:28
@chuanqi129 chuanqi129 force-pushed the copilot/fix-runtimeerror-xpu-kernel branch from ac40dfd to 81cafd9 Compare May 21, 2026 05:51
@chuanqi129
Copy link
Copy Markdown
Contributor

@copilot New UT failures detected in op_ut tests (6 total). Please check these new failures and analyze whether they are caused by the PR changes.

Important: Do NOT update the plan in the PR description directly. Use reply comments to update the status.

Job log: https://github.com/intel/torch-xpu-ops/actions/runs/26208134635
Full new failure report: https://github.com/intel/torch-xpu-ops/actions/runs/26208134635/artifacts/7133217738

New failures:

Category Class name Test name Status Message
op_ut third_party.torch-xpu-ops.test.xpu.dynamo.test_functions_xpu.DefaultsTests test_zip_strict failed torch._dynamo.exc.Unsupported: Observed exception
op_ut third_party.torch-xpu-ops.test.xpu.functorch.test_control_flow_xpu.TestControlFlowTraced test_while_loop_autograd_simple failed AssertionError: 'clas[152 chars] zeros: "i64[]" = torch.ops.aten.zeros.def[3514 chars]1)\n' != 'clas[152 chars] clone: "f32[3, 3]" = torch.ops.aten.clone[3636 chars]1)\n'
op_ut third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_tensor_xpu.TestQuantizedTensorXPU test_qtensor_sub_byte_aligned_cols_xpu failed RuntimeError: setStorage: sizes [1, 4], strides [4, 1], storage offset 0, and itemsize 1 requiring a storage size of 4 are out of bounds for storage of size 2
op_ut third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_tensor_xpu.TestQuantizedTensorXPU test_qtensor_sub_byte_not_aligned_cols_xpu failed RuntimeError: setStorage: sizes [1, 5], strides [5, 1], storage offset 0, and itemsize 1 requiring a storage size of 5 are out of bounds for storage of size 3
op_ut third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_tensor_xpu.TestQuantizedTensorXPU test_qtensor_per_channel_load_save_xpu failed RuntimeError: setStorage: sizes [20, 10], strides [10, 1], storage offset 0, and itemsize 1 requiring a storage size of 200 are out of bounds for storage of size 100

... and 1 more failures. See full list in artifact.

Instructions:

  1. Download the full new failure report artifact for complete error messages
  2. Analyze whether the failures are caused by PR changes
  3. If caused by PR changes, fix those failures on this PR branch directly
  4. Add failing test cases with pytest commands in the PR description for the reproducer job to use
  5. Use reply comments to report analysis status - do NOT modify the PR description plan directly

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 21, 2026

@copilot New UT failures detected in op_ut tests (6 total). Please check these new failures and analyze whether they are caused by the PR changes.

**Important: Do NOT update the plan in the PR de...

None of the 6 new failures are caused by this PR. Our changes are limited to two files:

  • src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp — adds CPU scalar handling in addcmul_kernel
  • test/repro/test_addcmul_cpu_scalar.py — reproducer test

The failing tests touch entirely different areas:

  • test_zip_strict — dynamo tracing of Python zip(..., strict=True), unrelated to pointwise ops
  • test_while_loop_autograd_simple — control flow autograd tracing, unrelated
  • test_qtensor_sub_byte_* / test_qtensor_per_channel_load_save_xpu — quantized tensor sub-byte storage layout (setStorage bounds check), unrelated to addcmul or any code we touched

These failures are pre-existing issues or regressions from other commits merged into the branch (e.g. bb49876 flash attention, 9c719e9 TensorCompare kernels, etc.) and are not caused by this PR.

Copilot AI requested a review from chuanqi129 May 21, 2026 10:27
@chuanqi129 chuanqi129 merged commit db39431 into main May 21, 2026
23 of 24 checks passed
@chuanqi129 chuanqi129 deleted the copilot/fix-runtimeerror-xpu-kernel branch May 21, 2026 23:48
@liangan1 liangan1 added the agent:merged This PR is auto generated by copilot agent and merged after engineer reviewed. label May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent:merged This PR is auto generated by copilot agent and merged after engineer reviewed. ai_generated disable_distributed Disable distributed UT test jobs for the PR disable_e2e Disable all e2e test jobs for the PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[UT] "RuntimeError: iter.device(arg).is_xpu()" in test_torch_xpu.py

5 participants