Bug: AK.accumulate!(...; inclusive=false) produces incorrect results for large arrays on both CUDA and Metal
Summary
AcceleratedKernels.accumulate! appears to produce incorrect results when using:
for large 1D arrays.
The issue reproduces on both CUDA and Metal backends.
The corresponding inclusive scan (inclusive=true) is correct and matches both backend-native scan implementations and a CPU reference exactly.
The issue appears specific to the exclusive scan path.
Package Versions
julia> using Pkg
julia> Pkg.status(["AcceleratedKernels","CUDA","KernelAbstractions"])
Status `~/.julia/environments/v1.12/Project.toml`
[6a4ca0a5] AcceleratedKernels v0.4.3
[052768ef] CUDA v5.11.3
[63c18a36] KernelAbstractions v0.9.41
Julia version:
CUDA Reproducer
using CUDA
using AcceleratedKernels
const AK = AcceleratedKernels
n = 1_000_000 # or for smaller numbers like 16384 (2^14)
clen = CuArray(rand(Int32(0):Int32(20), n))
cbgn = similar(clen)
AK.accumulate!(
+,
cbgn,
clen;
init=Int32(0),
inclusive=false
)
CUDA.synchronize()
clen_h = Array(clen)
ref = similar(clen_h)
s = Int32(0)
for i in eachindex(clen_h)
ref[i] = s
s += clen_h[i]
end
maximum(abs.(ref .- Array(cbgn)))
Observed:
Expected:
Metal Reproducer
using Metal
using AcceleratedKernels
const AK = AcceleratedKernels
n = 1_000_000
clen = MtlArray(rand(Int32(0):Int32(20), n))
cbgn = similar(clen)
AK.accumulate!(
+,
cbgn,
clen;
init=Int32(0),
inclusive=false
)
Metal.synchronize()
clen_h = Array(clen)
ref = similar(clen_h)
s = Int32(0)
for i in eachindex(clen_h)
ref[i] = s
s += clen_h[i]
end
maximum(abs.(ref .- Array(cbgn)))
Observed:
Expected:
Inclusive Scan Appears Correct
Using the same input:
cbgn = similar(clen)
AK.accumulate!(
+,
cbgn,
clen;
init=Int32(0),
inclusive=true
)
and comparing against:
s = Int32(0)
for i in eachindex(clen_h)
s += clen_h[i]
ref[i] = s
end
gives:
maximum(abs.(ref .- Array(cbgn))) == 0
on both CUDA and Metal.
Thus the issue appears specific to:
Independent Validation Against Backend-Native Scans
For CUDA:
CUDA.scan!(+, cbgn1, clen; dims=1)
For Metal:
Metal.scan!(+, cbgn1, clen; dims=1)
Both match a CPU reference exactly.
For example, to construct a 1-based exclusive scan:
using KernelAbstractions
@kernel function compute_cend_kernel!(cbgn, clen, ncell)
cid = @index(Global)
if cid <= ncell
cbgn[cid] -= clen[cid] - Int32(1)
end
end
Running:
CUDA.scan!(+, cbgn1, clen; dims=1)
# OR Metal.scan!(+, cbgn1, clen; dims=1)
kernel_cend = compute_cend_kernel!(backend, 256)
kernel_cend(cbgn1, clen, Int32(n); ndrange=n)
matches the CPU reference exactly:
maximum(abs.(cbgn_ref .- Array(cbgn1))) == 0
on both CUDA and Metal.
Additional Observations
For one test case, the first mismatch occurred at:
with:
a1[513] = 4942
a2[513] = 4939
and:
which initially suggested a block-boundary issue.
However, examining the error over the next block:
d = a1 .- a2
unique(d[513:1024])
produced values ranging from:
to
indicating that the discrepancy is not simply a constant carry offset.
Expected Behavior
For:
AK.accumulate!(
+,
dst,
src;
init=Int32(0),
inclusive=false
)
the output should satisfy:
dst[i] == sum(src[1:i-1])
for all valid indices.
The observed output does not satisfy this property for large arrays on either CUDA or Metal.
Conclusion
The issue appears specific to the exclusive scan implementation (inclusive=false) in AcceleratedKernels.
Inclusive scans (inclusive=true) are correct on both CUDA and Metal.
The discrepancy reproduces against:
- A CPU reference implementation
- CUDA native scan
- Metal native scan
Thank you for taking a look.
Bug:
AK.accumulate!(...; inclusive=false)produces incorrect results for large arrays on both CUDA and MetalSummary
AcceleratedKernels.accumulate!appears to produce incorrect results when using:for large 1D arrays.
The issue reproduces on both CUDA and Metal backends.
The corresponding inclusive scan (
inclusive=true) is correct and matches both backend-native scan implementations and a CPU reference exactly.The issue appears specific to the exclusive scan path.
Package Versions
Julia version:
CUDA Reproducer
Observed:
18Expected:
0Metal Reproducer
Observed:
14Expected:
0Inclusive Scan Appears Correct
Using the same input:
and comparing against:
gives:
on both CUDA and Metal.
Thus the issue appears specific to:
Independent Validation Against Backend-Native Scans
For CUDA:
For Metal:
Both match a CPU reference exactly.
For example, to construct a 1-based exclusive scan:
Running:
matches the CPU reference exactly:
on both CUDA and Metal.
Additional Observations
For one test case, the first mismatch occurred at:
with:
and:
which initially suggested a block-boundary issue.
However, examining the error over the next block:
produced values ranging from:
to
5indicating that the discrepancy is not simply a constant carry offset.
Expected Behavior
For:
the output should satisfy:
for all valid indices.
The observed output does not satisfy this property for large arrays on either CUDA or Metal.
Conclusion
The issue appears specific to the exclusive scan implementation (
inclusive=false) in AcceleratedKernels.Inclusive scans (
inclusive=true) are correct on both CUDA and Metal.The discrepancy reproduces against:
Thank you for taking a look.