Skip to content

Bug: AK.accumulate!(...; inclusive=false) produces incorrect results for large arrays on both CUDA and Metal #84

Description

@pankajpopli

Bug: AK.accumulate!(...; inclusive=false) produces incorrect results for large arrays on both CUDA and Metal

Summary

AcceleratedKernels.accumulate! appears to produce incorrect results when using:

inclusive = false

for large 1D arrays.

The issue reproduces on both CUDA and Metal backends.

The corresponding inclusive scan (inclusive=true) is correct and matches both backend-native scan implementations and a CPU reference exactly.

The issue appears specific to the exclusive scan path.


Package Versions

julia> using Pkg

julia> Pkg.status(["AcceleratedKernels","CUDA","KernelAbstractions"])
Status `~/.julia/environments/v1.12/Project.toml`
  [6a4ca0a5] AcceleratedKernels v0.4.3
  [052768ef] CUDA v5.11.3
  [63c18a36] KernelAbstractions v0.9.41

Julia version:

julia> VERSION
v"1.12"

CUDA Reproducer

using CUDA
using AcceleratedKernels

const AK = AcceleratedKernels

n = 1_000_000 # or for smaller numbers like 16384 (2^14)

clen = CuArray(rand(Int32(0):Int32(20), n))

cbgn = similar(clen)

AK.accumulate!(
    +,
    cbgn,
    clen;
    init=Int32(0),
    inclusive=false
)

CUDA.synchronize()

clen_h = Array(clen)

ref = similar(clen_h)

s = Int32(0)
for i in eachindex(clen_h)
    ref[i] = s
    s += clen_h[i]
end

maximum(abs.(ref .- Array(cbgn)))

Observed:

18

Expected:

0

Metal Reproducer

using Metal
using AcceleratedKernels

const AK = AcceleratedKernels

n = 1_000_000

clen = MtlArray(rand(Int32(0):Int32(20), n))

cbgn = similar(clen)

AK.accumulate!(
    +,
    cbgn,
    clen;
    init=Int32(0),
    inclusive=false
)

Metal.synchronize()

clen_h = Array(clen)

ref = similar(clen_h)

s = Int32(0)
for i in eachindex(clen_h)
    ref[i] = s
    s += clen_h[i]
end

maximum(abs.(ref .- Array(cbgn)))

Observed:

14

Expected:

0

Inclusive Scan Appears Correct

Using the same input:

cbgn = similar(clen)

AK.accumulate!(
    +,
    cbgn,
    clen;
    init=Int32(0),
    inclusive=true
)

and comparing against:

s = Int32(0)
for i in eachindex(clen_h)
    s += clen_h[i]
    ref[i] = s
end

gives:

maximum(abs.(ref .- Array(cbgn))) == 0

on both CUDA and Metal.

Thus the issue appears specific to:

inclusive = false

Independent Validation Against Backend-Native Scans

For CUDA:

CUDA.scan!(+, cbgn1, clen; dims=1)

For Metal:

Metal.scan!(+, cbgn1, clen; dims=1)

Both match a CPU reference exactly.

For example, to construct a 1-based exclusive scan:

using KernelAbstractions

@kernel function compute_cend_kernel!(cbgn, clen, ncell)
    cid = @index(Global)

    if cid <= ncell
        cbgn[cid] -= clen[cid] - Int32(1)
    end
end

Running:

CUDA.scan!(+, cbgn1, clen; dims=1)
# OR Metal.scan!(+, cbgn1, clen; dims=1)

kernel_cend = compute_cend_kernel!(backend, 256)
kernel_cend(cbgn1, clen, Int32(n); ndrange=n)

matches the CPU reference exactly:

maximum(abs.(cbgn_ref .- Array(cbgn1))) == 0

on both CUDA and Metal.


Additional Observations

For one test case, the first mismatch occurred at:

i = 513

with:

a1[513] = 4942
a2[513] = 4939

and:

sum(clen[1:512]) = 4941

which initially suggested a block-boundary issue.

However, examining the error over the next block:

d = a1 .- a2

unique(d[513:1024])

produced values ranging from:

-15

to

5

indicating that the discrepancy is not simply a constant carry offset.


Expected Behavior

For:

AK.accumulate!(
    +,
    dst,
    src;
    init=Int32(0),
    inclusive=false
)

the output should satisfy:

dst[i] == sum(src[1:i-1])

for all valid indices.

The observed output does not satisfy this property for large arrays on either CUDA or Metal.


Conclusion

The issue appears specific to the exclusive scan implementation (inclusive=false) in AcceleratedKernels.

Inclusive scans (inclusive=true) are correct on both CUDA and Metal.

The discrepancy reproduces against:

  • A CPU reference implementation
  • CUDA native scan
  • Metal native scan

Thank you for taking a look.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions