Bug: AK.accumulate!(...; inclusive=false) produces incorrect results for large arrays on both CUDA and Metal

# Bug: `AK.accumulate!(...; inclusive=false)` produces incorrect results for large arrays on both CUDA and Metal

## Summary

`AcceleratedKernels.accumulate!` appears to produce incorrect results when using:

```julia
inclusive = false
```

for large 1D arrays.

The issue reproduces on both CUDA and Metal backends.

The corresponding inclusive scan (`inclusive=true`) is correct and matches both backend-native scan implementations and a CPU reference exactly.

The issue appears specific to the exclusive scan path.

---

## Package Versions

```julia
julia> using Pkg

julia> Pkg.status(["AcceleratedKernels","CUDA","KernelAbstractions"])
Status `~/.julia/environments/v1.12/Project.toml`
  [6a4ca0a5] AcceleratedKernels v0.4.3
  [052768ef] CUDA v5.11.3
  [63c18a36] KernelAbstractions v0.9.41
```

Julia version:

```julia
julia> VERSION
v"1.12"
```

---

## CUDA Reproducer

```julia
using CUDA
using AcceleratedKernels

const AK = AcceleratedKernels

n = 1_000_000 # or for smaller numbers like 16384 (2^14)

clen = CuArray(rand(Int32(0):Int32(20), n))

cbgn = similar(clen)

AK.accumulate!(
    +,
    cbgn,
    clen;
    init=Int32(0),
    inclusive=false
)

CUDA.synchronize()

clen_h = Array(clen)

ref = similar(clen_h)

s = Int32(0)
for i in eachindex(clen_h)
    ref[i] = s
    s += clen_h[i]
end

maximum(abs.(ref .- Array(cbgn)))
```

Observed:

```julia
18
```

Expected:

```julia
0
```

---

## Metal Reproducer

```julia
using Metal
using AcceleratedKernels

const AK = AcceleratedKernels

n = 1_000_000

clen = MtlArray(rand(Int32(0):Int32(20), n))

cbgn = similar(clen)

AK.accumulate!(
    +,
    cbgn,
    clen;
    init=Int32(0),
    inclusive=false
)

Metal.synchronize()

clen_h = Array(clen)

ref = similar(clen_h)

s = Int32(0)
for i in eachindex(clen_h)
    ref[i] = s
    s += clen_h[i]
end

maximum(abs.(ref .- Array(cbgn)))
```

Observed:

```julia
14
```

Expected:

```julia
0
```

---

## Inclusive Scan Appears Correct

Using the same input:

```julia
cbgn = similar(clen)

AK.accumulate!(
    +,
    cbgn,
    clen;
    init=Int32(0),
    inclusive=true
)
```

and comparing against:

```julia
s = Int32(0)
for i in eachindex(clen_h)
    s += clen_h[i]
    ref[i] = s
end
```

gives:

```julia
maximum(abs.(ref .- Array(cbgn))) == 0
```

on both CUDA and Metal.

Thus the issue appears specific to:

```julia
inclusive = false
```

---

## Independent Validation Against Backend-Native Scans

For CUDA:

```julia
CUDA.scan!(+, cbgn1, clen; dims=1)
```

For Metal:

```julia
Metal.scan!(+, cbgn1, clen; dims=1)
```

Both match a CPU reference exactly.

For example, to construct a 1-based exclusive scan:

```julia
using KernelAbstractions

@kernel function compute_cend_kernel!(cbgn, clen, ncell)
    cid = @index(Global)

    if cid <= ncell
        cbgn[cid] -= clen[cid] - Int32(1)
    end
end
```

Running:

```julia
CUDA.scan!(+, cbgn1, clen; dims=1)
# OR Metal.scan!(+, cbgn1, clen; dims=1)

kernel_cend = compute_cend_kernel!(backend, 256)
kernel_cend(cbgn1, clen, Int32(n); ndrange=n)
```

matches the CPU reference exactly:

```julia
maximum(abs.(cbgn_ref .- Array(cbgn1))) == 0
```

on both CUDA and Metal.

---

## Additional Observations

For one test case, the first mismatch occurred at:

```julia
i = 513
```

with:

```julia
a1[513] = 4942
a2[513] = 4939
```

and:

```julia
sum(clen[1:512]) = 4941
```

which initially suggested a block-boundary issue.

However, examining the error over the next block:

```julia
d = a1 .- a2

unique(d[513:1024])
```

produced values ranging from:

```julia
-15
```

to

```julia
5
```

indicating that the discrepancy is not simply a constant carry offset.

---

## Expected Behavior

For:

```julia
AK.accumulate!(
    +,
    dst,
    src;
    init=Int32(0),
    inclusive=false
)
```

the output should satisfy:

```julia
dst[i] == sum(src[1:i-1])
```

for all valid indices.

The observed output does not satisfy this property for large arrays on either CUDA or Metal.

---

## Conclusion

The issue appears specific to the exclusive scan implementation (`inclusive=false`) in AcceleratedKernels.

Inclusive scans (`inclusive=true`) are correct on both CUDA and Metal.

The discrepancy reproduces against:

- A CPU reference implementation
- CUDA native scan
- Metal native scan

Thank you for taking a look.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: AK.accumulate!(...; inclusive=false) produces incorrect results for large arrays on both CUDA and Metal #84

Bug: `AK.accumulate!(...; inclusive=false)` produces incorrect results for large arrays on both CUDA and Metal

Summary

Package Versions

CUDA Reproducer

Metal Reproducer

Inclusive Scan Appears Correct

Independent Validation Against Backend-Native Scans

Additional Observations

Expected Behavior

Conclusion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Bug: AK.accumulate!(...; inclusive=false) produces incorrect results for large arrays on both CUDA and Metal #84

Description

Bug: AK.accumulate!(...; inclusive=false) produces incorrect results for large arrays on both CUDA and Metal

Summary

Package Versions

CUDA Reproducer

Metal Reproducer

Inclusive Scan Appears Correct

Independent Validation Against Backend-Native Scans

Additional Observations

Expected Behavior

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug: `AK.accumulate!(...; inclusive=false)` produces incorrect results for large arrays on both CUDA and Metal