DPCT 1007: Support for cub::Blockscan with pointer/array inputs

### Describe the bug

Have issues with porting `cub::BlockScan::ExclusiveSum` with pointer/array inputs and outputs. Reproducer is below. The port works fine with scalar inputs/outputs though. I was hoping if someone can take a look into and suggest porting strategy

```bash
DPCT1007:0: Migration of cub::BlockScan<int, 4>::ExclusiveSum<4>(int[4], int[4], int) is not supported.
```

Reproducer: 
```cuda
#include <cuda_runtime.h>
#include <cub/cub.cuh>
#include <stdio.h>

constexpr int n_threads = 4;

// CUDA kernel using BlockScan
__global__ void blockScanKernel(const int *input, int *output, int *block_aggregate) {
    // Specialize BlockScan for `n_threads` threads
    using BlockScan = cub::BlockScan<int, n_threads>;
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Each thread has 4 items
    int thread_data[4];
    int exclusive_sum[4];
    // Load data for each thread
    for (int i = 0; i < 4; i++) {
        thread_data[i] = input[threadIdx.x * 4 + i];
    }

    int aggregated_block;
    // Perform block-wide exclusive sum scan
    BlockScan(temp_storage).ExclusiveSum(thread_data, exclusive_sum, aggregated_block);

    // Output results
    for (int i = 0; i < 4; i++) {
        output[threadIdx.x * 4 + i] = exclusive_sum[i];
    }
    // Only one thread sets the aggregate
    if (threadIdx.x == 0) *block_aggregate = aggregated_block;
}

int main() {
    int h_input[n_threads * 4] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
    int h_output[n_threads * 4] = {};
    int h_block_aggregate = 0;

    int *d_input, *d_output, *d_block_aggregate;
    cudaMalloc(&d_input, sizeof(h_input));
    cudaMalloc(&d_output, sizeof(h_output));
    cudaMalloc(&d_block_aggregate, sizeof(h_block_aggregate));
    cudaMemcpy(d_input, h_input, sizeof(h_input), cudaMemcpyHostToDevice);

    // Launch with n_threads and 1 block (minimal demo)
    blockScanKernel<<<1, n_threads>>>(d_input, d_output, d_block_aggregate);

    cudaMemcpy(h_output, d_output, sizeof(h_output), cudaMemcpyDeviceToHost);
    cudaMemcpy(&h_block_aggregate, d_block_aggregate, sizeof(h_block_aggregate), cudaMemcpyDeviceToHost);

    // Print the results
    printf("Exclusive sums:\n");
    for (int i = 0; i < n_threads * 4; i++) {
        printf("%d ", h_output[i]);
    }
    printf("\nBlock aggregate: %d\n", h_block_aggregate);

    cudaFree(d_input);
    cudaFree(d_output);
    cudaFree(d_block_aggregate);

    return 0;
}
```

### To reproduce

_No response_

### Environment

- OS: [e.g Windows/Linux]
- Target device and vendor: [e.g. Nvidia GPU]
- DPC++ version: [e.g. commit hash or output of `clang++ --version`]
- Dependencies version: [e.g. the output of `nvidia-smi` and `sycl-ls --verbose`]


### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DPCT 1007: Support for cub::Blockscan with pointer/array inputs #2962

Describe the bug

To reproduce

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DPCT 1007: Support for cub::Blockscan with pointer/array inputs #2962

Description

Describe the bug

To reproduce

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions