Skip to content

DPCT 1007: Support for cub::Blockscan with pointer/array inputs #2962

@abagusetty

Description

@abagusetty

Describe the bug

Have issues with porting cub::BlockScan::ExclusiveSum with pointer/array inputs and outputs. Reproducer is below. The port works fine with scalar inputs/outputs though. I was hoping if someone can take a look into and suggest porting strategy

DPCT1007:0: Migration of cub::BlockScan<int, 4>::ExclusiveSum<4>(int[4], int[4], int) is not supported.

Reproducer:

#include <cuda_runtime.h>
#include <cub/cub.cuh>
#include <stdio.h>

constexpr int n_threads = 4;

// CUDA kernel using BlockScan
__global__ void blockScanKernel(const int *input, int *output, int *block_aggregate) {
    // Specialize BlockScan for `n_threads` threads
    using BlockScan = cub::BlockScan<int, n_threads>;
    __shared__ typename BlockScan::TempStorage temp_storage;

    // Each thread has 4 items
    int thread_data[4];
    int exclusive_sum[4];
    // Load data for each thread
    for (int i = 0; i < 4; i++) {
        thread_data[i] = input[threadIdx.x * 4 + i];
    }

    int aggregated_block;
    // Perform block-wide exclusive sum scan
    BlockScan(temp_storage).ExclusiveSum(thread_data, exclusive_sum, aggregated_block);

    // Output results
    for (int i = 0; i < 4; i++) {
        output[threadIdx.x * 4 + i] = exclusive_sum[i];
    }
    // Only one thread sets the aggregate
    if (threadIdx.x == 0) *block_aggregate = aggregated_block;
}

int main() {
    int h_input[n_threads * 4] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
    int h_output[n_threads * 4] = {};
    int h_block_aggregate = 0;

    int *d_input, *d_output, *d_block_aggregate;
    cudaMalloc(&d_input, sizeof(h_input));
    cudaMalloc(&d_output, sizeof(h_output));
    cudaMalloc(&d_block_aggregate, sizeof(h_block_aggregate));
    cudaMemcpy(d_input, h_input, sizeof(h_input), cudaMemcpyHostToDevice);

    // Launch with n_threads and 1 block (minimal demo)
    blockScanKernel<<<1, n_threads>>>(d_input, d_output, d_block_aggregate);

    cudaMemcpy(h_output, d_output, sizeof(h_output), cudaMemcpyDeviceToHost);
    cudaMemcpy(&h_block_aggregate, d_block_aggregate, sizeof(h_block_aggregate), cudaMemcpyDeviceToHost);

    // Print the results
    printf("Exclusive sums:\n");
    for (int i = 0; i < n_threads * 4; i++) {
        printf("%d ", h_output[i]);
    }
    printf("\nBlock aggregate: %d\n", h_block_aggregate);

    cudaFree(d_input);
    cudaFree(d_output);
    cudaFree(d_block_aggregate);

    return 0;
}

To reproduce

No response

Environment

  • OS: [e.g Windows/Linux]
  • Target device and vendor: [e.g. Nvidia GPU]
  • DPC++ version: [e.g. commit hash or output of clang++ --version]
  • Dependencies version: [e.g. the output of nvidia-smi and sycl-ls --verbose]

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions