The cub::DeviceRadixSort algorithm consists of many kernels. On small problem sizes, gaps between these kernels constitute significant portion of elapsed time.

nsys profile ./bin/cub.bench.radix_sort.keys.base --profile -a 'T{ct}=I32' -a 'OffsetT{ct}=I32' -a 'Elements{io}[pow2]=16' on A6000 Ada
We should try using PDL to accelerate small problem sizes.
The

cub::DeviceRadixSortalgorithm consists of many kernels. On small problem sizes, gaps between these kernels constitute significant portion of elapsed time.We should try using PDL to accelerate small problem sizes.