Skip to content

[FEA] Dask Array support for Aggregation #385

@MPebworthEpana

Description

@MPebworthEpana

I need to use Dask arrays and out-of-memory operations for all analysis, including pseudobulking.

But if I try to use Dask arrays with get.aggregate, I have the following error:

aggregated = rsc.get.aggregate(adata, by=["lvl_2", 'sample_id'], func=["sum", "count_nonzero"])
Traceback (most recent call last):
File "", line 1, in
File "/opt/mamba/envs/newrapids/lib/python3.12/site-packages/rapids_singlecell/get/_aggregated.py", line 419, in aggregate
_check_gpu_X(data)
File "/opt/mamba/envs/newrapids/lib/python3.12/site-packages/rapids_singlecell/preprocessing/_utils.py", line 277, in _check_gpu_X
raise TypeError(
TypeError: The input is a DaskArray. Rapids-singlecell doesn't support DaskArray in this function, so your input must be a CuPy ndarray or a CuPy sparse matrix.

The major benefit of using RAPIDS is that I can conduct out-of-memory operations with GPU -otherwise, I might as well just use a cheaper CPU instance and parallelize fully the pseudobulk operations.

When will Rapids-singlecell support Dask Arrays fully?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions