Skip to content

CUDA illegal memory access when running Segger on multi-GPU machine #12

@enric-bazz

Description

@enric-bazz

When running Segger CLI to segment Xenium version 4.0.0. data on a machine with 2 separate GPUs, I encounter a cascade of CUDA: illegal memory access errors when both GPUs are visible. This appears to be caused by automatic distributed process spawning. Limiting execution to a single GPU avoids the issue.


Steps to reproduce

  1. Run Segger CLI on a 2× NVIDIA GeForce RTX 4090 machine with the default environment.
  2. Observe output:
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
...
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
  1. Segger crashes with multiple errors such as:
RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress
During handling of the above exception, another exception occurred:
RuntimeError: CUDA error: an illegal memory access was encountered
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS
  1. Limiting the session to a single GPU resolves the issue:
export CUDA_VISIBLE_DEVICES=0

Output:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
...
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Execution completes without errors.


Environment

  • Python: 3.11.14
  • Segger: 0.2.0
  • PyTorch: 2.5.0+cu121
  • Lightning: 2.6.0
  • CUDA: 12.2.0
  • NVIDIA Drivers: 535.247.01
  • GPU: 2 × NVIDIA GeForce RTX 4090

Relevant packages and versions:

Package Version
torch_scatter 2.1.2+pt25cu121
cuml-cu12 25.4.0
cugraph-cu12 25.4.1
cuspatial-cu12 25.4.0
cudf-cu12 25.4.0
cupy-cuda12x 13.6.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions