Summary
Two related test-hygiene patterns cause spurious failures on environments where the test suite previously skipped cleanly:
Pattern A: No skipif on QdpEngine(0) instantiation. ~12 tests instantiate QdpEngine(0) (directly or via QdpBenchmark / QuantumDataLoader) without first checking that CUDA is available. On systems with CUDA_VISIBLE_DEVICES=, they error with CUDA_ERROR_NO_DEVICE rather than skipping.
Pattern B: Hardcoded device="cuda" without arch check. ~68 tests in testing/qdp/test_bindings.py use torch.tensor(..., device="cuda") or equivalent, then fail with cudaErrorNoKernelImageForDevice on GPUs whose compute capability isn't in the PyTorch wheel's compiled arch list. PR #1323 added a capability-aware fallback in qumat_qdp.loader._select_torch_device for production code, but tests don't use it.
Reproducer
On a host with a Pascal GPU (sm_61) and CUDA Toolkit installed (also requires the LD_PRELOAD workaround from Issue [N1] for the wheel quirk):
git checkout mahout-qumat-0.6.0-RC2
export LD_PRELOAD=$VIRTUAL_ENV/lib/python3.12/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12
# Pattern A: CUDA masked → 12 ERROR instead of SKIPPED
CUDA_VISIBLE_DEVICES= uv run pytest -v # 12 failed, 719 passed, 142 skipped
# Patterns A+B combined: CUDA visible but arch incompatible → 80 ERROR
uv run pytest -v # 80 failed, 785 passed, 8 skipped
Specific failing tests (Pattern A, CUDA masked)
testing/qdp/test_benchmark_api.py::test_qdp_benchmark_run_throughput
testing/qdp/test_benchmark_api.py::test_qdp_benchmark_run_latency
testing/qdp/test_benchmark_api.py::test_qdp_benchmark_device_id_propagated
testing/qdp/test_bindings.py::test_encode
testing/qdp/test_bindings.py::test_dlpack_device
testing/qdp/test_bindings.py::test_dlpack_single_use
testing/qdp/test_bindings.py::test_dlpack_with_stream[stream_legacy]
testing/qdp/test_bindings.py::test_dlpack_with_stream[stream_per_thread]
testing/qdp/test_bindings.py::test_pytorch_integration
testing/qdp/test_bindings.py::test_precision[float32-complex64]
testing/qdp/test_bindings.py::test_precision[float64-complex128]
testing/qdp_python/test_quantum_data_loader.py::test_synthetic_loader_batch_count
All produce the same error: Failed to initialize CUDA device 0: DriverError(CUDA_ERROR_NO_DEVICE, "no CUDA-capable device is detected").
Pattern B failures (CUDA-visible) are mostly in testing/qdp/test_bindings.py::test_encode_cuda_tensor*, test_angle_encode_cuda_tensor*, test_basis_encode_*, etc.
Comparison to RC1
Same hardware/setup with RC1 + CUDA_VISIBLE_DEVICES= produced 4 failures; RC2 produces 12. The delta is mostly new tests added since RC1 (the test_benchmark_api.py set, and a few new test_bindings.py cases) which inherited the no-skipif pattern.
Suggested fix
Standardize on one of:
# Pattern A guard
pytestmark = pytest.mark.skipif(
not _cuda_available(), reason="CUDA-capable device required"
)
# Pattern B guard — mirror qumat_qdp.loader._select_torch_device
pytestmark = pytest.mark.skipif(
not _cuda_compatible_with_torch(), reason="GPU arch not supported by this PyTorch build"
)
A shared fixture / helper in testing/conftest.py would avoid copy-pasting the check. The capability-list check from PR #1323 (get_device_capability ∩ get_arch_list) can be reused verbatim.
Environment
- OS: Ubuntu 24.04
- CUDA Toolkit: 12.4 (apt
nvidia-cuda-toolkit)
- GPU: NVIDIA GeForce GTX 1060 with Max-Q Design (sm_61)
- Python: 3.12.12 (uv-managed)
- PyTorch: 2.9.0+cu128 (resolved by RC2 lockfile)
Summary
Two related test-hygiene patterns cause spurious failures on environments where the test suite previously skipped cleanly:
Pattern A: No skipif on
QdpEngine(0)instantiation. ~12 tests instantiateQdpEngine(0)(directly or viaQdpBenchmark/QuantumDataLoader) without first checking that CUDA is available. On systems withCUDA_VISIBLE_DEVICES=, they error withCUDA_ERROR_NO_DEVICErather than skipping.Pattern B: Hardcoded
device="cuda"without arch check. ~68 tests intesting/qdp/test_bindings.pyusetorch.tensor(..., device="cuda")or equivalent, then fail withcudaErrorNoKernelImageForDeviceon GPUs whose compute capability isn't in the PyTorch wheel's compiled arch list. PR #1323 added a capability-aware fallback inqumat_qdp.loader._select_torch_devicefor production code, but tests don't use it.Reproducer
On a host with a Pascal GPU (sm_61) and CUDA Toolkit installed (also requires the
LD_PRELOADworkaround from Issue [N1] for the wheel quirk):Specific failing tests (Pattern A, CUDA masked)
testing/qdp/test_benchmark_api.py::test_qdp_benchmark_run_throughputtesting/qdp/test_benchmark_api.py::test_qdp_benchmark_run_latencytesting/qdp/test_benchmark_api.py::test_qdp_benchmark_device_id_propagatedtesting/qdp/test_bindings.py::test_encodetesting/qdp/test_bindings.py::test_dlpack_devicetesting/qdp/test_bindings.py::test_dlpack_single_usetesting/qdp/test_bindings.py::test_dlpack_with_stream[stream_legacy]testing/qdp/test_bindings.py::test_dlpack_with_stream[stream_per_thread]testing/qdp/test_bindings.py::test_pytorch_integrationtesting/qdp/test_bindings.py::test_precision[float32-complex64]testing/qdp/test_bindings.py::test_precision[float64-complex128]testing/qdp_python/test_quantum_data_loader.py::test_synthetic_loader_batch_countAll produce the same error:
Failed to initialize CUDA device 0: DriverError(CUDA_ERROR_NO_DEVICE, "no CUDA-capable device is detected").Pattern B failures (CUDA-visible) are mostly in
testing/qdp/test_bindings.py::test_encode_cuda_tensor*,test_angle_encode_cuda_tensor*,test_basis_encode_*, etc.Comparison to RC1
Same hardware/setup with RC1 +
CUDA_VISIBLE_DEVICES=produced 4 failures; RC2 produces 12. The delta is mostly new tests added since RC1 (thetest_benchmark_api.pyset, and a few newtest_bindings.pycases) which inherited the no-skipif pattern.Suggested fix
Standardize on one of:
A shared fixture / helper in
testing/conftest.pywould avoid copy-pasting the check. The capability-list check from PR #1323 (get_device_capability∩get_arch_list) can be reused verbatim.Environment
nvidia-cuda-toolkit)