fix(pd): adapting code for hardware compatibility (#5047)

HydrogenSulfate · web-flow · commit 73a1aa65161c · 2025-12-09T06:12:04.000Z
This pull request updates PaddlePaddle dependencies to nightly builds
and refactors device handling throughout the codebase to support more
flexible device selection (including XPU), improves compatibility, and
adds better flag management for tests. The most significant changes are
grouped below.

**Dependency Updates:**

* Updated PaddlePaddle and PaddlePaddle-GPU installation in
`.github/workflows/test_python.yml` and
`.github/workflows/test_cuda.yml` to use nightly builds
(`3.3.0.dev20251204`) from new package URLs, improving access to the
latest features and fixes.
[[1]](diffhunk://#diff-6d72d7142742932ca8c930aa674e12e3cf6c528566c88ab43f5cdb3169075f2fL50-R50)
[[2]](diffhunk://#diff-896a3c7f7278514eba7e2573f009d6739b3de6d4e1ece4b756e04cdbe5c3f3caL35-R35)

**Device Handling Refactor:**

* Replaced usage of `paddle.device.cuda.device_count()` and related
CUDA-specific APIs with more general `paddle.device.device_count()` and
`paddle.device.empty_cache()` in multiple locations, enabling support
for devices beyond CUDA (e.g., XPU).
[[1]](diffhunk://#diff-e3f56cd14511cf86a0db88d6d9ee5b08cf45374edfdef0625a0f519d94c58507L217-R217)
[[2]](diffhunk://#diff-c42cc453489450e30747781035e34ce592843893004b24481df3802b4fd6fa34L39-R39)
[[3]](diffhunk://#diff-c42cc453489450e30747781035e34ce592843893004b24481df3802b4fd6fa34L54-R54)
[[4]](diffhunk://#diff-e678abb052b278f8a479f8d13b839a9ec0effd9923478a850bc13758f918e1e9L32-R35)
[[5]](diffhunk://#diff-03ca05b7d964e1dd8ec22a81aff2d76b61b9f9b36111e384f177a04cc5a02f1eL9-R10)
* Updated logic for setting and retrieving device information in
`deepmd/pd/utils/env.py` to use `paddle.device.get_device()`, ensuring
correct device assignment for both CPU and GPU/XPU scenarios.

**Device Compatibility Improvements:**

* Enhanced `get_generator` in `deepmd/pd/utils/utils.py` to support XPU
devices and added a warning for unsupported device types, improving
compatibility and error messaging.

**Test Flag Management:**

* Added explicit management of the `FLAGS_use_stride_kernel` Paddle flag
in `source/tests/pd/test_multitask.py` to ensure proper test isolation
and restore flag values after tests.
[[1]](diffhunk://#diff-ad724907bbb8b6260857768d8f1fc7f0f2122b6b86c010efaf66f22f87c4170dR236-R242)
[[2]](diffhunk://#diff-ad724907bbb8b6260857768d8f1fc7f0f2122b6b86c010efaf66f22f87c4170dR280-R288)
* Set `FLAGS_use_stride_compute_kernel` environment variable to `0` in
workflow files to control kernel usage during tests.
[[1]](diffhunk://#diff-6d72d7142742932ca8c930aa674e12e3cf6c528566c88ab43f5cdb3169075f2fR64)
[[2]](diffhunk://#diff-896a3c7f7278514eba7e2573f009d6739b3de6d4e1ece4b756e04cdbe5c3f3caR63)

**Distributed Training Check:**

* Improved NCCL initialization check in distributed training setup to
handle cases where NCCL is not compiled, preventing assertion errors.

&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;

## Summary by CodeRabbit

## Release Notes

* **New Features**
  * Added support for XPU device accelerators.

* **Refactor**
* Improved device detection and initialization to support non-CUDA
backends and multiple device types with device-agnostic APIs.

* **Chores**
  * Updated testing workflows and dependencies.
  * Enhanced test compatibility for non-GPU environments.

&lt;sub&gt;✏️ Tip: You can customize this high-level summary in your review
settings.&lt;/sub&gt;

&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;
diff --git a/.github/workflows/test_cuda.yml b/.github/workflows/test_cuda.yml
@@ -47,7 +47,7 @@ jobs:
     - run: |
         export PYTORCH_ROOT=$(python -c 'import torch;print(torch.__path__[0])')
         export TENSORFLOW_ROOT=$(python -c 'import importlib.util,pathlib;print(pathlib.Path(importlib.util.find_spec("tensorflow").origin).parent)')
-        pip install "paddlepaddle-gpu==3.0.0" -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+        pip install --find-links "https://www.paddlepaddle.org.cn/packages/nightly/cu126/paddlepaddle-gpu/" --index-url https://pypi.org/simple "paddlepaddle-gpu==3.3.0.dev20251204"
         source/install/uv_with_retry.sh pip install --system -v -e .[gpu,test,lmp,cu12,torch,jax] mpi4py --reinstall-package deepmd-kit
       env:
         DP_VARIANT: cuda
@@ -61,6 +61,7 @@ jobs:
         # See https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html
         XLA_PYTHON_CLIENT_PREALLOCATE: false
         XLA_PYTHON_CLIENT_ALLOCATOR: platform
+        FLAGS_use_stride_compute_kernel: 0
     - name: Convert models
       run: source/tests/infer/convert-models.sh
     - run: |
diff --git a/.github/workflows/test_python.yml b/.github/workflows/test_python.yml
@@ -32,7 +32,7 @@ jobs:
         export TENSORFLOW_ROOT=$(python -c 'import importlib.util,pathlib;print(pathlib.Path(importlib.util.find_spec("tensorflow").origin).parent)')
         export PYTORCH_ROOT=$(python -c 'import torch;print(torch.__path__[0])')
         source/install/uv_with_retry.sh pip install --system -e .[test,jax] mpi4py --group pin_jax
-        source/install/uv_with_retry.sh pip install --system --pre "paddlepaddle==3.0.0" -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
+        source/install/uv_with_retry.sh pip install --system --find-links "https://www.paddlepaddle.org.cn/packages/nightly/cpu/paddlepaddle/" --index-url https://pypi.org/simple paddlepaddle==3.3.0.dev20251204
       env:
         # Please note that uv has some issues with finding
         # existing TensorFlow package. Currently, it uses
@@ -60,6 +60,7 @@ jobs:
     - run: pytest --cov=deepmd source/tests  --splits 6 --group ${{ matrix.group }} --store-durations --clean-durations --durations-path=.test_durations --splitting-algorithm least_duration
       env:
         NUM_WORKERS: 0
+        FLAGS_use_stride_compute_kernel: 0
     - name: Test TF2 eager mode
       run: pytest --cov=deepmd --cov-append source/tests/consistent/io/test_io.py source/jax2tf_tests
       env:
diff --git a/deepmd/pd/entrypoints/main.py b/deepmd/pd/entrypoints/main.py
@@ -95,7 +95,7 @@ def get_trainer(
     # Initialize DDP
     world_size = dist.get_world_size()
     if world_size > 1:
-        assert paddle.version.nccl() != "0"
+        assert not paddle.core.is_compiled_with_nccl() or paddle.version.nccl() != "0"
         fleet.init(is_collective=True)
 
     def prepare_trainer_input_single(
@@ -214,7 +214,7 @@ def get_compute_device(self) -> str:
 
     def get_ngpus(self) -> int:
         """Get the number of GPUs."""
-        return paddle.device.cuda.device_count()
+        return paddle.device.device_count()
 
     def get_backend_info(self) -> dict:
         """Get backend information."""
diff --git a/deepmd/pd/utils/auto_batch_size.py b/deepmd/pd/utils/auto_batch_size.py
@@ -36,7 +36,7 @@ def is_gpu_available(self) -> bool:
         bool
             True if GPU is available
         """
-        return paddle.device.cuda.device_count() > 0
+        return paddle.device.device_count() > 0
 
     def is_oom_error(self, e: Exception) -> bool:
         """Check if the exception is an OOM error.
@@ -51,6 +51,6 @@ def is_oom_error(self, e: Exception) -> bool:
         # (the meaningless error message should be considered as a bug in cusolver)
         if isinstance(e, MemoryError) and ("ResourceExhaustedError" in e.args[0]):
             # Release all unoccupied cached memory
-            paddle.device.cuda.empty_cache()
+            paddle.device.empty_cache()
             return True
         return False
diff --git a/deepmd/pd/utils/env.py b/deepmd/pd/utils/env.py
@@ -29,10 +29,10 @@
 # Make sure DDP uses correct device if applicable
 LOCAL_RANK = int(os.environ.get("PADDLE_LOCAL_RANK", 0))
 
-if os.environ.get("DEVICE") == "cpu" or paddle.device.cuda.device_count() <= 0:
+if os.environ.get("DEVICE") == "cpu" or paddle.device.device_count() <= 0:
     DEVICE = "cpu"
 else:
-    DEVICE = f"gpu:{LOCAL_RANK}"
+    DEVICE = paddle.device.get_device()
 
 paddle.device.set_device(DEVICE)
 
diff --git a/deepmd/pd/utils/utils.py b/deepmd/pd/utils/utils.py
@@ -3,6 +3,7 @@
     annotations,
 )
 
+import warnings
 from contextlib import (
     contextmanager,
 )
@@ -345,8 +346,21 @@ def get_generator(
             generator = paddle.framework.core.default_cuda_generator(
                 int(DEVICE.split("gpu:")[1])
             )
+        elif DEVICE == "xpu":
+            generator = paddle.framework.core.default_xpu_generator(0)
+        elif DEVICE.startswith("xpu:"):
+            generator = paddle.framework.core.default_xpu_generator(
+                int(DEVICE.split("xpu:")[1])
+            )
         else:
-            raise ValueError("DEVICE should be cpu or gpu or gpu:x")
+            # return none for compability in different devices
+            warnings.warn(
+                f"DEVICE is {DEVICE}, which is not supported. Returning None.",
+                category=UserWarning,
+                stacklevel=2,
+            )
+            return None
+            # raise ValueError("DEVICE should be cpu or gpu or gpu:x or xpu or xpu:x")
         generator.manual_seed(seed)
         return generator
     else:
diff --git a/source/tests/pd/conftest.py b/source/tests/pd/conftest.py
@@ -6,4 +6,5 @@
 @pytest.fixture(scope="package", autouse=True)
 def clear_cuda_memory(request):
     yield
-    paddle.device.cuda.empty_cache()
+    if paddle.device.get_device() != "cpu":
+        paddle.device.empty_cache()
diff --git a/source/tests/pd/test_multitask.py b/source/tests/pd/test_multitask.py
@@ -11,6 +11,7 @@
 )
 
 import numpy as np
+import paddle
 
 from deepmd.pd.entrypoints.main import (
     get_trainer,
@@ -232,8 +233,15 @@ def setUp(self) -> None:
         self.config["model"], self.shared_links = preprocess_shared_params(
             self.config["model"]
         )
+        if not paddle.device.is_compiled_with_cuda():
+            self.FLAGS_use_stride_kernel = paddle.get_flags("FLAGS_use_stride_kernel")[
+                "FLAGS_use_stride_kernel"
+            ]
+            paddle.set_flags({"FLAGS_use_stride_kernel": False})
 
     def tearDown(self) -> None:
+        if not paddle.device.is_compiled_with_cuda():
+            paddle.set_flags({"FLAGS_use_stride_kernel": self.FLAGS_use_stride_kernel})
         MultiTaskTrainTest.tearDown(self)
 
 
@@ -271,9 +279,17 @@ def setUp(self) -> None:
         self.config["model"], self.shared_links = preprocess_shared_params(
             self.config["model"]
         )
+        self.config["learning_rate"]["start_lr"] = 1e-5
         self.share_fitting = True
+        if not paddle.device.is_compiled_with_cuda():
+            self.FLAGS_use_stride_kernel = paddle.get_flags("FLAGS_use_stride_kernel")[
+                "FLAGS_use_stride_kernel"
+            ]
+            paddle.set_flags({"FLAGS_use_stride_kernel": False})
 
     def tearDown(self) -> None:
+        if not paddle.device.is_compiled_with_cuda():
+            paddle.set_flags({"FLAGS_use_stride_kernel": self.FLAGS_use_stride_kernel})
         MultiTaskTrainTest.tearDown(self)