Skip to content

Commit 73a1aa6

Browse files
fix(pd): adapting code for hardware compatibility (#5047)
This pull request updates PaddlePaddle dependencies to nightly builds and refactors device handling throughout the codebase to support more flexible device selection (including XPU), improves compatibility, and adds better flag management for tests. The most significant changes are grouped below. **Dependency Updates:** * Updated PaddlePaddle and PaddlePaddle-GPU installation in `.github/workflows/test_python.yml` and `.github/workflows/test_cuda.yml` to use nightly builds (`3.3.0.dev20251204`) from new package URLs, improving access to the latest features and fixes. [[1]](diffhunk://#diff-6d72d7142742932ca8c930aa674e12e3cf6c528566c88ab43f5cdb3169075f2fL50-R50) [[2]](diffhunk://#diff-896a3c7f7278514eba7e2573f009d6739b3de6d4e1ece4b756e04cdbe5c3f3caL35-R35) **Device Handling Refactor:** * Replaced usage of `paddle.device.cuda.device_count()` and related CUDA-specific APIs with more general `paddle.device.device_count()` and `paddle.device.empty_cache()` in multiple locations, enabling support for devices beyond CUDA (e.g., XPU). [[1]](diffhunk://#diff-e3f56cd14511cf86a0db88d6d9ee5b08cf45374edfdef0625a0f519d94c58507L217-R217) [[2]](diffhunk://#diff-c42cc453489450e30747781035e34ce592843893004b24481df3802b4fd6fa34L39-R39) [[3]](diffhunk://#diff-c42cc453489450e30747781035e34ce592843893004b24481df3802b4fd6fa34L54-R54) [[4]](diffhunk://#diff-e678abb052b278f8a479f8d13b839a9ec0effd9923478a850bc13758f918e1e9L32-R35) [[5]](diffhunk://#diff-03ca05b7d964e1dd8ec22a81aff2d76b61b9f9b36111e384f177a04cc5a02f1eL9-R10) * Updated logic for setting and retrieving device information in `deepmd/pd/utils/env.py` to use `paddle.device.get_device()`, ensuring correct device assignment for both CPU and GPU/XPU scenarios. **Device Compatibility Improvements:** * Enhanced `get_generator` in `deepmd/pd/utils/utils.py` to support XPU devices and added a warning for unsupported device types, improving compatibility and error messaging. **Test Flag Management:** * Added explicit management of the `FLAGS_use_stride_kernel` Paddle flag in `source/tests/pd/test_multitask.py` to ensure proper test isolation and restore flag values after tests. [[1]](diffhunk://#diff-ad724907bbb8b6260857768d8f1fc7f0f2122b6b86c010efaf66f22f87c4170dR236-R242) [[2]](diffhunk://#diff-ad724907bbb8b6260857768d8f1fc7f0f2122b6b86c010efaf66f22f87c4170dR280-R288) * Set `FLAGS_use_stride_compute_kernel` environment variable to `0` in workflow files to control kernel usage during tests. [[1]](diffhunk://#diff-6d72d7142742932ca8c930aa674e12e3cf6c528566c88ab43f5cdb3169075f2fR64) [[2]](diffhunk://#diff-896a3c7f7278514eba7e2573f009d6739b3de6d4e1ece4b756e04cdbe5c3f3caR63) **Distributed Training Check:** * Improved NCCL initialization check in distributed training setup to handle cases where NCCL is not compiled, preventing assertion errors. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added support for XPU device accelerators. * **Refactor** * Improved device detection and initialization to support non-CUDA backends and multiple device types with device-agnostic APIs. * **Chores** * Updated testing workflows and dependencies. * Enhanced test compatibility for non-GPU environments. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent 0ad7cbf commit 73a1aa6

8 files changed

Lines changed: 43 additions & 10 deletions

File tree

.github/workflows/test_cuda.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ jobs:
4747
- run: |
4848
export PYTORCH_ROOT=$(python -c 'import torch;print(torch.__path__[0])')
4949
export TENSORFLOW_ROOT=$(python -c 'import importlib.util,pathlib;print(pathlib.Path(importlib.util.find_spec("tensorflow").origin).parent)')
50-
pip install "paddlepaddle-gpu==3.0.0" -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
50+
pip install --find-links "https://www.paddlepaddle.org.cn/packages/nightly/cu126/paddlepaddle-gpu/" --index-url https://pypi.org/simple "paddlepaddle-gpu==3.3.0.dev20251204"
5151
source/install/uv_with_retry.sh pip install --system -v -e .[gpu,test,lmp,cu12,torch,jax] mpi4py --reinstall-package deepmd-kit
5252
env:
5353
DP_VARIANT: cuda
@@ -61,6 +61,7 @@ jobs:
6161
# See https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html
6262
XLA_PYTHON_CLIENT_PREALLOCATE: false
6363
XLA_PYTHON_CLIENT_ALLOCATOR: platform
64+
FLAGS_use_stride_compute_kernel: 0
6465
- name: Convert models
6566
run: source/tests/infer/convert-models.sh
6667
- run: |

.github/workflows/test_python.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ jobs:
3232
export TENSORFLOW_ROOT=$(python -c 'import importlib.util,pathlib;print(pathlib.Path(importlib.util.find_spec("tensorflow").origin).parent)')
3333
export PYTORCH_ROOT=$(python -c 'import torch;print(torch.__path__[0])')
3434
source/install/uv_with_retry.sh pip install --system -e .[test,jax] mpi4py --group pin_jax
35-
source/install/uv_with_retry.sh pip install --system --pre "paddlepaddle==3.0.0" -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
35+
source/install/uv_with_retry.sh pip install --system --find-links "https://www.paddlepaddle.org.cn/packages/nightly/cpu/paddlepaddle/" --index-url https://pypi.org/simple paddlepaddle==3.3.0.dev20251204
3636
env:
3737
# Please note that uv has some issues with finding
3838
# existing TensorFlow package. Currently, it uses
@@ -60,6 +60,7 @@ jobs:
6060
- run: pytest --cov=deepmd source/tests --splits 6 --group ${{ matrix.group }} --store-durations --clean-durations --durations-path=.test_durations --splitting-algorithm least_duration
6161
env:
6262
NUM_WORKERS: 0
63+
FLAGS_use_stride_compute_kernel: 0
6364
- name: Test TF2 eager mode
6465
run: pytest --cov=deepmd --cov-append source/tests/consistent/io/test_io.py source/jax2tf_tests
6566
env:

deepmd/pd/entrypoints/main.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ def get_trainer(
9595
# Initialize DDP
9696
world_size = dist.get_world_size()
9797
if world_size > 1:
98-
assert paddle.version.nccl() != "0"
98+
assert not paddle.core.is_compiled_with_nccl() or paddle.version.nccl() != "0"
9999
fleet.init(is_collective=True)
100100

101101
def prepare_trainer_input_single(
@@ -214,7 +214,7 @@ def get_compute_device(self) -> str:
214214

215215
def get_ngpus(self) -> int:
216216
"""Get the number of GPUs."""
217-
return paddle.device.cuda.device_count()
217+
return paddle.device.device_count()
218218

219219
def get_backend_info(self) -> dict:
220220
"""Get backend information."""

deepmd/pd/utils/auto_batch_size.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ def is_gpu_available(self) -> bool:
3636
bool
3737
True if GPU is available
3838
"""
39-
return paddle.device.cuda.device_count() > 0
39+
return paddle.device.device_count() > 0
4040

4141
def is_oom_error(self, e: Exception) -> bool:
4242
"""Check if the exception is an OOM error.
@@ -51,6 +51,6 @@ def is_oom_error(self, e: Exception) -> bool:
5151
# (the meaningless error message should be considered as a bug in cusolver)
5252
if isinstance(e, MemoryError) and ("ResourceExhaustedError" in e.args[0]):
5353
# Release all unoccupied cached memory
54-
paddle.device.cuda.empty_cache()
54+
paddle.device.empty_cache()
5555
return True
5656
return False

deepmd/pd/utils/env.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,10 @@
2929
# Make sure DDP uses correct device if applicable
3030
LOCAL_RANK = int(os.environ.get("PADDLE_LOCAL_RANK", 0))
3131

32-
if os.environ.get("DEVICE") == "cpu" or paddle.device.cuda.device_count() <= 0:
32+
if os.environ.get("DEVICE") == "cpu" or paddle.device.device_count() <= 0:
3333
DEVICE = "cpu"
3434
else:
35-
DEVICE = f"gpu:{LOCAL_RANK}"
35+
DEVICE = paddle.device.get_device()
3636

3737
paddle.device.set_device(DEVICE)
3838

deepmd/pd/utils/utils.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
annotations,
44
)
55

6+
import warnings
67
from contextlib import (
78
contextmanager,
89
)
@@ -345,8 +346,21 @@ def get_generator(
345346
generator = paddle.framework.core.default_cuda_generator(
346347
int(DEVICE.split("gpu:")[1])
347348
)
349+
elif DEVICE == "xpu":
350+
generator = paddle.framework.core.default_xpu_generator(0)
351+
elif DEVICE.startswith("xpu:"):
352+
generator = paddle.framework.core.default_xpu_generator(
353+
int(DEVICE.split("xpu:")[1])
354+
)
348355
else:
349-
raise ValueError("DEVICE should be cpu or gpu or gpu:x")
356+
# return none for compability in different devices
357+
warnings.warn(
358+
f"DEVICE is {DEVICE}, which is not supported. Returning None.",
359+
category=UserWarning,
360+
stacklevel=2,
361+
)
362+
return None
363+
# raise ValueError("DEVICE should be cpu or gpu or gpu:x or xpu or xpu:x")
350364
generator.manual_seed(seed)
351365
return generator
352366
else:

source/tests/pd/conftest.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,5 @@
66
@pytest.fixture(scope="package", autouse=True)
77
def clear_cuda_memory(request):
88
yield
9-
paddle.device.cuda.empty_cache()
9+
if paddle.device.get_device() != "cpu":
10+
paddle.device.empty_cache()

source/tests/pd/test_multitask.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
)
1212

1313
import numpy as np
14+
import paddle
1415

1516
from deepmd.pd.entrypoints.main import (
1617
get_trainer,
@@ -232,8 +233,15 @@ def setUp(self) -> None:
232233
self.config["model"], self.shared_links = preprocess_shared_params(
233234
self.config["model"]
234235
)
236+
if not paddle.device.is_compiled_with_cuda():
237+
self.FLAGS_use_stride_kernel = paddle.get_flags("FLAGS_use_stride_kernel")[
238+
"FLAGS_use_stride_kernel"
239+
]
240+
paddle.set_flags({"FLAGS_use_stride_kernel": False})
235241

236242
def tearDown(self) -> None:
243+
if not paddle.device.is_compiled_with_cuda():
244+
paddle.set_flags({"FLAGS_use_stride_kernel": self.FLAGS_use_stride_kernel})
237245
MultiTaskTrainTest.tearDown(self)
238246

239247

@@ -271,9 +279,17 @@ def setUp(self) -> None:
271279
self.config["model"], self.shared_links = preprocess_shared_params(
272280
self.config["model"]
273281
)
282+
self.config["learning_rate"]["start_lr"] = 1e-5
274283
self.share_fitting = True
284+
if not paddle.device.is_compiled_with_cuda():
285+
self.FLAGS_use_stride_kernel = paddle.get_flags("FLAGS_use_stride_kernel")[
286+
"FLAGS_use_stride_kernel"
287+
]
288+
paddle.set_flags({"FLAGS_use_stride_kernel": False})
275289

276290
def tearDown(self) -> None:
291+
if not paddle.device.is_compiled_with_cuda():
292+
paddle.set_flags({"FLAGS_use_stride_kernel": self.FLAGS_use_stride_kernel})
277293
MultiTaskTrainTest.tearDown(self)
278294

279295

0 commit comments

Comments
 (0)