Skip to content

Commit e679b8d

Browse files
wanghan-iapcmHan Wang
andauthored
fix(pt_expt): let deepmd.pt import errors propagate in comm op check (#5474)
## Summary `_check_underlying_ops_loaded()` in `deepmd/pt_expt/utils/comm.py` wraps `import deepmd.pt` in a blanket `except Exception: pass`, then falls through to a generic `RuntimeError` telling the user to build `libdeepmd_op_pt.so`. The problem: when the .so *is* built but loaded against a mismatched torch version, `import deepmd.pt` raises an `ImportError` with diagnostic detail (e.g. `undefined symbol: ...`) — exactly the message the user needs. The current code hides it and tells them to rebuild a library that's already built. This PR removes the `try/except` and lets the import error propagate. The downstream `RuntimeError` still fires for the case where the import succeeds but the ops still aren't registered. ## Trade-off External callers that previously caught `RuntimeError` from `comm.py` import will now see the raw `ImportError` for the .so-mismatch case. No in-tree caller does this. The diagnostic gain outweighs the contract change. ## Test plan - [x] Existing pt_expt tests (every consumer imports `comm.py`) — happy path unchanged - [ ] CI green <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved initialization error reporting: when native registrations fail to load, the underlying import/ABI error is now preserved and surfaced instead of being masked by a generic message, making root causes clearer for troubleshooting. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5474?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
1 parent 93f5580 commit e679b8d

1 file changed

Lines changed: 29 additions & 16 deletions

File tree

deepmd/pt_expt/utils/comm.py

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -50,25 +50,38 @@ def _check_underlying_ops_loaded() -> None:
5050
like DDP-spawned subprocesses that re-import modules from scratch
5151
and never see the test conftest's ``import deepmd.pt``.
5252
"""
53-
if not (
54-
hasattr(torch.ops, "deepmd_export")
55-
and hasattr(torch.ops.deepmd_export, "border_op")
56-
and hasattr(torch.ops.deepmd_export, "border_op_backward")
57-
):
53+
54+
def _ops_registered() -> bool:
55+
return (
56+
hasattr(torch.ops, "deepmd_export")
57+
and hasattr(torch.ops.deepmd_export, "border_op")
58+
and hasattr(torch.ops.deepmd_export, "border_op_backward")
59+
)
60+
61+
import_err: Exception | None = None
62+
if not _ops_registered():
5863
# Triggers cxx_op.py which torch.ops.load_library's the .so.
5964
try:
6065
import deepmd.pt # noqa: F401
61-
except Exception:
62-
# If deepmd.pt itself fails to import, fall through to the
63-
# explicit RuntimeError below — clearer than re-raising a
64-
# potentially-unrelated import error.
65-
pass
66-
67-
if not (
68-
hasattr(torch.ops, "deepmd_export")
69-
and hasattr(torch.ops.deepmd_export, "border_op")
70-
and hasattr(torch.ops.deepmd_export, "border_op_backward")
71-
):
66+
except Exception as exc:
67+
# ``deepmd/pt/__init__.py`` loads ``cxx_op`` (which registers
68+
# the ops) before running ``load_entry_point("deepmd.pt")``.
69+
# A broken third-party entry point can make the import raise
70+
# *after* the ops were already registered, so only re-raise
71+
# when the registration is still missing — that branch is the
72+
# one where the error (typically an ``undefined symbol`` ABI
73+
# mismatch against libdeepmd_op_pt.so) carries the diagnostic
74+
# detail that the generic RuntimeError below would hide.
75+
import_err = exc
76+
77+
if not _ops_registered():
78+
if import_err is not None:
79+
# Surface the raw import error (typically ``ImportError`` with
80+
# ``undefined symbol`` ABI detail) instead of burying it in a
81+
# generic message — that detail is what tells the user the
82+
# mismatch is between libdeepmd_op_pt.so and the runtime torch,
83+
# not a missing build.
84+
raise import_err
7285
raise RuntimeError(
7386
"torch.ops.deepmd_export.{border_op,border_op_backward} "
7487
"are not registered. Build libdeepmd_op_pt.so and ensure "

0 commit comments

Comments
 (0)