feat(rocke): make torch optional, not load-bearing#8891
Draft
shumway wants to merge 1 commit into
Draft
Conversation
✅ All Checks Passed — Ready for Review
📖 Need help? See the Policy FAQ for details on every check and how to fix failures. |
|
🎉 All checks passed! This PR is ready for review. |
rocKE is consumed downstream by PyTorch, so depending on `torch` is circular. torch was already absent from declared deps (pyproject pins only numpy) and the IR/emission path was torch-free, but it remained load-bearing at runtime and in the test suite: `tests/run_all.py` was RED (33 failed) without torch. This makes the torch-free suite GREEN. Runtime resolver (hip_module.py, comgr.py): - The accidental dependency: rocKE relied on `import torch` to put ROCm's wheel-bundled libamd_comgr.so / libamdhip64.so on the loader path. A library must not import torch for a side effect. `_candidate_lib_paths` now discovers ROCm without torch: env override -> torch-bundled-if-already-imported -> $ROCM_PATH/$ROCM_HOME -> globbed /opt/rocm*/core-*/lib & /opt/rocm*/lib (newest first) -> bare soname. Eliminates 16 `cannot load` failures. - Fix latent bug in resolved_lib_rocm_version(): on packaged ROCm 7.2 it read a component .info/version (7.13) instead of the ROCm release (7.2), which drives LLVM-flavor selection (llvm20 vs llvm22). Tests (test_rocke.py): - Gate the 17 genuinely torch-needing tests behind importorskip/skipUnless so they skip torch-free but still run in torch/GPU CI lanes (no coverage lost). - Add a torch-free /dev/kfd GPU guard for 3 tests the resolver fix unmasked (they reached hipModuleLoadData and hung on a GPU-less box). - Rewrite test_dtype_to_ir_accepts_torch_aliases to exercise the string path so it runs torch-free. Docs (AGENTS.md, BUILD.md): - Document torch as OPTIONAL (the three torch-only lanes vs the torch-free majority), the 5-tier ROCm library discovery order, the `cannot load libamd_comgr.so` -> set ROCM_PATH fix, and the new hard rule. Verified torch-free: byte-identity gate GREEN (65/65), pytest 460 passed / 0 failed, relative-path guard PASS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
71b5dba to
399dcb1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
rocKE is consumed downstream by PyTorch, so depending on
torchis circular. torch was already absent from declared deps (pyproject pins only numpy) and the IR/emission path was torch-free, but it remained load-bearing at runtime and in the test suite:tests/run_all.pywas RED (33 failed) without torch. This makes the torch-free suite GREEN.Technical details
Runtime resolver (hip_module.py, comgr.py):
import torchto put ROCm's wheel-bundled libamd_comgr.so / libamdhip64.so on the loader path. A library must not import torch for a side effect._candidate_lib_pathsnow discoversROCm without torch: env override -> torch-bundled-if-already-imported ->
$ROCM_PATH/$ROCM_HOME -> globbed /opt/rocm*/core-/lib & /opt/rocm/lib (newest first) -> bare soname. Eliminates 16
cannot loadfailures.Tests (test_rocke.py):
Docs (AGENTS.md, BUILD.md):
cannot load libamd_comgr.so-> set ROCM_PATH fix, and the new hard rule.Test Plan
This is need prior to enabling automatic CI, so the test are just run local.
Setup a venv python from requirements.txt, which does not include
torch. Verify thatpython tests/run_all.pyruns correctly. It previously failed withimport torchin several files.Test Results
Verified torch-free: byte-identity gate GREEN (65/65), pytest 460 passed / 0 failed, relative-path guard PASS.
JIRA ID : AICK-1518