Skip to content

feat(rocke): make torch optional, not load-bearing#8891

Draft
shumway wants to merge 1 commit into
developfrom
users/shumway/rocke-remove-torch
Draft

feat(rocke): make torch optional, not load-bearing#8891
shumway wants to merge 1 commit into
developfrom
users/shumway/rocke-remove-torch

Conversation

@shumway

@shumway shumway commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Motivation

rocKE is consumed downstream by PyTorch, so depending on torch is circular. torch was already absent from declared deps (pyproject pins only numpy) and the IR/emission path was torch-free, but it remained load-bearing at runtime and in the test suite: tests/run_all.py was RED (33 failed) without torch. This makes the torch-free suite GREEN.

Technical details

Runtime resolver (hip_module.py, comgr.py):

  • The accidental dependency: rocKE relied on import torch to put ROCm's wheel-bundled libamd_comgr.so / libamdhip64.so on the loader path. A library must not import torch for a side effect. _candidate_lib_paths now discovers
    ROCm without torch: env override -> torch-bundled-if-already-imported ->
    $ROCM_PATH/$ROCM_HOME -> globbed /opt/rocm*/core-/lib & /opt/rocm/lib (newest first) -> bare soname. Eliminates 16 cannot load failures.
  • Fix latent bug in resolved_lib_rocm_version(): on packaged ROCm 7.2 it read a component .info/version (7.13) instead of the ROCm release (7.2), which drives LLVM-flavor selection (llvm20 vs llvm22).

Tests (test_rocke.py):

  • Gate the 17 genuinely torch-needing tests behind importorskip/skipUnless so they skip torch-free but still run in torch/GPU CI lanes (no coverage lost).
  • Add a torch-free /dev/kfd GPU guard for 3 tests the resolver fix unmasked (they reached hipModuleLoadData and hung on a GPU-less box).
  • Rewrite test_dtype_to_ir_accepts_torch_aliases to exercise the string path so it runs torch-free.

Docs (AGENTS.md, BUILD.md):

  • Document torch as OPTIONAL (the three torch-only lanes vs the torch-free majority), the 5-tier ROCm library discovery order, the cannot load libamd_comgr.so -> set ROCM_PATH fix, and the new hard rule.

Test Plan

This is need prior to enabling automatic CI, so the test are just run local.

Setup a venv python from requirements.txt, which does not include torch. Verify that python tests/run_all.py runs correctly. It previously failed with import torch in several files.

Test Results

Verified torch-free: byte-identity gate GREEN (65/65), pytest 460 passed / 0 failed, relative-path guard PASS.

JIRA ID : AICK-1518

@therock-pr-bot

therock-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

✅ All Checks Passed — Ready for Review

Check Status Details
🌿 Branch Name ✅ Pass
📝 PR Title/Description ✅ Pass
Forbidden Files ✅ Pass
🧪 Unit Test ✅ Pass
🔎 pre-commit ✅ Pass
🚫 Draft PR 🔜 To Be Enabled
🚩 Feature Flag 🔜 To Be Enabled
📊 Code Coverage 🔜 To Be Enabled
🤖 therock-pr-bot ✅ Pass

🎉 All checks passed! This PR is ready for review.

📖 Need help? See the Policy FAQ for details on every check and how to fix failures.

@therock-pr-bot

therock-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

🎉 All checks passed! This PR is ready for review.

rocKE is consumed downstream by PyTorch, so depending on `torch` is
circular. torch was already absent from declared deps (pyproject pins only
numpy) and the IR/emission path was torch-free, but it remained load-bearing
at runtime and in the test suite: `tests/run_all.py` was RED (33 failed)
without torch. This makes the torch-free suite GREEN.

Runtime resolver (hip_module.py, comgr.py):
- The accidental dependency: rocKE relied on `import torch` to put ROCm's
  wheel-bundled libamd_comgr.so / libamdhip64.so on the loader path. A library
  must not import torch for a side effect. `_candidate_lib_paths` now discovers
  ROCm without torch: env override -> torch-bundled-if-already-imported ->
  $ROCM_PATH/$ROCM_HOME -> globbed /opt/rocm*/core-*/lib & /opt/rocm*/lib
  (newest first) -> bare soname. Eliminates 16 `cannot load` failures.
- Fix latent bug in resolved_lib_rocm_version(): on packaged ROCm 7.2 it read
  a component .info/version (7.13) instead of the ROCm release (7.2), which
  drives LLVM-flavor selection (llvm20 vs llvm22).

Tests (test_rocke.py):
- Gate the 17 genuinely torch-needing tests behind importorskip/skipUnless so
  they skip torch-free but still run in torch/GPU CI lanes (no coverage lost).
- Add a torch-free /dev/kfd GPU guard for 3 tests the resolver fix unmasked
  (they reached hipModuleLoadData and hung on a GPU-less box).
- Rewrite test_dtype_to_ir_accepts_torch_aliases to exercise the string path
  so it runs torch-free.

Docs (AGENTS.md, BUILD.md):
- Document torch as OPTIONAL (the three torch-only lanes vs the torch-free
  majority), the 5-tier ROCm library discovery order, the
  `cannot load libamd_comgr.so` -> set ROCM_PATH fix, and the new hard rule.

Verified torch-free: byte-identity gate GREEN (65/65), pytest 460 passed /
0 failed, relative-path guard PASS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shumway shumway force-pushed the users/shumway/rocke-remove-torch branch from 71b5dba to 399dcb1 Compare July 2, 2026 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant