Change rounding method in quantization process (triton backend fix for amd gpus)#54
Change rounding method in quantization process (triton backend fix for amd gpus)#54patientx wants to merge 1 commit into
Conversation
fails on ROCm:
AttributeError("module 'triton.language.extra.hip.libdevice' has no attribute 'rint'")
The generic from triton.language.extra import libdevice import resolves to triton.language.extra.hip.libdevice on AMD GPUs, which doesn't expose rint. This is the same gap as triton-lang#7135, where libdevice.round is declared but throws the same AttributeError when actually compiled on HIP — the documented workaround there is the same swap used below.
Repro: Load an INT8/ConvRot-quantized checkpoint with the native UNETLoader on an AMD GPU with the triton backend enabled (COMFY_KITCHEN_BACKEND=triton or equivalent). Fails on first call into int8_linear → triton_quantize_rowwise.
tl.floor(x + 0.5) is portable across both CUDA and HIP backends (no branching needed), and the rounding difference vs. rint's round-half-to-even is negligible for INT8 quantization scales.
Confirmed against current main (v0.2.12) — this is the only occurrence of libdevice.rint/libdevice.round in the codebase, so this single-line change should resolve ROCm compatibility for this path.
|
✅ All contributors have signed the CLA. Thank you! This PR is ready to be merged. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe Triton INT8 row-wise quantization kernel now rounds ChangesINT8 quantization rounding
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
I have read and agree to the Contributor License Agreement |
|
Just loading a model converted to int8-convrot without this patch results in this log from a failed run : |
|
this PR #56 for eager works more reliably on amd rdna2 although triton still looks faster |
upgrading from |
there was a reason I kept using 3.6.0.post25 and I remembered that reason after trying the 3.7.0, yes this lets me use triton backend without a PATCH but it is still slower than node route -tried 8 gens on both- BUT more importantly ... That version + using native loader straight up stops generation in like 3rd or 4th time and after a few seconds whole PC freezes, I monitor vram etc. no temp raising, no gpu usage maxing etc. It just kills the comfy & windows. Only a hard reset solves it. |
fails on ROCm:
AttributeError("module 'triton.language.extra.hip.libdevice' has no attribute 'rint'") The generic from triton.language.extra import libdevice import resolves to triton.language.extra.hip.libdevice on AMD GPUs, which doesn't expose rint. This is the same gap as triton-lang#7135, where libdevice.round is declared but throws the same AttributeError when actually compiled on HIP — the documented workaround there is the same swap used below. Repro: Load an INT8/ConvRot-quantized checkpoint with the native UNETLoader on an AMD GPU with the triton backend enabled (COMFY_KITCHEN_BACKEND=triton or equivalent). Fails on first call into int8_linear → triton_quantize_rowwise.
tl.floor(x + 0.5) is portable across both CUDA and HIP backends (no branching needed), and the rounding difference vs. rint's round-half-to-even is negligible for INT8 quantization scales. Confirmed against current main (v0.2.12) — this is the only occurrence of libdevice.rint/libdevice.round in the codebase, so this single-line change should resolve ROCm compatibility for this path.