Add f32::max / f32::min support via libdevice fmax/fmin#62
Open
nyoki-mtl wants to merge 1 commit into
Open
Conversation
Rust's `f32::max` / `f32::min` (and the f64 forms) lower in MIR to `core::intrinsics::maximum_number_nsz_f32` / `minimum_number_nsz_f32` (and the f64 forms). The MIR importer did not recognize those `maximum_number_nsz_*` / `minimum_number_nsz_*` names, so the call sites fell out of the rustc float-math placeholder pipeline added in 61028e6 and propagated as unresolved intrinsics into mir-lower. Wire them in as a thin extension of the existing `RustFloatMathIntrinsic` machinery (sqrt, sin, fma, …): * `dialect-mir::rust_intrinsics` — four new placeholder constants (`CALLEE_MAXNUM_NSZ_F{32,64}`, `CALLEE_MINNUM_NSZ_F{32,64}`). * `mir-importer::translator::terminator::intrinsics::float_math` — four new enum variants plus `from_core_path` arms for both `core::intrinsics::*` and `std::intrinsics::*` and `placeholder_callee` mappings. * `mir-lower::convert::ops::call` — symmetric enum variants and the libdevice symbol table: `__nv_fmaxf` / `__nv_fmax` / `__nv_fminf` / `__nv_fmin`. `arg_count = 2`. Add unit tests on both sides that lock the `from_core_path` ↔ `placeholder_callee` ↔ `libdevice_name` chain so a future rustc rename of `maximum_number_nsz_*` (or a typo in any single matcher) surfaces as a unit-test failure rather than a runtime "intrinsic not lowered" error after a long compile cycle. The `from_core_path` test explicitly rejects the NaN-propagating `maximumf*` / `minimumf*` family (backing `f32::maximum` / `f32::minimum`), which is deferred to a follow-up PR because libdevice exposes only the maxNum / minNum semantics directly. Add an `examples/fmaxmin_smoke/` crate that exercises the full chain end-to-end. It mirrors the `primitive_stress` shape: the libdevice auto-detection picks the `__nv_fmax*` / `__nv_fmin*` calls up and `cuda_host::ltoir::load_kernel_module` finishes the build through libNVVM + nvJitLink. The NaN argument is passed in from the host rather than embedded as `f32::NAN` in the kernel so the example stays focused on the max/min lowering and does not depend on how cuda-oxide renders NaN constants in LLVM IR. # NaN / signed-zero semantics `f32::max` calls the `maximum_number_nsz_f32` intrinsic, i.e. IEEE-754 maxNum with the "no signed zero" relaxation: - if exactly one operand is NaN, the non-NaN operand is returned; - the distinction between `-0.0` and `+0.0` may be ignored. libdevice `__nv_fmaxf` / `__nv_fminf` implement the same maxNum / minNum NaN rule. The `-0.0` vs `+0.0` relaxation that the `_nsz` suffix grants is a *permitted* slack, not a required behavior, so routing the relaxed intrinsics to the non-relaxed libdevice entry points is correctness-preserving. # Test results * `cargo oxide fmt --check` — clean. * `cargo clippy -p dialect-mir -p mir-importer -p mir-lower --lib --tests -- -D warnings` — clean. * `cargo test -p cuda-host -p cuda-macros -p dialect-llvm -p dialect-mir -p dialect-nvvm -p mir-lower -p mir-importer --lib --tests` — 72 passed, 13 suites. The new mir-importer and mir-lower unit tests are included. * `cargo oxide build fmaxmin_smoke` — succeeds. The generated `fmaxmin_smoke.ll` contains 4 `declare` lines and 8 `call` sites for the expected `__nv_fmax{,f}` / `__nv_fmin{,f}` symbols, the libdevice auto-detector forces NVVM IR + nvJitLink, and the pipeline reaches a cubin. Device launch on my local host fails with CUDA driver error 209 on `primitive_stress` and on this new `fmaxmin_smoke` alike (WSL2 + CUDA 13.1 driver / 12.9 toolkit on RTX 3070 Ti / sm_86); a reviewer with a working device path is welcome to confirm the smoke prints `SUCCESS`. Signed-off-by: nyoki-mtl <charmer.popopo@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
f32::max/f32::minsupport via libdevice fmax/fminSummary
f32::max/f32::min(and the f64 forms) lower in MIR tocore::intrinsics::maximum_number_nsz_f32/minimum_number_nsz_f32(and f64).
RustFloatMathIntrinsic::from_core_pathdid not match thosefour names, so the calls fell out of the rustc float-math placeholder
pipeline added in 61028e6 and propagated as unresolved intrinsics into
mir-lower. Wire them in as a thin extension of the existingsqrt/sin/fmamachinery:f32::maxcore::intrinsics::maximum_number_nsz_f32__nv_fmaxff64::maxcore::intrinsics::maximum_number_nsz_f64__nv_fmaxf32::mincore::intrinsics::minimum_number_nsz_f32__nv_fminff64::mincore::intrinsics::minimum_number_nsz_f64__nv_fminImplementation
Five small edits, mirroring
sqrt/fma/fabs:dialect-mir::rust_intrinsics— four new placeholder constants.mir-importer::translator::terminator::intrinsics::float_math—four enum variants,
from_core_patharms (bothcore::andstd::aliases),
placeholder_calleemappings.mir-lower::convert::ops::call— symmetric variants,from_placeholder_calleearms, libdevice symbol table entries,arg_count = 2.Unit tests on both sides lock the
from_core_path↔placeholder_callee↔libdevice_namechain so a future rustc renameof
maximum_number_nsz_*or a typo in any single matcher surfaces as aunit-test failure rather than a runtime "intrinsic not lowered" error
after a long compile cycle. The importer test also explicitly rejects
the NaN-propagating
maximumf*/minimumf*family (backingf32::maximum/f32::minimum) so the deferred scope is enforced.examples/fmaxmin_smoke/exercises the full chain end-to-end. Itmirrors the
primitive_stressshape — the libdevice auto-detectionpicks the
__nv_fmax*/__nv_fmin*calls up andcuda_host::ltoir::load_kernel_modulefinishes through libNVVM +nvJitLink.
NaN / signed-zero semantics
f32::maxcalls themaximum_number_nsz_f32intrinsic, i.e.IEEE-754 maxNum with the no signed zero relaxation: if exactly
one operand is NaN the non-NaN operand is returned, and the
distinction between
-0.0and+0.0may be ignored.libdevice
__nv_fmaxf/__nv_fminfimplement the same maxNum /minNum NaN rule. The
-0.0vs+0.0relaxation that the_nszsuffix grants is a permitted slack, not a required behavior, so
routing the relaxed intrinsics to the non-relaxed libdevice entry
points is correctness-preserving.
The NaN-propagating cousins (
f32::maximum/f32::minimum, backedby
core::intrinsics::maximumf32/minimumf32) intentionallyremain unhandled here; libdevice does not expose a NaN-propagating
maximum directly, so they need a different lowering. Splitting that
out keeps this PR small and reviewable.
Verification
Device launch on my local box (WSL2 + CUDA 13.1 driver / 12.9 toolkit
on RTX 3070 Ti / sm_86) fails with CUDA driver error 209 on
primitive_stressand on this newfmaxmin_smokealike, which isunrelated to this PR. A reviewer with a working device path is welcome
to confirm the smoke prints
SUCCESS.Reproduction (before this PR)
MIR (
rustc --emit=mir -O):Before this PR,
RustFloatMathIntrinsic::from_core_pathreturnsNoneforcore::intrinsics::maximum_number_nsz_f32, so noplaceholder is emitted and the call propagates unresolved through
mir-lower. After this PR the same MIR lowers to:which the existing
module_uses_libdeviceauto-detection routesthrough NVVM IR + nvJitLink automatically.
DCO
Single commit, signed off.