Skip to content

Add f32::max / f32::min support via libdevice fmax/fmin#62

Open
nyoki-mtl wants to merge 1 commit into
NVlabs:mainfrom
nyoki-mtl:feat/fmaxmin-intrinsic
Open

Add f32::max / f32::min support via libdevice fmax/fmin#62
nyoki-mtl wants to merge 1 commit into
NVlabs:mainfrom
nyoki-mtl:feat/fmaxmin-intrinsic

Conversation

@nyoki-mtl
Copy link
Copy Markdown

Add f32::max / f32::min support via libdevice fmax/fmin

Summary

f32::max / f32::min (and the f64 forms) lower in MIR to
core::intrinsics::maximum_number_nsz_f32 / minimum_number_nsz_f32
(and f64). RustFloatMathIntrinsic::from_core_path did not match those
four names, so the calls fell out of the rustc float-math placeholder
pipeline added in 61028e6 and propagated as unresolved intrinsics into
mir-lower. Wire them in as a thin extension of the existing
sqrt / sin / fma machinery:

Public API MIR intrinsic libdevice
f32::max core::intrinsics::maximum_number_nsz_f32 __nv_fmaxf
f64::max core::intrinsics::maximum_number_nsz_f64 __nv_fmax
f32::min core::intrinsics::minimum_number_nsz_f32 __nv_fminf
f64::min core::intrinsics::minimum_number_nsz_f64 __nv_fmin

Implementation

Five small edits, mirroring sqrt / fma / fabs:

  • dialect-mir::rust_intrinsics — four new placeholder constants.
  • mir-importer::translator::terminator::intrinsics::float_math
    four enum variants, from_core_path arms (both core:: and std::
    aliases), placeholder_callee mappings.
  • mir-lower::convert::ops::call — symmetric variants,
    from_placeholder_callee arms, libdevice symbol table entries,
    arg_count = 2.

Unit tests on both sides lock the from_core_path
placeholder_calleelibdevice_name chain so a future rustc rename
of maximum_number_nsz_* or a typo in any single matcher surfaces as a
unit-test failure rather than a runtime "intrinsic not lowered" error
after a long compile cycle. The importer test also explicitly rejects
the NaN-propagating maximumf* / minimumf* family (backing
f32::maximum / f32::minimum) so the deferred scope is enforced.

examples/fmaxmin_smoke/ exercises the full chain end-to-end. It
mirrors the primitive_stress shape — the libdevice auto-detection
picks the __nv_fmax* / __nv_fmin* calls up and
cuda_host::ltoir::load_kernel_module finishes through libNVVM +
nvJitLink.

NaN / signed-zero semantics

f32::max calls the maximum_number_nsz_f32 intrinsic, i.e.
IEEE-754 maxNum with the no signed zero relaxation: if exactly
one operand is NaN the non-NaN operand is returned, and the
distinction between -0.0 and +0.0 may be ignored.

libdevice __nv_fmaxf / __nv_fminf implement the same maxNum /
minNum NaN rule. The -0.0 vs +0.0 relaxation that the _nsz
suffix grants is a permitted slack, not a required behavior, so
routing the relaxed intrinsics to the non-relaxed libdevice entry
points is correctness-preserving.

The NaN-propagating cousins (f32::maximum / f32::minimum, backed
by core::intrinsics::maximumf32 / minimumf32) intentionally
remain unhandled here; libdevice does not expose a NaN-propagating
maximum directly, so they need a different lowering. Splitting that
out keeps this PR small and reviewable.

Verification

cargo oxide fmt --check                                 ✓ clean
cargo clippy -p dialect-mir -p mir-importer -p mir-lower
   --lib --tests -- -D warnings                         ✓ clean
cargo test -p cuda-host -p cuda-macros -p dialect-llvm
   -p dialect-mir -p dialect-nvvm -p mir-lower
   -p mir-importer --lib --tests                        ✓ 72 passed
cargo oxide build fmaxmin_smoke                         ✓ build succeeded
                                                          12 occurrences of
                                                          `__nv_fmax{,f}` /
                                                          `__nv_fmin{,f}` in
                                                          fmaxmin_smoke.ll
                                                          (4 declares + 8 calls)

Device launch on my local box (WSL2 + CUDA 13.1 driver / 12.9 toolkit
on RTX 3070 Ti / sm_86) fails with CUDA driver error 209 on
primitive_stress and on this new fmaxmin_smoke alike, which is
unrelated to this PR. A reviewer with a working device path is welcome
to confirm the smoke prints SUCCESS.

Reproduction (before this PR)

#[unsafe(no_mangle)]
pub fn probe_max(a: f32, b: f32) -> f32 { a.max(b) }

MIR (rustc --emit=mir -O):

_0 = maximum_number_nsz_f32(move _1, move _2) -> [...];

Before this PR, RustFloatMathIntrinsic::from_core_path returns
None for core::intrinsics::maximum_number_nsz_f32, so no
placeholder is emitted and the call propagates unresolved through
mir-lower. After this PR the same MIR lowers to:

declare float @__nv_fmaxf(float, float)
...
%v13 = call float @__nv_fmaxf(float %v6, float %v7)

which the existing module_uses_libdevice auto-detection routes
through NVVM IR + nvJitLink automatically.

DCO

Single commit, signed off.

Rust's `f32::max` / `f32::min` (and the f64 forms) lower in MIR to
`core::intrinsics::maximum_number_nsz_f32` / `minimum_number_nsz_f32`
(and the f64 forms). The MIR importer did not recognize those
`maximum_number_nsz_*` / `minimum_number_nsz_*` names, so the call sites
fell out of the rustc float-math placeholder pipeline added in 61028e6
and propagated as unresolved intrinsics into mir-lower.

Wire them in as a thin extension of the existing `RustFloatMathIntrinsic`
machinery (sqrt, sin, fma, …):

  * `dialect-mir::rust_intrinsics` — four new placeholder constants
    (`CALLEE_MAXNUM_NSZ_F{32,64}`, `CALLEE_MINNUM_NSZ_F{32,64}`).
  * `mir-importer::translator::terminator::intrinsics::float_math` — four
    new enum variants plus `from_core_path` arms for both
    `core::intrinsics::*` and `std::intrinsics::*` and
    `placeholder_callee` mappings.
  * `mir-lower::convert::ops::call` — symmetric enum variants and the
    libdevice symbol table: `__nv_fmaxf` / `__nv_fmax` / `__nv_fminf`
    / `__nv_fmin`. `arg_count = 2`.

Add unit tests on both sides that lock the `from_core_path` ↔
`placeholder_callee` ↔ `libdevice_name` chain so a future rustc
rename of `maximum_number_nsz_*` (or a typo in any single matcher)
surfaces as a unit-test failure rather than a runtime "intrinsic not
lowered" error after a long compile cycle. The `from_core_path` test
explicitly rejects the NaN-propagating `maximumf*` / `minimumf*`
family (backing `f32::maximum` / `f32::minimum`), which is deferred to
a follow-up PR because libdevice exposes only the maxNum / minNum
semantics directly.

Add an `examples/fmaxmin_smoke/` crate that exercises the full chain
end-to-end. It mirrors the `primitive_stress` shape: the libdevice
auto-detection picks the `__nv_fmax*` / `__nv_fmin*` calls up and
`cuda_host::ltoir::load_kernel_module` finishes the build through
libNVVM + nvJitLink. The NaN argument is passed in from the host
rather than embedded as `f32::NAN` in the kernel so the example stays
focused on the max/min lowering and does not depend on how cuda-oxide
renders NaN constants in LLVM IR.

# NaN / signed-zero semantics

`f32::max` calls the `maximum_number_nsz_f32` intrinsic, i.e. IEEE-754
maxNum with the "no signed zero" relaxation:
  - if exactly one operand is NaN, the non-NaN operand is returned;
  - the distinction between `-0.0` and `+0.0` may be ignored.

libdevice `__nv_fmaxf` / `__nv_fminf` implement the same maxNum /
minNum NaN rule. The `-0.0` vs `+0.0` relaxation that the `_nsz`
suffix grants is a *permitted* slack, not a required behavior, so
routing the relaxed intrinsics to the non-relaxed libdevice entry
points is correctness-preserving.

# Test results

  * `cargo oxide fmt --check` — clean.
  * `cargo clippy -p dialect-mir -p mir-importer -p mir-lower
     --lib --tests -- -D warnings` — clean.
  * `cargo test -p cuda-host -p cuda-macros -p dialect-llvm
     -p dialect-mir -p dialect-nvvm -p mir-lower -p mir-importer
     --lib --tests` — 72 passed, 13 suites. The new mir-importer and
    mir-lower unit tests are included.
  * `cargo oxide build fmaxmin_smoke` — succeeds. The generated
    `fmaxmin_smoke.ll` contains 4 `declare` lines and 8 `call` sites
    for the expected `__nv_fmax{,f}` / `__nv_fmin{,f}` symbols, the
    libdevice auto-detector forces NVVM IR + nvJitLink, and the
    pipeline reaches a cubin.

Device launch on my local host fails with CUDA driver error 209 on
`primitive_stress` and on this new `fmaxmin_smoke` alike (WSL2 + CUDA
13.1 driver / 12.9 toolkit on RTX 3070 Ti / sm_86); a reviewer with a
working device path is welcome to confirm the smoke prints `SUCCESS`.

Signed-off-by: nyoki-mtl <charmer.popopo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants