Add f32::max / f32::min support via libdevice fmax/fmin by nyoki-mtl · Pull Request #62 · NVlabs/cuda-oxide

nyoki-mtl · 2026-05-14T15:46:41Z

Add `f32::max` / `f32::min` support via libdevice fmax/fmin

Summary

f32::max / f32::min (and the f64 forms) lower in MIR to
core::intrinsics::maximum_number_nsz_f32 / minimum_number_nsz_f32
(and f64). RustFloatMathIntrinsic::from_core_path did not match those
four names, so the calls fell out of the rustc float-math placeholder
pipeline added in 61028e6 and propagated as unresolved intrinsics into
mir-lower. Wire them in as a thin extension of the existing
sqrt / sin / fma machinery:

Public API	MIR intrinsic	libdevice
`f32::max`	`core::intrinsics::maximum_number_nsz_f32`	`__nv_fmaxf`
`f64::max`	`core::intrinsics::maximum_number_nsz_f64`	`__nv_fmax`
`f32::min`	`core::intrinsics::minimum_number_nsz_f32`	`__nv_fminf`
`f64::min`	`core::intrinsics::minimum_number_nsz_f64`	`__nv_fmin`

Implementation

Five small edits, mirroring sqrt / fma / fabs:

dialect-mir::rust_intrinsics — four new placeholder constants.
mir-importer::translator::terminator::intrinsics::float_math —
four enum variants, from_core_path arms (both core:: and std::
aliases), placeholder_callee mappings.
mir-lower::convert::ops::call — symmetric variants,
from_placeholder_callee arms, libdevice symbol table entries,
arg_count = 2.

Unit tests on both sides lock the from_core_path ↔
placeholder_callee ↔ libdevice_name chain so a future rustc rename
of maximum_number_nsz_* or a typo in any single matcher surfaces as a
unit-test failure rather than a runtime "intrinsic not lowered" error
after a long compile cycle. The importer test also explicitly rejects
the NaN-propagating maximumf* / minimumf* family (backing
f32::maximum / f32::minimum) so the deferred scope is enforced.

examples/fmaxmin_smoke/ exercises the full chain end-to-end. It
mirrors the primitive_stress shape — the libdevice auto-detection
picks the __nv_fmax* / __nv_fmin* calls up and
cuda_host::ltoir::load_kernel_module finishes through libNVVM +
nvJitLink.

NaN / signed-zero semantics

f32::max calls the maximum_number_nsz_f32 intrinsic, i.e.
IEEE-754 maxNum with the no signed zero relaxation: if exactly
one operand is NaN the non-NaN operand is returned, and the
distinction between -0.0 and +0.0 may be ignored.

libdevice __nv_fmaxf / __nv_fminf implement the same maxNum /
minNum NaN rule. The -0.0 vs +0.0 relaxation that the _nsz
suffix grants is a permitted slack, not a required behavior, so
routing the relaxed intrinsics to the non-relaxed libdevice entry
points is correctness-preserving.

The NaN-propagating cousins (f32::maximum / f32::minimum, backed
by core::intrinsics::maximumf32 / minimumf32) intentionally
remain unhandled here; libdevice does not expose a NaN-propagating
maximum directly, so they need a different lowering. Splitting that
out keeps this PR small and reviewable.

Verification

cargo oxide fmt --check                                 ✓ clean
cargo clippy -p dialect-mir -p mir-importer -p mir-lower
   --lib --tests -- -D warnings                         ✓ clean
cargo test -p cuda-host -p cuda-macros -p dialect-llvm
   -p dialect-mir -p dialect-nvvm -p mir-lower
   -p mir-importer --lib --tests                        ✓ 72 passed
cargo oxide build fmaxmin_smoke                         ✓ build succeeded
                                                          12 occurrences of
                                                          `__nv_fmax{,f}` /
                                                          `__nv_fmin{,f}` in
                                                          fmaxmin_smoke.ll
                                                          (4 declares + 8 calls)

Device launch on my local box (WSL2 + CUDA 13.1 driver / 12.9 toolkit
on RTX 3070 Ti / sm_86) fails with CUDA driver error 209 on
primitive_stress and on this new fmaxmin_smoke alike, which is
unrelated to this PR. A reviewer with a working device path is welcome
to confirm the smoke prints SUCCESS.

Reproduction (before this PR)

#[unsafe(no_mangle)]
pub fn probe_max(a: f32, b: f32) -> f32 { a.max(b) }

MIR (rustc --emit=mir -O):

_0 = maximum_number_nsz_f32(move _1, move _2) -> [...];

Before this PR, RustFloatMathIntrinsic::from_core_path returns
None for core::intrinsics::maximum_number_nsz_f32, so no
placeholder is emitted and the call propagates unresolved through
mir-lower. After this PR the same MIR lowers to:

declare float @__nv_fmaxf(float, float)
...
%v13 = call float @__nv_fmaxf(float %v6, float %v7)

which the existing module_uses_libdevice auto-detection routes
through NVVM IR + nvJitLink automatically.

DCO

Single commit, signed off.

Rust's `f32::max` / `f32::min` (and the f64 forms) lower in MIR to `core::intrinsics::maximum_number_nsz_f32` / `minimum_number_nsz_f32` (and the f64 forms). The MIR importer did not recognize those `maximum_number_nsz_*` / `minimum_number_nsz_*` names, so the call sites fell out of the rustc float-math placeholder pipeline added in 61028e6 and propagated as unresolved intrinsics into mir-lower. Wire them in as a thin extension of the existing `RustFloatMathIntrinsic` machinery (sqrt, sin, fma, …): * `dialect-mir::rust_intrinsics` — four new placeholder constants (`CALLEE_MAXNUM_NSZ_F{32,64}`, `CALLEE_MINNUM_NSZ_F{32,64}`). * `mir-importer::translator::terminator::intrinsics::float_math` — four new enum variants plus `from_core_path` arms for both `core::intrinsics::*` and `std::intrinsics::*` and `placeholder_callee` mappings. * `mir-lower::convert::ops::call` — symmetric enum variants and the libdevice symbol table: `__nv_fmaxf` / `__nv_fmax` / `__nv_fminf` / `__nv_fmin`. `arg_count = 2`. Add unit tests on both sides that lock the `from_core_path` ↔ `placeholder_callee` ↔ `libdevice_name` chain so a future rustc rename of `maximum_number_nsz_*` (or a typo in any single matcher) surfaces as a unit-test failure rather than a runtime "intrinsic not lowered" error after a long compile cycle. The `from_core_path` test explicitly rejects the NaN-propagating `maximumf*` / `minimumf*` family (backing `f32::maximum` / `f32::minimum`), which is deferred to a follow-up PR because libdevice exposes only the maxNum / minNum semantics directly. Add an `examples/fmaxmin_smoke/` crate that exercises the full chain end-to-end. It mirrors the `primitive_stress` shape: the libdevice auto-detection picks the `__nv_fmax*` / `__nv_fmin*` calls up and `cuda_host::ltoir::load_kernel_module` finishes the build through libNVVM + nvJitLink. The NaN argument is passed in from the host rather than embedded as `f32::NAN` in the kernel so the example stays focused on the max/min lowering and does not depend on how cuda-oxide renders NaN constants in LLVM IR. # NaN / signed-zero semantics `f32::max` calls the `maximum_number_nsz_f32` intrinsic, i.e. IEEE-754 maxNum with the "no signed zero" relaxation: - if exactly one operand is NaN, the non-NaN operand is returned; - the distinction between `-0.0` and `+0.0` may be ignored. libdevice `__nv_fmaxf` / `__nv_fminf` implement the same maxNum / minNum NaN rule. The `-0.0` vs `+0.0` relaxation that the `_nsz` suffix grants is a *permitted* slack, not a required behavior, so routing the relaxed intrinsics to the non-relaxed libdevice entry points is correctness-preserving. # Test results * `cargo oxide fmt --check` — clean. * `cargo clippy -p dialect-mir -p mir-importer -p mir-lower --lib --tests -- -D warnings` — clean. * `cargo test -p cuda-host -p cuda-macros -p dialect-llvm -p dialect-mir -p dialect-nvvm -p mir-lower -p mir-importer --lib --tests` — 72 passed, 13 suites. The new mir-importer and mir-lower unit tests are included. * `cargo oxide build fmaxmin_smoke` — succeeds. The generated `fmaxmin_smoke.ll` contains 4 `declare` lines and 8 `call` sites for the expected `__nv_fmax{,f}` / `__nv_fmin{,f}` symbols, the libdevice auto-detector forces NVVM IR + nvJitLink, and the pipeline reaches a cubin. Device launch on my local host fails with CUDA driver error 209 on `primitive_stress` and on this new `fmaxmin_smoke` alike (WSL2 + CUDA 13.1 driver / 12.9 toolkit on RTX 3070 Ti / sm_86); a reviewer with a working device path is welcome to confirm the smoke prints `SUCCESS`. Signed-off-by: nyoki-mtl <charmer.popopo@gmail.com>

nyoki-mtl mentioned this pull request May 14, 2026

Emit NaN float literals as hex bit patterns, not bare nan #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add f32::max / f32::min support via libdevice fmax/fmin#62

Add f32::max / f32::min support via libdevice fmax/fmin#62
nyoki-mtl wants to merge 1 commit into
NVlabs:mainfrom
nyoki-mtl:feat/fmaxmin-intrinsic

nyoki-mtl commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nyoki-mtl commented May 14, 2026

Add f32::max / f32::min support via libdevice fmax/fmin

Summary

Implementation

NaN / signed-zero semantics

Verification

Reproduction (before this PR)

DCO

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `f32::max` / `f32::min` support via libdevice fmax/fmin