Currently, some standard TIR math intrinsics on CUDA lower to CUDA fast-math device functions by default.
For example, tir.exp / tirx.exp on float32 lowers to __expf(x) instead of the precise CUDA math function expf(x). This happens even when --use_fast_math is not passed to NVCC.
Why this is a problem
__expf, __logf, __sinf, etc. are CUDA fast-math intrinsics. They trade accuracy for performance and can introduce visible precision loss in numerically sensitive kernels.
Users generally expect standard math intrinsics such as T.exp, T.log, T.sin, and T.cos to preserve normal CUDA math semantics unless fast math is explicitly requested.
Fast-math behavior should ideally be opt-in, for example through a target option, compiler flag, or an explicit fast-math intrinsic.
Standard TIR math intrinsics should lower to precise CUDA math functions by default:
| TIR op |
Expected CUDA |
| tirx.exp |
expf |
| tirx.exp10 |
exp10f |
| tirx.log |
logf |
| tirx.log2 |
log2f |
| tirx.log10 |
log10f |
| tirx.sin |
sinf |
| tirx.cos |
cosf |
| tirx.tan |
tanf |
Fast-math variants such as __expf, __logf, __sinf, and __cosf should only be emitted when fast math is explicitly enabled.
Suggested fix: use CUDAMath instead of CUDAFastMath for standard CUDA math intrinsic lowering:
TVM_REGISTER_OP("tirx.exp")
.set_attr("cuda.FLowerIntrinsic", DispatchPureExtern);
If fast-math lowering is desired, it would be better to gate it behind an explicit fast-math option rather than making it the default behavior for standard math intrinsics.
cc @Hzfengsy @junrushao @quic-sanirudh @shingjan
Currently, some standard TIR math intrinsics on CUDA lower to CUDA fast-math device functions by default.
For example,
tir.exp/tirx.exponfloat32lowers to__expf(x)instead of the precise CUDA math function expf(x). This happens even when --use_fast_math is not passed to NVCC.Why this is a problem
__expf,__logf,__sinf, etc. are CUDA fast-math intrinsics. They trade accuracy for performance and can introduce visible precision loss in numerically sensitive kernels.Users generally expect standard math intrinsics such as
T.exp,T.log,T.sin, andT.costo preserve normal CUDA math semantics unless fast math is explicitly requested.Fast-math behavior should ideally be opt-in, for example through a target option, compiler flag, or an explicit fast-math intrinsic.
Standard TIR math intrinsics should lower to precise CUDA math functions by default:
Fast-math variants such as __expf, __logf, __sinf, and __cosf should only be emitted when fast math is explicitly enabled.
Suggested fix: use CUDAMath instead of CUDAFastMath for standard CUDA math intrinsic lowering:
TVM_REGISTER_OP("tirx.exp")
.set_attr("cuda.FLowerIntrinsic", DispatchPureExtern);
If fast-math lowering is desired, it would be better to gate it behind an explicit fast-math option rather than making it the default behavior for standard math intrinsics.
cc @Hzfengsy @junrushao @quic-sanirudh @shingjan