Commit 1a73c37
committed
fix(simd_half): preserve MXCSR across F16C cast batches (codex P2)
Per codex review on PR #183: `cast_f32_to_f16_batch_f16c` and
`cast_f16_to_f32_batch_f16c` use F16C intrinsics that can raise
FP exceptions (#O / #U / #P / #I / #D) on edge inputs — setting
bits in the MXCSR status word. The scalar reference paths
(`F16::to_f32`, `F16::from_f32_rounded`) are pure bit
manipulation and never touch MXCSR, so the F16C fast path was
introducing observable FP control-state side effects.
Codex's proposed fix (`_mm256_cvtps_ph::<8>` with bit 3 set for
`_MM_FROUND_NO_EXC`) does not apply here: the Rust stdarch
intrinsic enforces `static_assert_uimm_bits!(IMM8, 3)` so IMM8
is constrained to `0..=7`, and the underlying VCVTPS2PH IMM8
encoding has no SAE bit — bit 3 selects MXCSR.RM (not NO_EXC,
which is an AVX-512 convention). The only valid IMM8 values for
F16C `_mm256_cvtps_ph` are 0..=3 (the four rounding modes).
The actual fix: save MXCSR via STMXCSR before the SIMD region,
restore via LDMXCSR after. Preserves every bit of the original
control/status word (rounding mode, exception masks, flush-to-
zero, and importantly the exception flag bits that the SIMD path
may have set). Net effect: callers observe no MXCSR change vs.
the scalar path.
Implementation uses inline `asm!(stmxcsr/ldmxcsr)` rather than
`_mm_getcsr` / `_mm_setcsr` because those wrappers are deprecated
on stable Rust 1.95 (rustc deemed them unsound for cross-thread
visibility reasons; the official guidance is exactly this — use
inline asm). Two ops per batch call: one STMXCSR save at entry,
one LDMXCSR restore at exit. Cost: ~5 cycles total, dwarfed by
even a single 8-lane cvtps_ph chunk.
New test `f16c_cast_preserves_mxcsr` exercises the fix:
constructs input arrays containing 1e30 / -1e30 (overflow #O),
1e-30 (underflow / denormal #U / #D / #P), 1.0/3.0 (precision
#P), NaN, Inf, ±0, 1.0 — values designed to trigger every
relevant F16C exception. Snapshots MXCSR before, runs the cast,
snapshots after, asserts byte-equal. Same check for the upcast
direction with SNaN-encoded F16 inputs that trigger #I/#D in
`_mm256_cvtph_ps`. Both pass on this host (F16C + avx2 silicon).
Note: this fix does NOT prevent traps from firing on hosts where
the caller has unmasked FP exceptions before calling us. Trap
behaviour is the same as for any plain `a + b` of f32 that
overflows — fires from the SIMD ops themselves, not under our
control. Default MXCSR has all exception masks set (the
process-startup state on Linux/macOS/Windows), so this is the
common case and traps don't fire there.
Verification:
* 22 simd_half tests pass (was 21 before, +1 new MXCSR-
preservation test).
* Full lib sweep: 2087 tests pass.
* cargo clippy -- -D warnings clean (no deprecation warning
from _mm_getcsr / _mm_setcsr — we use inline asm instead).
* cargo fmt --all --check clean.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u1 parent 5074048 commit 1a73c37
1 file changed
Lines changed: 121 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
426 | 426 | | |
427 | 427 | | |
428 | 428 | | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
429 | 438 | | |
430 | 439 | | |
431 | 440 | | |
432 | 441 | | |
433 | 442 | | |
| 443 | + | |
434 | 444 | | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
435 | 450 | | |
436 | 451 | | |
437 | 452 | | |
| |||
444 | 459 | | |
445 | 460 | | |
446 | 461 | | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
447 | 467 | | |
448 | 468 | | |
449 | 469 | | |
| |||
452 | 472 | | |
453 | 473 | | |
454 | 474 | | |
455 | | - | |
456 | | - | |
457 | | - | |
458 | | - | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
459 | 493 | | |
460 | 494 | | |
461 | 495 | | |
462 | 496 | | |
463 | 497 | | |
464 | 498 | | |
| 499 | + | |
465 | 500 | | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
466 | 504 | | |
467 | 505 | | |
468 | 506 | | |
| |||
475 | 513 | | |
476 | 514 | | |
477 | 515 | | |
| 516 | + | |
| 517 | + | |
478 | 518 | | |
479 | 519 | | |
480 | 520 | | |
| |||
853 | 893 | | |
854 | 894 | | |
855 | 895 | | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
856 | 973 | | |
0 commit comments