m132: AMDGPUCodeGenPrepare vector scalarizer composes the m103 i64 sdiv INT32_MIN / -1 narrowing per lane
Discovery method: code inspection. Sibling shape to
m103 -- this
is the vector version: a v2i64 (or wider) sdiv with a per-lane
INT32_MIN numerator and a divisor splat of -1 triggers the same
i64-narrowing miscompile on each affected lane.
amdgpu/third_party/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp:1488-1520
scalarizes vector div/rem operations:
if (auto *VT = dyn_cast<FixedVectorType>(Ty)) {
NewDiv = PoisonValue::get(VT);
for (unsigned N = 0, E = VT->getNumElements(); N != E; ++N) {
Value *NumEltN = Builder.CreateExtractElement(Num, N);
Value *DenEltN = Builder.CreateExtractElement(Den, N);
Value *NewElt;
if (ScalarSize <= 32) {
NewElt = expandDivRem32(Builder, I, NumEltN, DenEltN);
...
} else {
NewElt = shrinkDivRem64(Builder, I, NumEltN, DenEltN); // <-- composes m103
if (!NewElt) {
NewElt = Builder.CreateBinOp(Opc, NumEltN, DenEltN);
if (auto *NewEltBO = dyn_cast<BinaryOperator>(NewElt))
Div64ToExpand.push_back(NewEltBO);
}
}
...
NewDiv = Builder.CreateInsertElement(NewDiv, NewElt, N);
}
}shrinkDivRem64 (line 1343) then performs the per-lane
getDivNumBits > 32 shrink that is the exact m103 bug. When
the divisor is a constant splat such as <i64 -1, i64 -1>,
divHasSpecialOptimization bails out of the IR-level expansion,
but the per-lane scalar sdiv i64 %elt, -1 survives into SDAG
where LowerSDIVREM
(AMDGPUISelLowering.cpp:2415-2430) applies the same buggy
narrowing per element.
For a lane whose numerator is sext(INT32_MIN) (33 sign bits) and
whose divisor is -1 (64 sign bits), both ComputeNumSignBits > 32
gates pass. The narrowed i32 op sdiv 0x80000000, -1 is poison;
lowering wraps to 0x80000000; the outer SIGN_EXTEND yields
0xFFFFFFFF_80000000. The well-defined i64 result is +2^31
(0x00000000_80000000).
A v2i64 sdiv whose lanes are each sext from i32 and whose divisor
is a literal -1 splat:
typedef long2 __attribute__((ext_vector_type(2))) v2i64;
v2i64 bug(int32_t a, int32_t b) {
v2i64 num = (v2i64){(int64_t)a, (int64_t)b};
return num / (v2i64)(int64_t)-1; // a == INT32_MIN -> lane 0 wrong
}ComputeNumSignBits sees 33 on lane 0 (sext(INT32_MIN)) and 64 on the
splat -1, both gates pass per-lane, narrowing fires per-lane.
reduced.ll builds a v2i64 numerator with lane 0 = sext(INT32_MIN)
and lane 1 = sext(100), divides by literal <i64 -1, i64 -1>, and
stores the low and high halves of lane 0's quotient.
bash known-miscompiles/run_ll_reproducer.sh known-miscompiles/m132-codegenprepare-vector-sdiv-int32min-narrowing/reduced.ll:
[0] input=0x80000000 O0=0x80000000 O2=0x80000000 mismatch=false
[1] input=0x00000064 O0=0xffffffff O2=0x00000000 mismatch=true
any_mismatch=true
- Index
[0]is the low32 of lane-0q--0x80000000either way: both the buggy narrowed lowering and the correct0 - xfold produce the same low half for INT32_MIN. - Index
[1]is the high32 of lane-0q--0xFFFFFFFFat O0 (buggySIGN_EXTENDof the narrowed i32 sdiv) vs0x00000000at O2 (correctsub i64 0, xafter InstCombine'ssdiv x, -1fold).
True i64 result on the host: q[0] = +2147483648 = 0x00000000_80000000.
- At O2, InstCombine sees the vector
sdiv <2 x i64> %num, splat(-1)and folds it tosub <2 x i64> zeroinitializer, %numbefore AMDGPUCodeGenPrepare scalarizes. No sdiv survives -- no narrowing. - At O0, the literal-divisor
sdivreaches AMDGPUCodeGenPrepare intact. The scalarizer at 1488-1520 splits per-lane, then defers eachsdiv i64 %elt, -1to SDAG (viadivHasSpecialOptimizationbail-out at line 1346). SDAGLowerSDIVREMthen applies the m103 narrowing per scalar lane, mis-lowering lane 0.
AMDGPUCodeGenPrepare runs in every codegen pipeline (both -O0
and -O2). The vector scalarizer (1488-1520) fires for any
v{2,3,4,...}i64 div/rem. Any source emitting a vector i64 sdiv
whose lanes can include (INT32_MIN, -1) after sign-extension --
e.g. HIP code dividing a long2/long4 of small ints by a small
constant divisor that simplifies to -1 -- gets a buggy lane.
The m103 NOTES.md already describes the scalar case. This entry
documents that the IR-level vector scalarizer in
AMDGPUCodeGenPrepare.cpp:1488-1520 compounds the bug: it manufactures
the per-lane scalar shape that m103 then mishandles, and it never
checks whether any single lane's (num, den) would be
(INT32_MIN, -1).
Fix m103 at its sources (AMDGPUISelLowering.cpp:2415-2430 and
AMDGPUCodeGenPrepare.cpp:1343-1372). The scalarizer at
1488-1520 is correct on its own -- it just propagates whatever
shrinkDivRem64 / SDAG LowerSDIVREM does. Tightening the m103
gate from > 32 to > 33, or adding the !(isPowerOfTwo(LHS) && isAllOnes(RHS)) exclusion described in m103's NOTES, fixes both the
scalar and the vector cases.
| Toolchain | Result |
|---|---|
LLVM HEAD with the local PR patches (build/llvm-fuzzer) |
Reproduces (q[0].hi = 0xFFFFFFFF at O0, 0x00000000 at O2). |
ROCm 7.1.1 (/opt/rocm-7.1.1/lib/llvm/bin/clang-20) |
Same buggy lowering -- both LowerSDIVREM and the vector scalarizer have been in tree for many releases. |
- The current FuzzX i64 sdiv emitter rarely pairs a v2i64 numerator
with lane 0 =
sext(INT32_MIN)and a literal splat divisor of-1. PerMEMORY.md(Prefer-random-over-idioms), the right hook is to enrich the i32-constant pool withINT32_MINand-1and let the random emitter pick them as lane fillers for asext-fed v{2,4}i64 sdiv with a constant-splat divisor. - Until m103 itself is fixed, both O0 and O2 pipelines apply the
buggy narrowing for the non-literal divisor form (e.g.
sdiv <2 x i64> %num, %splat-of-loaded-(-1)), so the O0-vs-O2 oracle collapses for that shape -- only the literal--1divisor exposes the differential.