m132: `AMDGPUCodeGenPrepare` vector scalarizer composes the m103 i64 sdiv `INT32_MIN / -1` narrowing per lane

Discovery method: code inspection. Sibling shape to m103 -- this is the vector version: a v2i64 (or wider) sdiv with a per-lane INT32_MIN numerator and a divisor splat of -1 triggers the same i64-narrowing miscompile on each affected lane.

The bug

amdgpu/third_party/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp:1488-1520 scalarizes vector div/rem operations:

if (auto *VT = dyn_cast<FixedVectorType>(Ty)) {
  NewDiv = PoisonValue::get(VT);

  for (unsigned N = 0, E = VT->getNumElements(); N != E; ++N) {
    Value *NumEltN = Builder.CreateExtractElement(Num, N);
    Value *DenEltN = Builder.CreateExtractElement(Den, N);

    Value *NewElt;
    if (ScalarSize <= 32) {
      NewElt = expandDivRem32(Builder, I, NumEltN, DenEltN);
      ...
    } else {
      NewElt = shrinkDivRem64(Builder, I, NumEltN, DenEltN);   // <-- composes m103
      if (!NewElt) {
        NewElt = Builder.CreateBinOp(Opc, NumEltN, DenEltN);
        if (auto *NewEltBO = dyn_cast<BinaryOperator>(NewElt))
          Div64ToExpand.push_back(NewEltBO);
      }
    }
    ...
    NewDiv = Builder.CreateInsertElement(NewDiv, NewElt, N);
  }
}

shrinkDivRem64 (line 1343) then performs the per-lane getDivNumBits > 32 shrink that is the exact m103 bug. When the divisor is a constant splat such as <i64 -1, i64 -1>, divHasSpecialOptimization bails out of the IR-level expansion, but the per-lane scalar sdiv i64 %elt, -1 survives into SDAG where LowerSDIVREM (AMDGPUISelLowering.cpp:2415-2430) applies the same buggy narrowing per element.

For a lane whose numerator is sext(INT32_MIN) (33 sign bits) and whose divisor is -1 (64 sign bits), both ComputeNumSignBits > 32 gates pass. The narrowed i32 op sdiv 0x80000000, -1 is poison; lowering wraps to 0x80000000; the outer SIGN_EXTEND yields 0xFFFFFFFF_80000000. The well-defined i64 result is +2^31 (0x00000000_80000000).

How the buggy shape arises

A v2i64 sdiv whose lanes are each sext from i32 and whose divisor is a literal -1 splat:

typedef long2 __attribute__((ext_vector_type(2))) v2i64;
v2i64 bug(int32_t a, int32_t b) {
  v2i64 num = (v2i64){(int64_t)a, (int64_t)b};
  return num / (v2i64)(int64_t)-1;   // a == INT32_MIN -> lane 0 wrong
}

ComputeNumSignBits sees 33 on lane 0 (sext(INT32_MIN)) and 64 on the splat -1, both gates pass per-lane, narrowing fires per-lane.

Reproducer

reduced.ll builds a v2i64 numerator with lane 0 = sext(INT32_MIN) and lane 1 = sext(100), divides by literal <i64 -1, i64 -1>, and stores the low and high halves of lane 0's quotient.

bash known-miscompiles/run_ll_reproducer.sh known-miscompiles/m132-codegenprepare-vector-sdiv-int32min-narrowing/reduced.ll:

[0] input=0x80000000 O0=0x80000000 O2=0x80000000 mismatch=false
[1] input=0x00000064 O0=0xffffffff O2=0x00000000 mismatch=true
any_mismatch=true

Index [0] is the low32 of lane-0 q -- 0x80000000 either way: both the buggy narrowed lowering and the correct 0 - x fold produce the same low half for INT32_MIN.
Index [1] is the high32 of lane-0 q -- 0xFFFFFFFF at O0 (buggy SIGN_EXTEND of the narrowed i32 sdiv) vs 0x00000000 at O2 (correct sub i64 0, x after InstCombine's sdiv x, -1 fold).

True i64 result on the host: q[0] = +2147483648 = 0x00000000_80000000.

Why O0 vs O2 cleanly mismatches

At O2, InstCombine sees the vector sdiv <2 x i64> %num, splat(-1) and folds it to sub <2 x i64> zeroinitializer, %num before AMDGPUCodeGenPrepare scalarizes. No sdiv survives -- no narrowing.
At O0, the literal-divisor sdiv reaches AMDGPUCodeGenPrepare intact. The scalarizer at 1488-1520 splits per-lane, then defers each sdiv i64 %elt, -1 to SDAG (via divHasSpecialOptimization bail-out at line 1346). SDAG LowerSDIVREM then applies the m103 narrowing per scalar lane, mis-lowering lane 0.

Why this matters in the default pipeline

AMDGPUCodeGenPrepare runs in every codegen pipeline (both -O0 and -O2). The vector scalarizer (1488-1520) fires for any v{2,3,4,...}i64 div/rem. Any source emitting a vector i64 sdiv whose lanes can include (INT32_MIN, -1) after sign-extension -- e.g. HIP code dividing a long2/long4 of small ints by a small constant divisor that simplifies to -1 -- gets a buggy lane.

The m103 NOTES.md already describes the scalar case. This entry documents that the IR-level vector scalarizer in AMDGPUCodeGenPrepare.cpp:1488-1520 compounds the bug: it manufactures the per-lane scalar shape that m103 then mishandles, and it never checks whether any single lane's (num, den) would be (INT32_MIN, -1).

Suggested fix

Fix m103 at its sources (AMDGPUISelLowering.cpp:2415-2430 and AMDGPUCodeGenPrepare.cpp:1343-1372). The scalarizer at 1488-1520 is correct on its own -- it just propagates whatever shrinkDivRem64 / SDAG LowerSDIVREM does. Tightening the m103 gate from > 32 to > 33, or adding the !(isPowerOfTwo(LHS) && isAllOnes(RHS)) exclusion described in m103's NOTES, fixes both the scalar and the vector cases.

Toolchain Results

Toolchain	Result
LLVM HEAD with the local PR patches (`build/llvm-fuzzer`)	Reproduces (`q[0].hi = 0xFFFFFFFF` at O0, `0x00000000` at O2).
ROCm 7.1.1 (`/opt/rocm-7.1.1/lib/llvm/bin/clang-20`)	Same buggy lowering -- both `LowerSDIVREM` and the vector scalarizer have been in tree for many releases.

Why the fuzzer hasn't caught it

The current FuzzX i64 sdiv emitter rarely pairs a v2i64 numerator with lane 0 = sext(INT32_MIN) and a literal splat divisor of -1. Per MEMORY.md (Prefer-random-over-idioms), the right hook is to enrich the i32-constant pool with INT32_MIN and -1 and let the random emitter pick them as lane fillers for a sext-fed v{2,4}i64 sdiv with a constant-splat divisor.
Until m103 itself is fixed, both O0 and O2 pipelines apply the buggy narrowing for the non-literal divisor form (e.g. sdiv <2 x i64> %num, %splat-of-loaded-(-1)), so the O0-vs-O2 oracle collapses for that shape -- only the literal--1 divisor exposes the differential.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

m132: `AMDGPUCodeGenPrepare` vector scalarizer composes the m103 i64 sdiv `INT32_MIN / -1` narrowing per lane

The bug

How the buggy shape arises

Reproducer

Why O0 vs O2 cleanly mismatches

Why this matters in the default pipeline

Suggested fix

Toolchain Results

Why the fuzzer hasn't caught it

Uh oh!

FilesExpand file tree

NOTES.md

Latest commit

History

NOTES.md

File metadata and controls

m132: AMDGPUCodeGenPrepare vector scalarizer composes the m103 i64 sdiv INT32_MIN / -1 narrowing per lane

The bug

How the buggy shape arises

Reproducer

Why O0 vs O2 cleanly mismatches

Why this matters in the default pipeline

Suggested fix

Toolchain Results

Why the fuzzer hasn't caught it

m132: `AMDGPUCodeGenPrepare` vector scalarizer composes the m103 i64 sdiv `INT32_MIN / -1` narrowing per lane