|
| 1 | +# ARM64: Spurious FixnumMult overflow side-exits due to broken smulh/cmp pattern match |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +A bug in the ARM64 backend causes `FixnumMult` to emit spurious overflow |
| 6 | +side-exits. The JIT bails to the interpreter on nearly every function call, |
| 7 | +reducing `ratio_in_zjit` from ~70% to ~4%. The root cause is a fragile |
| 8 | +instruction pattern match in the emit pass that silently fails when a spill |
| 9 | +Store is inserted between `Mul` and `RShift` by `arm64_scratch_split`. |
| 10 | + |
| 11 | +The failure mode is **silent**: no crash, no assertion, no error. The `JoMul` |
| 12 | +conditional branch simply reads stale CPU condition flags from a prior |
| 13 | +instruction, and since those flags happen to say "not equal" in most cases, the |
| 14 | +JIT side-exits to the interpreter. This makes it a **performance cliff** that |
| 15 | +is invisible unless you check `--zjit-stats`. |
| 16 | + |
| 17 | +In theory, if stale flags happened to indicate "equal" when a real overflow |
| 18 | +occurred, the JIT would silently produce wrong results. We have not observed |
| 19 | +this in practice but the possibility exists. |
| 20 | + |
| 21 | +## Minimal reproducer |
| 22 | + |
| 23 | +```ruby |
| 24 | +# frozen_string_literal: true |
| 25 | +# tmp/muloverflow.rb — two FixnumMult + getbyte + >>32 in a loop |
| 26 | +def repro(str) |
| 27 | + lo = 5381 |
| 28 | + hi = 0 |
| 29 | + i = 0 |
| 30 | + len = str.bytesize |
| 31 | + while i < len |
| 32 | + prod_lo = lo * 33 + str.getbyte(i) |
| 33 | + carry = prod_lo >> 32 |
| 34 | + lo = prod_lo & 0xFFFFFFFF |
| 35 | + hi = (hi * 5 + carry) & 0xFFFFFFFF |
| 36 | + i += 1 |
| 37 | + end |
| 38 | + lo |
| 39 | +end |
| 40 | + |
| 41 | +100.times { repro("hello world") } |
| 42 | +``` |
| 43 | + |
| 44 | +**Before fix:** |
| 45 | +``` |
| 46 | +$ ruby --zjit-stats tmp/muloverflow.rb 2>&1 | grep -E "mult_overflow|ratio_in_zjit" |
| 47 | + fixnum_mult_overflow: 71 (100.0%) |
| 48 | +ratio_in_zjit: 3.7% |
| 49 | +``` |
| 50 | + |
| 51 | +**After fix:** |
| 52 | +``` |
| 53 | +$ ruby --zjit-stats tmp/muloverflow.rb 2>&1 | grep -E "mult_overflow|ratio_in_zjit" |
| 54 | +side_exit_count: 0 |
| 55 | +ratio_in_zjit: 69.1% |
| 56 | +``` |
| 57 | + |
| 58 | +## Conditions required to trigger |
| 59 | + |
| 60 | +All of these must be present simultaneously: |
| 61 | + |
| 62 | +1. **Two `FixnumMult` instructions** in the same basic block |
| 63 | +2. **A cfunc call** (like `String#getbyte`) in between — this creates enough |
| 64 | + register pressure to cause the Mul output to be spilled to a stack slot |
| 65 | +3. **A right-shift by 32** (`>> 32`) — adds more live values, increasing spill |
| 66 | + pressure |
| 67 | + |
| 68 | +Removing any one of these conditions makes the bug disappear. |
| 69 | + |
| 70 | +## Root cause |
| 71 | + |
| 72 | +### ARM64 overflow detection for multiply |
| 73 | + |
| 74 | +ARM64 has no overflow flag for multiplication. The standard technique is: |
| 75 | + |
| 76 | +```asm |
| 77 | +smulh x16, x0, x1 ; signed multiply high: upper 64 bits |
| 78 | +mul x0, x0, x1 ; multiply: lower 64 bits |
| 79 | +asr x15, x0, #63 ; sign-extend the low result |
| 80 | +cmp x16, x15 ; if high bits != sign-extended low, overflow |
| 81 | +b.ne overflow_exit |
| 82 | +``` |
| 83 | + |
| 84 | +ZJIT implements this as a three-pass pipeline: |
| 85 | + |
| 86 | +1. **HIR → LIR lowering** (`codegen.rs`): Emits `Mul` + `JoMul` instructions |
| 87 | +2. **arm64_scratch_split** (`arm64/mod.rs`): Inserts `RShift` between `Mul` and |
| 88 | + `JoMul` to prepare the sign bit for the comparison |
| 89 | +3. **arm64_emit** (`arm64/mod.rs`): Pattern-matches `[Mul, RShift, JoMul]` to |
| 90 | + fuse them into the `smulh`+`mul`+`asr`+`cmp` sequence |
| 91 | + |
| 92 | +### The bug |
| 93 | + |
| 94 | +When the Mul output register is spilled (allocated to a stack slot by |
| 95 | +`alloc_regs`), `arm64_scratch_split` also inserts a `Store` instruction to |
| 96 | +write the result to the stack. The code inserts the Store **before** the |
| 97 | +RShift, producing: |
| 98 | + |
| 99 | +``` |
| 100 | +Mul x15, x15, x17 |
| 101 | +Store [x29 - 8], x15 ← spill |
| 102 | +RShift x15, x15, 63 ← sign bit extraction |
| 103 | +JoMul side_exit |
| 104 | +``` |
| 105 | + |
| 106 | +The emit pass checks `insns[idx+1]` and `insns[idx+2]` for `RShift` and |
| 107 | +`JoMul`: |
| 108 | + |
| 109 | +```rust |
| 110 | +match (insns.get(insn_idx + 1), insns.get(insn_idx + 2)) { |
| 111 | + (Some(Insn::RShift { .. }), Some(Insn::JoMul(_))) => { |
| 112 | + // emit smulh + mul + asr + cmp |
| 113 | + } |
| 114 | + _ => { |
| 115 | + mul(cb, out, left, right); // ← NO smulh, NO cmp |
| 116 | + } |
| 117 | +} |
| 118 | +``` |
| 119 | + |
| 120 | +With the Store in between, `insns[idx+1]` is `Store`, not `RShift`. The |
| 121 | +pattern match **falls through to the else branch**, which emits only `mul` |
| 122 | +without `smulh` or `cmp`. The `RShift` is then emitted as a standalone `asr`, |
| 123 | +and `JoMul` emits `b.ne` — but `cmp` was never executed, so `b.ne` reads |
| 124 | +**stale condition flags** from whatever instruction last set them. |
| 125 | + |
| 126 | +### Consequence |
| 127 | + |
| 128 | +- **If stale flags = NE (common):** Spurious side-exit. The function drops to |
| 129 | + interpreter speed. ~20x perf regression on affected code. |
| 130 | +- **If stale flags = EQ during real overflow (rare):** Missed overflow. The |
| 131 | + mul result wraps without promoting to Bignum. **Silent wrong result.** |
| 132 | + |
| 133 | +## Fix |
| 134 | + |
| 135 | +The fix has two parts: |
| 136 | + |
| 137 | +### 1. `arm64_scratch_split`: Reorder Store to after RShift when JoMul follows |
| 138 | + |
| 139 | +When the next instruction is `JoMul`, emit the `RShift` immediately after |
| 140 | +`Mul`, and move the spill `Store` to after the `RShift`. The Store now writes |
| 141 | +from the Mul output register directly (rather than SCRATCH0, which was |
| 142 | +clobbered by the RShift): |
| 143 | + |
| 144 | +``` |
| 145 | +Mul x15, x15, x17 ← mul result in x15 |
| 146 | +RShift x15, x15, 63 ← sign bit (clobbers x15) |
| 147 | +Store [x29 - 8], x15 ← stores sign bit, NOT mul result |
| 148 | +JoMul side_exit |
| 149 | +``` |
| 150 | + |
| 151 | +Wait — this stores the sign bit, not the mul result! So the emit pass must |
| 152 | +handle this. |
| 153 | + |
| 154 | +### 2. `arm64_emit`: Handle `[Mul, RShift, Store, JoMul]` pattern |
| 155 | + |
| 156 | +The emit pass now recognizes both patterns: |
| 157 | + |
| 158 | +- `[Mul, RShift, JoMul]` — original (no spill) |
| 159 | +- `[Mul, RShift, Store, JoMul]` — with spill |
| 160 | + |
| 161 | +For the spill case, the emitted ARM64 code is: |
| 162 | + |
| 163 | +```asm |
| 164 | +smulh x16, x0, x1 ; high 64 bits |
| 165 | +mul x15, x0, x1 ; low 64 bits |
| 166 | +stur x15, [x29, #-8] ; spill mul result BEFORE asr clobbers x15 |
| 167 | +asr x15, x15, #63 ; sign bit of low result |
| 168 | +cmp x16, x15 ; overflow check |
| 169 | +b.ne overflow_exit |
| 170 | +``` |
| 171 | + |
| 172 | +The `stur` is emitted between `mul` and `asr`, preserving the mul result in |
| 173 | +the spill slot before `asr` clobbers the register. The `smulh` result in X16 |
| 174 | +is safe because `stur` with a register source doesn't touch X16. |
| 175 | + |
| 176 | +## Bisection table |
| 177 | + |
| 178 | +These tests were run on the pre-fix build to isolate the trigger conditions: |
| 179 | + |
| 180 | +| Variant | Overflows? | ratio_in_zjit | |
| 181 | +|---------|-----------|---------------| |
| 182 | +| `lo * 33`, no hi, no getbyte | No | 69% | |
| 183 | +| `lo * 33 + getbyte`, no hi | No | 68% | |
| 184 | +| `lo * 33 + getbyte`, `>> 32`, no hi multiply | No | 69% | |
| 185 | +| `lo * 33 + getbyte`, `>> 32`, `hi * 33 + carry` (full) | **Yes** | 3.7% | |
| 186 | +| Same but `carry` computed but unused | **Yes** | 3.7% | |
| 187 | +| Same but `(lo << 5) + lo` instead of `lo * 33` | No | 68% | |
| 188 | +| `lo * 33 + (i & 0xFF)` instead of getbyte, with hi | No | 69% | |
| 189 | +| `hi * 33 + constant` in isolation | No | 69% | |
| 190 | + |
| 191 | +## Lessons and recommendations |
| 192 | + |
| 193 | +### 1. Avoid fragile peephole pattern matching across passes |
| 194 | + |
| 195 | +The core issue is that `arm64_scratch_split` and `arm64_emit` communicate via |
| 196 | +an **implicit contract**: "RShift will be at idx+1 after Mul". When |
| 197 | +`scratch_split` inserts a Store, this contract is silently violated. Neither |
| 198 | +pass checks or asserts the invariant. |
| 199 | + |
| 200 | +**Recommendation**: Use an explicit fused instruction (`MulWithOverflowCheck`) |
| 201 | +in the LIR instead of relying on Mul+RShift+JoMul being contiguous. This is |
| 202 | +what other compilers do: |
| 203 | + |
| 204 | +- **LLVM**: Uses `llvm.smul.with.overflow` intrinsic — a single instruction |
| 205 | + that produces both the result and an overflow bit. |
| 206 | +- **Cranelift**: Uses `imul` + `trapif` where the overflow flag is an explicit |
| 207 | + operand, not a side effect. |
| 208 | +- **V8 TurboFan**: Uses `Int32MulWithOverflow` as a single node that lowers to |
| 209 | + the platform-specific sequence atomically. |
| 210 | +- **GCC**: The `-ftrapv` multiply overflow check is emitted as a single |
| 211 | + inseparable sequence during final code emission, never as separate |
| 212 | + matchable instructions. |
| 213 | + |
| 214 | +### 2. Reject unknown instruction sequences |
| 215 | + |
| 216 | +When the emit pass encounters `Mul` without the expected `RShift + JoMul` |
| 217 | +pattern, it falls through to a plain `mul`. But if there IS a `JoMul` later |
| 218 | +(just not at idx+2), the `JoMul` will execute with wrong flags. The else |
| 219 | +branch should either: |
| 220 | + |
| 221 | +- **Panic**: "Mul followed by JoMul but RShift not at expected position" |
| 222 | +- **Emit a safe fallback**: `smulh` + `mul` + `asr` + `cmp` unconditionally |
| 223 | + |
| 224 | +The current silent fallthrough is the worst option. |
| 225 | + |
| 226 | +### 3. Add end-to-end overflow tests with register pressure |
| 227 | + |
| 228 | +The existing tests only exercise `FixnumMult` in simple functions with low |
| 229 | +register pressure (where the output isn't spilled). A test with two multiplies |
| 230 | +and a cfunc call would have caught this immediately. |
| 231 | + |
| 232 | +### 4. Consider separating "instruction lowering" from "instruction selection" |
| 233 | + |
| 234 | +The ARM64 backend conflates two concerns: |
| 235 | + |
| 236 | +- **Lowering**: Converting abstract LIR to concrete register operations, |
| 237 | + handling spills (scratch_split) |
| 238 | +- **Selection**: Fusing multiple LIR instructions into a single ARM64 sequence |
| 239 | + (emit) |
| 240 | + |
| 241 | +Other compilers (LLVM, Cranelift) keep these clearly separated, with selection |
| 242 | +happening before register allocation. Post-regalloc passes only handle |
| 243 | +mechanical concerns (encoding, addressing modes) and never need to pattern |
| 244 | +match across multiple instructions. |
| 245 | + |
| 246 | +## Files changed |
| 247 | + |
| 248 | +- `zjit/src/backend/arm64/mod.rs` — scratch_split Mul handling + emit Mul |
| 249 | + pattern match |
| 250 | + |
| 251 | +## Test results |
| 252 | + |
| 253 | +- 1454 unit tests: all pass |
| 254 | +- 27 integration tests, 7605 assertions: all pass |
| 255 | +- Reproducer: 0 side exits, 69.1% ratio_in_zjit (was 71 exits, 3.7%) |
| 256 | +- DJB2 64-bit hash: correct results match interpreter |
0 commit comments