Skip to content

wamr_llvm_jit is a large slow outlier on i8x16.bitmask when a loop-varying negative byte is inserted via i8x16.replace_lane #4931

@gaaraw

Description

@gaaraw

Subject of the issue

wamr_llvm_jit is much slower than peer runtimes on a small i8x16.bitmask loop when the input vector is built with i8x16.replace_lane and the replaced lane is loop-varying in the negative-byte range 0x80..0xff.

Test case

The clearest minimized reproducer is:

(module
  (type (func (param i32)))
  (type (func))
  (import "wasi_snapshot_preview1" "proc_exit" (func (type 0)))
  (func (type 1)
    (local $i i64)
    (local $acc i32)
    (local.set $i (i64.const 1073741824))
    (local.set $acc (i32.const 1311768464))
    (loop $body
      (local.get $acc)
      v128.const i8x16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      local.get $i
      i32.wrap_i64
      i32.const 0x80
      i32.or
      i8x16.replace_lane 7
      i8x16.bitmask
      i32.xor
      (local.set $acc)
      (local.set $i (i64.sub (local.get $i) (i64.const 1)))
      (br_if $body (i64.ne (local.get $i) (i64.const 0)))
    )
    (i32.const 0)
    (local.get $acc)
    (i32.store)
    (call 0 (i32.const 0))
  )
  (memory 1)
  (export "_start" (func 1))
  (export "memory" (memory 0))
)

I also checked closely matched controls:

  • multilane_all_neg_splat: all lanes vary through i8x16.splat
  • sweep_const_80: replaced lane is constant negative
  • obs_extract_lane_s7_negvary_xor: keep replace_lane and negative varying lane, but observe with extract_lane_s instead of bitmask
  • cross_i16x8_bitmask_negvary: same high-level pattern translated to i16x8.bitmask

Your environment

  • wasmer: 6.1.0
  • WAMR: iwasm 2.4.4
  • wasmedge: 0.16.1-18-gc457fe30
  • wasmtime: 41.0.0 (4898322a4 2025-12-18)
  • wabt: 1.0.39
  • llvm: 21.1.5
  • Host OS: Ubuntu 22.04.5 LTS x64
  • CPU: 12th Gen Intel® Core™ i7-12700 × 20

Steps to reproduce

  1. Compile the testcase with wat2wasm reproducer.wat -o reproducer.wasm.
  2. Run the wasm file with wamr_llvm_jit and compare its wall-clock or task-clock time with other runtimes.
  3. Compare against the controls listed above.

Representative commands in my setup:

wat2wasm reproducer.wat -o reproducer.wasm

# WAMR LLVM JIT
/path/to/iwasm reproducer.wasm

# peer runtimes
/path/to/wasmer run --llvm reproducer.wasm
/path/to/wasmedge --enable-jit reproducer.wasm
/path/to/wasmer run reproducer.wasm
/path/to/wasmtime reproducer.wasm

Expected and actual behavior

Expected behavior

For this small tight SIMD loop, I would expect wamr_llvm_jit to stay in the same rough range as the other major runtimes, or at least not to become a strong outlier only for this narrow i8x16.bitmask shape.

Actual behavior

wamr_llvm_jit is a large slowdown outlier on the reduced trigger.

Representative timings (seconds):

variant wasmer_llvm wasmedge_jit wamr_llvm_jit wasmer_cranelift wasmtime
multilane_one_neg_lane7 0.31224 0.03303 2.00588 0.61651 0.63321
sweep_loop_or_80 0.31067 0.03142 1.98314 0.62097 0.61535
multilane_all_neg_splat 0.31384 0.03224 0.31538 0.61624 0.62004
sweep_const_80 0.30556 0.03335 0.32205 0.30358 0.33636
obs_extract_lane_s7_negvary_xor ~0.38 ~0.27 ~0.35 ~0.69 ~0.68
cross_i16x8_bitmask_negvary 0.30917 0.03175 0.31183 0.69188 0.69385

Important observations:

  • A single loop-varying negative byte inserted by i8x16.replace_lane is already sufficient to trigger the slowdown.
  • The slowdown does not reproduce when the same negative pattern is produced by i8x16.splat.
  • The slowdown does not reproduce when the replaced lane is constant negative instead of loop-varying.
  • The slowdown does not reproduce on the matched i16x8.bitmask version.
  • Replacing bitmask with extract_lane_s keeps some work alive but does not reproduce the large WAMR outlier.

So the strongest observed trigger condition is:

i8x16.bitmask consuming a vector built by i8x16.replace_lane, where at least one replaced lane is loop-varying and its effective byte stays in 0x80..0xff.

Extra Info

I also exported WAMR low-level artifacts with wamrc --format=llvmir-unopt, --format=llvmir-opt, and --format=object.

For the slow i8x16 bitmask-shaped cases, optimized LLVM IR consistently contains a pattern like:

%new_vector = insertelement <16 x i8> ..., i8 %lane, i64 7
%isneg = icmp slt <16 x i8> %new_vector, zeroinitializer
%mask_bits = select <16 x i1> %isneg,
  <16 x i64> <1,2,4,...,32768>,
  zeroinitializer
%call = tail call i64 @llvm.vector.reduce.or.v16i64(<16 x i64> %mask_bits)

At object level, the hot loop becomes a long synthesized sequence with operations such as:

  • vpinsrb
  • vpcmpgtb
  • vpmovsxbw
  • vpmovzxbq / vpmovzxwq
  • vextracti128
  • repeated vpand / vpor

I did not observe a compact pmovmskb / vpmovmskb-style extraction sequence in these WAMR-generated object files.

By contrast:

  • the constant-negative control (sweep_const_80) is optimized down so the expensive bitmask path no longer stays alive in the hot loop;
  • the matched i16x8.bitmask control does not preserve a comparable heavy object-level sequence.

So the report above is based on both runtime measurements and WAMR low-level evidence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions