You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: PLAN.md
+32Lines changed: 32 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -110,6 +110,20 @@ Unreachable under the current test matrix. Each is small to fill if a new wasm o
110
110
| 13 |`vmulld` ✅ DONE (commit `c1f15514bc6c`; POWER10) |`mulInt64x2` 9-insn (P9) / 11-insn (P8) extract-mul-reassemble |**1 insn on POWER10** — drops 2 GPR scratches and the GPR↔FPR round-trip entirely |
111
111
| 14 |`paddi` (extended) ✅ DONE (commit `4aed8c8f4e8b`; POWER10) |`addPtr/subPtr/add64/sub64` with Imm32/ImmWord/Imm64 in 17-34-bit signed range — was `movePtr scratch + add/subf` (3 insns + GPR scratch on P9) |**1 prefixed insn on POWER10**, no scratch. Already used for `movePtr` (item #-/P10 table), now extended to scalar immediate add/sub helpers used by every JS int32/int64 closure-offset / IC-slot computation |
112
112
| 15 |`fcfidu` / `fcfidus` ✅ DONE (commit `ddee1e276088`; POWER7+) |`convertUInt64To{Double,Float32}`~9-insn manual sign-split (branch + halve via shift+OR + scalar convert + doubling fadd) |**2 insns flat** (`mtvsrd + fcfidu[s]`), no branch, no GPR scratch. The emitters were already wired but unused. |
113
+
| 16 |`rlwinm` (contig-mask) + `ori`/`oris`/`xori`/`xoris` pairs ✅ DONE (commits `5d68cf47ad8c`, `98356835229c`; POWER7+) |`and32`/`andPtr`/`or32`/`xor32`/`orPtr`/`xorPtr` with `Imm32` — was `move32 scratch + and/or/xor` (3 insns + GPR scratch) |**1-2 insns, no scratch** for the common cases: 16-bit-fit → `andi.`/`ori`/`xori`; contig-mask → `rlwinm`; mixed-bit unsigned-32 → `ori+oris` / `xori+xoris`|
114
+
| 17 |`rldicl`/`rldicr`/`rlwinm` + `ori`/`oris`/`xori`/`xoris` pairs ✅ DONE (commit `9d591766eace`; POWER7+) |`and64`/`or64`/`xor64` with `Imm64` — was `movePtr scratch + and/or/xor_` (5-6 insns + GPR scratch) |**1-2 insns, no scratch** when imm fits 16-bit (`andi.`), is a contig-mask 64-bit (`rldicl`/`rldicr`/`rlwinm`), or fits unsigned 32-bit (`ori+oris` / `xori+xoris`). Hot in tree: `and64(Imm64((1<<MantissaBits)-1))` for double-payload mask → 1 `rldicl`|
115
+
| 18 |`mulli` ✅ DONE (commit `ae8173c891fa`; POWER7+) |`mulBy3` (was 2 `add` + GPR scratch) and `mulPtr(ImmWord)` 16-bit-fit fast path (was `movePtr scratch + mulld`, 5+ insns) |**1 insn, no scratch** for `mulBy3` (handles `src==dest` cleanly via mulli's RA-read-before-RT-write) and `mulPtr(ImmWord)` when imm fits int16 |
116
+
| 19 |`plxv` / `pstxv` ⏳ CANDIDATE (POWER10) |`loadUnalignedSimd128(Address)` / `storeUnalignedSimd128(Address)` non-zero-offset path — currently `movePtr scratch + lxvx(dest, base, scratch)` (P9: 2 insns + GPR scratch; P8: same plus `xxpermdi` byteswap) |**1 prefixed insn, no scratch** when offset fits 34-bit signed. `as_plxv` already wired; `as_pstxv` (= `pstxv` 8LS prefixed store) needs new emitter + sim decoder. Hot on every wasm v128 load/store with a constant frame offset. |
117
+
| 20 |`subfic`-into-dest ⏳ CANDIDATE (POWER7+) |`rotateRight(Register count, Register input, Register dest)` and `rotateRight64(Register, …)` — currently `subfic(scratch, count, 32) + rlwnm(dest, input, scratch, 0, 31)` (1 GPR scratch) |**same insn count, drops 1 GPR scratch** when `dest != input` (the common case): `subfic dest, count, 32; rlwnm dest, input, dest, 0, 31`. Need a guard for the `dest == input` case (caller would lose `input` after the subfic write); fall back to scratch then. |
118
+
| 21 |`vspltisb` + dual-path ✅ DONE (commit `5733c96fecb8`; POWER7+) |`swizzleInt8x16` was 12 insns + GPR scratch + 1 red-zone slot on POWER9: `movePtr(0x10101010) + splatX4` for `splat(16)`, `vcmpgtub`, RedZoneStash mask, `movePtr(0x0F0F0F0F) + splatX4` for `splat(15)`, `vsububs`, `vperm`, RedZoneRestore, `xxland`| Reformulated `rhs<16` as `!(rhs>15)`; both splats now `vspltisb 15` (1 insn each, no GPR scratch). Two paths: **dest != rhs (7 insns, no red-zone, no GPR scratch — wasm baseline's `swizzleInt8x16(rsd, rs, rsd)` hits this)**; **dest == rhs (9 insns, 1 red-zone slot, no GPR scratch)**. Re-`vspltisb` rather than stashing splat(15) — cheaper than redzone round-trip. Verified on real P9: default + `MOZ_PPC64_FORCE_POWER8=1` both 13715 PASS / 0 FAIL. |
119
+
| 22 |`vcmpequd.` + `mfocrf` + CR6.LT extract ✅ DONE (commit `b60d47986c03`; POWER8+) |`anyTrueSimd128` was `mfvsrd + mfvsrld + or_ + subfic + subfe + neg` (6 insns + GPR scratch + 2 long-latency FPR↔GPR transfers; POWER8 needed an extra xxpermdi through ScratchSimd128 because mfvsrld is POWER9-only) |**5 insns flat, no GPR scratch, no FPR↔GPR round-trip.** Path: `ZeroSimd128(scratch); vcmpequd. scratch, src, scratch (CR6.LT=1 iff all-zero); mfocrf dest, cr6; rlwinm dest, dest, 25, 31, 31 (extract CR6.LT); xori dest, dest, 1`. Mirrors the existing `EmitAllTrueInt` template (SH=27 for CR6.EQ); we use SH=25 for CR6.LT plus the xori-flip. Real-P9 + sim full matrix verified (jit-test default + FORCE_POWER8 + sim FORCE_POWER{9,10,8}, jstests default + FORCE_POWER8). |
120
+
| 23 |`xsmaxjdp` + isel branchless clamp ✅ DONE (commit `2d4ff1cf8c68`; POWER9 fast path) |`clampDoubleToUint8` had two mispredictable FP-compare branches (`≤0/NaN → 0` and `≥255 → 255`) on the hot Uint8ClampedArray store path |**POWER9 path: 7 insns, no branches.**`zeroDouble fpscratch; xsmaxjdp fpscratch, input, fpscratch (max(input,0); NaN→0 via P9 Java-NaN-as-less semantic ISA v3.0B §7.6.1.7); fctid fpscratch, fpscratch; mfvsrd output, fpscratch; li max255, 255; cmpdi output, 255; isel output, max255, output, GT`. fctid saturates out-of-int64 to INT64_MAX which the upper clamp catches. POWER8 fallback unchanged (xsmaxjdp unavailable; fctid maps NaN to INT64_MAX → would clamp to 255 not 0, so keep NaN-filtering branches). Real-P9 + sim full matrix verified (jit-test default + FORCE_POWER8 + sim FORCE_POWER{9,10,8}, jstests default + FORCE_POWER8). |
121
+
| 24-27 | drop dead `mr`/`movePtr` before `stXcx.` ✅ DONE (commit `aad9b6e8e3f7`; POWER7+) |`CompareExchange` 4-byte path, `CompareExchange64`, `AtomicExchange` 4-byte path, `AtomicExchange64` — each had `xs_mr/movePtr scratch, value; stXcx scratch, ...`| Per Power ISA v3.0B §3.3.16 `stXcx.` reads RS without modifying it, so the move-to-scratch was dead. Replaced with `stXcx value, ...` directly. **1 insn saved per atomic-loop iteration in each helper, plus drops the `scratch2` GPR acquire entirely** (only used at the move site). Real-P9 + sim full matrix verified. |
122
+
| 28 | orphaned helper removal ✅ DONE (commit `aad9b6e8e3f7`) |`MacroAssemblerPPC64Compat::convertUInt64ToDouble(Register, FloatRegister)` — single-Register Compat-class form with the old branched signed-split path; orphaned by commit `ddee1e276088` (which collapsed the public 3-arg API to `mtvsrd + fcfidu`) | Deleted entirely. Zero callers anywhere in tree. Direct demonstration of the "Refactoring hygiene — remove orphaned helpers in the same commit" rule (this would have been caught at the time of `ddee1e276088`). |
123
+
| 29 |`plfd` / `plfs` for general memory FP loads ⏳ CANDIDATE (POWER10) |`loadDouble(Address)` / `loadFloat32(Address)` (h lines 1472, 1493) when `offset` doesn't fit 16-bit signed — currently `movePtr scratch + lfdx/lfsx` (2 insns + GPR scratch) |**1 prefixed insn, no scratch** when offset fits 34-bit signed. `as_plfd` / `as_plfs` already wired (used for `loadConstantDouble`/`Float32`); just plumb them into the `Address`-form fallback path. |
124
+
| 30 |`pld` for general 64-bit memory load ⏳ CANDIDATE (POWER10) |`loadPtr(Address)` (h line 1436) when offset doesn't fit DS-form (16-bit signed, 4-byte aligned) — currently `movePtr scratch + ldx` (2 insns + GPR scratch). Same shape for `load64(Address)` (which delegates to `loadPtr`). |**1 prefixed insn, no scratch** when offset fits 34-bit signed. `as_pld` already wired (used in `movePtr` for 33-34-bit-signed values). Hot on deep stack frames where slot offsets exceed 16-bit signed. |
125
+
| 31 | P10 prefixed-store emitters + plumbing ⏳ CANDIDATE (POWER10, needs emitter work) |`storeDouble(Address)` / `storeFloat32(Address)` / `storePtr(Address)` / `storeUnalignedSimd128(Address)` (across h and cpp) when offset doesn't fit DS-form — currently `movePtr scratch + stXx`|**1 prefixed insn each, no scratch** with `pstfd`/`pstfs`/`pstd`/`pstxv` (POWER10). Requires new emitters (`as_pstfd`, `as_pstfs`, `as_pstd`, `as_pstxv`) and matching sim decoders before the JIT can use them. ISA encoding is the 8LS prefix family (Type 0); same shape as `pld`/`plxv` already in tree. |
126
+
| 32 | drop orphaned `xs_mr` after #24/#26 atomic refactor ⏳ TBD | The `xs_mr scratch, …` removed by items #24 / #26 may leave `xs_mr` (defined in `Assembler-ppc64-inl.h` style) without callers if the atomic helpers were the only consumers. | After #24-#27 ship, run `git grep '\bxs_mr\b'` to confirm. If the only hit is the definition, drop it per the "Refactoring hygiene" rule. (Likely still has callers — `xs_mr` is the standard "move register" macro — but worth verifying after the atomic patches land.) |
113
127
114
128
**Notes on item #1 (`vinsertb`/`vinserth`, shipped):** re-introduces what `66fa9e6625` removed earlier. That removal was correct AT THE TIME — it predated the VR-namespace migration when Simd128 lived in VSR0-31; the V-form VRT field would have addressed the wrong physical register. Post-migration the encoding mismatch is gone. First sim-decoder pass had the byte-pair order flipped for `vinserth`; caught by running the same wasm test on real P9 silicon (rather than only the sim) and observing the byte-level mismatch. Verified across 10 sweeps (real P9 jit-test/jstests with default + `FORCE_POWER8=1`; sim jit-test/jstests with `FORCE_POWER9/10/8=1`).
115
129
@@ -309,6 +323,23 @@ Procedure:
309
323
4.**No subagent output enters PLAN.md in the same turn.** Draft, verify per-item, then commit.
310
324
5.**Measure, don't estimate.** Five minutes on the dev box beats arithmetic.
311
325
326
+
### Refactoring hygiene — remove orphaned helpers in the same commit
327
+
328
+
When a refactor replaces a multi-step helper sequence with a direct opcode (the dominant pattern in this branch's audit work), the old `static` helpers in `MacroAssembler-ppc64-inl.h` / `.cpp` often lose their callers. Drop them **in the same commit as the refactor**, not later.
329
+
330
+
Why this matters:
331
+
-`./mach build binaries` (JS-shell-only) does **not** run the unified-compilation `-Wunused-function` check that `./mach build` (full Firefox) runs. An orphan helper passes JS jit-test sweeps cleanly and only surfaces when someone (the user, CI, or a clobber-rebuild) does a full firefox build — sometimes weeks later.
332
+
- Concrete incident, 2026-04-30: `ExtractLow32x4Lanes` in `MacroAssembler-ppc64-inl.h` lost its only callers when commit `3832c823aa95` rewrote `convertInt32x4ToFloat64x2{,_u}` to use `vmrglw + xvcv{s,u}xwdp`. The unused helper sat for several commits before the user's full firefox build failed under `-Werror=unused-function`. Build-fix commit `1723d63b8eb3` dropped it.
333
+
- Same shape applies to: `RedZoneStashSimd128` slots that are no longer used after a redzone-elimination refactor; per-lane scalar fall-back helpers (`ExtractLaneToGPR` variants) that were the body of an unrolled loop now collapsed into a single VMX op; `loadConstantSimd128` mask arrays whose callers moved to a `vspltis*` / `xxspltib` direct path.
334
+
335
+
How to apply, every refactor:
336
+
1. After implementing the new direct-opcode path, run `git grep '<helper_name>'` over `js/src/jit/ppc64/`. If the only hit is the definition, delete it in the same diff.
337
+
2. Skim the file for nearby `static (inline|void) Foo(...)` declarations the refactor might have orphaned indirectly (e.g., a private utility used only by the helper you just removed).
338
+
3. If a helper is intentionally kept for a known near-term caller, add a one-line `// kept for <opcode>: <reason>` comment so the next audit doesn't delete it without reading.
339
+
4. JS-shell verification (`./mach build binaries -j20`) is necessary but **not sufficient**. Either run a full `./mach build -j20` periodically, or grep for `[[maybe_unused]]` / `__attribute__((unused))` / `static (inline)?` in the touched files as a cheap pre-commit check. The full-build catch is mechanical; the upstream MR will reject anything `-Werror=unused-function` flags.
340
+
341
+
Counterpart to `feedback_no_for_future_use_helpers.md` (don't add speculative helpers without a caller). Together: a helper exists iff it has at least one caller, in every commit.
- Use `python3 js/src/jit-test/jit_test.py` directly (not `./mach jit-test`) — we build with `--disable-tests`.
393
424
- Always save test output to a file; don't re-run slow tests to extract different information.
394
425
-**Do NOT use `obj-sm` for wasm testing** — wasm is disabled at runtime in `-O0` builds.
426
+
-**Always pass `-j20` to `./mach build` on the PPC64 box.** The default auto-parallelism picks `-j32` (32 cores × 64 GiB RAM heuristic) which saturates the box during lld linking and makes it SSH-unreachable. `./mach build -j20`, `./mach build binaries -j20`, etc. — every invocation.
395
427
-**Never run two heavy commands concurrently on the PPC64 box** (two builds, two jit-test runs, etc.) — sequence them. The sim host (`192.168.64.3`) is the same: a build that overwrites `dist/bin/js` while a test run is shelling it out gives a `PermissionError [Errno 13]` mid-suite (we hit this 2026-04-30).
396
428
-**Full-tree rsync before testing, not just `js/src/jit/ppc64/`.** Partial rsyncs leave the remote source out of sync with local in places that bite during build/test:
397
429
-`Cargo.lock` + `third_party/rust/<crate>/` after upstream merges (saw `glslopt 0.1.13/0.1.14` and `neqo v0.26.0/v0.26.1` mismatches → cbindgen step fails or "lock file required" errors)
0 commit comments