Skip to content

Commit 563dfa9

Browse files
committed
PLAN.md: mark items mozilla-firefox#24-mozilla-firefox#28 DONE (atomic helpers + orphan removal)
1 parent aad9b6e commit 563dfa9

1 file changed

Lines changed: 32 additions & 0 deletions

File tree

PLAN.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,20 @@ Unreachable under the current test matrix. Each is small to fill if a new wasm o
110110
| 13 | `vmulld` ✅ DONE (commit `c1f15514bc6c`; POWER10) | `mulInt64x2` 9-insn (P9) / 11-insn (P8) extract-mul-reassemble | **1 insn on POWER10** — drops 2 GPR scratches and the GPR↔FPR round-trip entirely |
111111
| 14 | `paddi` (extended) ✅ DONE (commit `4aed8c8f4e8b`; POWER10) | `addPtr/subPtr/add64/sub64` with Imm32/ImmWord/Imm64 in 17-34-bit signed range — was `movePtr scratch + add/subf` (3 insns + GPR scratch on P9) | **1 prefixed insn on POWER10**, no scratch. Already used for `movePtr` (item #-/P10 table), now extended to scalar immediate add/sub helpers used by every JS int32/int64 closure-offset / IC-slot computation |
112112
| 15 | `fcfidu` / `fcfidus` ✅ DONE (commit `ddee1e276088`; POWER7+) | `convertUInt64To{Double,Float32}` ~9-insn manual sign-split (branch + halve via shift+OR + scalar convert + doubling fadd) | **2 insns flat** (`mtvsrd + fcfidu[s]`), no branch, no GPR scratch. The emitters were already wired but unused. |
113+
| 16 | `rlwinm` (contig-mask) + `ori`/`oris`/`xori`/`xoris` pairs ✅ DONE (commits `5d68cf47ad8c`, `98356835229c`; POWER7+) | `and32`/`andPtr`/`or32`/`xor32`/`orPtr`/`xorPtr` with `Imm32` — was `move32 scratch + and/or/xor` (3 insns + GPR scratch) | **1-2 insns, no scratch** for the common cases: 16-bit-fit → `andi.`/`ori`/`xori`; contig-mask → `rlwinm`; mixed-bit unsigned-32 → `ori+oris` / `xori+xoris` |
114+
| 17 | `rldicl`/`rldicr`/`rlwinm` + `ori`/`oris`/`xori`/`xoris` pairs ✅ DONE (commit `9d591766eace`; POWER7+) | `and64`/`or64`/`xor64` with `Imm64` — was `movePtr scratch + and/or/xor_` (5-6 insns + GPR scratch) | **1-2 insns, no scratch** when imm fits 16-bit (`andi.`), is a contig-mask 64-bit (`rldicl`/`rldicr`/`rlwinm`), or fits unsigned 32-bit (`ori+oris` / `xori+xoris`). Hot in tree: `and64(Imm64((1<<MantissaBits)-1))` for double-payload mask → 1 `rldicl` |
115+
| 18 | `mulli` ✅ DONE (commit `ae8173c891fa`; POWER7+) | `mulBy3` (was 2 `add` + GPR scratch) and `mulPtr(ImmWord)` 16-bit-fit fast path (was `movePtr scratch + mulld`, 5+ insns) | **1 insn, no scratch** for `mulBy3` (handles `src==dest` cleanly via mulli's RA-read-before-RT-write) and `mulPtr(ImmWord)` when imm fits int16 |
116+
| 19 | `plxv` / `pstxv` ⏳ CANDIDATE (POWER10) | `loadUnalignedSimd128(Address)` / `storeUnalignedSimd128(Address)` non-zero-offset path — currently `movePtr scratch + lxvx(dest, base, scratch)` (P9: 2 insns + GPR scratch; P8: same plus `xxpermdi` byteswap) | **1 prefixed insn, no scratch** when offset fits 34-bit signed. `as_plxv` already wired; `as_pstxv` (= `pstxv` 8LS prefixed store) needs new emitter + sim decoder. Hot on every wasm v128 load/store with a constant frame offset. |
117+
| 20 | `subfic`-into-dest ⏳ CANDIDATE (POWER7+) | `rotateRight(Register count, Register input, Register dest)` and `rotateRight64(Register, …)` — currently `subfic(scratch, count, 32) + rlwnm(dest, input, scratch, 0, 31)` (1 GPR scratch) | **same insn count, drops 1 GPR scratch** when `dest != input` (the common case): `subfic dest, count, 32; rlwnm dest, input, dest, 0, 31`. Need a guard for the `dest == input` case (caller would lose `input` after the subfic write); fall back to scratch then. |
118+
| 21 | `vspltisb` + dual-path ✅ DONE (commit `5733c96fecb8`; POWER7+) | `swizzleInt8x16` was 12 insns + GPR scratch + 1 red-zone slot on POWER9: `movePtr(0x10101010) + splatX4` for `splat(16)`, `vcmpgtub`, RedZoneStash mask, `movePtr(0x0F0F0F0F) + splatX4` for `splat(15)`, `vsububs`, `vperm`, RedZoneRestore, `xxland` | Reformulated `rhs<16` as `!(rhs>15)`; both splats now `vspltisb 15` (1 insn each, no GPR scratch). Two paths: **dest != rhs (7 insns, no red-zone, no GPR scratch — wasm baseline's `swizzleInt8x16(rsd, rs, rsd)` hits this)**; **dest == rhs (9 insns, 1 red-zone slot, no GPR scratch)**. Re-`vspltisb` rather than stashing splat(15) — cheaper than redzone round-trip. Verified on real P9: default + `MOZ_PPC64_FORCE_POWER8=1` both 13715 PASS / 0 FAIL. |
119+
| 22 | `vcmpequd.` + `mfocrf` + CR6.LT extract ✅ DONE (commit `b60d47986c03`; POWER8+) | `anyTrueSimd128` was `mfvsrd + mfvsrld + or_ + subfic + subfe + neg` (6 insns + GPR scratch + 2 long-latency FPR↔GPR transfers; POWER8 needed an extra xxpermdi through ScratchSimd128 because mfvsrld is POWER9-only) | **5 insns flat, no GPR scratch, no FPR↔GPR round-trip.** Path: `ZeroSimd128(scratch); vcmpequd. scratch, src, scratch (CR6.LT=1 iff all-zero); mfocrf dest, cr6; rlwinm dest, dest, 25, 31, 31 (extract CR6.LT); xori dest, dest, 1`. Mirrors the existing `EmitAllTrueInt` template (SH=27 for CR6.EQ); we use SH=25 for CR6.LT plus the xori-flip. Real-P9 + sim full matrix verified (jit-test default + FORCE_POWER8 + sim FORCE_POWER{9,10,8}, jstests default + FORCE_POWER8). |
120+
| 23 | `xsmaxjdp` + isel branchless clamp ✅ DONE (commit `2d4ff1cf8c68`; POWER9 fast path) | `clampDoubleToUint8` had two mispredictable FP-compare branches (`≤0/NaN → 0` and `≥255 → 255`) on the hot Uint8ClampedArray store path | **POWER9 path: 7 insns, no branches.** `zeroDouble fpscratch; xsmaxjdp fpscratch, input, fpscratch (max(input,0); NaN→0 via P9 Java-NaN-as-less semantic ISA v3.0B §7.6.1.7); fctid fpscratch, fpscratch; mfvsrd output, fpscratch; li max255, 255; cmpdi output, 255; isel output, max255, output, GT`. fctid saturates out-of-int64 to INT64_MAX which the upper clamp catches. POWER8 fallback unchanged (xsmaxjdp unavailable; fctid maps NaN to INT64_MAX → would clamp to 255 not 0, so keep NaN-filtering branches). Real-P9 + sim full matrix verified (jit-test default + FORCE_POWER8 + sim FORCE_POWER{9,10,8}, jstests default + FORCE_POWER8). |
121+
| 24-27 | drop dead `mr`/`movePtr` before `stXcx.` ✅ DONE (commit `aad9b6e8e3f7`; POWER7+) | `CompareExchange` 4-byte path, `CompareExchange64`, `AtomicExchange` 4-byte path, `AtomicExchange64` — each had `xs_mr/movePtr scratch, value; stXcx scratch, ...` | Per Power ISA v3.0B §3.3.16 `stXcx.` reads RS without modifying it, so the move-to-scratch was dead. Replaced with `stXcx value, ...` directly. **1 insn saved per atomic-loop iteration in each helper, plus drops the `scratch2` GPR acquire entirely** (only used at the move site). Real-P9 + sim full matrix verified. |
122+
| 28 | orphaned helper removal ✅ DONE (commit `aad9b6e8e3f7`) | `MacroAssemblerPPC64Compat::convertUInt64ToDouble(Register, FloatRegister)` — single-Register Compat-class form with the old branched signed-split path; orphaned by commit `ddee1e276088` (which collapsed the public 3-arg API to `mtvsrd + fcfidu`) | Deleted entirely. Zero callers anywhere in tree. Direct demonstration of the "Refactoring hygiene — remove orphaned helpers in the same commit" rule (this would have been caught at the time of `ddee1e276088`). |
123+
| 29 | `plfd` / `plfs` for general memory FP loads ⏳ CANDIDATE (POWER10) | `loadDouble(Address)` / `loadFloat32(Address)` (h lines 1472, 1493) when `offset` doesn't fit 16-bit signed — currently `movePtr scratch + lfdx/lfsx` (2 insns + GPR scratch) | **1 prefixed insn, no scratch** when offset fits 34-bit signed. `as_plfd` / `as_plfs` already wired (used for `loadConstantDouble`/`Float32`); just plumb them into the `Address`-form fallback path. |
124+
| 30 | `pld` for general 64-bit memory load ⏳ CANDIDATE (POWER10) | `loadPtr(Address)` (h line 1436) when offset doesn't fit DS-form (16-bit signed, 4-byte aligned) — currently `movePtr scratch + ldx` (2 insns + GPR scratch). Same shape for `load64(Address)` (which delegates to `loadPtr`). | **1 prefixed insn, no scratch** when offset fits 34-bit signed. `as_pld` already wired (used in `movePtr` for 33-34-bit-signed values). Hot on deep stack frames where slot offsets exceed 16-bit signed. |
125+
| 31 | P10 prefixed-store emitters + plumbing ⏳ CANDIDATE (POWER10, needs emitter work) | `storeDouble(Address)` / `storeFloat32(Address)` / `storePtr(Address)` / `storeUnalignedSimd128(Address)` (across h and cpp) when offset doesn't fit DS-form — currently `movePtr scratch + stXx` | **1 prefixed insn each, no scratch** with `pstfd`/`pstfs`/`pstd`/`pstxv` (POWER10). Requires new emitters (`as_pstfd`, `as_pstfs`, `as_pstd`, `as_pstxv`) and matching sim decoders before the JIT can use them. ISA encoding is the 8LS prefix family (Type 0); same shape as `pld`/`plxv` already in tree. |
126+
| 32 | drop orphaned `xs_mr` after #24/#26 atomic refactor ⏳ TBD | The `xs_mr scratch, …` removed by items #24 / #26 may leave `xs_mr` (defined in `Assembler-ppc64-inl.h` style) without callers if the atomic helpers were the only consumers. | After #24-#27 ship, run `git grep '\bxs_mr\b'` to confirm. If the only hit is the definition, drop it per the "Refactoring hygiene" rule. (Likely still has callers — `xs_mr` is the standard "move register" macro — but worth verifying after the atomic patches land.) |
113127

114128
**Notes on item #1 (`vinsertb`/`vinserth`, shipped):** re-introduces what `66fa9e6625` removed earlier. That removal was correct AT THE TIME — it predated the VR-namespace migration when Simd128 lived in VSR0-31; the V-form VRT field would have addressed the wrong physical register. Post-migration the encoding mismatch is gone. First sim-decoder pass had the byte-pair order flipped for `vinserth`; caught by running the same wasm test on real P9 silicon (rather than only the sim) and observing the byte-level mismatch. Verified across 10 sweeps (real P9 jit-test/jstests with default + `FORCE_POWER8=1`; sim jit-test/jstests with `FORCE_POWER9/10/8=1`).
115129

@@ -309,6 +323,23 @@ Procedure:
309323
4. **No subagent output enters PLAN.md in the same turn.** Draft, verify per-item, then commit.
310324
5. **Measure, don't estimate.** Five minutes on the dev box beats arithmetic.
311325

326+
### Refactoring hygiene — remove orphaned helpers in the same commit
327+
328+
When a refactor replaces a multi-step helper sequence with a direct opcode (the dominant pattern in this branch's audit work), the old `static` helpers in `MacroAssembler-ppc64-inl.h` / `.cpp` often lose their callers. Drop them **in the same commit as the refactor**, not later.
329+
330+
Why this matters:
331+
- `./mach build binaries` (JS-shell-only) does **not** run the unified-compilation `-Wunused-function` check that `./mach build` (full Firefox) runs. An orphan helper passes JS jit-test sweeps cleanly and only surfaces when someone (the user, CI, or a clobber-rebuild) does a full firefox build — sometimes weeks later.
332+
- Concrete incident, 2026-04-30: `ExtractLow32x4Lanes` in `MacroAssembler-ppc64-inl.h` lost its only callers when commit `3832c823aa95` rewrote `convertInt32x4ToFloat64x2{,_u}` to use `vmrglw + xvcv{s,u}xwdp`. The unused helper sat for several commits before the user's full firefox build failed under `-Werror=unused-function`. Build-fix commit `1723d63b8eb3` dropped it.
333+
- Same shape applies to: `RedZoneStashSimd128` slots that are no longer used after a redzone-elimination refactor; per-lane scalar fall-back helpers (`ExtractLaneToGPR` variants) that were the body of an unrolled loop now collapsed into a single VMX op; `loadConstantSimd128` mask arrays whose callers moved to a `vspltis*` / `xxspltib` direct path.
334+
335+
How to apply, every refactor:
336+
1. After implementing the new direct-opcode path, run `git grep '<helper_name>'` over `js/src/jit/ppc64/`. If the only hit is the definition, delete it in the same diff.
337+
2. Skim the file for nearby `static (inline|void) Foo(...)` declarations the refactor might have orphaned indirectly (e.g., a private utility used only by the helper you just removed).
338+
3. If a helper is intentionally kept for a known near-term caller, add a one-line `// kept for <opcode>: <reason>` comment so the next audit doesn't delete it without reading.
339+
4. JS-shell verification (`./mach build binaries -j20`) is necessary but **not sufficient**. Either run a full `./mach build -j20` periodically, or grep for `[[maybe_unused]]` / `__attribute__((unused))` / `static (inline)?` in the touched files as a cheap pre-commit check. The full-build catch is mechanical; the upstream MR will reject anything `-Werror=unused-function` flags.
340+
341+
Counterpart to `feedback_no_for_future_use_helpers.md` (don't add speculative helpers without a caller). Together: a helper exists iff it has at least one caller, in every commit.
342+
312343
---
313344

314345
## Key regression tests
@@ -392,6 +423,7 @@ MOZ_PPC64_FORCE_POWER8=1 python3 js/src/jit-test/jit_test.py $JS --jitflags=none
392423
- Use `python3 js/src/jit-test/jit_test.py` directly (not `./mach jit-test`) — we build with `--disable-tests`.
393424
- Always save test output to a file; don't re-run slow tests to extract different information.
394425
- **Do NOT use `obj-sm` for wasm testing** — wasm is disabled at runtime in `-O0` builds.
426+
- **Always pass `-j20` to `./mach build` on the PPC64 box.** The default auto-parallelism picks `-j32` (32 cores × 64 GiB RAM heuristic) which saturates the box during lld linking and makes it SSH-unreachable. `./mach build -j20`, `./mach build binaries -j20`, etc. — every invocation.
395427
- **Never run two heavy commands concurrently on the PPC64 box** (two builds, two jit-test runs, etc.) — sequence them. The sim host (`192.168.64.3`) is the same: a build that overwrites `dist/bin/js` while a test run is shelling it out gives a `PermissionError [Errno 13]` mid-suite (we hit this 2026-04-30).
396428
- **Full-tree rsync before testing, not just `js/src/jit/ppc64/`.** Partial rsyncs leave the remote source out of sync with local in places that bite during build/test:
397429
- `Cargo.lock` + `third_party/rust/<crate>/` after upstream merges (saw `glslopt 0.1.13/0.1.14` and `neqo v0.26.0/v0.26.1` mismatches → cbindgen step fails or "lock file required" errors)

0 commit comments

Comments
 (0)