Skip to content

Commit c8ded10

Browse files
authored
Merge pull request #56 from thiremani/codex/post-o3-scalar-unroll
[codex] Unroll post-O3 scalar recurrences
2 parents 9cdc24e + 577e194 commit c8ded10

3 files changed

Lines changed: 708 additions & 15 deletions

File tree

docs/Pluto ABI Optimization Plan.md

Lines changed: 47 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Pluto ABI Optimization Plan
22

3-
**Status:** Phase 1 scalar ABI shipped internally; remaining phases proposed
4-
**Scope:** Internal call lowering, scalar fast paths, tail recursion, external ABI stability
3+
**Status:** Phase 1 scalar ABI and post-O3 scalar recurrence unroll are implemented; remaining ABI phases proposed
4+
**Scope:** Internal call lowering, scalar fast paths, aggregate returns, external ABI stability, targeted post-LLVM loop cleanup
55

66
## 1. Problem
77

@@ -19,7 +19,9 @@ Simple and uniform, but costly for scalar-heavy code:
1919
- single-scalar outputs use `sret` instead of register return
2020
- self tail recursion lowers to recursive calls plus stack traffic instead of loops
2121

22-
The `fib_tail` benchmark exposes this clearly. LLVM `-O3` cannot recover from ABI choices baked into the function signature — it can promote local allocas and simplify CFGs, but it cannot fix pointer-based param ABI or `sret`-only returns. ABI classification and tail-recursion lowering must be done in Pluto.
22+
The `fib_tail` benchmark originally exposed this clearly. LLVM `-O3` cannot recover from ABI choices baked into the function signature — it can promote local allocas and simplify CFGs, but it cannot fix pointer-based param ABI or `sret`-only returns. ABI classification must be done in Pluto.
23+
24+
After Phase 1, `fib_tail` is no longer a strong argument for a Pluto-level tail-call optimization pass: with direct scalar params and returns, LLVM can already eliminate the simple self recursion and inline the helper into the benchmark loop. A Pluto TCO pass should be considered only for stack-safety or broader semantic reasons, not as the next performance optimization.
2325

2426
## 2. Key Principle: Separate Semantics from ABI
2527

@@ -81,25 +83,24 @@ This was the highest-value initial optimization because it benefits all scalar-h
8183

8284
**Benchmark target:** `fib`, `fib_tail`, and other call-heavy scalar code. Note: `sum` is not a useful target here — its optimized IR is already a call-free scalar loop; the current gap vs clang is loop optimization quality, not call ABI.
8385

84-
### Phase 2: Restricted self tail recursion (next highest priority)
86+
### Phase 2: Restricted self tail recursion (deprioritized)
8587

8688
Transform self-recursive calls into loops when all of:
8789

8890
- direct scalar params only
8991
- single direct scalar return
9092
- self call in tail position
91-
- no ownership-sensitive temporaries live across the tail call
92-
- no cleanup work required before return
93+
- cleanup for any owned locals can be emitted explicitly on the tail backedge
9394

94-
Intentionally narrow — ignore multi-output, strings, arrays, mutual recursion. This is medium difficulty because the restricted form avoids all the hard ownership/cleanup interactions.
95+
This is not currently worth pursuing as a performance phase. An experiment against `fib_tail` showed that optimized baseline IR already contains no recursive `FibAux` call after Phase 1, and standalone binary timings were within noise of a custom Pluto loop lowering. The compiler complexity is therefore not justified for this benchmark.
9596

96-
**Benchmark target:** `fib_tail` (eliminates stack growth entirely and removes the remaining recursive-call overhead after Phase 1).
97+
Keep this as a future stack-safety/language-semantics feature only. If revisited, the implementation should be a real analysis/lowering pass with explicit backedge cleanup, not a benchmark-specific special case.
9798

9899
### Phase 3: Small POD aggregate returns
99100

100101
Support direct multi-output returns for plain-data aggregates (`{I64, I64}`, `{I64, F64}`) when the target ABI allows. Model as LLVM aggregate return and let target classification decide direct vs indirect.
101102

102-
This is the natural next ABI expansion after Phase 2 because the classifier/lowering split from Phase 1 is already in place. The remaining work is target-aware aggregate classification, not another structural refactor.
103+
This is the natural next ABI expansion because the classifier/lowering split from Phase 1 is already in place. The remaining work is target-aware aggregate classification, not another structural refactor.
103104

104105
### Phase 4: Generalized ABI classification
105106

@@ -116,8 +117,28 @@ for that benchmark is already a call-free scalar loop, but clang still produces
116117
more optimized kernel. That means `sum` should be treated as a loop/codegen quality
117118
benchmark, not as a primary validation target for the ABI phases above.
118119

119-
The target-metadata quick wins have already landed far enough to make this a
120-
different problem now. The next things to investigate here are:
120+
The first targeted loop/codegen step is implemented: after LLVM `default<O3>`,
121+
Pluto annotates small scalar recurrence loops with `llvm.loop.unroll.count = 4`
122+
and reruns only LLVM's function loop unroller. This is deliberately post-O3 so
123+
it can see loops created by LLVM inlining and tail-recursion elimination. It is
124+
not a global `--unroll-count=4` policy.
125+
126+
This pass targets the post-inline `fib_tail` recurrence and intentionally leaves
127+
broader loop classes to LLVM's normal cost model:
128+
129+
- `fib_tail` benefits because the helper has become a small scalar recurrence loop
130+
- `sum` is rejected because the hot loop contains integer remainder (`% 17`)
131+
- `harmonic` is currently vectorized by LLVM on the benchmarked target, so its
132+
post-O3 IR has vector operations and existing loop metadata; a scalar `fdiv`
133+
recurrence remains eligible if LLVM does not vectorize it
134+
135+
Regression coverage compiles a real Pluto `fib_tail` fixture, runs the same
136+
post-O3 pipeline in-process, and checks that metadata reaches the post-inline
137+
recurrence before LLVM's loop unroller runs. Broader A/B experiments should use
138+
temporary local compiler builds rather than permanent user-facing optimization
139+
knobs.
140+
141+
The next things to investigate for loop/codegen quality are:
121142

122143
- emit a more canonical counted-loop fast path for common `I64` ranges, especially `step == 1`
123144
- preserve affine-friendly loop structure so LLVM can unroll and strength-reduce more aggressively
@@ -128,7 +149,9 @@ These are worth treating as a separate optimization track because they improve
128149
call-free kernels like `sum` and still benefit range-heavy scalar code such as
129150
`harmonic`, without depending on more ABI surface area.
130151

131-
**Benchmark target:** `sum`, `harmonic`, and other call-free or loop-dominated integer kernels.
152+
**Benchmark target:** `fib_tail` for the scalar recurrence unroll pass; `sum`,
153+
`harmonic`, and other call-free or loop-dominated kernels for future loop
154+
canonicalization and strength-reduction work.
132155

133156
## 5. Rollout Strategy
134157

@@ -141,8 +164,17 @@ Each phase should:
141164

142165
## 6. Practical Recommendation
143166

144-
If only one remaining optimization can be prioritized: **Phase 2** (restricted self tail recursion). Phase 1 is already shipped, and tail recursion is now the clearest remaining win on the benchmark set.
167+
If only one remaining optimization can be prioritized: **make the implemented
168+
post-O3 scalar recurrence unroll pass boring and well-covered**. It has a clear
169+
`fib_tail` win, a bounded eligibility filter, and a real post-opt IR regression
170+
test. Do not expand it toward `sum`/`harmonic` without a separate measured
171+
hypothesis.
172+
173+
After that, prioritize **the counted-loop fast path from the non-ABI loop/codegen
174+
track**. Phase 1 is already shipped, and `fib_tail` no longer demonstrates a
175+
meaningful Pluto-level TCO win because LLVM already optimizes the simple tail
176+
recursion once scalar ABI lowering is in place.
145177

146-
If two: **Phase 2 + Phase 3** (restricted tail recursion + small POD aggregate returns). That extends the current ABI work with the best continuity and risk/reward ratio.
178+
If staying on ABI work, prioritize **Phase 3** (small POD aggregate returns) over custom TCO. That extends the current ABI work with clearer risk/reward and avoids adding a pass that current benchmarks do not justify.
147179

148-
If a parallel non-ABI effort is desired at the same time, prioritize the counted-loop fast path from the loop/codegen track rather than additional cache or tooling work.
180+
If a parallel non-ABI effort is desired at the same time, continue with the counted-loop fast path from the loop/codegen track rather than additional cache or tooling work.

0 commit comments

Comments
 (0)