You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/Pluto ABI Optimization Plan.md
+47-15Lines changed: 47 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Pluto ABI Optimization Plan
2
2
3
-
**Status:** Phase 1 scalar ABI shipped internally; remaining phases proposed
4
-
**Scope:** Internal call lowering, scalar fast paths, tail recursion, external ABI stability
3
+
**Status:** Phase 1 scalar ABI and post-O3 scalar recurrence unroll are implemented; remaining ABI phases proposed
4
+
**Scope:** Internal call lowering, scalar fast paths, aggregate returns, external ABI stability, targeted post-LLVM loop cleanup
5
5
6
6
## 1. Problem
7
7
@@ -19,7 +19,9 @@ Simple and uniform, but costly for scalar-heavy code:
19
19
- single-scalar outputs use `sret` instead of register return
20
20
- self tail recursion lowers to recursive calls plus stack traffic instead of loops
21
21
22
-
The `fib_tail` benchmark exposes this clearly. LLVM `-O3` cannot recover from ABI choices baked into the function signature — it can promote local allocas and simplify CFGs, but it cannot fix pointer-based param ABI or `sret`-only returns. ABI classification and tail-recursion lowering must be done in Pluto.
22
+
The `fib_tail` benchmark originally exposed this clearly. LLVM `-O3` cannot recover from ABI choices baked into the function signature — it can promote local allocas and simplify CFGs, but it cannot fix pointer-based param ABI or `sret`-only returns. ABI classification must be done in Pluto.
23
+
24
+
After Phase 1, `fib_tail` is no longer a strong argument for a Pluto-level tail-call optimization pass: with direct scalar params and returns, LLVM can already eliminate the simple self recursion and inline the helper into the benchmark loop. A Pluto TCO pass should be considered only for stack-safety or broader semantic reasons, not as the next performance optimization.
23
25
24
26
## 2. Key Principle: Separate Semantics from ABI
25
27
@@ -81,25 +83,24 @@ This was the highest-value initial optimization because it benefits all scalar-h
81
83
82
84
**Benchmark target:**`fib`, `fib_tail`, and other call-heavy scalar code. Note: `sum` is not a useful target here — its optimized IR is already a call-free scalar loop; the current gap vs clang is loop optimization quality, not call ABI.
Transform self-recursive calls into loops when all of:
87
89
88
90
- direct scalar params only
89
91
- single direct scalar return
90
92
- self call in tail position
91
-
- no ownership-sensitive temporaries live across the tail call
92
-
- no cleanup work required before return
93
+
- cleanup for any owned locals can be emitted explicitly on the tail backedge
93
94
94
-
Intentionally narrow — ignore multi-output, strings, arrays, mutual recursion. This is medium difficulty because the restricted form avoids all the hard ownership/cleanup interactions.
95
+
This is not currently worth pursuing as a performance phase. An experiment against `fib_tail` showed that optimized baseline IR already contains no recursive `FibAux` call after Phase 1, and standalone binary timings were within noise of a custom Pluto loop lowering. The compiler complexity is therefore not justified for this benchmark.
95
96
96
-
**Benchmark target:**`fib_tail` (eliminates stack growth entirely and removes the remaining recursive-call overhead after Phase 1).
97
+
Keep this as a future stack-safety/language-semantics feature only. If revisited, the implementation should be a real analysis/lowering pass with explicit backedge cleanup, not a benchmark-specific special case.
97
98
98
99
### Phase 3: Small POD aggregate returns
99
100
100
101
Support direct multi-output returns for plain-data aggregates (`{I64, I64}`, `{I64, F64}`) when the target ABI allows. Model as LLVM aggregate return and let target classification decide direct vs indirect.
101
102
102
-
This is the natural next ABI expansion after Phase 2 because the classifier/lowering split from Phase 1 is already in place. The remaining work is target-aware aggregate classification, not another structural refactor.
103
+
This is the natural next ABI expansion because the classifier/lowering split from Phase 1 is already in place. The remaining work is target-aware aggregate classification, not another structural refactor.
103
104
104
105
### Phase 4: Generalized ABI classification
105
106
@@ -116,8 +117,28 @@ for that benchmark is already a call-free scalar loop, but clang still produces
116
117
more optimized kernel. That means `sum` should be treated as a loop/codegen quality
117
118
benchmark, not as a primary validation target for the ABI phases above.
118
119
119
-
The target-metadata quick wins have already landed far enough to make this a
120
-
different problem now. The next things to investigate here are:
120
+
The first targeted loop/codegen step is implemented: after LLVM `default<O3>`,
121
+
Pluto annotates small scalar recurrence loops with `llvm.loop.unroll.count = 4`
122
+
and reruns only LLVM's function loop unroller. This is deliberately post-O3 so
123
+
it can see loops created by LLVM inlining and tail-recursion elimination. It is
124
+
not a global `--unroll-count=4` policy.
125
+
126
+
This pass targets the post-inline `fib_tail` recurrence and intentionally leaves
127
+
broader loop classes to LLVM's normal cost model:
128
+
129
+
-`fib_tail` benefits because the helper has become a small scalar recurrence loop
130
+
-`sum` is rejected because the hot loop contains integer remainder (`% 17`)
131
+
-`harmonic` is currently vectorized by LLVM on the benchmarked target, so its
132
+
post-O3 IR has vector operations and existing loop metadata; a scalar `fdiv`
133
+
recurrence remains eligible if LLVM does not vectorize it
134
+
135
+
Regression coverage compiles a real Pluto `fib_tail` fixture, runs the same
136
+
post-O3 pipeline in-process, and checks that metadata reaches the post-inline
137
+
recurrence before LLVM's loop unroller runs. Broader A/B experiments should use
138
+
temporary local compiler builds rather than permanent user-facing optimization
139
+
knobs.
140
+
141
+
The next things to investigate for loop/codegen quality are:
121
142
122
143
- emit a more canonical counted-loop fast path for common `I64` ranges, especially `step == 1`
123
144
- preserve affine-friendly loop structure so LLVM can unroll and strength-reduce more aggressively
@@ -128,7 +149,9 @@ These are worth treating as a separate optimization track because they improve
128
149
call-free kernels like `sum` and still benefit range-heavy scalar code such as
129
150
`harmonic`, without depending on more ABI surface area.
130
151
131
-
**Benchmark target:**`sum`, `harmonic`, and other call-free or loop-dominated integer kernels.
152
+
**Benchmark target:**`fib_tail` for the scalar recurrence unroll pass; `sum`,
153
+
`harmonic`, and other call-free or loop-dominated kernels for future loop
154
+
canonicalization and strength-reduction work.
132
155
133
156
## 5. Rollout Strategy
134
157
@@ -141,8 +164,17 @@ Each phase should:
141
164
142
165
## 6. Practical Recommendation
143
166
144
-
If only one remaining optimization can be prioritized: **Phase 2** (restricted self tail recursion). Phase 1 is already shipped, and tail recursion is now the clearest remaining win on the benchmark set.
167
+
If only one remaining optimization can be prioritized: **make the implemented
168
+
post-O3 scalar recurrence unroll pass boring and well-covered**. It has a clear
169
+
`fib_tail` win, a bounded eligibility filter, and a real post-opt IR regression
170
+
test. Do not expand it toward `sum`/`harmonic` without a separate measured
171
+
hypothesis.
172
+
173
+
After that, prioritize **the counted-loop fast path from the non-ABI loop/codegen
174
+
track**. Phase 1 is already shipped, and `fib_tail` no longer demonstrates a
175
+
meaningful Pluto-level TCO win because LLVM already optimizes the simple tail
176
+
recursion once scalar ABI lowering is in place.
145
177
146
-
If two: **Phase 2 + Phase 3** (restricted tail recursion + small POD aggregate returns). That extends the current ABI work with the best continuity and risk/reward ratio.
178
+
If staying on ABI work, prioritize **Phase 3** (small POD aggregate returns) over custom TCO. That extends the current ABI work with clearer risk/reward and avoids adding a pass that current benchmarks do not justify.
147
179
148
-
If a parallel non-ABI effort is desired at the same time, prioritize the counted-loop fast path from the loop/codegen track rather than additional cache or tooling work.
180
+
If a parallel non-ABI effort is desired at the same time, continue with the counted-loop fast path from the loop/codegen track rather than additional cache or tooling work.
0 commit comments