Commit 096fe8e
authored
IxVM kernel: cut Aiur FFT cost on reduction- and check-heavy shards (#438)
* IxVM: cheaper Expr.Share / universe lookups (list_drop + field index)
Expr.Share resolution and universe lookups walked the sharing/univ lists
with `list_lookup_u64` (per-step `u64_is_zero` + `relaxed_u64_pred`). Switch
to:
- universe lookups: `list_lookup(univs, flatten_u64(idx))` — cheap per-step
field subtraction instead of the U64 predecessor.
- Expr.Share (Convert convert_expr + Ingress deref_share):
`let ListNode.Cons(e, _) = load(list_drop(sharing, flatten_u64(idx)));`.
list_drop returns the sublist *pointer*; repeated Share lookups collapse
to shared memo entries, so the walk is near-free.
Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 10`
(Init, owned=3, closure=676): total FFT 3.89G -> 2.11G. The merged
pointer-list walk drops from 1.93G (list_lookup) to 5.8M (list_drop, 0.27%).
(Measurement run had in-circuit blake3 verification disabled — not committed.)
* IxVM: hot/cold split address_eq on a limb-0 prefilter
The primitive-dispatch gauntlet in whnf compares each Const head against ~50
known primitive addresses, so address_eq is the single hottest circuit. The
full 32-limb compare ran at width 109 on every call.
Split it: `address_eq` now compares limb 0 only and rejects (the common case —
a single differing limb proves inequality), delegating the remaining 31-limb
compare to a separate `address_eq_tail`. Because Aiur charges a function's full
width on every one of its rows, the cold path must be a separate function (a
nested match in address_eq changes nothing — measured identical); as its own
circuit, address_eq_tail's height is only the rare limb-0-match rows.
Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 10`:
- address_eq: width 109 -> 80, FFT 509.8M -> 374.1M
- address_eq_tail (new): width 108, height 1683, FFT 1.9M
- total shard FFT: 2.106G -> 1.973G
Limb-0 is the sweet spot: hot height (259991) dominates tail height (1683), so
pulling a second limb forward (N=2) widens the hot circuit by more than it
saves in the tail (measured 1.976G, worse). Result is identical to a full
32-limb compare (sound).
(Measurement run had in-circuit blake3 verification disabled — not committed.)
* IxVM: hot/cold split whnf_with_spine (Const + Proj head dispatch)
whnf_with_spine is the single hottest circuit on reduction-heavy consts: width
89 charged on every one of its 3.36M rows. The width was set by two arms that
only a minority of rows reach — the Const-head dispatch (delta/iota/quot/prim
gauntlet) and the Proj-head reduction.
Factor both into their own functions (`whnf_const_head`, `whnf_proj_head`).
Because Aiur charges a function's full width on every row, the ~76% of steps
that are App/Lam/Let now run at the narrow residual width instead of 89; the
wide dispatch only taxes its own (smaller) row count.
Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`
(1 reduction-heavy owned const):
- whnf_with_spine: width 89 -> 34, FFT 6.48G -> 2.48G
- whnf_const_head (new): width 77, height 821490, FFT 1.24G
- whnf_proj_head (new): width 56, height 3452, FFT 0.002G
- total shard FFT: 45.54G -> 42.78G (-6.1%)
The Proj arm is rare (3452 rows) but was nearly half the width; extracting it
dropped whnf_with_spine 65 -> 34 for ~zero relocated cost. Pure refactor.
* IxVM: non-tail collect_spine, single shared definition
Replace the tail-recursive `collect_spine_go(e, acc)` (and its verbatim copy
`collect_spine_simple_go`) with one non-tail `collect_spine(e)` in the
kernelTypes block. Repoint all callers (whnf, primitive, inductive_check,
def_eq); delete both old copies. Keyed on `e` alone, it now dedups shared
sub-spines (0 -> 2.19M cache hits).
Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`:
- collect_spine: 5.38M rows / 2.65G -> 2.17M rows / 0.96G
- list_snoc.Ptr: +0.96G (new)
- total shard FFT: 42.78G -> 41.67G
* IxVM: keyed skip-set for sharded check (replace addr_in_list scan)
check_all_skipping tested each closure const against the assumption-leaf list
via addr_in_list, an O(N) linear address_eq scan — the single largest address_eq
source on sharded checks (1.81M calls on Init shard 9, more than all primitive
dispatch combined).
Build an RBTreeMap skip-set once (keyed on the first 4 address bytes), then test
membership with one rbtree lookup + one confirming address_eq. Key collisions
overwrite: sound because a missed skip only rechecks a frontier const, and the
confirming address_eq rules out false positives.
Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`:
- address_eq: 2.07M rows / 3.48G -> 274K rows / 0.40G
- rbtree build + lookups: +~0.46G
- total shard FFT: 41.67G -> 37.84G
* IxVM: collapse symbolic-base Nat.rec succ-tower to an offset
try_nat_linear_rec only folded `Nat.rec base (λ_ ih => succ ih) (Lit n)` when
the base was also a literal; a symbolic base declined and fell through to n
iota steps that materialize succ^n(base). Reduce the symbolic-base case
directly to the offset primitive `Nat.add base (Lit n)` instead.
This keeps the value in the same compact `base + n` form a literal already has,
so def-eq comparing it converges instead of descending n unary succ layers (the
UTF-8 `x + 0xC0` reduction class). Offset magnitude stays KLimbs; only the
nat_add top-position index is a (small) G.
Measured on a `x + D = Nat.rec x succ-step D` benchmark: the unary def-eq
descent collapses (nat_succ_of D -> ~constant), ~-15% total FFT at D=512. The
remaining cost is the in-whnf chain reduction (substitution), unchanged here.
* IxVM: keep Nat base+literal compact (stop succ-tower materialization)
Reducing `Nat.add x (Lit n)` (symbolic x) delta-unfolded into a succ^n(x)
tower, and the nat_succ dispatch then whnf-cascaded the whole chain — O(n)
substitution per reduction, the dominant cost on Nat-arithmetic proofs (the
UTF-8 codec class).
Two coordinated changes:
- whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (non-literal base,
nonzero n) stuck as a compact offset instead of unfolding it.
- def-eq (nat_offset_of + try_def_eq_nat): decompose each side to base + KLimbs
offset (reading `Nat.add base (Lit m)` in O(1), peeling the few succs whnf
exposes) and decide `base_a ≟ base_b` only when offsets are equal; otherwise
fall back. All magnitudes stay KLimbs (no Goldilocks collapse).
Measured on `x + D = Nat.rec x succ-step D` (depths 64..512): total FFT goes
from O(D) (328M at D=512) to a depth-independent ~28.5M (-91% at D=512). Full
`lake test -- --ignored ixvm` passes (297 tests; BAD cases still rejected);
a recursor-on-symbolic-offset completeness test reduces correctly.
* IxVM: keep symbolic Nat.div/mod stuck (stop division-tower blowup)
`Nat.div x (Lit k)` / `Nat.mod x (Lit k)` with a symbolic base delta-unfolded
the division algorithm and expanded hugely, even though the result is
irreducible for a symbolic base. `Nat.shiftRight x k` unfolds to k nested
`Nat.div _ 2`, so a symbolic `>>>` materialized a deep, super-linear tower —
the dominant cost in the UTF-8 codec's `c >>> 6/12/18` encoding.
Generalize `try_nat_offset_stuck` (which already keeps `Nat.add base (Lit n)`
compact) to also leave `Nat.div`/`Nat.mod base (Lit k)` (k ≥ 2) stuck. Thresholds
keep `x/1 = x` and `x/0 = 0` reducing. Both `x >>> 18` and `(x >>> 9) >>> 9` then
reduce to the same nested-stuck-div form, so def-eq stays structural and the
identity still holds.
Measured on `x >>> 18 = (x >>> 9) >>> 9`: total FFT 15.65G -> 455M (-97%), now
depth-independent (matches `x >>> 6`). No regression on the Nat.add reproducers;
full `lake test -- --ignored ixvm` passes (297 tests).
* IxVM: narrow Ixon expr deserialization (Let split + telescope by &Expr)
Ingress deserialization (parsing every closure const's Ixon bytes into Expr
trees) is the dominant cost when checking large-closure constants. Two width
cuts on the hottest parsers:
- get_expr: the Let arm (three recursive get_expr calls) set the function width
though Let is rare. Factored into get_expr_let — width 102 -> 75.
- get_app_telescope: passed `func` by value (the wide Expr union) on every
recursion step. Pass it by `&Expr`, loaded only at the base case — width
95 -> 78.
Measured by profiling ingress of the `ByteArray.utf8DecodeChar?…` closure
(2000 consts): get_expr 766M -> 563M, get_app_telescope 477M -> 392M, total
ingress FFT 3.13G -> 2.85G. Full `lake test -- --ignored ixvm` passes (297).
* IxVM: lbr-trimmed context keys for whnf/infer/def-eq memoization
Key whnf / k_infer / k_is_def_eq memoization on the loose-bvar-reachable
suffix of the local context (ctx_trim / ctx_reachable), mirroring Rust's
whnf_key / infer_key / def_eq_ctx_key. A closed subterm (lbr 0) keys to
the empty context and shares its result across every binder depth; eager
substitution is unchanged. Fast paths (closed -> Nil, full-reach -> as-is)
keep the boundary cheap.
FFT (lake exe ix check <const> --stats-out): Nat.add_comm 58.28M -> 57.90M
(-0.66%); Int.emod_emod_of_dvd 5.090G -> 4.691G (-7.85%); Init shard 10
4.738G -> 4.721G (-0.37%). 297 ixvm tests pass.
* ix CLI: resolve private Lean.Name args in check/prove
parseName can't rebuild a private name (_private.M.0.foo) from its
display string -- the marker / scope-index components don't round-trip
through naive dot-splitting. Add toString fallbacks: resolveName for the
compiled-in env and resolveIxeAddr for a .ixe env, so check and prove
(which share forEachClaim) resolve names like
_private.Init.Data.Vector.Extract.0.Vector.extract_append._proof_1
across both the compiled-in and --ixe paths.
* IxVM: multi-arg beta with narrow substitution walk
whnf_apply_beta substituted one spine arg per expr_inst1, re-walking the
(shrinking) body for each, so a K-arg application `(λλλ.body) x y z` walked
the body K times. Mirror Rust whnf.rs:541-567: peel the whole telescope
(peel_beta) and substitute all consumed args in ONE expr_inst_many walk
(simul_subst semantics). Single-arg betas stay on the cheap expr_inst1 path,
so short telescopes don't pay the list overhead (why the earlier
unconditional simul_subst regressed). Hot/cold split expr_inst_many_walk's
BVar arm (list_length / list_lookup / expr_lift) into expr_inst_many_bvar so
the hot App/Lam walk stays narrow.
Substitution was 53% of FFT on omega-heavy proofs. FFT (ix check <const>):
_private.…Vector.extract_append._proof_1 56.64G -> 38.31G (-32.4%);
Nat.add_comm 57.90M -> 56.17M; Init shard 10 4.721G -> 4.668G (-1.1%);
shard 9 16.346G -> 16.314G. 297 ixvm tests pass.
* IxVM: gate the primitive gauntlet behind a memoized prim_any_addr
whnf_const_head ran 5 try_* dispatchers (nat / str / bitvec / native /
decidable) in sequence on every Const head, each comparing the head address
against its family's primitive addresses — wasted on the common non-primitive
head, and worse, several try_* speculatively WHNF their args (to test for
literals / decidables), cascading into deep reductions that are then thrown
away. Factor the chain into try_address_primitives and gate it behind
prim_any_addr: a memoized (per head address) sum of the 43 head-dispatched
primitive address_eq checks — 1 only for a real primitive, so a 0 skips the
whole gauntlet straight to projection-definition / delta.
Validated by a differential assert (prim_any_addr must be 1 whenever the
gauntlet matched) run over the 297-test suite + Int.emod_emod_of_dvd +
Vector.extract_append._proof_1; it caught two missing addresses
(string_utf8_byte_size, punit_size_of_1), then passed clean across millions of
reductions. Assert removed after validation.
FFT (ix check <const>): _private.…Vector.extract_append._proof_1
38.31G -> 28.02G (-26.8%); Int.emod_emod_of_dvd 4.69G -> 3.97G (-15%);
Nat.add_comm 56.17M -> 56.06M; Init shard 9 16.314G -> 16.279G; shard 10
4.668G -> 4.659G. 297 ixvm tests pass.1 parent 623be4b commit 096fe8e
12 files changed
Lines changed: 646 additions & 216 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
48 | 58 | | |
49 | 59 | | |
50 | 60 | | |
| |||
88 | 98 | | |
89 | 99 | | |
90 | 100 | | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
91 | 111 | | |
92 | 112 | | |
93 | 113 | | |
| |||
189 | 209 | | |
190 | 210 | | |
191 | 211 | | |
192 | | - | |
193 | | - | |
| 212 | + | |
194 | 213 | | |
195 | | - | |
196 | | - | |
| 214 | + | |
| 215 | + | |
197 | 216 | | |
198 | 217 | | |
199 | | - | |
| 218 | + | |
200 | 219 | | |
201 | 220 | | |
202 | 221 | | |
| |||
224 | 243 | | |
225 | 244 | | |
226 | 245 | | |
227 | | - | |
228 | | - | |
229 | | - | |
230 | | - | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
231 | 250 | | |
232 | 251 | | |
233 | | - | |
234 | | - | |
235 | | - | |
236 | | - | |
237 | | - | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
238 | 258 | | |
239 | 259 | | |
240 | 260 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
71 | | - | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
72 | 74 | | |
73 | 75 | | |
74 | 76 | | |
| |||
145 | 147 | | |
146 | 148 | | |
147 | 149 | | |
148 | | - | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
149 | 153 | | |
150 | 154 | | |
151 | 155 | | |
| |||
199 | 203 | | |
200 | 204 | | |
201 | 205 | | |
202 | | - | |
| 206 | + | |
| 207 | + | |
203 | 208 | | |
204 | 209 | | |
205 | 210 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
56 | | - | |
57 | | - | |
58 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
59 | 65 | | |
60 | 66 | | |
61 | 67 | | |
62 | | - | |
| 68 | + | |
63 | 69 | | |
64 | 70 | | |
65 | 71 | | |
66 | | - | |
| 72 | + | |
67 | 73 | | |
68 | 74 | | |
69 | 75 | | |
| |||
80 | 86 | | |
81 | 87 | | |
82 | 88 | | |
83 | | - | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
84 | 103 | | |
85 | 104 | | |
86 | 105 | | |
| |||
792 | 811 | | |
793 | 812 | | |
794 | 813 | | |
795 | | - | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
796 | 817 | | |
797 | 818 | | |
798 | 819 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
101 | 101 | | |
102 | 102 | | |
103 | 103 | | |
104 | | - | |
105 | | - | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
106 | 108 | | |
107 | 109 | | |
108 | | - | |
| 110 | + | |
109 | 111 | | |
110 | 112 | | |
111 | | - | |
112 | | - | |
| 113 | + | |
| 114 | + | |
113 | 115 | | |
114 | 116 | | |
115 | 117 | | |
| |||
180 | 182 | | |
181 | 183 | | |
182 | 184 | | |
183 | | - | |
| 185 | + | |
184 | 186 | | |
185 | 187 | | |
186 | 188 | | |
| |||
189 | 191 | | |
190 | 192 | | |
191 | 193 | | |
192 | | - | |
193 | | - | |
194 | | - | |
195 | | - | |
196 | | - | |
| 194 | + | |
197 | 195 | | |
198 | 196 | | |
199 | 197 | | |
200 | 198 | | |
201 | 199 | | |
202 | 200 | | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
203 | 211 | | |
204 | 212 | | |
205 | 213 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
804 | 804 | | |
805 | 805 | | |
806 | 806 | | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
| 843 | + | |
807 | 844 | | |
808 | 845 | | |
809 | 846 | | |
| |||
813 | 850 | | |
814 | 851 | | |
815 | 852 | | |
816 | | - | |
| 853 | + | |
| 854 | + | |
| 855 | + | |
| 856 | + | |
817 | 857 | | |
818 | 858 | | |
819 | 859 | | |
820 | 860 | | |
821 | 861 | | |
822 | | - | |
| 862 | + | |
823 | 863 | | |
824 | 864 | | |
825 | 865 | | |
826 | 866 | | |
827 | 867 | | |
828 | | - | |
| 868 | + | |
829 | 869 | | |
830 | | - | |
| 870 | + | |
831 | 871 | | |
832 | 872 | | |
833 | | - | |
| 873 | + | |
834 | 874 | | |
835 | 875 | | |
836 | 876 | | |
| |||
0 commit comments