Commit 2d41a96
authored
Jcb/kernel sharding (#436)
* Sharding: out-of-circuit kernel profiler + balanced env partitioner
Add the measure -> partition -> manifest pipeline for splitting a compiled
`.ixe` environment into balanced, low-overhead shards, so large environments
(mathlib is ~631k blocks) can be proven in zero-knowledge as independent
conditional proofs rather than one infeasible monolith.
Two new CLI verbs; the kernel runs once to profile, then sharding is cheap to
re-tune for any shard count:
ix profile <env.ixe> -> env.ixesp (per-block heartbeats + delta graph)
ix shard <env.ixesp> -> env.ixes (N-shard manifest + what-if metrics)
Phase A -- out-of-circuit kernel profiler (gated; zero overhead when off):
- Record, per checked constant, its heartbeats (recursive fuel) and the set of
constants whose definition bodies it delta-unfolds. The *delta-unfold* graph,
not the static reference graph, is the real cross-shard cost: a shard must
ingress the body of any foreign block its members unfold.
- KEnv gains an optional ProfileSink (per-worker accumulator). Capture sites are
the delta-unfold commit points in whnf.rs and per-constant fuel accounting in
tc.rs; begin_const markers in check.rs/inductive.rs attribute work to the
right constant. Per-constant cache isolation makes recording sound (no unfolds
skipped by cross-constant memo) and faithful to in-circuit, un-memoized cost.
- src/ix/profile.rs: the `.ixesp` block-profile model (heartbeats + serialized
size + CSR delta graph), the ProfileSink accumulator, and explicit
little-endian (de)serialization (no serde dependency).
Phase B -- partitioner (src/ix/shard.rs, pure):
- Weighted hypergraph partitioning under the connectivity-1 (km1) metric:
vertices = blocks weighted by heartbeats (balance), nets = blocks weighted by
serialized size (ingress cost); objective = total cross-shard ingress bytes.
- Recursive bisection via greedy graph-growing + Fiduccia-Mattheyses refinement,
with cut-net splitting on recursion and a hub-net cap. Leaf-count allocation
distributes the N-shard budget by heartbeat mass clamped to [1, side size], so
any N works (not only powers of two) and every shard is non-empty when
#blocks >= N. A balance-weight cap (heartbeats capped at total/N) keeps splits
even while accepting the unavoidable imbalance from atomic mutual blocks.
- Parallelized with rayon across independent subtrees (deterministic: identical
result to serial), with live progress output so a slow run is never mistaken
for a stuck one.
Phase C -- manifest (src/ix/shard.rs):
- `.ixes` per-shard manifest: member blocks, heartbeat/own-size sums, foreign
(delta-dependency) block sets, delta-based assumption-tree roots
(merkle_root_canonical), and a what-if summary (balance, total cross-shard
ingress, atomic-block floor).
FFI + CLI:
- src/ffi/kernel.rs: profile_anon_ixe (runs the anon kernel over a `.ixe` with
recording, maps constants to home blocks, writes `.ixesp`) and shard_esp
(partition + manifest); rs_kernel_profile_anon / rs_shard_esp FFI.
- Ix/Cli/{ProfileCmd,ShardCmd}.lean, Ix/KernelCheck.lean externs, Main wiring.
No external graph-library dependency; the partitioner is self-contained. Kernel
changes are gated behind KEnv::profile_sink, so production checking is
unaffected. Validated end-to-end on initstd/lean/mathlib (64/128/256 shards,
0 empty shards). Follow-up: multilevel coarsening to make mathlib-scale
partitions run in seconds and recover cut quality.
* Sharding: multilevel coarsening for the env partitioner
Replace the flat greedy+FM body of `bisect()` with a multilevel V-cycle
(coarsen → partition the tiny coarsest graph → uncoarsen + refine), so large
environments partition in a fraction of the time *and* at markedly lower
cross-shard ingress. Self-contained to `src/ix/shard.rs`; recursion, leaf-count
allocation, cut-net splitting, rayon parallelism, and the balance-weight cap are
unchanged.
Each bisection now:
- coarsens the sub-hypergraph by heavy-edge matching (merge blocks that co-occur
in small, heavy delta-nets) under a cluster-weight cap, down to ~256
supervertices;
- decides the global cut once on that tiny graph (greedy graph-growing + FM to
convergence, from several diverse seeds, keeping the lowest balanced cut);
- uncoarsens, projecting the cut down and boundary-refining with FM at each
level.
Deciding global structure on the small graph and only ever refining locally on
the big ones is what improves both axes at once.
Benchmarks (profiled `.ixe` → partition; balance ±5%; flat baseline → multilevel;
0 empty shards, deterministic in every case):
env n partition time cross-shard ingress
init 64 5.0s → 2.7s 8.77M → 6.52M (-26%)
init 128 6.0s → 2.7s 12.44M → 9.83M (-21%)
init 256 6.0s → 3.0s 17.71M → 15.03M (-15%)
lean 64 8.0s → 3.4s 9.59M → 7.42M (-23%)
lean 128 8.0s → 3.5s 12.59M → 10.48M (-17%)
lean 256 7.0s → 3.6s 18.73M → 15.49M (-17%)
mathlib 64 99s → 30.8s 185.63M → 93.60M (-50%)
mathlib 128 101s → 31.4s 233.58M →131.84M (-44%)
mathlib 256 104s → 33.8s 294.85M →184.32M (-37%)
mathlib (631k blocks, 12.4M delta edges) partitions ~3.2x faster with roughly
half the cross-shard ingress; its heartbeat imbalance also drops (1.66x → 1.53x
at n=64). The non-empty-shard guarantee and determinism hold throughout.
Implementation notes — what real-env validation required beyond the textbook
scheme:
- Fallback pairing in matching: a vertex with no heavy-edge partner (delta-sparse
or only in hub nets) is merged with the next unmatched vertex. They share no
tracked net, so the merge is cut-neutral, but it keeps coarsening shrinking
~2x/pass instead of stalling far above the target (which had left an expensive
initial partition on a ~51k-vertex graph).
- Incremental FM gains: a move updates only the changed net's contribution to
each co-pin (O(Σ pins of the moved vertex's nets)) instead of recomputing
neighbour gains from scratch (O(Σ neighbour degrees)). Dominant speedup —
refining the finest level dropped from ~11s to ~0.9s on init.
- Degenerate-split rejection: graph-growing seeded at a light vertex can sweep an
entire sub onto one side (cut 0) when one atomic block holds more than half the
balance weight; the initial-partition selector now rejects empty-sided
candidates and prefers balanced-then-min-cut, so no shard is left empty.
- One boundary-refinement pass per uncoarsen level (measured 1.6–1.8x faster than
two for only ~3–6% more ingress — a clear win given the margin).
Removes the FM-skip heuristic (`FM_SKIP_VERTICES` / `FM_FULL_VERTICES`): full
refinement is now affordable at every size because it only ever runs on
already-good partitions. Adds unit tests for one-level contraction, cluster-cap
matching, large-cluster separation through the full V-cycle, determinism, and
the non-empty guarantee under a dominant block.
* clippy & fmt1 parent e2794c9 commit 2d41a96
13 files changed
Lines changed: 2701 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
145 | 169 | | |
146 | 170 | | |
147 | 171 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
8 | 9 | | |
| 10 | + | |
9 | 11 | | |
10 | 12 | | |
11 | 13 | | |
| |||
27 | 29 | | |
28 | 30 | | |
29 | 31 | | |
| 32 | + | |
30 | 33 | | |
| 34 | + | |
31 | 35 | | |
32 | 36 | | |
33 | 37 | | |
| |||
0 commit comments