Skip to content

Commit 2d41a96

Browse files
Jcb/kernel sharding (#436)
* Sharding: out-of-circuit kernel profiler + balanced env partitioner Add the measure -> partition -> manifest pipeline for splitting a compiled `.ixe` environment into balanced, low-overhead shards, so large environments (mathlib is ~631k blocks) can be proven in zero-knowledge as independent conditional proofs rather than one infeasible monolith. Two new CLI verbs; the kernel runs once to profile, then sharding is cheap to re-tune for any shard count: ix profile <env.ixe> -> env.ixesp (per-block heartbeats + delta graph) ix shard <env.ixesp> -> env.ixes (N-shard manifest + what-if metrics) Phase A -- out-of-circuit kernel profiler (gated; zero overhead when off): - Record, per checked constant, its heartbeats (recursive fuel) and the set of constants whose definition bodies it delta-unfolds. The *delta-unfold* graph, not the static reference graph, is the real cross-shard cost: a shard must ingress the body of any foreign block its members unfold. - KEnv gains an optional ProfileSink (per-worker accumulator). Capture sites are the delta-unfold commit points in whnf.rs and per-constant fuel accounting in tc.rs; begin_const markers in check.rs/inductive.rs attribute work to the right constant. Per-constant cache isolation makes recording sound (no unfolds skipped by cross-constant memo) and faithful to in-circuit, un-memoized cost. - src/ix/profile.rs: the `.ixesp` block-profile model (heartbeats + serialized size + CSR delta graph), the ProfileSink accumulator, and explicit little-endian (de)serialization (no serde dependency). Phase B -- partitioner (src/ix/shard.rs, pure): - Weighted hypergraph partitioning under the connectivity-1 (km1) metric: vertices = blocks weighted by heartbeats (balance), nets = blocks weighted by serialized size (ingress cost); objective = total cross-shard ingress bytes. - Recursive bisection via greedy graph-growing + Fiduccia-Mattheyses refinement, with cut-net splitting on recursion and a hub-net cap. Leaf-count allocation distributes the N-shard budget by heartbeat mass clamped to [1, side size], so any N works (not only powers of two) and every shard is non-empty when #blocks >= N. A balance-weight cap (heartbeats capped at total/N) keeps splits even while accepting the unavoidable imbalance from atomic mutual blocks. - Parallelized with rayon across independent subtrees (deterministic: identical result to serial), with live progress output so a slow run is never mistaken for a stuck one. Phase C -- manifest (src/ix/shard.rs): - `.ixes` per-shard manifest: member blocks, heartbeat/own-size sums, foreign (delta-dependency) block sets, delta-based assumption-tree roots (merkle_root_canonical), and a what-if summary (balance, total cross-shard ingress, atomic-block floor). FFI + CLI: - src/ffi/kernel.rs: profile_anon_ixe (runs the anon kernel over a `.ixe` with recording, maps constants to home blocks, writes `.ixesp`) and shard_esp (partition + manifest); rs_kernel_profile_anon / rs_shard_esp FFI. - Ix/Cli/{ProfileCmd,ShardCmd}.lean, Ix/KernelCheck.lean externs, Main wiring. No external graph-library dependency; the partitioner is self-contained. Kernel changes are gated behind KEnv::profile_sink, so production checking is unaffected. Validated end-to-end on initstd/lean/mathlib (64/128/256 shards, 0 empty shards). Follow-up: multilevel coarsening to make mathlib-scale partitions run in seconds and recover cut quality. * Sharding: multilevel coarsening for the env partitioner Replace the flat greedy+FM body of `bisect()` with a multilevel V-cycle (coarsen → partition the tiny coarsest graph → uncoarsen + refine), so large environments partition in a fraction of the time *and* at markedly lower cross-shard ingress. Self-contained to `src/ix/shard.rs`; recursion, leaf-count allocation, cut-net splitting, rayon parallelism, and the balance-weight cap are unchanged. Each bisection now: - coarsens the sub-hypergraph by heavy-edge matching (merge blocks that co-occur in small, heavy delta-nets) under a cluster-weight cap, down to ~256 supervertices; - decides the global cut once on that tiny graph (greedy graph-growing + FM to convergence, from several diverse seeds, keeping the lowest balanced cut); - uncoarsens, projecting the cut down and boundary-refining with FM at each level. Deciding global structure on the small graph and only ever refining locally on the big ones is what improves both axes at once. Benchmarks (profiled `.ixe` → partition; balance ±5%; flat baseline → multilevel; 0 empty shards, deterministic in every case): env n partition time cross-shard ingress init 64 5.0s → 2.7s 8.77M → 6.52M (-26%) init 128 6.0s → 2.7s 12.44M → 9.83M (-21%) init 256 6.0s → 3.0s 17.71M → 15.03M (-15%) lean 64 8.0s → 3.4s 9.59M → 7.42M (-23%) lean 128 8.0s → 3.5s 12.59M → 10.48M (-17%) lean 256 7.0s → 3.6s 18.73M → 15.49M (-17%) mathlib 64 99s → 30.8s 185.63M → 93.60M (-50%) mathlib 128 101s → 31.4s 233.58M →131.84M (-44%) mathlib 256 104s → 33.8s 294.85M →184.32M (-37%) mathlib (631k blocks, 12.4M delta edges) partitions ~3.2x faster with roughly half the cross-shard ingress; its heartbeat imbalance also drops (1.66x → 1.53x at n=64). The non-empty-shard guarantee and determinism hold throughout. Implementation notes — what real-env validation required beyond the textbook scheme: - Fallback pairing in matching: a vertex with no heavy-edge partner (delta-sparse or only in hub nets) is merged with the next unmatched vertex. They share no tracked net, so the merge is cut-neutral, but it keeps coarsening shrinking ~2x/pass instead of stalling far above the target (which had left an expensive initial partition on a ~51k-vertex graph). - Incremental FM gains: a move updates only the changed net's contribution to each co-pin (O(Σ pins of the moved vertex's nets)) instead of recomputing neighbour gains from scratch (O(Σ neighbour degrees)). Dominant speedup — refining the finest level dropped from ~11s to ~0.9s on init. - Degenerate-split rejection: graph-growing seeded at a light vertex can sweep an entire sub onto one side (cut 0) when one atomic block holds more than half the balance weight; the initial-partition selector now rejects empty-sided candidates and prefers balanced-then-min-cut, so no shard is left empty. - One boundary-refinement pass per uncoarsen level (measured 1.6–1.8x faster than two for only ~3–6% more ingress — a clear win given the margin). Removes the FM-skip heuristic (`FM_SKIP_VERTICES` / `FM_FULL_VERTICES`): full refinement is now affordable at every size because it only ever runs on already-good partitions. Adds unit tests for one-level contraction, cluster-cap matching, large-cluster separation through the full V-cycle, determinism, and the non-empty guarantee under a dominant block. * clippy & fmt
1 parent e2794c9 commit 2d41a96

13 files changed

Lines changed: 2701 additions & 0 deletions

File tree

Ix/Cli/ProfileCmd.lean

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
/-
2+
`ix profile <path.ixe>`: run the Ix kernel out of circuit over a serialized
3+
`.ixe` environment, recording per-block heartbeats and the delta-unfold graph
4+
into a `.ixesp` sidecar. This is the cost model consumed by `ix shard`
5+
(see `plans/sharding.md`).
6+
7+
Recording defaults to *cache-isolated* mode: the kernel clears its
8+
cross-constant reduction-memo caches between constants so every delta-unfold
9+
re-executes (sound recording) and recorded heartbeats reflect the in-circuit
10+
(un-memoized) cost. `--keep-caches` trades fidelity for speed.
11+
-/
12+
module
13+
public import Cli
14+
public import Ix.Common
15+
public import Ix.KernelCheck
16+
public import Std.Internal.UV.System
17+
18+
public section
19+
20+
open Ix.KernelCheck
21+
22+
namespace Ix.Cli.ProfileCmd
23+
24+
def runProfileCmd (p : Cli.Parsed) : IO UInt32 := do
25+
let some pathArg := p.positionalArg? "path"
26+
| p.printError "error: must specify <path> to a .ixe file"
27+
return 1
28+
let envPath := pathArg.as! String
29+
let outPath : String :=
30+
match p.flag? "out" with
31+
| some flag => flag.as! String
32+
| none => envPath ++ ".ixesp"
33+
let isolate := !(p.flag? "keep-caches" |>.isSome)
34+
let quiet := !(p.flag? "verbose" |>.isSome)
35+
36+
if let some flag := p.flag? "workers" then
37+
let n := flag.as! Nat
38+
if n == 0 then
39+
p.printError "error: --workers must be > 0"
40+
return 1
41+
Std.Internal.UV.System.osSetenv "IX_KERNEL_CHECK_WORKERS" (toString n)
42+
43+
IO.println s!"Profiling {envPath}{outPath} (isolate={isolate})"
44+
let start ← IO.monoMsNow
45+
rsProfileAnonFFI envPath outPath isolate quiet
46+
let elapsed := (← IO.monoMsNow) - start
47+
IO.println s!"[profile] wrote {outPath} in {elapsed.formatMs}"
48+
return 0
49+
50+
end Ix.Cli.ProfileCmd
51+
52+
open Ix.Cli.ProfileCmd in
53+
def profileCmd : Cli.Cmd := `[Cli|
54+
"profile" VIA runProfileCmd;
55+
"Profile a `.ixe` out of circuit → `.ixesp` (sharding cost + delta graph)"
56+
57+
FLAGS:
58+
out : String; "Output .ixesp path (default: <path>.ixesp)"
59+
"keep-caches"; "Keep cross-constant caches: faster, lower-fidelity, may under-record"
60+
workers : Nat; "Parallel kernel workers (default: available_parallelism). Plumbs IX_KERNEL_CHECK_WORKERS."
61+
verbose; "Log every constant (default: quiet)"
62+
63+
ARGS:
64+
path : String; "Path to a serialized .ixe environment"
65+
]
66+
67+
end

Ix/Cli/ShardCmd.lean

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
/-
2+
`ix shard <path.ixesp> --shards N`: partition a profiled environment into `N`
3+
shards via recursive balanced min-cut bisection, minimizing cross-shard
4+
delta-unfold ingress (see `plans/sharding.md`).
5+
6+
Reads the `.ixesp` produced by `ix profile` (pure offline graph work, so `N`
7+
is cheap to re-tune without re-running the kernel). Writes a `.ixes` manifest
8+
and prints a what-if report (per-shard heartbeat balance + total cross-shard
9+
ingress). The partitioner is self-contained — no external graph-library
10+
dependency.
11+
-/
12+
module
13+
public import Cli
14+
public import Ix.KernelCheck
15+
16+
public section
17+
18+
open Ix.KernelCheck
19+
20+
namespace Ix.Cli.ShardCmd
21+
22+
def runShardCmd (p : Cli.Parsed) : IO UInt32 := do
23+
let some pathArg := p.positionalArg? "path"
24+
| p.printError "error: must specify <path> to a .ixesp file"
25+
return 1
26+
let espPath := pathArg.as! String
27+
let numShards : Nat :=
28+
match p.flag? "shards" with
29+
| some flag => flag.as! Nat
30+
| none => 1
31+
let balancePct : Nat :=
32+
match p.flag? "balance" with
33+
| some flag => flag.as! Nat
34+
| none => 5
35+
let outPath : String :=
36+
match p.flag? "out" with
37+
| some flag => flag.as! String
38+
| none => espPath ++ ".ixes"
39+
40+
IO.println s!"Sharding {espPath} into {numShards} shards (balance ±{balancePct}%)"
41+
rsShardEspFFI espPath (toString numShards) (toString balancePct) outPath
42+
if !outPath.isEmpty then
43+
IO.println s!"[shard] wrote {outPath}"
44+
return 0
45+
46+
end Ix.Cli.ShardCmd
47+
48+
open Ix.Cli.ShardCmd in
49+
def shardCmd : Cli.Cmd := `[Cli|
50+
"shard" VIA runShardCmd;
51+
"Partition a `.ixesp` into N balanced shards minimizing cross-shard delta"
52+
53+
FLAGS:
54+
shards : Nat; "Number of shards N (default 1 = a single shard)"
55+
balance : Nat; "Per-bisection balance tolerance, percent (default 5)"
56+
out : String; "Output .ixes manifest path (default: <path>.ixes)"
57+
58+
ARGS:
59+
path : String; "Path to a .ixesp produced by `ix profile`"
60+
]
61+
62+
end

Ix/KernelCheck.lean

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,30 @@ opaque rsCheckAnonFFI :
142142
@& String → -- fail-out path ("" = none)
143143
IO (Array (String × Option CheckError))
144144

145+
/-- FFI: profile a `.ixe` out of circuit, writing a `.ixesp` sidecar with
146+
per-block heartbeats + the delta-unfold graph (the sharding cost model,
147+
see `plans/sharding.md`). Runs the anon kernel over every checkable target.
148+
`isolate` clears the kernel's reduction-memo caches between constants for
149+
sound/faithful recording; `quiet` suppresses per-constant progress. -/
150+
@[extern "rs_kernel_profile_anon"]
151+
opaque rsProfileAnonFFI :
152+
@& String → -- .ixe path
153+
@& String → -- .ixesp output path
154+
@& Bool → -- isolate caches
155+
@& Bool → -- quiet
156+
IO Unit
157+
158+
/-- FFI: partition a `.ixesp` into `numShards` shards, writing a `.ixes`
159+
manifest. `numShards` and `balancePct` are decimal strings (kept ABI-simple).
160+
Empty `outPath` skips the manifest. Prints a what-if report to stderr. -/
161+
@[extern "rs_shard_esp"]
162+
opaque rsShardEspFFI :
163+
@& String → -- .ixesp path
164+
@& String → -- num_shards (N)
165+
@& String → -- balance percent
166+
@& String → -- .ixes output path ("" = skip)
167+
IO Unit
168+
145169
end Ix.KernelCheck
146170

147171
end

Main.lean

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ import Ix.Cli.CheckRsCmd
55
import Ix.Cli.ClaimCmd
66
import Ix.Cli.CompileCmd
77
import Ix.Cli.IngressCmd
8+
import Ix.Cli.ProfileCmd
89
import Ix.Cli.ProveCmd
10+
import Ix.Cli.ShardCmd
911
import Ix.Cli.TreeCmd
1012
import Ix.Cli.ValidateCmd
1113
import Ix.Cli.VerifyCmd
@@ -27,7 +29,9 @@ def ixCmd : Cli.Cmd := `[Cli|
2729
checkRsCmd;
2830
claimCmd;
2931
treeCmd;
32+
profileCmd;
3033
proveCmd;
34+
shardCmd;
3135
verifyCmd;
3236
addrOfCmd;
3337
ingressCmd;

0 commit comments

Comments
 (0)