feat: bytecode VM, CEK optimizations, version-gated FLAT decoding, and conformance fixes#47
Open
jonathanlim222 wants to merge 148 commits into
Open
feat: bytecode VM, CEK optimizations, version-gated FLAT decoding, and conformance fixes#47jonathanlim222 wants to merge 148 commits into
jonathanlim222 wants to merge 148 commits into
Conversation
…ormance tests - Add `LedgerValue` type with insert, lookup, union, contains, scale, valueData, unValueData - 9 new builtins: `insertCoin`, `lookupCoin`, `unionValue`, `valueContains`, `valueData`, `unValueData`, `scaleValue`, `bls12_381_G1_multiScalarMul`, `bls12_381_G2_multiScalarMul` - Case on constants (bool, unit, integer, list, pair) - Flat encode/decode for Value (tag 13) - Proper `ValueError` error types - Fix `listToArray` to one-argument cost model - Add `V` type param to `Machine` to avoid repeating `V: Eval<'a>` on every method Signed-off-by: rvcas <x@rvcas.dev>
Signed-off-by: rvcas <x@rvcas.dev>
Signed-off-by: KtorZ <matthias.benkort@gmail.com>
The MachineState enum was being allocated in the arena on every CEK step transition, despite being ephemeral (created, matched once, then replaced). This wasted bump allocator space and added unnecessary indirection through arena-allocated references. Changed MachineState to be returned by value (stack-local) from compute() and return_compute(), eliminating ~14-19% overhead on benchmark scripts. Before: auction_1-1 268µs, auction_1-2 765µs, auction_1-3 778µs After: auction_1-1 228µs, auction_1-2 634µs, auction_1-3 631µs
When encountering Force(Delay(body)), short-circuit directly to computing body instead of: allocating FrameForce → computing Delay → allocating VDelay → returning to FrameForce → force_evaluate → compute body. This eliminates 2 arena allocations and 2 state transitions per Force(Delay) occurrence, which is very common in UPLC programs (used for all polymorphic builtin instantiations). ~3-4% improvement on benchmark scripts.
When encountering Apply(Lambda(body), arg), skip the normal path of: 1. Push FrameAwaitFunTerm, compute Lambda 2. Create VLambda value, return 3. Push FrameAwaitArg, compute arg 4. apply_evaluate → extend env, compute body Instead, detect the pattern at the Apply step and use a new FrameAwaitArgForLambda frame that directly extends the environment when the argument value is ready. This eliminates VLambda allocation, FrameAwaitFunTerm allocation, and two state transitions. Apply(Lambda(...), arg) is the most common pattern in UPLC (every let-binding desugars to this), so this optimization has broad impact. ~9-13% improvement on benchmark scripts (cumulative with prior opts). Before: auction_1-1 228µs, auction_1-2 634µs, auction_1-3 626µs After: auction_1-1 198µs, auction_1-2 564µs, auction_1-3 562µs
Builtin functions have a max arity of 6, so using a heap-allocated BumpVec for accumulating args was unnecessary. Each partial application (push/force) was cloning the BumpVec, allocating new backing storage in the arena. Now uses a fixed [Option<&Value>; 6] array that's trivially Copy, eliminating clone overhead and bump allocator pressure for partial builtin applications. The Runtime struct is now Copy, making force() and push() simple struct copies with no heap interaction.
Callgrind profiling showed Env::lookup consuming 15% of total execution time. The linked-list environment required O(n) pointer-chasing for de Bruijn index n, which is cache-unfriendly. New approach uses a spine-based array: each Env node holds up to 8 values inline in a fixed-size array, with a pointer to a parent spine for overflow. Since most de Bruijn indices are small (1-4), the vast majority of lookups now resolve via a single array index into the current spine — O(1) with no pointer chasing. Push is still O(1): if the spine has room, copy the small array and append; if full, start a new spine with the current env as parent. ~20-23% improvement on benchmark scripts. Cumulative ~41% faster than original baseline. Before: auction_1-1 198µs, auction_1-2 564µs After: auction_1-1 158µs, auction_1-2 440µs
Add 31 tests covering FLAT encode/decode round-trips for all term types (constants, lambdas, builtins, constr/case, force/delay, error), including string/bytestring edge cases that weren't previously tested. The existing 818 conformance tests use text-based parsing and don't exercise the FLAT binary decoder at all — a deliberate bug in the byte-reading code passes all conformance tests but is caught by these new tests. Also includes tests that decode+re-encode all benchmark scripts, and a standalone profiling binary for callgrind/perf analysis.
Two major optimizations to the FLAT binary decoder: 1. 64-bit accumulator: Replace byte-at-a-time bit reading with a 64-bit accumulator that pre-fetches up to 56 bits. Eliminates per-bit bounds checks and byte-crossing logic. Position tracked via single bit_pos field; accumulator drained for byte-aligned reads. 2. Integer fast path: big_word()/integer() now decode into u64 first (covers ~99% of integers), only falling back to BigInt for values exceeding 63 bits. Avoids heap allocation for every integer literal. Decode-only benchmarks show 9-11% improvement on representative scripts. Since decode is ~20% of total benchmark time, this contributes ~2% end-to-end improvement. Also adds a decode-only benchmark (flat_decode) for measuring decode performance in isolation.
Profiling showed Env::push at 3.5% of total time, dominated by copying the spine array on each push. Reducing SPINE_SIZE from 8 to 4 halves the copy cost while still covering the most common de Bruijn indices (1-4) in a single array access. Benchmarked sizes 2, 4, 6, 8, 16 across multiple runs. Size 4 gives the best tradeoff: ~5% faster than 8 on auction scripts, with only marginally more parent-spine traversals. Before (spine=8): auction_1-1 ~158µs, auction_1-2 ~440µs After (spine=4): auction_1-1 ~149µs, auction_1-2 ~424µs
Polymorphic builtins like ifThenElse, fstPair, sndPair require 1-2 forces before application. Previously each force went through: compute Force → push FrameForce → compute Builtin → create VBuiltin → return to FrameForce → force_evaluate → create forced Runtime Now detected at compile time and short-circuited to create the forced Runtime directly, skipping frame allocations and state transitions. ~3-4% improvement on scripts heavy in polymorphic builtins.
Link-Time Optimization (lto=true) and single codegen unit (codegen-units=1) give the compiler full visibility across all crate boundaries for inlining, devirtualization, and dead code elimination. ~20-28% improvement across benchmark scripts. This is a pure compiler optimization with zero code changes. Before: auction_1-1 ~165µs, auction_1-2 ~419µs After: auction_1-1 ~118µs, auction_1-2 ~325µs
Two changes: 1. Enable LTO (lto=true) and single codegen unit (codegen-units=1) for release and bench profiles. Gives the compiler full cross-crate visibility for inlining and dead code elimination. ~20-25% improvement. 2. Replace arena-allocated Context linked list with a pre-allocated Vec<Frame> stack. Eliminates per-frame arena allocation, improves cache locality (contiguous memory), and simplifies MachineState (no longer carries Context reference, reducing enum size). Combined effect: uplc-turbo now ranks 3rd across all VM implementations, ahead of Scalus CEK, Plutuz, and both Chrysalis variants. Geo mean: 266µs → 201µs (24.6% faster on this benchmark run).
Add bytecode module with: - Opcode definitions (10 core + 4 superinstructions + 4 specialized constants) - Compiler that translates Term AST to flat bytecode with backpatching - 11 compiler tests verifying correct opcode generation for all patterns Superinstructions detected at compile time: - ForceDelay: Force(Delay(body)) → single opcode, body inline - ApplyLambda: Apply(Lambda(body), arg) → single opcode - ForceBuiltin: Force(Builtin(f)) → single opcode - Force2Builtin: Force(Force(Builtin(f))) → single opcode Specialized constants for Unit, true, false, small integers avoid constant pool lookup. VM execution loop is stubbed — needs Value type extension for bytecode closures (LambdaBC/DelayBC variants) before full implementation.
… tests Complete bytecode VM implementation: Compiler (compiler.rs): - Compiles Term AST to flat bytecode with backpatching - Detects and emits superinstructions for common patterns - Interns constants into a side-table pool - 11 compiler unit tests VM (vm.rs): - Tight dispatch loop over u8 opcodes - LambdaBC/DelayBC value variants for bytecode closures (body_ip + env) - Delegates builtin execution to existing Machine::call - Full CEK semantics: env, continuation stack, budget tracking Passing integration tests: - bc_eval_integer, bc_eval_unit, bc_eval_true - bc_eval_identity (lambda + apply) - bc_eval_force_delay - bc_eval_add_integer (builtin) - bc_eval_nested_apply (deep lambda nesting) - bc_eval_if_then_else (polymorphic builtin with force/delay)
Key fixes: - LambdaBC/DelayBC values now store original AST terms for discharge, enabling correct term reconstruction for output/error reporting - Compiler records lambda_info/delay_info mappings from body_ip → AST - Constr uses u8 tag (common case) with ConstrBig (u64) for large tags, keeping the hot path cache-friendly via separate opcodes - ForceBuiltin/Force2Builtin superinstructions now only emitted when the builtin actually requires forcing (fixes argExpected test) Test results: 1703 total passing - 818 conformance tests via bytecode VM (NEW - all passing) - 818 conformance tests via AST interpreter - 36 unit tests (compiler + VM integration) - 31 FLAT round-trip tests
- Replace HashMap lookups for lambda_info/delay_info with Vec indexing via compile-time u16 IDs embedded in the bytecode - Benchmark pre-compiles bytecode once, only measures execution (AOT) - ConstrBig opcode for large tags (u64), Constr stays u8 for common case - 818/818 conformance tests still passing
Replace guarded match patterns (op if op == Op::Xxx as u8 =>) with direct numeric literals (0x01 =>) in the bytecode dispatch loop. This allows the compiler to generate a jump table instead of chained comparisons. Results (AOT, pre-compiled): - auction_1-1: 120µs (matching AST interpreter's 118µs) - auction_1-2: 369µs (vs AST's 325µs — 14% gap) The bytecode VM shows dramatically better cache behavior (60% fewer L1 misses) and branch prediction (33% fewer misses) compared to AST, but executes ~19% more instructions due to opcode decoding overhead.
The use_cases benchmark now supports UPLC_BENCH_MODE env var: - "ast" (default): standard AST interpreter (decode + eval per iteration) - "bytecode": AOT bytecode VM (compile once, execute per iteration) Docker runner script passes the env var through to the benchmark binary.
Two major optimizations to the bytecode VM: 1. LambdaBC/DelayBC values now store only (body_ip, id, env) instead of (body_ip, env, parameter, body). The AST refs for discharge are looked up from CompiledProgram tables only when needed (final output). This cuts the value size significantly, reducing arena allocation pressure. 2. Frame::Constr and Frame::Cases no longer heap-allocate Vec<u32> for field/branch offsets. Instead they store (offsets_start, count) and read offsets directly from the bytecode array on demand. Bytecode VM now 25-44% faster than the AST interpreter: auction_1-1: 118µs → 82µs (1.44x) auction_1-2: 325µs → 259µs (1.25x) auction_1-3: 332µs → 257µs (1.29x) All 1703 tests passing (818 bytecode conformance + 818 AST + 36 unit + 31 FLAT).
Use slice try_into for read_u32/read_u16 instead of individual byte accesses. This lets the compiler elide per-byte bounds checks and generate a single aligned load instruction.
When Apply's function is a Var (variable lookup), the function evaluation is instant — just an array index into the environment. The ApplyVar superinstruction skips FrameAwaitFunTerm entirely, directly pushing FrameAwaitArg with the looked-up value. Captures ~19% of Apply opcodes (444 out of 2344 in auction_1-2). Saves 1 frame push + 1 frame pop + 1 Phase transition per occurrence. 818/818 conformance tests passing.
Force(Var(idx)) is extremely common (56% of all Force opcodes). ForceVar looks up the variable and forces it directly, skipping the FrameForce push/pop cycle entirely. Reduces total opcode count by ~8% (8493 → 7807 on auction_1-2). 818/818 conformance tests passing.
Instead of allocating a Value::Con wrapper in the arena for every Const opcode execution, pre-wrap all constant pool entries into Values once at the start of execute(). The Const opcode now does a simple array index instead of an arena allocation. ~10% improvement on auction scripts (120→107µs on auction_1-1). 818/818 conformance tests passing.
Specialized constant opcodes (ConstUnit, ConstTrue, ConstFalse) now return pre-allocated values instead of arena-allocating on each use. These values are created once per execute() call and shared across all occurrences.
All 92 builtin values are pre-allocated once per execute() call. The Builtin opcode now does a simple array lookup instead of arena-allocating a Runtime struct + Value wrapper per occurrence.
- Exclude crates/uplc-fuzz from workspace (broken dep after rebase) - Add bench pre-check to skip scripts that fail decode/eval - Add utility examples: eval_flat, eval_flat_bc, check_max_var, bc_dump, bench_loop (for profiling) Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Adds a fuzzing harness that generates random UPLC programs and tests them for correctness against expected evaluation behavior. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Tests that dropList with a huge integer argument fails with an evaluation failure rather than silently overflowing the cost calculation. This test currently fails due to integer overflow in cost arithmetic. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Prevents integer overflow panics when cost model inputs are very large (e.g. dropList with a huge index). All arithmetic in costing functions, budget subtraction, and ExBudget::Sub now uses saturating operations so overflow produces clamped values instead of panics, letting the out-of-budget check fire normally. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Tests for case expressions on list constants: - case-list-empty-valid: empty list with 2 branches takes nil branch - case-list-nonempty-valid: non-empty list takes cons branch with head/tail - case-list-empty-too-many-branches: empty list with 4 branches errors These tests currently fail due to incorrect branch count validation in the case-on-constants implementation. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Boolean: False accepts 1 or 2 branches, True requires exactly 2. List: Cons accepts 1 or 2 branches, Nil requires exactly 2. Any other branch count is an error. Previously the code was too permissive, allowing arbitrary branch counts for lists. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Fuzzer-generated test exercising Case and Constr term discharge. This test currently fails because discharge does not handle Case and Constr terms. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Discharge was missing Case and Constr variants, causing incorrect results when these terms appeared inside closures that needed environment substitution. Also adds Arena::alloc_slice helper used by the discharge implementation, and fixes trailing newline in fuzz_0017 budget expectation. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Implement proper flat serialization for BLS12-381 G1/G2 elements (using compressed point format) and Data constants (via CBOR). Previously these variants returned errors or were unimplemented. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
XorByteString was incorrectly charged using OrByteString's cost model, causing budget divergence from the Haskell reference implementation. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Integers beyond i128 range caused a panic instead of falling through to the bounds check and returning an evaluation error, as Haskell does. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
…rences These three divergences caused budget mismatches vs the Haskell reference. Haskell uses element count for lists, maxBound sentinel for pairs. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
The bytecode VM accepted branch counts that the CEK machine would reject, causing divergences on edge-case programs found by fuzzing. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
The previous code used with_env which didn't have access to bytecode-specific lambda/delay info tables, causing incorrect term reconstruction for bytecode closures. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
The pretty-printer emits forms like ((B #..), (I 1)) inside maps, which the parser couldn't previously round-trip. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Recursive decode_term() overflows the default ~2MB test thread stack on deeply nested benchmark scripts. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Adds conformance tests for edge cases discovered during the Haskell-vs-Rust divergence audit. These guard against regressions in consByteString bounds-checking, expModInteger with negative exponents, and case-on-constants branch validation. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Extract parse_plutus_version() to deduplicate the version string parsing that was duplicated between main() and replay(). Replace the debug-specific summarize_top() function, which was hardcoded to Case/Constant patterns left over from investigating a specific divergence, with a generic term_variant_name() that returns the top-level variant name for any term. Import Outcome at the top of replay() to remove six fully-qualified path references. Change &PathBuf parameters to &Path for idiomatic Rust. Exit with code 1 when replay detects a divergence so scripts and CI can detect failures programmatically. Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
81b5a24 to
f818fb1
Compare
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch adds an AOT bytecode compiler and VM for UPLC, applies extensive CEK machine optimizations, introduces version-gated FLAT decoding, fixes multiple conformance bugs found via differential fuzzing, and adds a fuzzing harness.
Bytecode VM (crates/uplc/src/bytecode/)
CEK machine performance
FLAT codec
Conformance fixes (found via uplc-fuzz)
Differential fuzzing (crates/uplc-fuzz/)
New conformance tests