Skip to content

feat: bytecode VM, CEK optimizations, version-gated FLAT decoding, and conformance fixes#47

Open
jonathanlim222 wants to merge 148 commits into
pragma-org:mainfrom
jonathanlim222:pi/bytecode
Open

feat: bytecode VM, CEK optimizations, version-gated FLAT decoding, and conformance fixes#47
jonathanlim222 wants to merge 148 commits into
pragma-org:mainfrom
jonathanlim222:pi/bytecode

Conversation

@jonathanlim222
Copy link
Copy Markdown
Contributor

@jonathanlim222 jonathanlim222 commented Apr 30, 2026

This branch adds an AOT bytecode compiler and VM for UPLC, applies extensive CEK machine optimizations, introduces version-gated FLAT decoding, fixes multiple conformance bugs found via differential fuzzing, and adds a fuzzing harness.

Bytecode VM (crates/uplc/src/bytecode/)

  • AOT compiler translates UPLC AST → compact bytecode with a constant pool, lambda/delay info tables, and superinstructions (ForceDelay, ApplyLambda, ForceBuiltin, Force2Builtin, ApplyVar, ForceVar, Apply2, Apply3, ConstUnit, ConstTrue, ConstFalse, ConstSmallInt)
  • Register-style dispatch loop with jump-table dispatch, step-countdown budget checking, and inner return-drain loop
  • Pre-creates builtin Runtime/Value objects and pre-wraps constant pool entries at execution start
  • Passes all 1,012+ conformance tests (AST and bytecode)
  • Includes conformance_bytecode.rs test harness, UPLC_BENCH_MODE=bytecode toggle for benchmarks, and profiling examples (bc_dump, bc_stats, bench_loop, eval_flat_bc, profile_bc)

CEK machine performance

  • Environment: replaced linked-list Env with spine-based array (configurable spine size = 4), replaced BumpVec with fixed-size array for builtin args
  • Fast paths: Force(Delay(body)), Apply(Lambda(body), arg), Force(Builtin), Force(Force(Builtin))
  • Frame size: boxed ConstrFrame to shrink Frame enum
  • Allocation: stack-local MachineState, pre-allocated Unit/True/False values
  • Context: replaced Context linked list with Vec stack
  • Build: LTO + single codegen unit for release/bench profiles

FLAT codec

  • Rewrote decoder with 64-bit accumulator and integer fast path (avoids per-bit bounds checks)
  • Added FLAT encoding for BLS12-381 elements and Data constants
  • Added round-trip test suite (flat_roundtrip.rs)
  • Version-gated decoding: decode(arena, bytes, plutus_version, protocol_version) rejects CONSTR/CASE terms, VALUE constant type, and unavailable builtins for pre-1.1.0 programs; decode_ungated(arena, bytes) preserves old permissive behavior

Conformance fixes (found via uplc-fuzz)

  • Case expressions on boolean values in V3
  • Case-on-constants branch validation matching Haskell semantics
  • Case-on-list branch validation in bytecode VM
  • XorByteString cost function (was using OrByteString's)
  • IndexByteString panic on huge indices (> i128 range)
  • Cost model parity: list memory = element count, pair memory = i64::MAX sentinel, saturating arithmetic in budget spending
  • CBOR encoding/decoding for negative big ints
  • Data parser accepting redundant parentheses from the pretty-printer
  • discharge handling of Case and Constr terms
  • Bytecode VM discharge using with_env_bc for Lambda/Delay closures

Differential fuzzing (crates/uplc-fuzz/)

  • New crate for differential fuzzing between Rust and Haskell UPLC evaluators
  • Builtin-aware program generation, mutation strategies, seed corpus
  • TUI dashboard, replay mode, debug flag
  • Regression tests in divergence_tests.rs documenting all found bugs

New conformance tests

  • case-on-boolean, case-on-list, constant-case (bool, integer, list, pair, unit), dropList-overflow, fuzz_0017, large_var

jonathanlim222 and others added 30 commits January 22, 2026 10:05
…ormance tests

- Add `LedgerValue` type with insert, lookup, union, contains, scale, valueData, unValueData
- 9 new builtins: `insertCoin`, `lookupCoin`, `unionValue`, `valueContains`, `valueData`, `unValueData`, `scaleValue`, `bls12_381_G1_multiScalarMul`, `bls12_381_G2_multiScalarMul`
- Case on constants (bool, unit, integer, list, pair)
- Flat encode/decode for Value (tag 13)
- Proper `ValueError` error types
- Fix `listToArray` to one-argument cost model
- Add `V` type param to `Machine` to avoid repeating `V: Eval<'a>` on every method

Signed-off-by: rvcas <x@rvcas.dev>
Signed-off-by: rvcas <x@rvcas.dev>
Signed-off-by: KtorZ <matthias.benkort@gmail.com>
The MachineState enum was being allocated in the arena on every CEK step
transition, despite being ephemeral (created, matched once, then
replaced). This wasted bump allocator space and added unnecessary
indirection through arena-allocated references.

Changed MachineState to be returned by value (stack-local) from compute()
and return_compute(), eliminating ~14-19% overhead on benchmark scripts.

Before: auction_1-1 268µs, auction_1-2 765µs, auction_1-3 778µs
After:  auction_1-1 228µs, auction_1-2 634µs, auction_1-3 631µs
When encountering Force(Delay(body)), short-circuit directly to
computing body instead of: allocating FrameForce → computing Delay →
allocating VDelay → returning to FrameForce → force_evaluate → compute
body. This eliminates 2 arena allocations and 2 state transitions per
Force(Delay) occurrence, which is very common in UPLC programs
(used for all polymorphic builtin instantiations).

~3-4% improvement on benchmark scripts.
When encountering Apply(Lambda(body), arg), skip the normal path of:
  1. Push FrameAwaitFunTerm, compute Lambda
  2. Create VLambda value, return
  3. Push FrameAwaitArg, compute arg
  4. apply_evaluate → extend env, compute body

Instead, detect the pattern at the Apply step and use a new
FrameAwaitArgForLambda frame that directly extends the environment
when the argument value is ready. This eliminates VLambda allocation,
FrameAwaitFunTerm allocation, and two state transitions.

Apply(Lambda(...), arg) is the most common pattern in UPLC (every
let-binding desugars to this), so this optimization has broad impact.

~9-13% improvement on benchmark scripts (cumulative with prior opts).
Before: auction_1-1 228µs, auction_1-2 634µs, auction_1-3 626µs
After:  auction_1-1 198µs, auction_1-2 564µs, auction_1-3 562µs
Builtin functions have a max arity of 6, so using a heap-allocated
BumpVec for accumulating args was unnecessary. Each partial application
(push/force) was cloning the BumpVec, allocating new backing storage
in the arena.

Now uses a fixed [Option<&Value>; 6] array that's trivially Copy,
eliminating clone overhead and bump allocator pressure for partial
builtin applications. The Runtime struct is now Copy, making force()
and push() simple struct copies with no heap interaction.
Callgrind profiling showed Env::lookup consuming 15% of total execution
time. The linked-list environment required O(n) pointer-chasing for
de Bruijn index n, which is cache-unfriendly.

New approach uses a spine-based array: each Env node holds up to 8
values inline in a fixed-size array, with a pointer to a parent spine
for overflow. Since most de Bruijn indices are small (1-4), the vast
majority of lookups now resolve via a single array index into the
current spine — O(1) with no pointer chasing.

Push is still O(1): if the spine has room, copy the small array and
append; if full, start a new spine with the current env as parent.

~20-23% improvement on benchmark scripts. Cumulative ~41% faster than
original baseline.

Before: auction_1-1 198µs, auction_1-2 564µs
After:  auction_1-1 158µs, auction_1-2 440µs
Add 31 tests covering FLAT encode/decode round-trips for all term
types (constants, lambdas, builtins, constr/case, force/delay, error),
including string/bytestring edge cases that weren't previously tested.

The existing 818 conformance tests use text-based parsing and don't
exercise the FLAT binary decoder at all — a deliberate bug in the
byte-reading code passes all conformance tests but is caught by these
new tests.

Also includes tests that decode+re-encode all benchmark scripts,
and a standalone profiling binary for callgrind/perf analysis.
Two major optimizations to the FLAT binary decoder:

1. 64-bit accumulator: Replace byte-at-a-time bit reading with a 64-bit
   accumulator that pre-fetches up to 56 bits. Eliminates per-bit bounds
   checks and byte-crossing logic. Position tracked via single bit_pos
   field; accumulator drained for byte-aligned reads.

2. Integer fast path: big_word()/integer() now decode into u64 first
   (covers ~99% of integers), only falling back to BigInt for values
   exceeding 63 bits. Avoids heap allocation for every integer literal.

Decode-only benchmarks show 9-11% improvement on representative scripts.
Since decode is ~20% of total benchmark time, this contributes ~2%
end-to-end improvement.

Also adds a decode-only benchmark (flat_decode) for measuring decode
performance in isolation.
Profiling showed Env::push at 3.5% of total time, dominated by copying
the spine array on each push. Reducing SPINE_SIZE from 8 to 4 halves
the copy cost while still covering the most common de Bruijn indices
(1-4) in a single array access.

Benchmarked sizes 2, 4, 6, 8, 16 across multiple runs. Size 4 gives
the best tradeoff: ~5% faster than 8 on auction scripts, with only
marginally more parent-spine traversals.

Before (spine=8): auction_1-1 ~158µs, auction_1-2 ~440µs
After  (spine=4): auction_1-1 ~149µs, auction_1-2 ~424µs
Polymorphic builtins like ifThenElse, fstPair, sndPair require 1-2
forces before application. Previously each force went through:
  compute Force → push FrameForce → compute Builtin → create VBuiltin
  → return to FrameForce → force_evaluate → create forced Runtime

Now detected at compile time and short-circuited to create the forced
Runtime directly, skipping frame allocations and state transitions.

~3-4% improvement on scripts heavy in polymorphic builtins.
Link-Time Optimization (lto=true) and single codegen unit
(codegen-units=1) give the compiler full visibility across all crate
boundaries for inlining, devirtualization, and dead code elimination.

~20-28% improvement across benchmark scripts. This is a pure compiler
optimization with zero code changes.

Before: auction_1-1 ~165µs, auction_1-2 ~419µs
After:  auction_1-1 ~118µs, auction_1-2 ~325µs
Two changes:

1. Enable LTO (lto=true) and single codegen unit (codegen-units=1) for
   release and bench profiles. Gives the compiler full cross-crate
   visibility for inlining and dead code elimination. ~20-25% improvement.

2. Replace arena-allocated Context linked list with a pre-allocated
   Vec<Frame> stack. Eliminates per-frame arena allocation, improves
   cache locality (contiguous memory), and simplifies MachineState
   (no longer carries Context reference, reducing enum size).

Combined effect: uplc-turbo now ranks 3rd across all VM implementations,
ahead of Scalus CEK, Plutuz, and both Chrysalis variants.

Geo mean: 266µs → 201µs (24.6% faster on this benchmark run).
Add bytecode module with:
- Opcode definitions (10 core + 4 superinstructions + 4 specialized constants)
- Compiler that translates Term AST to flat bytecode with backpatching
- 11 compiler tests verifying correct opcode generation for all patterns

Superinstructions detected at compile time:
- ForceDelay: Force(Delay(body)) → single opcode, body inline
- ApplyLambda: Apply(Lambda(body), arg) → single opcode
- ForceBuiltin: Force(Builtin(f)) → single opcode
- Force2Builtin: Force(Force(Builtin(f))) → single opcode

Specialized constants for Unit, true, false, small integers avoid
constant pool lookup.

VM execution loop is stubbed — needs Value type extension for bytecode
closures (LambdaBC/DelayBC variants) before full implementation.
… tests

Complete bytecode VM implementation:

Compiler (compiler.rs):
- Compiles Term AST to flat bytecode with backpatching
- Detects and emits superinstructions for common patterns
- Interns constants into a side-table pool
- 11 compiler unit tests

VM (vm.rs):
- Tight dispatch loop over u8 opcodes
- LambdaBC/DelayBC value variants for bytecode closures (body_ip + env)
- Delegates builtin execution to existing Machine::call
- Full CEK semantics: env, continuation stack, budget tracking

Passing integration tests:
- bc_eval_integer, bc_eval_unit, bc_eval_true
- bc_eval_identity (lambda + apply)
- bc_eval_force_delay
- bc_eval_add_integer (builtin)
- bc_eval_nested_apply (deep lambda nesting)
- bc_eval_if_then_else (polymorphic builtin with force/delay)
Key fixes:
- LambdaBC/DelayBC values now store original AST terms for discharge,
  enabling correct term reconstruction for output/error reporting
- Compiler records lambda_info/delay_info mappings from body_ip → AST
- Constr uses u8 tag (common case) with ConstrBig (u64) for large tags,
  keeping the hot path cache-friendly via separate opcodes
- ForceBuiltin/Force2Builtin superinstructions now only emitted when
  the builtin actually requires forcing (fixes argExpected test)

Test results: 1703 total passing
- 818 conformance tests via bytecode VM (NEW - all passing)
- 818 conformance tests via AST interpreter
- 36 unit tests (compiler + VM integration)
- 31 FLAT round-trip tests
- Replace HashMap lookups for lambda_info/delay_info with Vec indexing
  via compile-time u16 IDs embedded in the bytecode
- Benchmark pre-compiles bytecode once, only measures execution (AOT)
- ConstrBig opcode for large tags (u64), Constr stays u8 for common case
- 818/818 conformance tests still passing
Replace guarded match patterns (op if op == Op::Xxx as u8 =>) with
direct numeric literals (0x01 =>) in the bytecode dispatch loop.
This allows the compiler to generate a jump table instead of chained
comparisons.

Results (AOT, pre-compiled):
- auction_1-1: 120µs (matching AST interpreter's 118µs)
- auction_1-2: 369µs (vs AST's 325µs — 14% gap)

The bytecode VM shows dramatically better cache behavior (60% fewer
L1 misses) and branch prediction (33% fewer misses) compared to AST,
but executes ~19% more instructions due to opcode decoding overhead.
The use_cases benchmark now supports UPLC_BENCH_MODE env var:
- "ast" (default): standard AST interpreter (decode + eval per iteration)
- "bytecode": AOT bytecode VM (compile once, execute per iteration)

Docker runner script passes the env var through to the benchmark binary.
Two major optimizations to the bytecode VM:

1. LambdaBC/DelayBC values now store only (body_ip, id, env) instead of
   (body_ip, env, parameter, body). The AST refs for discharge are looked
   up from CompiledProgram tables only when needed (final output). This
   cuts the value size significantly, reducing arena allocation pressure.

2. Frame::Constr and Frame::Cases no longer heap-allocate Vec<u32> for
   field/branch offsets. Instead they store (offsets_start, count) and
   read offsets directly from the bytecode array on demand.

Bytecode VM now 25-44% faster than the AST interpreter:
  auction_1-1: 118µs → 82µs (1.44x)
  auction_1-2: 325µs → 259µs (1.25x)
  auction_1-3: 332µs → 257µs (1.29x)

All 1703 tests passing (818 bytecode conformance + 818 AST + 36 unit + 31 FLAT).
Use slice try_into for read_u32/read_u16 instead of individual byte
accesses. This lets the compiler elide per-byte bounds checks and
generate a single aligned load instruction.
When Apply's function is a Var (variable lookup), the function
evaluation is instant — just an array index into the environment.
The ApplyVar superinstruction skips FrameAwaitFunTerm entirely,
directly pushing FrameAwaitArg with the looked-up value.

Captures ~19% of Apply opcodes (444 out of 2344 in auction_1-2).
Saves 1 frame push + 1 frame pop + 1 Phase transition per occurrence.

818/818 conformance tests passing.
Force(Var(idx)) is extremely common (56% of all Force opcodes).
ForceVar looks up the variable and forces it directly, skipping
the FrameForce push/pop cycle entirely.

Reduces total opcode count by ~8% (8493 → 7807 on auction_1-2).
818/818 conformance tests passing.
Instead of allocating a Value::Con wrapper in the arena for every
Const opcode execution, pre-wrap all constant pool entries into
Values once at the start of execute(). The Const opcode now does
a simple array index instead of an arena allocation.

~10% improvement on auction scripts (120→107µs on auction_1-1).
818/818 conformance tests passing.
Specialized constant opcodes (ConstUnit, ConstTrue, ConstFalse) now
return pre-allocated values instead of arena-allocating on each use.
These values are created once per execute() call and shared across
all occurrences.
All 92 builtin values are pre-allocated once per execute() call.
The Builtin opcode now does a simple array lookup instead of
arena-allocating a Runtime struct + Value wrapper per occurrence.
Quantumplation and others added 25 commits April 30, 2026 07:06
- Exclude crates/uplc-fuzz from workspace (broken dep after rebase)
- Add bench pre-check to skip scripts that fail decode/eval
- Add utility examples: eval_flat, eval_flat_bc, check_max_var,
  bc_dump, bench_loop (for profiling)

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Adds a fuzzing harness that generates random UPLC programs and tests
them for correctness against expected evaluation behavior.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Tests that dropList with a huge integer argument fails with an
evaluation failure rather than silently overflowing the cost calculation.
This test currently fails due to integer overflow in cost arithmetic.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Prevents integer overflow panics when cost model inputs are very large
(e.g. dropList with a huge index). All arithmetic in costing functions,
budget subtraction, and ExBudget::Sub now uses saturating operations
so overflow produces clamped values instead of panics, letting the
out-of-budget check fire normally.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Tests for case expressions on list constants:
- case-list-empty-valid: empty list with 2 branches takes nil branch
- case-list-nonempty-valid: non-empty list takes cons branch with head/tail
- case-list-empty-too-many-branches: empty list with 4 branches errors

These tests currently fail due to incorrect branch count validation
in the case-on-constants implementation.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Boolean: False accepts 1 or 2 branches, True requires exactly 2.
List: Cons accepts 1 or 2 branches, Nil requires exactly 2. Any other
branch count is an error. Previously the code was too permissive,
allowing arbitrary branch counts for lists.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Fuzzer-generated test exercising Case and Constr term discharge.
This test currently fails because discharge does not handle
Case and Constr terms.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Discharge was missing Case and Constr variants, causing incorrect
results when these terms appeared inside closures that needed
environment substitution. Also adds Arena::alloc_slice helper
used by the discharge implementation, and fixes trailing newline
in fuzz_0017 budget expectation.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Implement proper flat serialization for BLS12-381 G1/G2 elements
(using compressed point format) and Data constants (via CBOR).
Previously these variants returned errors or were unimplemented.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
XorByteString was incorrectly charged using OrByteString's cost model, causing budget divergence from the Haskell reference implementation.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Integers beyond i128 range caused a panic instead of falling through to the bounds check and returning an evaluation error, as Haskell does.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
…rences

These three divergences caused budget mismatches vs the Haskell reference. Haskell uses element count for lists, maxBound sentinel for pairs.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
The bytecode VM accepted branch counts that the CEK machine would reject, causing divergences on edge-case programs found by fuzzing.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
The previous code used with_env which didn't have access to bytecode-specific lambda/delay info tables, causing incorrect term reconstruction for bytecode closures.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
The pretty-printer emits forms like ((B #..), (I 1)) inside maps, which the parser couldn't previously round-trip.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Recursive decode_term() overflows the default ~2MB test thread stack on deeply nested benchmark scripts.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Adds conformance tests for edge cases discovered during the Haskell-vs-Rust divergence audit. These guard against regressions in consByteString bounds-checking, expModInteger with negative exponents, and case-on-constants branch validation.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Extract parse_plutus_version() to deduplicate the version string parsing that was duplicated between main() and replay(). Replace the debug-specific summarize_top() function, which was hardcoded to Case/Constant patterns left over from investigating a specific divergence, with a generic term_variant_name() that returns the top-level variant name for any term. Import Outcome at the top of replay() to remove six fully-qualified path references. Change &PathBuf parameters to &Path for idiomatic Rust. Exit with code 1 when replay detects a divergence so scripts and CI can detect failures programmatically.

Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
Signed-off-by: Jonathan Lim <jonathan.lim.222@gmail.com>
@jonathanlim222 jonathanlim222 changed the title Fix Haskell divergences and improve uplc-fuzz feat: Add uplc-fuzz utility, performance optimizations, and fixed Haskell divergences Apr 30, 2026
@jonathanlim222 jonathanlim222 changed the title feat: Add uplc-fuzz utility, performance optimizations, and fixed Haskell divergences feat: bytecode VM, CEK optimizations, version-gated FLAT decoding, and conformance fixes Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants