You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(gc): Phase C3b — minor GC trace skips old-gen (v0.5.227)
The time-win core of generational GC. New gc_collect_minor entry
runs the same root-mark phase but drains the worklist via a new
minor variant that skips old-gen objects' fields.
crates/perry-runtime/src/gc.rs:
drain_trace_worklist_inner(worklist, valid_ptrs, minor_only)
Common implementation. When minor_only is true, calls
pointer_in_old_gen(user_addr) per popped header and `continue`s
for old-gen objects — they stay marked (treated as black
leaves) but their fields aren't recursively visited. Young
children held by old-gen parents reach the worklist via the
remembered set scan from C3a.
drain_trace_worklist (full) and drain_trace_worklist_minor
(specialized) wrap the inner with bool false / true.
trace_marked_objects_minor builds the same worklist as
trace_marked_objects but drains via the minor variant.
gc_collect_minor() — full minor GC entry:
Same mark phase as gc_collect_inner (stack + globals + 9
registered + RS), drains via minor trace, sweeps, clears RS.
Sweep is unchanged — arena_reset_empty_blocks already
restricts itself to nursery, malloc-side sweep walks
MALLOC_STATE objects regardless of generation.
gen_gc_enabled() — OnceLock-cached PERRY_GEN_GC env-var gate.
gc_collect_inner now tail-calls gc_collect_minor when the gate
is on. Default OFF — opt-in until C4 lands.
In workloads with substantial old-gen working set, minor GC is
now O(young live + RS roots) instead of O(all live). Today the
win is unobservable on the JSON benches because OLD_ARENA stays
empty — Phase C4 will wire nursery→old promotion to fill it.
2 new unit tests:
test_gc_collect_minor_clears_rs — RS empty after minor GC
test_gc_collect_minor_runs_without_panic — mixed nursery+old
heap, three sequential minor collections, no crash
Runtime tests 156 -> 158.
Regression:
20/20 test_json_*.ts under default (10) + GEN_GC=1 WB=1 (10)
bench_json_roundtrip 72-74ms across all 4 mode combinations
(default / GEN_GC / WB / GEN_GC+WB) — within noise as
expected (empty OLD_ARENA = no objects to skip)
Infrastructure layer complete. Phase C4 adds promotion (nursery
survivors → OLD_ARENA) which makes the skip-old-gen optimization
load-bearing for the bench_json_roundtrip RSS ≤70 MB ship
criterion.
Copy file name to clipboardExpand all lines: CLAUDE.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
8
8
9
9
Perry is a native TypeScript compiler written in Rust that compiles TypeScript source code directly to native executables. It uses SWC for TypeScript parsing and LLVM for code generation.
10
10
11
-
**Current Version:** 0.5.226
11
+
**Current Version:** 0.5.227
12
12
13
13
## TypeScript Parity Status
14
14
@@ -148,6 +148,7 @@ First-resolved directory cached in `compile_package_dirs`; subsequent imports re
148
148
Keep entries to 1-2 lines max. Full details in CHANGELOG.md.
149
149
150
150
- **v0.5.205** — Fix #183: `perry compile --target web` on a real-world app (Bloom Jump built on the Bloom engine) produced a WASM binary the browser refused to load — `Compiling function #687 failed: expected 0 elements on the stack for fallthru, found 103` (count varies with engine state). Root cause in `crates/perry-codegen-wasm/src/emit.rs`: the four direct-`Call`-instruction code paths — `Expr::Call` FuncRef arm (~4302), `Expr::Call` ExternFuncRef arm (~4324), `Expr::New` user-class ctor (~5844), `Expr::SuperCall` parent-ctor (~5894), `Expr::StaticMethodCall` direct-static path (~5979) — each emit `emit_expr(arg)` per source arg and pad up with `TAG_UNDEFINED` when `args.len() < expected`, but had no matching drop-excess branch when `args.len() > expected`. WASM `call` consumes exactly the callee's declared param count, so when JS's "extra args evaluated for side effects, then silently ignored" semantics met Perry's WASM codegen, every extra evaluated arg leaked past the call and accumulated on the enclosing block's operand stack — 103 values by the time `_start`'s final `end` hit the validator. The shape that triggered it in jump/bloom was `bloom/src/core/colors.ts`'s `Colors = new __AnonShape_2(...24 PropertyGets...)` landing on a Phase-3-synthesized ctor with lower declared arity, multiplied across bloom's 10 submodules. Fix: after each existing `for _ in args.len()..expected { I64Const(TAG_UNDEFINED) }` pad-up loop, add the mirror `for _ in expected..args.len() { Drop }` — matches JS semantics (extras evaluated for side effects but discarded) and keeps the operand stack aligned with the callee's WASM type at every direct-Call site. Verified end-to-end against the exact issue repro cloned fresh from `github.com/Bloom-Engine/jump` + `github.com/Bloom-Engine/engine`: both path A `file:./vendor/bloom/` and path B `file:../engine/` now compile to a WebAssembly-validating `.wasm` (416,923 / 413,780 bytes respectively, 140 FFI imports intact, `WebAssembly.compile` resolves clean on node 20+); a synthetic `takesFive(mc(),mc(),1,2,3,4)` minimal case that previously failed `Compiling function #213 failed: ... found 1` also validates. `cargo test --release -p perry-runtime -p perry-hir -p perry-codegen-wasm -p perry`: 262/262 passed. Note: issue #183 also claimed path A found only 1 module and emitted 9 FFI imports — could not reproduce in a fresh clone (both paths find 10 modules identically); most likely an artifact of the reporter's local `vendor/bloom` snapshot predating the `exports` map, and the "runGame silently no-ops" symptom the user actually observed was the browser refusing to instantiate the invalid WASM with the surrounding JS glue swallowing the error — fixed here.
151
+
- **v0.5.227** — Gen-GC **Phase C3b**: minor GC trace skips old-gen objects. New `gc_collect_minor()` entry in `crates/perry-runtime/src/gc.rs` runs the same root-mark phase (stack + globals + 9 registered scanners + RS) but drains the worklist via the new `drain_trace_worklist_minor` variant which calls `pointer_in_old_gen(user_addr)` per popped header and `continue`s for old-gen objects without invoking `trace_object` / `trace_array` / `trace_closure` / etc. — they stay marked (treated as black leaves) but their fields aren't recursively visited. Young children held by old-gen parents reach the worklist exclusively via the remembered set, scanned by `mark_remembered_set_roots` (C3a). New `gen_gc_enabled()` helper reads `PERRY_GEN_GC` env var (cached via `OnceLock`); when set to `1` / `on` / `true`, every collection routes through `gc_collect_minor` instead of the standard mark-sweep. `gc_collect_inner` checks the flag at entry and tail-calls minor when the gate is on. Default OFF — opt-in until Phase C4 hits the bench_json_roundtrip ship criterion (RSS ≤70 MB direct path) and proves out across the full test corpus. **Trace specialization is the time-win core of generational GC**: in workloads with substantial old-gen working set, minor GC is now `O(young live + RS roots)` instead of `O(all live)` — but today the win is unobservable on the JSON benches because OLD_ARENA stays empty (Phase C4 will wire nursery→old promotion to actually fill it). Two new unit tests pin the C3b invariants: `test_gc_collect_minor_clears_rs` verifies RS is empty after `gc_collect_minor`, `test_gc_collect_minor_runs_without_panic` exercises a mixed nursery+old-gen heap through three sequential minor collections. Runtime tests 156 → **158**. Full regression sweep clean: 20/20 `test_json_*.ts` (10 default + 10 with `PERRY_GEN_GC=1 PERRY_WRITE_BARRIERS=1`); `bench_json_roundtrip` best-of-5 across all 4 mode combinations (default / GEN_GC / WB / GEN_GC+WB) all 72-74 ms — within noise. The infrastructure layer is now complete; Phase C4 lights it up by adding promotion (nursery survivors → OLD_ARENA via copying evacuation), which makes the "skip old-gen during minor" optimization meaningful.
151
152
- **v0.5.226** — Gen-GC **Phase C3a**: remembered-set roots flow into the GC mark phase + RS clears after every collection. New `mark_remembered_set_roots(valid_ptrs)` in `crates/perry-runtime/src/gc.rs` snapshots the per-thread `REMEMBERED_SET` (populated by the codegen-emitted write barriers from C2/C2-expansion) and re-marks each old-gen header as a `POINTER_TAG`-tagged value via the standard `try_mark_value` machinery. Wired into `gc_collect_inner` between `mark_registered_roots` and `trace_marked_objects`. RS cleared via `REMEMBERED_SET.with(|s| s.borrow_mut().clear())` immediately after `sweep()` returns, so the next collection cycle starts coherent — barrier emissions at C2 sites repopulate the RS as needed during the next allocation epoch. **Today this is correctness-equivalent to before** (the conservative C-stack scan + 9 root scanners already kept everything alive); the contribution is the **infrastructure point** — the RS now has a real consumer in the GC, validated end-to-end. C3b will add the actual generational specialization (skip old-gen objects during marking, scan only nursery from RS roots, gated `PERRY_GEN_GC=1`) which is where the time/RSS wins land. New unit test `test_remembered_set_cleared_after_full_gc` pins the clear-after-GC invariant: populate RS via barrier, run full GC, assert RS is empty. Runtime tests 155 → **156**. Full regression sweep clean: 10/10 `test_json_*.ts` match Node under default AND `PERRY_WRITE_BARRIERS=1` (where the RS actually fills with old→young entries during parse). `bench_json_roundtrip` best-of-5 WB-off 65 ms vs WB-on 65 ms — RS clear cost invisible (HashSet of <100 entries clearing in microseconds). Gap tests 25/28 (baseline). Phase C is now 3/3 sub-phases done at the infrastructure level: C1 (RS storage), C2 (codegen emission), C3a (GC consumes RS). C3b adds the generational mark-phase specialization that yields the bench_json_roundtrip RSS ≤70 MB ship criterion.
152
153
- **v0.5.225** — Gen-GC **Phase C2 expansion**: write-barrier emission extended to every remaining heap-store site. New `emit_write_barrier(ctx, parent_bits, child_bits)` helper at the top of `crates/perry-codegen/src/expr.rs` consolidates the gate-checked emit (gated `PERRY_WRITE_BARRIERS=1`, branchless at codegen via `OnceLock`-cached `write_barriers_enabled()`). Now wired at: (1) `Expr::PropertySet` generic `obj.x = y` path (refactored from inline emit at v0.5.224 to use the helper), (2) `Expr::IndexSet` array element path — both the local-without-slot fallback AND the no-local-id fallback at the runtime-call site, (3) `Expr::IndexSet` array element FAST path inside `lower_index_set_fast` — covers `arr[i] = v` for `arr` in `ctx.locals` (the most common shape), barrier emitted in the merge block after both inline-extend and realloc paths converge, (4) `Expr::IndexSet` string-literal-key path `obj["k"] = v`, (5) `Expr::IndexSet` runtime-string-key fallback path, (6) `Expr::LocalSet` closure capture write — both branches: boxed (parent = box ptr from `js_box_set`) AND non-boxed (parent = closure ptr from `js_closure_set_capture_f64`). NOT yet covered (deferred — separate codegen paths): class-field-set fast path with statically-typed receivers using direct field-index store. The 4-WB-site stress test (`/tmp/wb_all_sites.ts`: array indexed-set, two object key sets, one PropertySet) emits 4 `call void @js_write_barrier(...)` lines in IR and matches Node byte-for-byte. Regression sweep clean: 10/10 `test_json_*.ts` match Node under `PERRY_WRITE_BARRIERS=1`. `bench_json_roundtrip` best-of-5 WB-off 64 ms vs WB-on 64 ms — barrier overhead invisible because the bench's hot path is parse + stringify (no user-code field/element writes). The barrier becomes load-bearing once Phase C3 lands and minor GC consumes the remembered set — at that point the barrier replaces a full-arena scan with an RS-only scan, which is the actual time win the gen-GC plan promises.
153
154
- **v0.5.224** — Gen-GC **Phase C sub-phase 2**: codegen emits `js_write_barrier(parent_bits, child_bits)` after the generic `Expr::PropertySet` heap store in `crates/perry-codegen/src/expr.rs`. Gated behind `PERRY_WRITE_BARRIERS=1` env var (cached via `OnceLock` in new `crates/perry-codegen/src/codegen.rs::write_barriers_enabled()`) — default OFF, so no production-perf impact until Phase C3 lands and minor GC actually consumes the remembered set. Initial-scope coverage: the generic PropertySet path (used for `any`-typed receivers like `const h: any = {}; h.v = ...`). Class-field-set fast paths (statically-typed receivers using direct field-index lookup) are intentionally NOT instrumented in this sub-phase — those go through different codegen and will be wired in C2 follow-up. Verified end-to-end via `PERRY_SAVE_LL=<dir>`: a 3-field-write program emits exactly 3 `call void @js_write_barrier(i64 ..., i64 ...)` lines after the matching `js_object_set_field_by_name` calls, runs cleanly, output matches Node. **Regression sweep clean under WB on**: 10/10 `test_json_*.ts` match Node byte-for-byte; runtime tests 155/155 unchanged (sub-phase C1 6 tests still cover the runtime side); `bench_json_roundtrip` best-of-5 WB-off 66 ms vs WB-on 65 ms — within noise because the bench doesn't write object fields on the hot path (parse+stringify only). The barrier's per-call cost (one bitcast + one extern call + the runtime's O(blocks) old-vs-young range scan) is currently the dominant overhead per heap store; sub-phase C3's `GC_FLAG_YOUNG` bit-test will replace the range scan with a single conditional branch. **Next**: sub-phase C3 (minor GC implementation — scan precise roots + remembered set, evacuate survivors to old-gen, clear RS, gated behind `PERRY_GEN_GC=1`). C3 is where the precision and arena split actually become useful — bench_json_roundtrip direct-path RSS should drop to ≤70 MB per the Phase C ship criterion.
0 commit comments