Skip to content

Commit 8a7ea99

Browse files
authored
bench: refresh all benchmark suites at v0.5.908 (2026-05-14) (#765)
Full rerun of polyglot, JSON polyglot, honest_bench, and suite/ microbenchmarks on an otherwise-idle machine. Confirms that yesterday's v0.5.891 sweep (#745 follow-up) was dominated by parallel cargo-build contamination — σ on Perry compute cells dropped from 25-57 ms to 0.3-2.2 ms. Key results: - Compute polyglot matches v0.5.585 historical numbers within 1-4 ms across all 9 cells (default + --fast-math); fast-math cleanly reproduces 8× / 3.6× / 2.9× speedups on loop_overhead / math_intensive / accumulate. - honest_bench: Perry slightly faster on all 3 workloads vs v0.5.891 (image_conv 365 → 354 ms; json_full 1155 → 1098 ms); 300/300 output-matched rows. - #745 partial fix verification: JSON polyglot RSS dropped 254 → 227 MB roundtrip and 411 → 309 MB iterate after v0.5.900's GC trigger-ratchet fix. Residual ~150 MB gap vs v0.5.279 baseline flagged on the issue. - suite/: method_calls back to 9 ms (yesterday's 25 ms was noise); closure/factorial regressions vs v0.5.173 persist as known follow-ups. Docs refreshed: top-level README, benchmarks/README, polyglot RESULTS{,_AUTO,_OPT}.md, honest_bench REPORT.md (+ regenerated charts), json_polyglot RESULTS.md (auto), suite/results/RESULTS.md (new). All with 2026-05-14 / v0.5.908 datestamps and historical deltas vs v0.5.891 and v0.5.279.
1 parent d2e27e7 commit 8a7ea99

14 files changed

Lines changed: 4064 additions & 813 deletions

File tree

README.md

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -52,36 +52,38 @@ People are building real apps with Perry today. Here are some highlights:
5252

5353
> **As of v0.5.585, fast-math is opt-in.** Perry's default mode emits no `reassoc + contract` per-instruction FMF flags, so f64 arithmetic is bit-exact with Node. `--fast-math` (CLI), `PERRY_FAST_MATH=1` (env), or `"perry": { "fastMath": true }` in `package.json` re-enables the flags. See [`docs/src/cli/fast-math.md`](docs/src/cli/fast-math.md) for the discussion of when it does and doesn't matter. The numbers below are Perry's default mode unless noted.
5454
55-
Numbers below for Perry are from a 2026-05-06 sweep on macOS ARM64 (M1 Max, RUNS=11 medians, `taskpolicy -t 0 -l 0`). Other languages are from the 2026-04-25 v0.5.249 sweep on the same hardware (compiler versions unchanged — these numbers don't shift with Perry-side work). Source + methodology in [`benchmarks/polyglot/`](benchmarks/polyglot/).
55+
Numbers below are from a 2026-05-14 sweep on macOS ARM64 (M1 Max, RUNS=11 medians, `taskpolicy -t 0 -l 0`) at Perry v0.5.908 on an otherwise-idle machine. All languages re-measured together this run. Source + methodology in [`benchmarks/polyglot/`](benchmarks/polyglot/).
5656

5757
| Benchmark | Perry | Rust | C++ | Go | Swift | Java | Node | Bun | What it tests |
5858
|---------------------|------:|------:|------:|------:|------:|------:|------:|------:|---------------|
59-
| fibonacci | 304 | 330 | 315 | 451 | 406 | 282 | 1022 | 589 | Recursive function calls (i64 specialization) |
60-
| loop_data_dependent | 221 | 229 | 129 | 128 | 233 | 229 | 322 | 232 | Multiplicative carry through `sum` (genuinely-non-foldable f64) |
61-
| object_create | 2 | 0 | 0 | 0 | 0 | 5 | 11 | 6 | Object allocation (1M objects, scalar replacement) |
62-
| nested_loops | 17 | 8 | 8 | 10 | 8 | 11 | 18 | 21 | Nested array access (cache-bound) |
63-
| array_read | 11 | 9 | 9 | 11 | 9 | 12 | 13 | 16 | Sequential read (10M elements) |
64-
| array_write | 4 | 7 | 3 | 9 | 2 | 7 | 9 | 6 | Sequential write (10M elements) |
59+
| fibonacci | 309 | 316 | 309 | 446 | 401 | 278 | 987 | 518 | Recursive function calls (i64 specialization) |
60+
| loop_data_dependent | 225 | 226 | 129 | 128 | 225 | 226 | 226 | 230 | Multiplicative carry through `sum` (genuinely-non-foldable f64) |
61+
| object_create | 2 | 0 | 0 | 0 | 0 | 5 | 8 | 6 | Object allocation (1M objects, scalar replacement) |
62+
| nested_loops | 18 | 8 | 8 | 10 | 8 | 10 | 17 | 20 | Nested array access (cache-bound) |
63+
| array_read | 11 | 9 | 9 | 10 | 9 | 11 | 14 | 16 | Sequential read (10M elements) |
64+
| array_write | 3 | 7 | 2 | 9 | 2 | 6 | 9 | 6 | Sequential write (10M elements) |
6565

66-
Default Perry runs in the same neighborhood as Rust default `-O`, C++ `-O3`, and Swift `-O` on every row — competitive on integer recursion (`fibonacci`), within a tick of native on object allocation thanks to scalar replacement (`object_create`), within a few ms on cache-bound work (`nested_loops`, `array_read`/`array_write`), and matching the no-contract compiled pack on genuinely-non-foldable f64 (`loop_data_dependent`). Go and `clang -O3` win the `loop_data_dependent` row by fusing `sum * a + b` into a single `FMADDD` instruction (FMA contraction is `-ffp-contract=fast` — a separate knob `--fast-math` deliberately doesn't toggle). Python column omitted to keep the table readable; full numbers in [`benchmarks/polyglot/RESULTS.md`](benchmarks/polyglot/RESULTS.md).
66+
Default Perry runs in the same neighborhood as Rust default `-O`, C++ `-O3`, and Swift `-O` on every row — competitive on integer recursion (`fibonacci` 309 vs Rust 316 / C++ 309), within a tick of native on object allocation thanks to scalar replacement (`object_create`), within a few ms on cache-bound work (`nested_loops`, `array_read`/`array_write`), and matching the no-contract compiled pack on genuinely-non-foldable f64 (`loop_data_dependent` 225 vs Rust 226 / Bun 230 / Node 226). Apple Clang `-O3` and Go default win the `loop_data_dependent` row at 128-129 by fusing `sum * a + b` into a single `FMADDD` instruction (FMA contraction is `-ffp-contract=fast` — a separate knob `--fast-math` deliberately doesn't toggle). Python column omitted to keep the table readable; full numbers in [`benchmarks/polyglot/RESULTS.md`](benchmarks/polyglot/RESULTS.md).
6767

68-
We deliberately don't lead with the trivially-foldable accumulator microbenchmarks (`loop_overhead` / `math_intensive` / `accumulate`) that Perry posted big numbers on through v0.5.584. Those are flag-aggressiveness probes — they measure whether each compiler applied `reassoc + autovectorize` to a `sum += 1.0`-shaped loop, not how fast the resulting loop computes under load. Perry default sits in the no-flags pack (~95 ms) on all three; `--fast-math` recovers 12 / 14 / 33 ms. C++ `-O3 -ffast-math` matches Perry `--fast-math` to the millisecond on the same kernels — same LLVM pipeline, one flag. The full breakdown is in [`benchmarks/README.md`](benchmarks/README.md#optimization-probes-compiler-flag-aggressiveness-not-runtime-perf) and [`polyglot/RESULTS_OPT.md`](benchmarks/polyglot/RESULTS_OPT.md).
68+
We deliberately don't lead with the trivially-foldable accumulator microbenchmarks (`loop_overhead` / `math_intensive` / `accumulate`) that Perry posted big numbers on through v0.5.584. Those are flag-aggressiveness probes — they measure whether each compiler applied `reassoc + autovectorize` to a `sum += 1.0`-shaped loop, not how fast the resulting loop computes under load. Perry default sits in the no-flags pack (97 / 51 / 97 ms in this sweep) on all three; `--fast-math` recovers 12 / 14 / 34 ms. C++ `-O3 -ffast-math` matches Perry `--fast-math` to the millisecond on the same kernels — same LLVM pipeline, one flag. The full breakdown is in [`benchmarks/README.md`](benchmarks/README.md#optimization-probes-compiler-flag-aggressiveness-not-runtime-perf) and [`polyglot/RESULTS_OPT.md`](benchmarks/polyglot/RESULTS_OPT.md).
6969

7070
### vs Node.js and Bun
7171

72-
Perry's broader benchmark suite covers workloads outside the polyglot set — closures, classes, JSON, prime sieve, etc. **The numbers below are from the 2026-04-23 v0.5.173 baseline run; a v0.5.585 rerun is on the followup list.** Most of these are not FP-foldable accumulator patterns (factorial is integer modulo, method_calls dispatches through closures, json_roundtrip is parse/stringify-bound), so the v0.5.585 default-mode numbers should be close to those shown.
72+
Perry's broader benchmark suite covers workloads outside the polyglot set — closures, classes, JSON, prime sieve, etc. Numbers below from the 2026-05-14 v0.5.908 sweep via `benchmarks/suite/run_benchmarks.sh` (single-run-per-cell, not RUNS=11 medians — see [`benchmarks/polyglot/`](benchmarks/polyglot/) for the rigorous multi-run methodology).
7373

74-
| Benchmark | Perry (v0.5.173) | Node.js | Bun | What it tests |
74+
| Benchmark | Perry (v0.5.908) | Node.js | Bun | What it tests |
7575
|-----------|-----------------:|--------:|----:|---------------|
76-
| factorial | 31ms | 596ms | 98ms | Modular accumulation (integer fast path) |
77-
| method_calls | 1ms | 11ms | 9ms | Class method dispatch (10M calls) |
78-
| closure | 10ms | 309ms | 51ms | Closure creation + invocation (10M calls) |
79-
| binary_trees | 3ms | 10ms | 7ms | Tree allocation + traversal (1M nodes, scalar replacement) |
80-
| string_concat | 0ms | 3ms | 2ms | 100K string appends |
81-
| prime_sieve | 5ms | 8ms | 7ms | Sieve of Eratosthenes |
82-
| mandelbrot | 23ms | 25ms | 30ms | Complex f64 iteration (800x800) |
83-
| matrix_multiply | 24ms | 34ms | 35ms | 256x256 matrix multiply |
84-
| json_roundtrip | 314ms | 377ms | 250ms | 50× `JSON.parse` + `JSON.stringify` on a ~1MB, 10K-item blob |
76+
| factorial | 107ms | 591ms | 97ms | Modular accumulation (integer fast path) |
77+
| method_calls | 9ms | 11ms | 9ms | Class method dispatch (10M calls) |
78+
| closure | 50ms | 304ms | 51ms | Closure creation + invocation (10M calls) |
79+
| binary_trees | 2ms | 10ms | 7ms | Tree allocation + traversal (1M nodes, scalar replacement) |
80+
| string_concat | 0ms | 3ms | 1ms | 100K string appends |
81+
| prime_sieve | 3ms | 8ms | 7ms | Sieve of Eratosthenes |
82+
| mandelbrot | 28ms | 25ms | 29ms | Complex f64 iteration (800x800) |
83+
| matrix_multiply | 28ms | 34ms | 34ms | 256x256 matrix multiply |
84+
| json_roundtrip (lazy tape, gen-gc) | 83ms | 377ms | 249ms | 50× `JSON.parse` + `JSON.stringify` on a ~1MB, 10K-item blob |
85+
86+
`closure` and `factorial` are still slower than the older v0.5.173 baseline (10 → 50 ms, 31 → 107 ms). The v0.5.585 fast-math opt-in flip accounts for `factorial` (integer modulo plus an FP-tail reduction that the old default-on fast-math collapsed); `closure` regression is tracked as a follow-up. `method_calls` is back at baseline this sweep (9 ms) — yesterday's 25 ms reading was single-run noise from concurrent CPU load. The wins on `binary_trees` / `string_concat` / `prime_sieve` / `mandelbrot` / `matrix_multiply` against Node/Bun hold steady. Single-run cells are noisier than RUNS=11 medians; the lower-noise multi-run polyglot table above remains the canonical comparison.
8587

8688
Perry compiles to native machine code via LLVM — no JIT warmup, no interpreter overhead. Key optimizations that apply in both modes: **scalar replacement** of non-escaping objects (escape analysis eliminates heap allocation entirely — object fields become registers), inline bump allocator for objects that do escape, i32 loop counters for bounded array access, integer-modulo fast path (`fptosi → srem → sitofp` instead of `fmod`), elimination of redundant `js_number_coerce` calls on numeric function returns, and i64 specialization for pure numeric recursive functions.
8789

0 commit comments

Comments
 (0)