You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bench: refresh all benchmark suites at v0.5.908 (2026-05-14) (#765)
Full rerun of polyglot, JSON polyglot, honest_bench, and suite/
microbenchmarks on an otherwise-idle machine. Confirms that
yesterday's v0.5.891 sweep (#745 follow-up) was dominated by
parallel cargo-build contamination — σ on Perry compute cells
dropped from 25-57 ms to 0.3-2.2 ms.
Key results:
- Compute polyglot matches v0.5.585 historical numbers within
1-4 ms across all 9 cells (default + --fast-math); fast-math
cleanly reproduces 8× / 3.6× / 2.9× speedups on loop_overhead /
math_intensive / accumulate.
- honest_bench: Perry slightly faster on all 3 workloads vs
v0.5.891 (image_conv 365 → 354 ms; json_full 1155 → 1098 ms);
300/300 output-matched rows.
- #745 partial fix verification: JSON polyglot RSS dropped
254 → 227 MB roundtrip and 411 → 309 MB iterate after v0.5.900's
GC trigger-ratchet fix. Residual ~150 MB gap vs v0.5.279 baseline
flagged on the issue.
- suite/: method_calls back to 9 ms (yesterday's 25 ms was noise);
closure/factorial regressions vs v0.5.173 persist as known
follow-ups.
Docs refreshed: top-level README, benchmarks/README, polyglot
RESULTS{,_AUTO,_OPT}.md, honest_bench REPORT.md (+ regenerated
charts), json_polyglot RESULTS.md (auto), suite/results/RESULTS.md
(new). All with 2026-05-14 / v0.5.908 datestamps and historical
deltas vs v0.5.891 and v0.5.279.
Copy file name to clipboardExpand all lines: README.md
+22-20Lines changed: 22 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,36 +52,38 @@ People are building real apps with Perry today. Here are some highlights:
52
52
53
53
> **As of v0.5.585, fast-math is opt-in.** Perry's default mode emits no `reassoc + contract` per-instruction FMF flags, so f64 arithmetic is bit-exact with Node. `--fast-math` (CLI), `PERRY_FAST_MATH=1` (env), or `"perry": { "fastMath": true }` in `package.json` re-enables the flags. See [`docs/src/cli/fast-math.md`](docs/src/cli/fast-math.md) for the discussion of when it does and doesn't matter. The numbers below are Perry's default mode unless noted.
54
54
55
-
Numbers below for Perry are from a 2026-05-06 sweep on macOS ARM64 (M1 Max, RUNS=11 medians, `taskpolicy -t 0 -l 0`). Other languages are from the 2026-04-25 v0.5.249 sweep on the same hardware (compiler versions unchanged — these numbers don't shift with Perry-side work). Source + methodology in [`benchmarks/polyglot/`](benchmarks/polyglot/).
55
+
Numbers below are from a 2026-05-14 sweep on macOS ARM64 (M1 Max, RUNS=11 medians, `taskpolicy -t 0 -l 0`) at Perry v0.5.908 on an otherwise-idle machine. All languages re-measured together this run. Source + methodology in [`benchmarks/polyglot/`](benchmarks/polyglot/).
56
56
57
57
| Benchmark | Perry | Rust | C++ | Go | Swift | Java | Node | Bun | What it tests |
Default Perry runs in the same neighborhood as Rust default `-O`, C++ `-O3`, and Swift `-O` on every row — competitive on integer recursion (`fibonacci`), within a tick of native on object allocation thanks to scalar replacement (`object_create`), within a few ms on cache-bound work (`nested_loops`, `array_read`/`array_write`), and matching the no-contract compiled pack on genuinely-non-foldable f64 (`loop_data_dependent`). Go and `clang -O3` win the `loop_data_dependent` row by fusing `sum * a + b` into a single `FMADDD` instruction (FMA contraction is `-ffp-contract=fast` — a separate knob `--fast-math` deliberately doesn't toggle). Python column omitted to keep the table readable; full numbers in [`benchmarks/polyglot/RESULTS.md`](benchmarks/polyglot/RESULTS.md).
66
+
Default Perry runs in the same neighborhood as Rust default `-O`, C++ `-O3`, and Swift `-O` on every row — competitive on integer recursion (`fibonacci` 309 vs Rust 316 / C++ 309), within a tick of native on object allocation thanks to scalar replacement (`object_create`), within a few ms on cache-bound work (`nested_loops`, `array_read`/`array_write`), and matching the no-contract compiled pack on genuinely-non-foldable f64 (`loop_data_dependent` 225 vs Rust 226 / Bun 230 / Node 226). Apple Clang `-O3`and Go default win the `loop_data_dependent` row at 128-129 by fusing `sum * a + b` into a single `FMADDD` instruction (FMA contraction is `-ffp-contract=fast` — a separate knob `--fast-math` deliberately doesn't toggle). Python column omitted to keep the table readable; full numbers in [`benchmarks/polyglot/RESULTS.md`](benchmarks/polyglot/RESULTS.md).
67
67
68
-
We deliberately don't lead with the trivially-foldable accumulator microbenchmarks (`loop_overhead` / `math_intensive` / `accumulate`) that Perry posted big numbers on through v0.5.584. Those are flag-aggressiveness probes — they measure whether each compiler applied `reassoc + autovectorize` to a `sum += 1.0`-shaped loop, not how fast the resulting loop computes under load. Perry default sits in the no-flags pack (~95 ms) on all three; `--fast-math` recovers 12 / 14 / 33 ms. C++ `-O3 -ffast-math` matches Perry `--fast-math` to the millisecond on the same kernels — same LLVM pipeline, one flag. The full breakdown is in [`benchmarks/README.md`](benchmarks/README.md#optimization-probes-compiler-flag-aggressiveness-not-runtime-perf) and [`polyglot/RESULTS_OPT.md`](benchmarks/polyglot/RESULTS_OPT.md).
68
+
We deliberately don't lead with the trivially-foldable accumulator microbenchmarks (`loop_overhead` / `math_intensive` / `accumulate`) that Perry posted big numbers on through v0.5.584. Those are flag-aggressiveness probes — they measure whether each compiler applied `reassoc + autovectorize` to a `sum += 1.0`-shaped loop, not how fast the resulting loop computes under load. Perry default sits in the no-flags pack (97 / 51 / 97 ms in this sweep) on all three; `--fast-math` recovers 12 / 14 / 34 ms. C++ `-O3 -ffast-math` matches Perry `--fast-math` to the millisecond on the same kernels — same LLVM pipeline, one flag. The full breakdown is in [`benchmarks/README.md`](benchmarks/README.md#optimization-probes-compiler-flag-aggressiveness-not-runtime-perf) and [`polyglot/RESULTS_OPT.md`](benchmarks/polyglot/RESULTS_OPT.md).
69
69
70
70
### vs Node.js and Bun
71
71
72
-
Perry's broader benchmark suite covers workloads outside the polyglot set — closures, classes, JSON, prime sieve, etc. **The numbers below are from the 2026-04-23 v0.5.173 baseline run; a v0.5.585 rerun is on the followup list.** Most of these are not FP-foldable accumulator patterns (factorial is integer modulo, method_calls dispatches through closures, json_roundtrip is parse/stringify-bound), so the v0.5.585 default-mode numbers should be close to those shown.
72
+
Perry's broader benchmark suite covers workloads outside the polyglot set — closures, classes, JSON, prime sieve, etc. Numbers below from the 2026-05-14 v0.5.908 sweep via `benchmarks/suite/run_benchmarks.sh` (single-run-per-cell, not RUNS=11 medians — see [`benchmarks/polyglot/`](benchmarks/polyglot/) for the rigorous multi-run methodology).
73
73
74
-
| Benchmark | Perry (v0.5.173) | Node.js | Bun | What it tests |
74
+
| Benchmark | Perry (v0.5.908) | Node.js | Bun | What it tests |
`closure` and `factorial` are still slower than the older v0.5.173 baseline (10 → 50 ms, 31 → 107 ms). The v0.5.585 fast-math opt-in flip accounts for `factorial` (integer modulo plus an FP-tail reduction that the old default-on fast-math collapsed); `closure` regression is tracked as a follow-up. `method_calls` is back at baseline this sweep (9 ms) — yesterday's 25 ms reading was single-run noise from concurrent CPU load. The wins on `binary_trees` / `string_concat` / `prime_sieve` / `mandelbrot` / `matrix_multiply` against Node/Bun hold steady. Single-run cells are noisier than RUNS=11 medians; the lower-noise multi-run polyglot table above remains the canonical comparison.
85
87
86
88
Perry compiles to native machine code via LLVM — no JIT warmup, no interpreter overhead. Key optimizations that apply in both modes: **scalar replacement** of non-escaping objects (escape analysis eliminates heap allocation entirely — object fields become registers), inline bump allocator for objects that do escape, i32 loop counters for bounded array access, integer-modulo fast path (`fptosi → srem → sitofp` instead of `fmod`), elimination of redundant `js_number_coerce` calls on numeric function returns, and i64 specialization for pure numeric recursive functions.
0 commit comments