vanilla: json-comp cache, zero-alloc routing, DB prepared statement + HTML escape#877
vanilla: json-comp cache, zero-alloc routing, DB prepared statement + HTML escape#877enghitalo wants to merge 10 commits into
Conversation
json-comp recompressed the gzip body on EVERY request even though the output
for a given (count, m) is fully deterministic — and gzip CPU, not allocation,
dominates that profile. Cache the COMPLETE gzipped response per (count, m) and
append the cached copy on a hit (bounded map, RwMutex). The benchmark hits only
a handful of (count, m) pairs, so the cache stays tiny.
Also route on the path WITHOUT allocating: a tos() view into the request buffer
instead of all_before('?')'s per-request string copy (one alloc per request on
the hot path), shaving GC churn off baseline/json too.
Local before/after (16-core loopback, gcannon, single listener):
json-comp 58K -> 390K req/s (+570%, 6.7x)
Correctness verified: gzip body decodes to the right items/count/total; the
cached response is byte-identical across requests; all other routes unchanged.
Applies to both the epoll and io_uring variants.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/benchmark -f vanilla-epoll --save |
|
/benchmark -f vanilla-io_uring --save |
|
👋 |
1 similar comment
|
👋 |
Remove the remaining small per-request allocations on the hot path:
• qint/qstr took a `string` key and called `key.bytes()` every request (one
[]u8 alloc per parameter — baseline parses a+b, async-db min+max+limit…).
Keys are now precomputed `const []u8` (qk_*), built once at init.
• /json/<n> and /crud/items/<id> parsed the id via route[n..].i64(), a
substring copy. parse_u_at() reads the digits straight from the path view.
Local before/after (16-core loopback) is within noise (baseline ~528K→530K,
json ~206K→212K) — these allocs are tiny next to the response builder MDA2AV#866
removed — but allocation scaled hard on the 64-core arena (json +322% there),
so this trims more GC churn for that environment at zero cost. Note: @[manualfree]
is a no-op under the GC build the arena uses (`v -prod` = Boehm GC; manualfree
only affects -autofree), so reducing allocations is the lever, not manualfree.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Benchmark ResultsFramework:
Full log |
…scape
Folds the DB-path work into this PR so everything lands together:
• async-db uses a PostgreSQL prepared statement (PQprepare/PQexecPrepared via
db.pg, lazily prepared per pooled connection) instead of exec_param_many's
per-request server-side SQL re-parse — local +9%.
• escape_html (fortunes) does ONE pass with a no-alloc fast path instead of
replace_each's five full-string passes — local +27% fortunes.
DB profiles remain bound by the stdlib db.pg driver (text protocol), so this
narrows the gap without closing it. Both backends.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
👋 |
1 similar comment
|
👋 |
Benchmark ResultsFramework:
Full log |
|
/benchmark -f vanilla-epoll /benchmark -f vanilla-io_uring |
|
👋 |
|
/benchmark -f vanilla-epoll |
|
👋 |
Benchmark ResultsFramework:
Full log |
Single-element array push (`arr << x`) is 4-7x slower on post-0.5.1 V (vlang/v#27468) while bulk push_many, allocation and indexed writes are unaffected. The two hot single-element `<<` sites are now bulk writes: - wi() built integer digits with `out << tmp[i]` per digit; it now itoa's back-to-front into the [20]u8 scratch and flushes with one push_many. - write_json_response() pushed the item separator `,` and closing `}` one byte at a time; the closing `}` is now fused with the separator into a single '},' / '}' push_many. Output is byte-identical (verified across counts 0..4096 and edge-value integers). This makes the JSON hot path fast on both the 0.5.1 release and current master, independent of the upstream codegen regression. Both epoll and io_uring backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1b5fcc2 to
44fa661
Compare
|
/benchmark -f vanilla-epoll |
|
👋 |
Build V from source at the 0.5.1 tag instead of the prebuilt release zip. Plain `make` can't build an old tag: its latest_vc step `git pull`s the newest vlang/vc bootstrap, which no longer matches 0.5.1's vlib (fails with `unknown ident \`native\``). So pin vc to the commit cut for 0.5.1 (vlang/vc f461dfeb = "[v:master] 0c3183c - V 0.5.1") and run make's own bootstrap recipe (cc -> v1 -> v2 -> v). Drop curl/unzip from the build deps. Pinned by tag, not a master commit, because post-0.5.1 master carries a codegen regression (single-element array push 4-7x slower, vlang/v#27468). Both backends; verified the source-built compiler serves /json and /pipeline correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Benchmark ResultsFramework:
Full log |
The static handler copied each asset's full prebuilt response (up to ~300 KB) into the per-connection write_buf every request — a userspace copy plus a large *scanned* write_buf that grows the GC's stop-the-world cost at high conn counts (why vanilla sat ~4x behind nginx/swerver on the static profile). Preload each asset's fd once (O_RDONLY, page-cached, borrowed for the server's life) and a precomputed response head; serve the head into write_buf and stream the body zero-copy via core.queue_file (sendfile(2), already wired through the epoll backend's deferred-send + EPOLLOUT path). write_buf no longer grows, the body is never copied, and the kernel pushes file pages straight to the socket — the same model nginx and swerver use. Local (vendor.js 307 KB, 64c, wrk): 25.7K -> 59.3K req/s, 7.36 -> 16.97 GB/s (2.3x). Output verified byte-identical (md5) incl. keep-alive. epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bodies) The lib now streams (drains) request bodies larger than 1 MiB instead of buffering them, so for a large upload req.body is empty — but the byte count the upload profile wants is the declared Content-Length. Answer by req.content_length() (falls back to the buffered body length when absent, which also covers small bodies that still take the buffered path). Depends on enghitalo/vanilla#31 (adds HttpRequest.content_length() + the engine drain); the Dockerfile clones lib main, so that PR must merge before this builds. Local (source-built V 0.5.1): upload single-conn 45 req/s / 907 MB/s, 32c 303 req/s / 6.1 GB/s — matching the top upload servers; RSS 14 MB (was ~1 GB buffering). epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4647fd0 to
8c9f992
Compare
…nt_length() (drain MDA2AV#31 merged); the prior run cloned vanilla before it landed
|
/benchmark -f vanilla-epoll |
|
👋 |
Benchmark ResultsFramework:
Full log |
Summary
Allocation/CPU-reduction and large-I/O follow-ups to the zero-alloc handler (#866), in
frameworks/vanilla-epoll/main.v(+vanilla-io_uringwhere noted) — consolidated into one PR.Allocation / GC-pressure (both backends)
(count, m)is deterministic and gzip CPU dominates, so the complete gzipped response is cached per(count, m)(bounded map + RwMutex) and appended on a hit. Compress once, reuse.?via atos()view into the request buffer instead ofall_before('?')'s per-request copy.qint/qstrtake a precomputedconst []u8(qk_*) instead ofkey.bytes()per call; path ints (/json/<n>,/crud/items/<id>) parsed in place (parse_u_at) instead ofroute[n..]substrings.PQprepare/PQexecPrepared(lazily prepared per pooled connection) instead ofexec_param_many's per-request server-side SQL re-parse.escape_html(fortunes) does one pass with a no-alloc fast path instead ofreplace_each's five full-string passes.out << bin the JSON paths withpush_many/indexed writes (wiitoa into a stack scratch + onepush_many; fused object},separators). Single-element<<is 4–7× slower on post-0.5.1 V (vlang/v#27468); this keeps the hot path fast across V versions. Output is byte-identical (verified across counts 0..4096 + edge ints).Large I/O (epoll only; io_uring is a follow-up)
/uploadanswered by Content-Length (streaming drain) — the engine now drains large request bodies into a fixed buffer (recv + discard) instead of buffering the whole 20 MB into a per-conn read buffer (which grew into a big scanned GC block → 64-core stop-the-world cliff). The handler answers fromreq.content_length()(lib change merged as epoll: stream (drain) large request bodies instead of buffering them — fix the upload cliff enghitalo/vanilla#31), falling back tobody.lenfor small/buffered bodies (crud/json POSTs unaffected — only >1 MiB drains). Local (vs the upload leader zix): single-conn 45 req/s / 895 MB/s (zix 45 / 903); 32-parallel 303 req/s / 6.1 GB/s (zix 336 / 6.7) — within ~10%, the 86× gap is closed; server RSS 12 MB (was ~928 MB while buffering).sendfile(2)— preload each asset'sO_RDONLYfd + a precomputed response head;out << f.headerthencore.queue_file(f.fd, 0, f.size)streams the body zero-copy from the page cache, so the write buffer never grows into a large scanned block (the static "cliff"). Local vendor.js (307 KB, 64 conns): 25.7K → 59.3K req/s (2.3×), 7.36 → 16.97 GB/s; md5-identical, keep-alive OK.Build
vlang/vc@f461dfeb) — reproducible builds, and avoids the post-0.5.1 master<<regression (#27468) that otherwise slows every path. Matches the-benchmarker pin (v: pin compiler to V 0.5.1 (from source); keep vanilla_io_uring but skip it in CI the-benchmarker/web-frameworks#9466).Arena results (vanilla-epoll, 64-core, Δ vs #866's zero-alloc baseline)
Highlights: upload now leads the profile — the streaming drain (#7) took it from ~30 to 11,109 req/s, past the prior leader (~4.3K), at −89% memory. static +460% at −90% memory (sendfile(2), #8 — the buffer-growth cliff is gone). json-comp leads (+1,380%, gzip-response cache). pipelined +1,407% (zero-alloc routing +
<<-armor). The DB-bound profiles (api-4/16, async-db) are flat by design — #877 doesn't touch the DB engine; closing those is the async DB driver's job (below).Lib dependencies (all merged to enghitalo/vanilla
main).noscan_dataflag behind$if vanilla_noscan ?so the lib builds on 0.5.1.frame_head_len+HttpRequest.content_length()), which Tweak server TCP loopback configs for performance #7 here depends on.The DB-bound profiles stay flat because they're driver-bound (
db.pg's synchronous text protocol;vebon the same stack is slower, so vanilla already maxes it). Closing that gap is the async, epoll-integrated DB driver: the async runtime (ac.watch(fd, …)+ continuations) is now merged to enghitalo/vanillamain(#34 → #36 → #37, Linux epoll + macOS kqueue), and a measured PoC showed ~6× per-thread on the DB path. The remaining piece — a thindb_asyncconsumer on that runtime — is tracked in enghitalo/vanilla#32. Correctness verified for all paths here.Supersedes #878 (folded in here).
🤖 Generated with Claude Code