Render speed-ups and performance investigations.#153
Merged
Conversation
Extend html-compress to mark book-combined as compress-eligible so book.html collapses inter-element whitespace at Jekyll time instead of paged.js's WhiteSpaceFilter doing ~37k DOM mutations at render time. Reorder :pages, :post_render and :documents, :post_render hooks into a three-tier convention so adding compress to book.html composes correctly with the other plugins: :high mutators (book-href-rewrite) :normal compress (html-compress) :low readers (pdfify capture, offlinify per-page rewrite) Without the layering, book-href-rewrite's landing-heading strip ran after compress, leaving adjacent single-space runs that no downstream pass collapsed. The 3-tier ordering makes "compress is the last cleanup pass among mutators" and "readers see final compressed bytes" hold by construction. Verified: 0 outside-pre multi-whitespace runs in the regenerated book.html (was 37,087 without compress). Branch-counting the WhiteSpaceFilter post-fix shows DOM mutations drop from ~37k to 0. Ruby-prof A/B confirms the priority shuffle is CPU-invariant; the only attributable cost is one extra compress! call (~480 ms once per Jekyll build, ~300-500 ms saved per paged.js render). Adds analyze-trace.mjs --children mode used to localise this during the investigation. Full writeup in perf/README.md and docs/_plugins/html-compress.md.
A 3+3 paired cpu-profile A/B (perf/ab-aggregate.mjs) showed the filter's 181k TreeWalker callbacks cost ~600 ms of CPU on every render even when html-compress has already collapsed inter-element whitespace at Jekyll time. ~125 ms is direct (filterTree/filterEmpty self); the rest is indirect -- gBCR, recalcStyle, performLayout and UpdateStyleAndLayout all run ~14% cheaper per call when V8's IC and Blink scheduler aren't being churned by 181k C++->JS dispatches. The cost is small per call but compounds because the walk lives inside the same microtask continuation as the per-page render loop. Earlier wall-clock A/B (3+3, 8.78s vs 8.53s) had attributed the delta to noise; that was wrong. Per-row aggregation across paired cpu profiles shows the filterTree row at 88 ms (sd 14) vs 2 ms (sd 1) -- a 6 sigma shift -- and the downstream gBCR row at -338 ms mean, consistent with the trace's -574 ms drop on Document::UpdateStyleAndLayout total. The fix: gate the TreeWalker invocation behind window.PagedConfig.runWhitespaceFilter (default undefined = off). Our pipeline never sets the flag because html-compress already does the work; documents that need the cleanup can opt back in. Also adds perf/ab-aggregate.mjs (per-row mean+SD aggregator across 6 paired cpu profiles) and a long writeup in perf/README.md with the methodology, the corrected understanding of why the filter has cost (not flush migration -- it does no layout-flushing work; it's V8 IC pressure + Blink scheduler overhead), and lessons about when to trust wall-clock vs aggregated cpu-profile rows.
- docs/render-book.mjs, perf/measure.mjs: add --disable-gpu and --disable-software-rasterizer. Renderer ~120 MB lighter, gpu-process ~84 MB lighter (shrinks to a 16 MB stub -- only --in-process-gpu kills it entirely, at +15 s wall clock; rejected), generate ~5 s faster, PDF byte-identical. - perf/probe-parallel.mjs: two-shard pageRanges parallel-generate probe. N=2 saves ~17 s wall clock (render+generate ~36 s vs ~53 s single-process), confirms two browsers parallelise at the OS level. Not shipped -- N=2 ~5 GB peak, N=4 ~10 GB peak, over CI budget. - perf/probe-memory.mjs + sample-mem.ps1: per-process tree memory sampler. PowerShell + WMI walks the chrome.exe parent->child tree at 500 ms intervals, reports per-process private bytes + working set. Used to A/B the --disable-gpu / --in-process-gpu / --single- process variants (the last crashes in modern headless). - perf/probe-renderer-mem.mjs + analyze-mem-trace.mjs: per-allocator renderer breakdown via Chromium's memory-infra trace + on-demand PMD dumps. Shows the 1.9 GB renderer is ~80 % Blink (Oilpan heap), not V8 (V8 is 34 MB). Top object classes are paged.js's per-page CSS grid (132 MB), 1 M ComputedStyle (74 MB), LayoutNG fragments (~200 MB combined), 411 k AXNodeObject for tagged-PDF (41 MB). - --gc-passes N flag on probe-renderer-mem.mjs: triggers V8 + Memory.simulatePressureNotification between render and generate. One pass + pressure (~1 s) frees ~180 MB of dangling Blink objects reachable from no user-visible state. Not shipped -- masking a retention defect (paged.js hooks? detach-pages closures?) rather than fixing it. Hypotheses + next-step heap-snapshot direction documented in perf/README.md.
…ion. CDP HeapProfiler.takeHeapSnapshot at post-render (and post-gc when combined with --gc-passes) -- ~200 MB .heapsnapshot file per dump, loadable in Chrome DevTools Memory tab. The Comparison view between pre- and post-gc snapshots shows which V8-visible categories the GC freed; Summary + filter "Detached" surfaces DOM nodes still held by JS after their owning page was removed, and Retainers gives the exact chain. Workflow documented in perf/README.md under "--heap- snapshot: extract V8 retainer chains". Oilpan-only objects (PhysicalBoxFragment, LogicalLineItems, ConstraintSpace::RareData -- no V8 wrapper) don't appear in the V8 snapshot but are typically owned by a DOM node that does, so the investigation route is detached-DOM-from-snapshot + ownership graph from the memory-infra dump.
V8 heap snapshot diff pre-gc vs post-gc is byte-identical -- same
2,938,992 nodes, same 108.9 MB self_size, same per-category counts.
Rules out the "dangling JS references" hypothesis the gc-pass probe
initially suggested.
Per-Blink-class diff of the memory-infra dumps (new
perf/diff-blink-classes.mjs) shows what actually gets freed: style-
system caches and layout intermediates that are unreachable from
the moment their page finalises but stay in Oilpan because nothing
forces a major GC during the synchronous render loop. Two ~100% freed
categories are the cleanest signal: CachedMatchedProperties (Blink's
style-sharing cache, dead after layout) and GridItemData (paged.js's
per-page-template CSS grid items, dead after layout). The remainder
is sub-ComputedStyle (StyleBoxData, StyleSurroundData, StyleMisc*),
ShapeResultView / HarfBuzzRunGlyphData / ShapeResultRun, layout-
fragment RareData.
Conclusion: not a leak, not actionable as a retention fix. The only
direct mitigation is forcing a GC (already rejected, costs ~1 s).
Indirect lever is upstream DOM size (DOM-shape audit).
Tooling produced:
- perf/analyze-heap-snapshot.mjs: top type x name aggregation +
pairwise diff for V8 heap snapshots. Also surfaces the
detachedness=2 subset (corrected from earlier mis-read of the V8
DetachednessV8 enum, where {1=Attached, 2=Detached}).
- perf/diff-blink-classes.mjs: per-Blink-class diff between two
memory-infra dumps in the same trace. Strips the per-dump GUID
suffix from class names so the same class lines up across dumps.
README updated: GC-pass section title and intro corrected; "What
might be holding the references" replaced with "What the GC actually
freed"; --heap-snapshot workflow re-framed as a visibility check
rather than a retainer-chain hunt (because the diff is zero).
Research notes from the conversation that explored what it would take
to extract Blink's draw stream (SkPicture / cc::PaintRecord), spawn
standalone PrintCompositor utility processes, or build a Chromium-
linked helper binary -- all to enable parallel PDF generation without
N-way memory blowup.
Five approaches catalogued with honest cost estimates:
A. Patch + upstream a Chromium flag (skip PrintCompositor for
single-renderer, or streaming printToPDF).
B. Port SkPDF to JS (doesn't help alone -- the input data
extraction is the real bottleneck).
C. Frida + reimplement Mojo client in Node (~15-22 weeks).
D. Frida + CanvasKit-WASM workers (~6-10 weeks, tagged-PDF rebuild
required).
E. Helper binary linking Chromium components (~4-6 weeks total,
corrected from earlier overestimates -- shallow gclient sync
~20-30 GB and ~30-90 min, targeted ninja build of ~1500-2500
TUs ~30-90 min first time).
All rejected for the current 70 s build, but documented so the
analysis isn't lost if the book size or CI budget makes it relevant
again. Also captures the hard facts:
- chrome.dll is a single 283 MB monolithic binary with exactly six
exported functions (ChromeMain + 5 others); PrintCompositor / Mojo
/ Skia / Blink / V8 are not externally callable.
- The idle Chromium tree is ~125-180 MB (corrected from earlier
claim of "70-1100 MB"; the high end was PDF-in-transit, not
steady-state).
- HarfBuzz shaping results and SkTextBlob glyph positions never
leave the renderer via any public API; the natural extraction
point is the Mojo serialization between renderer and
PrintCompositor.
New probe perf/probe-idle-browser.mjs measures the idle baseline
(post-launch, post-newPage, post-goto(about:blank)) -- the data
behind the corrected memory math.
Pointer from perf/README.md "Memory" section to CHROMIUM.md so the
separate research file is discoverable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.