Fix Fetch.enable + http-proxy CDP send/recv deadlock (#2462)#2468
Fix Fetch.enable + http-proxy CDP send/recv deadlock (#2462)#2468staylor wants to merge 2 commits into
Conversation
37b3431 to
f05f451
Compare
|
I rebased + added a little change in #2489 |
|
I've been talking about a possible solution for some of these issues. Currently, every CDP connection maps to 1 Browser and runs in a thread. I think having the CDP socket polled and read from the main thread (in Network.zig) and communicate messages to the worker via an Erlang OTP-style mailbox would be more robust. In normal operations, the mailbox would be drained in Runner.tick, i.e. a safe point, never inside of a JS callback. For Request Interception, which needs to block current execution, it would scan the mailbox and only process "safe" messages (a simple allow list of commands) + a timeout. This adds some complexity at the main <-> worker boundary. It removes the natural back-pressure on incoming CDP messages (which I don't think is really a problem, but there are pretty simple solutions to that if it is). The main upside is that CDP commands that mutate the environment (e.g. Target.closeTarget) are only ever executed in Runner.tick. I'm not sure if this 100% solves the blocking write issue, but it should unblock the peer send which, from the way you desribe it was at least part of the problem. Unrelated to the PR, another benefit is that we can detect a disconnected client even if the worker is stuck in a heavy JS loop. Would really appreciate any feedback you have about this. |
|
let me resolve conflicts and throw Claude at it - this PR is the main blocker for Puppeteer and Playwright. |
) WsConnection.send used to switch the socket to blocking mode on EWOULDBLOCK, with a comment claiming this 'should virtually never happen'. Under CDP Fetch.enable + --http-proxy it is routine: the CONNECT+TLS proxy round-trip means many subresources are in flight simultaneously, lightpanda emits a flood of Fetch.requestPaused events, the kernel send buffer fills, lightpanda blocks on write, and puppeteer's matching Fetch.continueRequest replies pile up in lightpanda's TCP recv buffer (which lightpanda can't drain because it is blocked on the write). Both peers wedge until the client times out. Other contributing problems all collapse paused-intercept sites where an outer loop polls without ever draining the CDP socket, OR where disconnect-time cleanup re-enters JS through paths the runtime can no longer satisfy: * HttpClient.perform skipped the CDP socket poll entirely whenever processMessages() returned non-empty, so a steady stream of HTTP completions could starve CDP reads. * ScriptManagerBase.waitForImport spun on `client.tick(200)` and discarded the .cdp_socket return, so a script `import()` whose request was paused at the InterceptionLayer hung forever. * BrowserContext.deinit aborted pending intercepts via `transfer.abort`, which fired XHR/script error_callback chains into a half-torn-down V8 context (the inspector had already been stopped two lines above). * Headers.deinit was non-idempotent, so a value-copied Headers (the hazard RobotsLayer documents) double-freed its curl_slist on the second deinit; the symptom was an "incorrect alignment" panic inside ZigToCurlAllocator.free. * Transfer.deinit was non-idempotent, so a cascade out of error_callback (e.g. Script.errorCallback -> manager.evaluate() -> JS execution -> Frame.deinit -> abortOwner -> Transfer.kill -> Transfer.deinit) reached `arena_pool.release` twice on the same arena. Coordinated changes: * src/network/WsConnection.zig: On EWOULDBLOCK, instead of switching to a blocking write, poll for both POLLOUT and POLLIN. While waiting for write space, drain any incoming bytes into the reader buffer (without dispatching - that would re-enter send and recurse). Adds tryRead/bufferedBytes accessors. * src/browser/HttpClient.zig: - Add has_buffered_input to CDPClient. In perform(), return .cdp_socket when buffered input exists, and always do at least a non-blocking poll on the CDP socket so HTTP completions can no longer starve CDP reads. - Make Transfer.deinit idempotent by claiming ownership through `client.transfers.remove(self.id)`. Second deinits (cascades out of error_callback) early-return. - Make `Transfer.kill` public (was `fn`) so BrowserContext.deinit can use it. - Tighten RequestParams.deinit / Request.deinit to take `*` instead of `*const` so they can call into `Headers.deinit` (now mutating). * src/network/http.zig: Headers.deinit now nulls out `self.headers` after `curl_slist_free_all`, so a second deinit is a no-op. Without this guard a value-copied Headers double-frees the curl_slist (the hazard RobotsLayer's call site already documents). * src/browser/ScriptManagerBase.zig: waitForImport now drains pending CDP messages on every iteration (matching the syncRequest pattern) and re-fetches the imported_modules entry per iteration. The cached entry was a use-after-free risk because the CDP-drain step above re-enters JS, and a transitively-imported module's preloadImport() -> getOrPut() can rehash the map and invalidate the prior entry pointer. A failed CDP blocking_read surfaces as error.Failed so the caller breaks the loop rather than retrying against a dead socket. * src/cdp/CDP.zig: - Wire hasBufferedInput. - Replace read() with tryRead() in readSocket and tolerate the no_new_data case so we still process messages drained during a backpressured send. - BrowserContext.deinit aborts pending intercepts via `transfer.kill` instead of `transfer.abort`. `kill` fires shutdown_callback (or no-op for transfers without one), avoiding error_callback's re-entry into JS through XHR/script error handlers - those crash because the V8 context and inspector this BC owns have either been torn down already or are about to be. Verified end-to-end against a puppeteer + http-proxy reproducer: | URL | before | after | |----------------------------|----------------------|------------------| | example.com | OK 124ms | OK 117ms | | github.com | HANG 20s (12/82) | OK 1410ms (82/82)| | shopify.com | HANG 20s (1/4) | OK 1973ms (66/66)| | allbirds.com | HANG 20s (12/53) | OK 3944ms (372) | | allbirds wool-runners PDP | HANG 20s (12/53) | OK 6286ms (459) | (nike.com still doesn't reach `load` event but all 68 continueRequests process cleanly - the remaining stall is third-party widgets keeping the page in `loading` state, not the CDP/HTTP deadlock this PR fixes.) Lightpanda survives 18 back-to-back navigation runs across the matrix above (3 per URL) without crashing. Fixes lightpanda-io#2462.
f05f451 to
865e876
Compare
EXPLORATION ON TOP OF THE DEADLOCK FIX. Drop this commit if we decide
to ship just the inline fixes.
Background: PR review suggested decoupling CDP reads from the worker's
runner loop entirely, via a dedicated reader thread that parses WS
frames and pushes complete CDP messages onto a mailbox the worker
drains at safe points (Runner.tick, syncRequest, waitForImport). That
removes the two-way backpressure deadlock at its root and replaces the
per-site point fixes from the squashed parent commit (WsConnection.send
POLLIN drain, has_buffered_input, CDP.tryRead, ScriptManager extra
cdp.blocking_read) with a uniform model.
Shape of the prototype:
* New src/cdp/Mailbox.zig
- Thread-safe FIFO + wake pipe.
- Producer = reader thread (push, close).
- Consumer = worker thread (pop, drainWake, freeMessage, deinit).
- HttpClient.perform polls Mailbox.wake_read instead of the raw WS
socket fd, so the runner still gets a .cdp_socket signal as soon
as a message lands.
* src/cdp/CDP.zig
- Add `mailbox: Mailbox` and `reader_thread: ?std.Thread`.
- init() prepares the mailbox and points CDPClient.socket at the
wake fd, but does NOT spawn the reader yet — that would race the
WS handshake on socket reads.
- New `startReader()` is called from Server.serveConnection after
handshake succeeds.
- deinit() shuts down the WS recv side (wakes the reader's blocked
poll via POLLHUP), joins the reader, then tears down everything
else.
- readSocket() now drains the mailbox; readerLoop() is the reader's
poll-then-read-then-processMessages loop.
- drainMailbox is the new CDPClient.blocking_read implementation.
blocking_read_start / blocking_read_end / has_buffered_input are
gone — the mailbox makes them obsolete.
* src/network/WsConnection.zig
- Two threads now reach the send path concurrently (reader sends
close/pong on protocol errors and ping replies; worker sends CDP
events and responses), so send_arena and the socket-write loop
have to be serialized. Added a send_mutex; split send into
sendLocked (caller holds the mutex) so sendPong / sendJSON can
hold it across their alloc-from-send_arena + send sequence.
- Reverted EWOULDBLOCK handling to the simpler "flip to blocking,
do the write, flip back" approach. Safe again because the recv
side is owned by a different thread — the peer keeps draining
while we block.
- Removed the POLLIN-drain machinery (waitWritableDrainingReads,
drainAvailable, tryRead, bufferedBytes, ReadResult) — all of that
existed to manually break the read/write deadlock the mailbox now
eliminates structurally.
* src/browser/HttpClient.zig
- Drop has_buffered_input from CDPClient. Drop the corresponding
branch in perform().
- Keep the perform change that always polls the CDP fd (with a
timeout=0 fast poll if we just processed completions) — still
valid; prevents HTTP completion floods from starving CDP reads.
- Drop blocking_read_start / blocking_read_end from CDPClient (dead
code: only the middle of the trio, blocking_read, is ever used).
* src/Server.zig
- Call cdp.startReader() after the handshake and after registering
the connection. Tests that init CDP without a real handshake
never call startReader and remain unaffected.
Verified against the PR's exact reproducer (puppeteer
setRequestInterception + http-proxy through a local CONNECT proxy):
| URL | before fixes | with mailbox |
|-------------------|------------------|--------------------|
| allbirds.com (x3) | HANG 20s (12/53) | OK 2.1-3.1s (all) |
All 692 in-tree tests pass.
Open questions to discuss before promoting this:
* Should the inline-fix commit live alongside this in the same PR, or
should we drop the inline send/perform changes once the mailbox
lands? Several of the parent commit's fixes (Transfer.deinit /
Headers.deinit / BC.deinit kill vs abort / ScriptManager re-fetch +
entry guard) are independent of the read path and stay either way.
* The "allow-list-during-sync-wait" idea from review is not
implemented here. With the mailbox we drain everything at every
safe point, including syncRequest's tick loop, so an interception
can resume even while a synchronous XHR/import is in flight. If
that scope needs tightening (e.g. only allow Fetch.continueRequest
/ failRequest / fulfillRequest / continueWithAuth during sync
wait) it would go in Mailbox.pop / CDP.readSocket.
* Outbound writes still go inline (synchronous) on the worker thread.
We could push outgoing frames through a writer thread too, but the
inbound decoupling alone already breaks the deadlock and keeps the
diff small.
|
@karlseguin — I took a swing at your mailbox idea as a second commit on top so it's easy to compare (and easy to drop if you'd rather we don't go this direction): Shape
Net: ~165 deletions in Verification against the PR reproLocal CONNECT proxy on (For reference: All 692 in-tree tests still pass. What I didn't change
How you'd like to proceed?A few options I see:
Happy to take whichever direction works for you — and to iterate on the prototype (allow-list, outbound writer thread, anything else). |
|
@staylor thanks for taking a stab at this. This is something I wanted to implement myself (with Claude's help, if I'm being honest). One notable difference with my approach is that I use the main thread (via Network.zig) to poll/read the CDP sockets, so it doesn't require spinning up any new threads. The PR is #2495. It doesn't address all of the issues in this PR, but it should address some (along with other issues we've been seeing) and hopefully serves as a better foundation. Would love if you'd be able to take a look and possibly try it. |
|
yeah I can took a look in a lil bit - I have no preference on the approach, I just want to unblock our own adoption. if your PR fixes the IO deadlock, we're probably fine. I don't think we actually need #2399 and #2476, but I'll confirm. #2476 sprang up intermediately while I was trying to track down the deadlock, which took multiple multi-hour sessions. Claude even gave up twice. |
whatever we can do to get yours merged, let's do it! Observations
Caveats
Bottom line for your PRIn this smoke test, PR #2495 alone is the most reliable Playwright path against allbirds.com under --http-proxy. Your PR #2468 (rebased on current main) consistently crashes earlier under the same load, and #2476 doesn't fix it |
|
Closing. Fixed (or at least significantly improved) by #2495 |
Closes #2462.
Summary
Five coordinated fixes for the
Fetch.enable+--http-proxydeadlock. The original symptom (CONNECT/TLS proxy round-trips inflate concurrency, lightpanda's WS send fills the kernel buffer, lightpanda blocks on write while puppeteer's matchingFetch.continueRequestreplies pile up unread) is the headline scenario, but along the way an empirical reproducer surfaced four more places that produce the same "lightpanda stops draining the CDP socket and/or crashes mid-cleanup" symptom. All five share the root pattern: a synchronous wait loop or shutdown path that re-enters JS without giving CDP a chance to make progress.A/B verification
Repro: puppeteer-core +
setRequestInterception(true)+req.continue()against a localhost forward proxy, lightpanda started with--http-proxy http://127.0.0.1:8080.https://example.comhttps://github.comhttps://www.shopify.comhttps://www.allbirds.comhttps://www.allbirds.com/products/womens-wool-runners-dapple-greyLightpanda survives 18 back-to-back navigations across the matrix above (3 per URL) without crashing on the patched binary. Pre-patch, the cleanup path triggered SIGSEGV (
ZigToCurlAllocator.free: incorrect alignment) insideBrowserContext.deinitwhenever the page actually completed before puppeteer disconnected.(
https://www.nike.comstill doesn't fire theloadevent in either version. After this fix all 68Fetch.continueRequestmessages process cleanly; the remaining stall is third-party widgets keeping the page inloadingstate, not a CDP/HTTP deadlock. Out of scope.)Reproducer
The puppeteer-core script that drove all measurements:
Lightpanda startup:
Any HTTP CONNECT-capable forward proxy on
127.0.0.1:8080reproduces (vanilla Nodehttp.createServerwith aconnecthandler is enough).Changes
src/network/WsConnection.zig-- drain reads while waiting for write spaceWsConnection.sendpreviously switched the socket to blocking mode onEWOULDBLOCK. The header comment said this "should virtually never happen" -- underFetch.enable+--http-proxyit is routine, and the blocking write deadlocks because the peer is also blocked trying to send to us.New behavior: on
EWOULDBLOCK, poll forPOLLOUTandPOLLIN, drain any incoming bytes into the reader buffer (without dispatching -- that would re-entersendand recurse). Buffered bytes are picked up by the nextCDP.readSocket. AddstryRead(variant ofreadthat distinguishes "no new bytes" from "peer closed") andbufferedBytesaccessor.src/browser/HttpClient.zig-- never starve CDP readsperform()returned.normalearly wheneverprocessMessages()had work, so a steady stream of HTTP completions never gave the CDP socket a chance to be polled. Now: always do at least a non-blockingcurl_multi_pollof the CDP socket, and return.cdp_socketimmediately when the WS reader has bytes that were rescued during a backpressured send (via the newhas_buffered_inputcallback onCDPClient).Also makes
Transfer.deinitidempotent. The error-callback path can cascade into re-entrant deinits (e.g.Script.errorCallback -> manager.evaluate() -> JS execution -> Frame.deinit -> abortOwner -> Transfer.kill -> Transfer.deinit); the second deinit then double-frees the Transfer's arena (ArenaPool: double-free detectedpanic). The fix usesclient.transfers.remove(self.id)as a one-shot ownership claim -- if it returns false, this transfer was already torn down by the cascade and we early-return.Transfer.killis promoted fromfntopub fnsoBrowserContext.deinitcan use it (see CDP.zig change below).RequestParams.deinit/Request.deinitnow take*instead of*constso they can call into the now-mutatingHeaders.deinit.src/network/http.zig-- make Headers.deinit idempotentHeaders.deinitdid not nullself.headersaftercurl_slist_free_all. Combined with the value-copy hazardRobotsLayeralready documents (Headers wraps a raw*curl_slist, so a value copy shares the pointer), a second deinit on a copied Headers double-freed the curl slist. The symptom was an "incorrect alignment" panic insideZigToCurlAllocator.freebecause the slist nodes were already returned to the pool. Nullingself.headersafter the free makes the second call a no-op regardless of how the value-copy hazard is triggered.src/browser/ScriptManagerBase.zig-- drain CDP inwaitForImportwaitForImportspun onclient.tick(200)and discarded the.cdp_socketreturn. A scriptimport()whose request was paused at theInterceptionLayerhung forever: the matchingFetch.continueRequestreply never reachedCDP.processMessage, the request never resumed, and this loop spun until the client's navigation timeout fired. Mirrors thesyncRequestpattern that already does this correctly.The entry pointer is now re-fetched on every loop iteration. The CDP-drain step above re-enters JS, which can call back into
preloadImportfor a transitively-imported module;preloadImportcallsimported_modules.getOrPut, which may rehash the map and invalidate every existing entry pointer. A cachedentrywas a use-after-free on the nextentry.value_ptr.stateaccess.src/cdp/CDP.zig-- wire backpressured-buffer drain + safer teardownreadSocketnow usestryRead(the newWsConnectionAPI) and tolerates theno_new_datacase so messages drained during a backpressuredsendstill get processed.BrowserContext.deinitaborts pending intercepts viatransfer.killinstead oftransfer.abort.killfiresshutdown_callback(or no-op for transfers without one);abortfireserror_callback, which re-enters JS via XHR/script error handlers. The handlers crash because the inspector and V8 context thisBrowserContextowns have already been torn down (inspector.stopSession()runs two lines above the abort loop). The original code already aborted "before closing the session/page since some of these might callback into the page/scriptmanager" --kill's shutdown path is the right primitive for that.Stack of failure modes I observed during debugging
For anyone hitting this region in the future, the symptoms shifted as I peeled layers:
Fetch.requestPausedevents fire in lightpanda butFetch.continueRequestreplies never arrive. (Backpressure deadlock -- fixed by the WsConnection + perform changes.)BrowserContext.deinitSIGSEGVs incurl_slist_free_all. (Headers double-free -- fixed byHeaders.deinitidempotency.)ArenaPool: double-free detectedinTransfer.deinit. (Cascade out oferror_callbackre-deinits the same transfer -- fixed byTransfer.deinitownership claim.)_v8_inspector__Session__wrapObject. (XHR/scripterror_callbackre-enters JS through a torn-down inspector -- fixed bykillinstead ofabortinBrowserContext.deinit.)