feat: allow navigation hooks to override crawling context members#3672
feat: allow navigation hooks to override crawling context members#3672barjin wants to merge 18 commits into
Conversation
Hooks may now return a Partial<Context>; returned own properties are merged into the crawling context via property descriptors. Subsequent hooks and pipeline stages observe the overrides. Browser crawler installs the navigation response on the context before postNavigationHooks run, so hooks can both read and override it. Closes #3660
… handleCloudflareChallengeHook handleCloudflareChallenge now reloads the page after solving the challenge and returns the fresh Response, so its result can be propagated back into the crawling context via a postNavigationHook return value. A new top-level handleCloudflareChallengeHook(options?) factory wraps that flow so users can pass it directly to postNavigationHooks.
…eCloudflareChallenge
There was a problem hiding this comment.
Pull request overview
Enables preNavigationHooks / postNavigationHooks to return a Partial<Context> whose own properties are merged into the crawling context, allowing hooks to override context members (notably response) for subsequent hooks and pipeline stages (e.g., after solving Cloudflare challenges).
Changes:
- Updated
BasicCrawler._executeHooks()to apply hook return-value overrides onto the shared context via property descriptors. - Extended hook typings + docs across HTTP/Browser/Playwright/Puppeteer crawlers and added a
handleCloudflareChallengeHook()helper for Playwright. - Added an
HttpCrawlertest asserting that a navigation hook can overrideresponse(and therebybody/status observed by the handler).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test/core/crawlers/http_crawler.test.ts | Adds coverage for overriding response via hook return value in HttpCrawler. |
| packages/basic-crawler/src/internals/basic-crawler.ts | Implements the core “hook can return overrides” behavior by merging returned properties into context. |
| packages/browser-crawler/src/internals/browser-crawler.ts | Ensures response is definable on context and re-reads response after hooks for downstream processing. |
| packages/http-crawler/src/internals/http-crawler.ts | Updates hook types/docs so postNavigationHooks can return partial context overrides. |
| packages/puppeteer-crawler/src/internals/puppeteer-crawler.ts | Documents hook return-value overrides for postNavigationHooks. |
| packages/playwright-crawler/src/internals/playwright-crawler.ts | Documents hook return-value overrides and adds handleCloudflareChallengeHook(). |
| packages/playwright-crawler/src/internals/utils/playwright-utils.ts | Changes handleCloudflareChallenge() to return a post-challenge Response for propagation via hook overrides; updates docs/examples. |
| packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts | Applies hook return-value overrides when running adaptive hooks (HTTP and browser modes). |
| docs/guides/navigation_hooks_overrides.ts | New guide snippet demonstrating context override via hook return value. |
| docs/guides/avoid_blocking.mdx | Updates guide to mention hook-based response propagation and includes the new snippet. |
| docs/guides/avoid_blocking_camoufox.ts | Switches to using handleCloudflareChallengeHook() in the Camoufox example. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- browser-crawler: read response back from the context's own property descriptor so hook overrides win even when _navigationHandler returned undefined - navigation_hooks_overrides.ts guide: guard against handleCloudflareChallenge returning undefined to avoid overwriting response with undefined - document return-value overrides on preNavigationHooks in http, puppeteer, playwright and adaptive-playwright JSDoc - fix 'Modyfing' -> 'Modifying' typo in pre-existing JSDoc - add browser-crawler test asserting response override is observed by the next postNavigationHook and the request handler
Each pre/postNavigationHook now composes as its own ContextPipeline middleware, so the property-descriptor merge of hook return values happens in ContextPipeline.call rather than duplicated in BasicCrawler._executeHooks or the adaptive wrapper. performNavigation splits into prepareNavigation / navigate / finalizeNavigation; the HTTP crawler likewise splits prepareHttpRequest / makeHttpRequest / processHttpResponse with hooks interleaved. gotoOptions and the before-hooks cookie snapshot move to symbol-keyed context slots so the navigate step can pick them up. A small skipGuard wrapper short-circuits the navigation steps when request.skipNavigation is set, removing the inline check from every middleware.
skipGuard accepts the hook signature directly, so the wrapHook wrapper is just a rename.
…ntext gotoOptions is now a first-class field on BrowserCrawlingContext (typed narrowly on PuppeteerCrawlingContext / PlaywrightCrawlingContext / StagehandCrawlingContext). BrowserHook collapses to a single-arg signature matching the http and adaptive variants; the GOTO_OPTIONS symbol slot and the readContextField hop for it are removed. Hooks that previously mutated the gotoOptions parameter now mutate ctx.gotoOptions instead. BREAKING: navigation hooks for browser crawlers (Playwright, Puppeteer, Stagehand) no longer receive gotoOptions as a second argument. Migration: read it from the crawling context (`ctx.gotoOptions`).
Adds a GoToOptions type parameter on BrowserCrawlingContext (defaulting
to Dictionary). PuppeteerCrawlingContext / PlaywrightCrawlingContext /
StagehandCrawlingContext now narrow the field by passing their own
goto-options type as the generic argument instead of redeclaring
`gotoOptions: T & {}`. The `T & {}` shape-strip-undefined trick is
replaced with a clearer NonNullable<...> at the type alias.
preparePage swaps its empty-object gotoOptions placeholder for the
same throwing-getter idiom already used for response, so accidental
early reads now surface a helpful error instead of seeing an empty
object.
AdaptivePlaywrightCrawler drops the per-hook adaptHook wrapper; the
hooks are structurally compatible with the underlying crawlers'
signatures, so an array-level cast removes the per-request promise
allocation per hook.
pageOptions is a browser-pool concept (PrePageCreateHook on BrowserPool) unrelated to crawler navigation hooks; the cross-reference read like a constraint of preNavigationHooks itself.
…eNavigation Replaces the property-descriptor probe with a straightforward try/catch around the read — same behavior, much easier to scan.
The intermediate `pre`/`post` variables were context-free and shared between two different crawler types — the cast is clearer next to the hook list it's actually feeding into.
…turing Drops the repeated `?? []` from four call sites by defaulting in the options destructuring.
0ad26da to
9dc9728
Compare
9dc9728 to
564524a
Compare
Make the http-crawler skipGuard generic to drop the hook casts and use .map(skipGuard), simplify the browser-crawler navigate return, and remove the misplaced navigation-hooks override docs from avoid_blocking.
| export type BrowserHook<Context = BrowserCrawlingContext, GoToOptions extends Dictionary | undefined = Dictionary> = ( | ||
| export type BrowserHook<Context = BrowserCrawlingContext> = ( | ||
| crawlingContext: Context, | ||
| gotoOptions: GoToOptions, |
There was a problem hiding this comment.
db146f5 drops the second gotoOptions param for navigation hooks and passes this through the CrawlingContext instead.
I suppose we had it like this because of the lack of a unified ContextPipeline in v3, but I'd like to validate this hunch with someone else.
There was a problem hiding this comment.
I can't think of a reason to keep this a separate parameter, so your hunch is probably right.
| action: async (ctx) => (ctx.request.skipNavigation ? {} : ((await action(ctx)) ?? {})), | ||
| }); | ||
|
|
||
| const middlewares: ContextMiddleware<any, any>[] = [ |
There was a problem hiding this comment.
This doesn't ensure that the middlewares "connect" correctly — I could add a step that expects a completely random property here and it would pass type check and explode cryptically at runtime.
| .compose({ action: this.handleBlockedRequestByContent.bind(this) }) | ||
| .compose({ action: this.restoreRequestState.bind(this) }), | ||
| contextPipelineBuilder: () => { | ||
| const middlewares: ContextMiddleware<Context, Partial<Context>>[] = [ |
There was a problem hiding this comment.
preNavigationHooksandpostNavigationHooksmay now return aPartial<Context>; returned own properties are merged into the crawling context.Closes #3660