Skip to content

feat: allow navigation hooks to override crawling context members#3672

Open
barjin wants to merge 18 commits into
v4from
feat/nav-hook-context-overrides
Open

feat: allow navigation hooks to override crawling context members#3672
barjin wants to merge 18 commits into
v4from
feat/nav-hook-context-overrides

Conversation

@barjin
Copy link
Copy Markdown
Member

@barjin barjin commented May 25, 2026

preNavigationHooks and postNavigationHooks may now return a Partial<Context>; returned own properties are merged into the crawling context.

Closes #3660

barjin added 3 commits May 25, 2026 14:01
Hooks may now return a Partial<Context>; returned own properties are merged
into the crawling context via property descriptors. Subsequent hooks and
pipeline stages observe the overrides. Browser crawler installs the
navigation response on the context before postNavigationHooks run, so
hooks can both read and override it.

Closes #3660
… handleCloudflareChallengeHook

handleCloudflareChallenge now reloads the page after solving the challenge
and returns the fresh Response, so its result can be propagated back into
the crawling context via a postNavigationHook return value.

A new top-level handleCloudflareChallengeHook(options?) factory wraps that
flow so users can pass it directly to postNavigationHooks.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables preNavigationHooks / postNavigationHooks to return a Partial<Context> whose own properties are merged into the crawling context, allowing hooks to override context members (notably response) for subsequent hooks and pipeline stages (e.g., after solving Cloudflare challenges).

Changes:

  • Updated BasicCrawler._executeHooks() to apply hook return-value overrides onto the shared context via property descriptors.
  • Extended hook typings + docs across HTTP/Browser/Playwright/Puppeteer crawlers and added a handleCloudflareChallengeHook() helper for Playwright.
  • Added an HttpCrawler test asserting that a navigation hook can override response (and thereby body/status observed by the handler).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test/core/crawlers/http_crawler.test.ts Adds coverage for overriding response via hook return value in HttpCrawler.
packages/basic-crawler/src/internals/basic-crawler.ts Implements the core “hook can return overrides” behavior by merging returned properties into context.
packages/browser-crawler/src/internals/browser-crawler.ts Ensures response is definable on context and re-reads response after hooks for downstream processing.
packages/http-crawler/src/internals/http-crawler.ts Updates hook types/docs so postNavigationHooks can return partial context overrides.
packages/puppeteer-crawler/src/internals/puppeteer-crawler.ts Documents hook return-value overrides for postNavigationHooks.
packages/playwright-crawler/src/internals/playwright-crawler.ts Documents hook return-value overrides and adds handleCloudflareChallengeHook().
packages/playwright-crawler/src/internals/utils/playwright-utils.ts Changes handleCloudflareChallenge() to return a post-challenge Response for propagation via hook overrides; updates docs/examples.
packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts Applies hook return-value overrides when running adaptive hooks (HTTP and browser modes).
docs/guides/navigation_hooks_overrides.ts New guide snippet demonstrating context override via hook return value.
docs/guides/avoid_blocking.mdx Updates guide to mention hook-based response propagation and includes the new snippet.
docs/guides/avoid_blocking_camoufox.ts Switches to using handleCloudflareChallengeHook() in the Camoufox example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/browser-crawler/src/internals/browser-crawler.ts Outdated
Comment thread docs/guides/navigation_hooks_overrides.ts Outdated
Comment thread packages/http-crawler/src/internals/http-crawler.ts
Comment thread packages/puppeteer-crawler/src/internals/puppeteer-crawler.ts
Comment thread packages/playwright-crawler/src/internals/playwright-crawler.ts
Comment thread packages/browser-crawler/src/internals/browser-crawler.ts Outdated
Comment thread packages/basic-crawler/src/internals/basic-crawler.ts Outdated
Comment thread packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts Outdated
- browser-crawler: read response back from the context's own property descriptor
  so hook overrides win even when _navigationHandler returned undefined
- navigation_hooks_overrides.ts guide: guard against handleCloudflareChallenge
  returning undefined to avoid overwriting response with undefined
- document return-value overrides on preNavigationHooks in http, puppeteer,
  playwright and adaptive-playwright JSDoc
- fix 'Modyfing' -> 'Modifying' typo in pre-existing JSDoc
- add browser-crawler test asserting response override is observed by the next
  postNavigationHook and the request handler
@barjin barjin requested review from janbuchar and l2ysho May 25, 2026 13:05
Comment thread packages/basic-crawler/src/internals/basic-crawler.ts Outdated
Comment thread packages/browser-crawler/src/internals/browser-crawler.ts Outdated
barjin added 2 commits May 29, 2026 12:22
Each pre/postNavigationHook now composes as its own ContextPipeline
middleware, so the property-descriptor merge of hook return values
happens in ContextPipeline.call rather than duplicated in
BasicCrawler._executeHooks or the adaptive wrapper. performNavigation
splits into prepareNavigation / navigate / finalizeNavigation; the
HTTP crawler likewise splits prepareHttpRequest / makeHttpRequest /
processHttpResponse with hooks interleaved. gotoOptions and the
before-hooks cookie snapshot move to symbol-keyed context slots so
the navigate step can pick them up. A small skipGuard wrapper
short-circuits the navigation steps when request.skipNavigation is
set, removing the inline check from every middleware.
barjin added 7 commits May 29, 2026 12:44
skipGuard accepts the hook signature directly, so the wrapHook wrapper
is just a rename.
…ntext

gotoOptions is now a first-class field on BrowserCrawlingContext (typed
narrowly on PuppeteerCrawlingContext / PlaywrightCrawlingContext /
StagehandCrawlingContext). BrowserHook collapses to a single-arg
signature matching the http and adaptive variants; the
GOTO_OPTIONS symbol slot and the readContextField hop for it are
removed. Hooks that previously mutated the gotoOptions parameter
now mutate ctx.gotoOptions instead.

BREAKING: navigation hooks for browser crawlers (Playwright, Puppeteer,
Stagehand) no longer receive gotoOptions as a second argument.
Migration: read it from the crawling context (`ctx.gotoOptions`).
Adds a GoToOptions type parameter on BrowserCrawlingContext (defaulting
to Dictionary). PuppeteerCrawlingContext / PlaywrightCrawlingContext /
StagehandCrawlingContext now narrow the field by passing their own
goto-options type as the generic argument instead of redeclaring
`gotoOptions: T & {}`. The `T & {}` shape-strip-undefined trick is
replaced with a clearer NonNullable<...> at the type alias.

preparePage swaps its empty-object gotoOptions placeholder for the
same throwing-getter idiom already used for response, so accidental
early reads now surface a helpful error instead of seeing an empty
object.

AdaptivePlaywrightCrawler drops the per-hook adaptHook wrapper; the
hooks are structurally compatible with the underlying crawlers'
signatures, so an array-level cast removes the per-request promise
allocation per hook.
pageOptions is a browser-pool concept (PrePageCreateHook on BrowserPool)
unrelated to crawler navigation hooks; the cross-reference read like a
constraint of preNavigationHooks itself.
…eNavigation

Replaces the property-descriptor probe with a straightforward try/catch
around the read — same behavior, much easier to scan.
The intermediate `pre`/`post` variables were context-free and shared
between two different crawler types — the cast is clearer next to the
hook list it's actually feeding into.
…turing

Drops the repeated `?? []` from four call sites by defaulting in the
options destructuring.
@barjin barjin force-pushed the feat/nav-hook-context-overrides branch 2 times, most recently from 0ad26da to 9dc9728 Compare June 1, 2026 07:29
@barjin barjin force-pushed the feat/nav-hook-context-overrides branch from 9dc9728 to 564524a Compare June 1, 2026 11:45
Make the http-crawler skipGuard generic to drop the hook casts and use
.map(skipGuard), simplify the browser-crawler navigate return, and remove
the misplaced navigation-hooks override docs from avoid_blocking.
Comment thread packages/basic-crawler/src/internals/basic-crawler.ts Outdated
Comment thread packages/browser-crawler/src/internals/browser-crawler.ts Outdated
export type BrowserHook<Context = BrowserCrawlingContext, GoToOptions extends Dictionary | undefined = Dictionary> = (
export type BrowserHook<Context = BrowserCrawlingContext> = (
crawlingContext: Context,
gotoOptions: GoToOptions,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db146f5 drops the second gotoOptions param for navigation hooks and passes this through the CrawlingContext instead.

I suppose we had it like this because of the lack of a unified ContextPipeline in v3, but I'd like to validate this hunch with someone else.

Copy link
Copy Markdown
Contributor

@janbuchar janbuchar Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a reason to keep this a separate parameter, so your hunch is probably right.

@barjin barjin requested review from janbuchar and l2ysho and removed request for janbuchar and l2ysho June 1, 2026 12:29
@barjin barjin requested a review from janbuchar June 2, 2026 13:48
Comment thread packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts Outdated
action: async (ctx) => (ctx.request.skipNavigation ? {} : ((await action(ctx)) ?? {})),
});

const middlewares: ContextMiddleware<any, any>[] = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't ensure that the middlewares "connect" correctly — I could add a step that expects a completely random property here and it would pass type check and explode cryptically at runtime.

.compose({ action: this.handleBlockedRequestByContent.bind(this) })
.compose({ action: this.restoreRequestState.bind(this) }),
contextPipelineBuilder: () => {
const middlewares: ContextMiddleware<Context, Partial<Context>>[] = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barjin barjin requested a review from janbuchar June 4, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants