You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Add page lifecycle hooks to BrowserPool (#1791)
### Description
Add four page lifecycle hooks to `BrowserPool` registered as decorators:
- `pre_page_create_hook` — called before page creation;
`browser_new_context_options` is mutable, so the hook can affect how the
page context is configured.
- `post_page_create_hook` — called after page creation.
- `pre_page_close_hook` — called before page close.
- `post_page_close_hook` — called after page close.
### Issues
- Relates: #1741
### Testing
- Added new tests for `BrowserPool`.
A <ApiLinkto="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> is a browser-based crawler. In contrast to HTTP-based crawlers like <ApiLinkto="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLinkto="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, it uses a real browser to render pages and extract data. It is built on top of the [Playwright](https://playwright.dev/python/) browser automation library. While browser-based crawlers are typically slower and less efficient than HTTP-based crawlers, they can handle dynamic, client-side rendered sites that standard HTTP-based crawlers cannot manage.
@@ -57,14 +57,22 @@ You can also configure each plugin used by <ApiLink to="class/BrowserPool">`Brow
57
57
58
58
For an example of how to implement a custom browser plugin, see the [Camoufox example](../examples/playwright-crawler-with-camoufox). [Camoufox](https://camoufox.com/) is a stealth browser plugin designed to reduce detection by anti-scraping measures and is fully compatible with <ApiLinkto="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.
59
59
60
-
## Page configuration with pre-navigation hooks
60
+
## Page configuration with lifecycle page hooks
61
+
62
+
For additional setup or event-driven actions around page creation and closure, the <ApiLinkto="class/BrowserPool">`BrowserPool`</ApiLink> exposes four lifecycle hooks: <ApiLinkto="class/BrowserPool#pre_page_create_hook">`pre_page_create_hook`</ApiLink>, <ApiLinkto="class/BrowserPool#post_page_create_hook">`post_page_create_hook`</ApiLink>, <ApiLinkto="class/BrowserPool#pre_page_close_hook">`pre_page_close_hook`</ApiLink>, and <ApiLinkto="class/BrowserPool#post_page_close_hook">`post_page_close_hook`</ApiLink>. To use them, create a `BrowserPool` instance and pass it to <ApiLinkto="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> via the `browser_pool` argument.
In some use cases, you may need to configure the [page](https://playwright.dev/python/docs/api/class-page) before it navigates to the target URL. For instance, you might set navigation timeouts or manipulate other page-level settings. For such cases you can use the <ApiLinkto="class/PlaywrightCrawler#pre_navigation_hook">`pre_navigation_hook`</ApiLink> method of the <ApiLinkto="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>. This method is called before the page navigates to the target URL and allows you to configure the page instance.
70
+
Navigation hooks allow for additional configuration at specific points during page navigation. For example, the <ApiLinkto="class/PlaywrightCrawler#pre_navigation_hook">`pre_navigation_hook`</ApiLink> is called before each navigation and provides <ApiLinkto="class/PlaywrightPreNavCrawlingContext">`PlaywrightPreNavCrawlingContext`</ApiLink> - including the [page](https://playwright.dev/python/docs/api/class-page) instance and a <ApiLinkto="class/PlaywrightPreNavCrawlingContext#block_requests">`block_requests`</ApiLink> helper for filtering unwanted resource types and URL patterns. See the [block requests example](https://crawlee.dev/python/docs/examples/playwright-crawler-with-block-requests) for a dedicated walkthrough.
This guide introduced the <ApiLinkto="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and explained how to configure it using <ApiLinkto="class/BrowserPool">`BrowserPool`</ApiLink> and <ApiLinkto="class/PlaywrightBrowserPlugin">`PlaywrightBrowserPlugin`</ApiLink>. You learned how to launch multiple browsers, configure browser and context settings, and apply pre-navigation hooks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
78
+
This guide introduced the <ApiLinkto="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and explained how to configure it using <ApiLinkto="class/BrowserPool">`BrowserPool`</ApiLink> and <ApiLinkto="class/PlaywrightBrowserPlugin">`PlaywrightBrowserPlugin`</ApiLink>. You learned how to launch multiple browsers, configure browser and context settings, use <ApiLinkto="class/BrowserPool">`BrowserPool`</ApiLink> lifecycle page hooks, and apply navigation hooks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
0 commit comments