| id | playwright-crawler |
|---|---|
| title | Playwright crawler |
| description | Learn how to use PlaywrightCrawler for browser-based web scraping. |
import ApiLink from '@site/src/components/ApiLink'; import CodeBlock from '@theme/CodeBlock'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
import MultipleLaunchExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/multiple_launch_example.py'; import BrowserConfigurationExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/browser_configuration_example.py'; import NavigationHooksExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/navigation_hooks_example.py'; import BrowserPoolLaunchHooksExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/browser_pool_launch_hooks_example.py'; import BrowserPoolPageHooksExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/browser_pool_page_hooks_example.py'; import PluginBrowserConfigExample from '!!raw-loader!./code_examples/playwright_crawler/plugin_browser_configuration_example.py';
A PlaywrightCrawler is a browser-based crawler. In contrast to HTTP-based crawlers like ParselCrawler or BeautifulSoupCrawler, it uses a real browser to render pages and extract data. It is built on top of the Playwright browser automation library. While browser-based crawlers are typically slower and less efficient than HTTP-based crawlers, they can handle dynamic, client-side rendered sites that standard HTTP-based crawlers cannot manage.
Use PlaywrightCrawler in scenarios that require full browser capabilities, such as:
- Dynamic content rendering: Required when pages rely on heavy JavaScript to load or modify content in the browser.
- Anti-scraping protection: Helpful for sites using JavaScript-based security or advanced anti-automation measures.
- Complex cookie management: Necessary for sites with session or cookie requirements that standard HTTP-based crawlers cannot handle easily.
If HTTP-based crawlers are insufficient, PlaywrightCrawler can address these challenges. See a basic example for a typical usage demonstration.
The PlaywrightCrawler uses other Crawlee components under the hood, notably BrowserPool and PlaywrightBrowserPlugin. These components let you to configure the browser and context settings, launch multiple browsers, and apply pre-navigation hooks. You can create your own instances of these components and pass them to the PlaywrightCrawler constructor.
- The
PlaywrightBrowserPluginmanages how browsers are launched and how browser contexts are created. It accepts browser launch and new context options. - The
BrowserPoolmanages the lifecycle of browser instances (launching, recycling, etc.). You can customize its behavior to suit your needs.
The BrowserPool allows you to manage multiple browsers. Each browser instance is managed by a separate PlaywrightBrowserPlugin and can be configured independently. This is useful for scenarios like testing multiple configurations or implementing browser rotation to help avoid blocks or detect different site behaviors.
The PlaywrightBrowserPlugin provides access to all relevant Playwright configuration options for both browser launches and new browser contexts. You can specify these options in the constructor of PlaywrightBrowserPlugin or PlaywrightCrawler:
You can also configure each plugin used by BrowserPool:
For an example of how to implement a custom browser plugin, see the Camoufox example. Camoufox is a stealth browser plugin designed to reduce detection by anti-scraping measures and is fully compatible with PlaywrightCrawler.
The BrowserPool exposes lifecycle hooks for both browser launches and page creation/closure. To use them, create a BrowserPool instance and pass it to PlaywrightCrawler via the browser_pool argument.
The pre_launch_hook and post_launch_hook are called once per browser instance, before and after it is launched. Use them for logging, metrics, or any setup at the browser level. Note that these hooks are not called when a new page is created in an already-running browser.
For additional setup or event-driven actions around page creation and closure, the BrowserPool exposes four hooks: pre_page_create_hook, post_page_create_hook, pre_page_close_hook, and post_page_close_hook.
Navigation hooks allow for additional configuration at specific points during page navigation. The pre_navigation_hook is called before each navigation and provides PlaywrightPreNavCrawlingContext - including the page instance and a block_requests helper for filtering unwanted resource types and URL patterns. See the block requests example for a dedicated walkthrough. Similarly, the post_navigation_hook is called after each navigation and provides PlaywrightPostNavCrawlingContext - useful for post-load checks such as detecting CAPTCHAs or verifying page state.
For full control over browser launching, you can subclass PlaywrightBrowserPlugin and override its new_browser method. This lets you integrate any Playwright-compatible browser backend — such as a custom Chromium build, a stealth browser, or a browser with a persistent profile.
The overridden new_browser method must return a PlaywrightBrowserController instance wrapping your custom browser. Pass your plugin to BrowserPool, which you then provide to PlaywrightCrawler via the browser_pool argument.
For a real-world example of a custom browser plugin, see the Camoufox example.
:::note Third-party projects that provide alternative browser backends for Crawlee can link to this section as the canonical reference for plugin subclassing. :::