crawlee-python/docs/guides/playwright_crawler.mdx at f1c8bba58c2eac7cf44de93f378123798c308c6d · apify/crawlee-python

id	playwright-crawler
title	Playwright crawler
description	Learn how to use PlaywrightCrawler for browser-based web scraping.

import ApiLink from '@site/src/components/ApiLink'; import CodeBlock from '@theme/CodeBlock'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import MultipleLaunchExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/multiple_launch_example.py'; import BrowserConfigurationExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/browser_configuration_example.py'; import NavigationHooksExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/navigation_hooks_example.py'; import BrowserPoolLaunchHooksExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/browser_pool_launch_hooks_example.py'; import BrowserPoolPageHooksExample from '!!raw-loader!roa-loader!./code_examples/playwright_crawler/browser_pool_page_hooks_example.py'; import PluginBrowserConfigExample from '!!raw-loader!./code_examples/playwright_crawler/plugin_browser_configuration_example.py';

A PlaywrightCrawler is a browser-based crawler. In contrast to HTTP-based crawlers like ParselCrawler or BeautifulSoupCrawler, it uses a real browser to render pages and extract data. It is built on top of the Playwright browser automation library. While browser-based crawlers are typically slower and less efficient than HTTP-based crawlers, they can handle dynamic, client-side rendered sites that standard HTTP-based crawlers cannot manage.

When to use Playwright crawler

Use PlaywrightCrawler in scenarios that require full browser capabilities, such as:

Dynamic content rendering: Required when pages rely on heavy JavaScript to load or modify content in the browser.
Anti-scraping protection: Helpful for sites using JavaScript-based security or advanced anti-automation measures.
Complex cookie management: Necessary for sites with session or cookie requirements that standard HTTP-based crawlers cannot handle easily.

If HTTP-based crawlers are insufficient, PlaywrightCrawler can address these challenges. See a basic example for a typical usage demonstration.

Advanced configuration

The PlaywrightCrawler uses other Crawlee components under the hood, notably BrowserPool and PlaywrightBrowserPlugin. These components let you to configure the browser and context settings, launch multiple browsers, and apply pre-navigation hooks. You can create your own instances of these components and pass them to the PlaywrightCrawler constructor.

The PlaywrightBrowserPlugin manages how browsers are launched and how browser contexts are created. It accepts browser launch and new context options.
The BrowserPool manages the lifecycle of browser instances (launching, recycling, etc.). You can customize its behavior to suit your needs.

Managing multiple browsers

The BrowserPool allows you to manage multiple browsers. Each browser instance is managed by a separate PlaywrightBrowserPlugin and can be configured independently. This is useful for scenarios like testing multiple configurations or implementing browser rotation to help avoid blocks or detect different site behaviors.

{MultipleLaunchExample}

Browser launch and context configuration

The PlaywrightBrowserPlugin provides access to all relevant Playwright configuration options for both browser launches and new browser contexts. You can specify these options in the constructor of PlaywrightBrowserPlugin or PlaywrightCrawler:

{BrowserConfigurationExample}

You can also configure each plugin used by BrowserPool:

{PluginBrowserConfigExample}

For an example of how to implement a custom browser plugin, see the Camoufox example. Camoufox is a stealth browser plugin designed to reduce detection by anti-scraping measures and is fully compatible with PlaywrightCrawler.

Browser pool lifecycle hooks

The BrowserPool exposes lifecycle hooks for both browser launches and page creation/closure. To use them, create a BrowserPool instance and pass it to PlaywrightCrawler via the browser_pool argument.

Browser launch hooks

The pre_launch_hook and post_launch_hook are called once per browser instance, before and after it is launched. Use them for logging, metrics, or any setup at the browser level. Note that these hooks are not called when a new page is created in an already-running browser.

{BrowserPoolLaunchHooksExample}

Page lifecycle hooks

For additional setup or event-driven actions around page creation and closure, the BrowserPool exposes four hooks: pre_page_create_hook, post_page_create_hook, pre_page_close_hook, and post_page_close_hook.

{BrowserPoolPageHooksExample}

Navigation hooks

Navigation hooks allow for additional configuration at specific points during page navigation. The pre_navigation_hook is called before each navigation and provides PlaywrightPreNavCrawlingContext - including the page instance and a block_requests helper for filtering unwanted resource types and URL patterns. See the block requests example for a dedicated walkthrough. Similarly, the post_navigation_hook is called after each navigation and provides PlaywrightPostNavCrawlingContext - useful for post-load checks such as detecting CAPTCHAs or verifying page state.

{NavigationHooksExample}

Conclusion

Extending the browser plugin

For full control over browser launching, you can subclass PlaywrightBrowserPlugin and override its new_browser method. This lets you integrate any Playwright-compatible browser backend — such as a custom Chromium build, a stealth browser, or a browser with a persistent profile.

The overridden new_browser method must return a PlaywrightBrowserController instance wrapping your custom browser. Pass your plugin to BrowserPool, which you then provide to PlaywrightCrawler via the browser_pool argument.

{ExtendingPluginExample}

For a real-world example of a custom browser plugin, see the Camoufox example.

:::note Third-party projects that provide alternative browser backends for Crawlee can link to this section as the canonical reference for plugin subclassing. :::

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When to use Playwright crawler

Advanced configuration

Managing multiple browsers

Browser launch and context configuration

Browser pool lifecycle hooks

Browser launch hooks

Page lifecycle hooks

Navigation hooks

Conclusion

Extending the browser plugin

Uh oh!

FilesExpand file tree

playwright_crawler.mdx

Latest commit

History

playwright_crawler.mdx

File metadata and controls

When to use Playwright crawler

Advanced configuration

Managing multiple browsers

Browser launch and context configuration

Browser pool lifecycle hooks

Browser launch hooks

Page lifecycle hooks

Navigation hooks

Conclusion

Extending the browser plugin