Skip to content

Commit 2d2a050

Browse files
authored
feat: add first version of browser pool and playwright crawler (#161)
### Description - Introduce the initial version of `BrowserPool` and `PlaywrightCrawler`. - It lacks several features - fingerprinting, hooks, managing multiple instances of browsers, browser abstraction, ... - Those will be done later, see #131. - `BrowserPool` is responsible for managing browser-related resources, but currently, it only supports handling a single browser instance through the plugin. - Also a very first version of `PlaywrightCrawler` is introduced, primarily to enhance testing and to provide a clear view of the end-user interface and results. ### Related issues - #79 ### Testing - Unit tests for new modules were written. - For ad-hoc test code samples see `README.md`. ### TODO - [x] playwright install in CI - [x] update related issues
1 parent 5c3753a commit 2d2a050

25 files changed

Lines changed: 744 additions & 8 deletions

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ install-dev:
1212
python3 -m pip install --upgrade pip poetry
1313
poetry install --all-extras
1414
poetry run pre-commit install
15+
poetry run playwright install
1516

1617
build:
1718
poetry build --no-interaction -vv

README.md

Lines changed: 110 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,108 @@ from crawlee.enqueue_strategy import EnqueueStrategy
206206

207207
#### PlaywrightCrawler
208208

209-
- TODO
209+
[`PlaywrightCrawler`](https://github.com/apify/crawlee-py/tree/master/src/crawlee/playwright_crawler) extends
210+
the `BasicCrawler`. It provides the same features and on top of that, it uses
211+
[Playwright](https://playwright.dev/python) browser automation tool.
212+
213+
This crawler provides a straightforward framework for parallel web page crawling using headless versions of Chromium,
214+
Firefox, and Webkit browsers through Playwright. URLs to be crawled are supplied by a request provider, which can be
215+
either a `RequestList` containing a static list of URLs or a dynamic `RequestQueue`.
216+
217+
Using a headless browser to download web pages and extract data, `PlaywrightCrawler` is ideal for crawling
218+
websites that require JavaScript execution. For websites that do not require JavaScript, consider using
219+
the `BeautifulSoupCrawler`, which utilizes raw HTTP requests and will be much faster.
220+
221+
Example usage:
222+
223+
```python
224+
import asyncio
225+
226+
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
227+
from crawlee.storages import Dataset, RequestQueue
228+
229+
230+
async def main() -> None:
231+
# Open a default request queue and add requests to it
232+
rq = await RequestQueue.open()
233+
await rq.add_request('https://crawlee.dev')
234+
235+
# Open a default dataset for storing results
236+
dataset = await Dataset.open()
237+
238+
# Create a crawler instance and provide a request provider (and other optional arguments)
239+
crawler = PlaywrightCrawler(
240+
request_provider=rq,
241+
# headless=False,
242+
# browser_type='firefox',
243+
)
244+
245+
@crawler.router.default_handler
246+
async def request_handler(context: PlaywrightCrawlingContext) -> None:
247+
record = {
248+
'request_url': context.request.url,
249+
'page_url': context.page.url,
250+
'page_title': await context.page.title(),
251+
'page_content': (await context.page.content())[:10000],
252+
}
253+
await dataset.push_data(record)
254+
255+
await crawler.run()
256+
257+
258+
if __name__ == '__main__':
259+
asyncio.run(main())
260+
```
261+
262+
Example usage with custom browser pool:
263+
264+
```python
265+
import asyncio
266+
267+
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin
268+
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
269+
from crawlee.storages import Dataset, RequestQueue
270+
271+
272+
async def main() -> None:
273+
# Open a default request queue and add requests to it
274+
rq = await RequestQueue.open()
275+
await rq.add_request('https://crawlee.dev')
276+
await rq.add_request('https://apify.com')
277+
278+
# Open a default dataset for storing results
279+
dataset = await Dataset.open()
280+
281+
# Create a browser pool with a Playwright browser plugin
282+
browser_pool = BrowserPool(
283+
plugins=[
284+
PlaywrightBrowserPlugin(
285+
browser_type='firefox',
286+
browser_options={'headless': False},
287+
page_options={'viewport': {'width': 1920, 'height': 1080}},
288+
)
289+
]
290+
)
291+
292+
# Create a crawler instance and provide a browser pool and request provider
293+
crawler = PlaywrightCrawler(request_provider=rq, browser_pool=browser_pool)
294+
295+
@crawler.router.default_handler
296+
async def request_handler(context: PlaywrightCrawlingContext) -> None:
297+
record = {
298+
'request_url': context.request.url,
299+
'page_url': context.page.url,
300+
'page_title': await context.page.title(),
301+
'page_content': (await context.page.content())[:10000],
302+
}
303+
await dataset.push_data(record)
304+
305+
await crawler.run()
306+
307+
308+
if __name__ == '__main__':
309+
asyncio.run(main())
310+
```
210311

211312
### Storages
212313

@@ -416,6 +517,14 @@ if __name__ == '__main__':
416517
asyncio.run(main())
417518
```
418519

520+
<!--
521+
### Browser Management
522+
523+
- TODO
524+
- Write once browser rotation and/or other features are ready
525+
- Update PlaywrightCrawler according to this
526+
-->
527+
419528
## Running on the Apify platform
420529

421530
Crawlee is open-source and runs anywhere, but since it's developed by [Apify](https://apify.com), it's easy to set up on the Apify platform and run in the cloud. Visit the [Apify SDK website](https://sdk.apify.com) to learn more about deploying Crawlee to the Apify platform.

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ html5lib = { version = "^1.1", optional = true }
5555
httpx = "^0.27.0"
5656
lxml = { version = "^5.2.1", optional = true }
5757
more_itertools = "^10.2.0"
58+
playwright = { version = "^1.43.0", optional = true }
5859
psutil = "^5.9.8"
5960
pydantic = "^2.6.3"
6061
pydantic-settings = "^2.2.1"
@@ -78,6 +79,7 @@ pytest-timeout = "~2.3.0"
7879
pytest-xdist = "~3.6.0"
7980
respx = "~0.21.0"
8081
ruff = "~0.4.0"
82+
setuptools = "^70.0.0" # setuptools are used by pytest, but not explicitly required
8183
types-aiofiles = "^23.2.0.20240106"
8284
types-beautifulsoup4 = "^4.12.0.20240229"
8385
types-colorama = "~0.4.15.20240106"
@@ -87,6 +89,7 @@ proxy-py = "^2.4.4"
8789

8890
[tool.poetry.extras]
8991
beautifulsoup = ["beautifulsoup4", "lxml", "html5lib"]
92+
playwright = ["playwright"]
9093

9194
[tool.ruff]
9295
line-length = 120

src/crawlee/_utils/blocked.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
from __future__ import annotations
2+
13
# Inspiration: https://github.com/apify/crawlee/blob/v3.9.2/packages/utils/src/internals/blocked.ts
24

35
CLOUDFLARE_RETRY_CSS_SELECTORS = [
Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1-
from .basic_crawler import BasicCrawler, UserDefinedErrorHandlerError
2-
from .context_pipeline import BasicCrawlingContext, ContextPipeline, RequestHandlerError
1+
from .basic_crawler import BasicCrawler, BasicCrawlerOptions
2+
from .context_pipeline import ContextPipeline
3+
from .errors import RequestHandlerError, UserDefinedErrorHandlerError
34
from .router import Router
5+
from .types import BasicCrawlingContext

src/crawlee/basic_crawler/basic_crawler.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
RequestHandlerRunResult,
3535
SendRequestFunction,
3636
)
37+
from crawlee.browsers import BrowserPool
3738
from crawlee.configuration import Configuration
3839
from crawlee.enqueue_strategy import EnqueueStrategy
3940
from crawlee.events.local_event_manager import LocalEventManager
@@ -75,6 +76,8 @@ class BasicCrawlerOptions(TypedDict, Generic[TCrawlingContext]):
7576
retry_on_blocked: NotRequired[bool]
7677
proxy_configuration: NotRequired[ProxyConfiguration]
7778
statistics: NotRequired[Statistics[StatisticsState]]
79+
browser_pool: NotRequired[BrowserPool]
80+
use_browser_pool: NotRequired[bool]
7881
_context_pipeline: NotRequired[ContextPipeline[TCrawlingContext]]
7982

8083

@@ -105,6 +108,8 @@ def __init__(
105108
retry_on_blocked: bool = True,
106109
proxy_configuration: ProxyConfiguration | None = None,
107110
statistics: Statistics | None = None,
111+
browser_pool: BrowserPool | None = None,
112+
use_browser_pool: bool = False,
108113
_context_pipeline: ContextPipeline[TCrawlingContext] | None = None,
109114
) -> None:
110115
"""Initialize the BasicCrawler.
@@ -125,6 +130,8 @@ def __init__(
125130
retry_on_blocked: If set to True, the crawler will try to automatically bypass any detected bot protection
126131
proxy_configuration: A HTTP proxy configuration to be used for making requests
127132
statistics: A preconfigured `Statistics` instance if you wish to use non-default configuration
133+
browser_pool: A preconfigured `BrowserPool` instance for browser crawling.
134+
use_browser_pool: Enables using the browser pool for crawling.
128135
_context_pipeline: Allows extending the request lifecycle and modifying the crawling context.
129136
This parameter is meant to be used by child classes, not when BasicCrawler is instantiated directly.
130137
"""
@@ -180,6 +187,10 @@ def __init__(
180187
log_message=f'{logger.name} request statistics',
181188
)
182189

190+
self._use_browser_pool = use_browser_pool
191+
if self._use_browser_pool:
192+
self._browser_pool = browser_pool or BrowserPool()
193+
183194
self._running = False
184195
self._has_finished_before = False
185196

@@ -293,6 +304,9 @@ async def run(self, requests: list[str | BaseRequestData] | None = None) -> Fina
293304
if self._use_session_pool:
294305
await exit_stack.enter_async_context(self._session_pool)
295306

307+
if self._use_browser_pool:
308+
await exit_stack.enter_async_context(self._browser_pool)
309+
296310
await self._pool.run()
297311

298312
if self._statistics.error_tracker.total > 0:

src/crawlee/basic_crawler/errors.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
from __future__ import annotations
2+
13
from typing import Generic
24

35
from typing_extensions import TypeVar
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
from .beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
1+
from .beautifulsoup_crawler import BeautifulSoupCrawler
2+
from .types import BeautifulSoupCrawlingContext

src/crawlee/browsers/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .browser_pool import BrowserPool
2+
from .playwright_browser_plugin import PlaywrightBrowserPlugin
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Inspiration: https://github.com/apify/crawlee/blob/v3.10.0/packages/browser-pool/src/abstract-classes/browser-plugin.ts
2+
3+
from __future__ import annotations
4+
5+
from abc import ABC, abstractmethod
6+
from typing import TYPE_CHECKING, Literal
7+
8+
if TYPE_CHECKING:
9+
from types import TracebackType
10+
11+
from playwright.async_api import Browser, Page
12+
13+
14+
class BaseBrowserPlugin(ABC):
15+
"""An abstract base class for browser plugins.
16+
17+
Browser plugins act as wrappers around browser automation tools like Playwright,
18+
providing a unified interface for interacting with browsers.
19+
"""
20+
21+
@property
22+
@abstractmethod
23+
def browser(self) -> Browser | None:
24+
"""Return the browser instance."""
25+
26+
@property
27+
@abstractmethod
28+
def browser_type(self) -> Literal['chromium', 'firefox', 'webkit']:
29+
"""Return the browser type name."""
30+
31+
@abstractmethod
32+
async def __aenter__(self) -> BaseBrowserPlugin:
33+
"""Enter the context manager and initialize the browser plugin."""
34+
35+
@abstractmethod
36+
async def __aexit__(
37+
self,
38+
exc_type: type[BaseException] | None,
39+
exc_value: BaseException | None,
40+
exc_traceback: TracebackType | None,
41+
) -> None:
42+
"""Exit the context manager and close the browser plugin."""
43+
44+
@abstractmethod
45+
async def new_page(self) -> Page:
46+
"""Get a new page in a browser."""

0 commit comments

Comments
 (0)