-
Notifications
You must be signed in to change notification settings - Fork 24
docs: Add Crawl4AI guide #942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
a62a06b
docs: add Crawl4AI guide
vdusek aef1813
docs: renumber Crawl4AI guide to 08 and switch to a single-file example
vdusek d6d4dcd
chore: drop unused ruff ignore for the removed Crawl4AI project
vdusek 360e0d0
docs: tidy clause-gluing dashes in the Crawl4AI guide
vdusek 437924e
address feedback
vdusek 394de8f
Merge branch 'master' into docs/crawl4ai-guide
vdusek 5b6113f
Merge branch 'master' into docs/crawl4ai-guide
vdusek 7126de9
docs: Clone Crawl4AI guide into versioned docs (v3.4)
vdusek a847279
Merge branch 'master' into docs/crawl4ai-guide
vdusek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| --- | ||
| id: crawl4ai | ||
| title: LLM-ready scraping with Crawl4AI | ||
| description: Build an Apify Actor that scrapes web pages into LLM-ready markdown using the Crawl4AI library. | ||
| --- | ||
|
|
||
| import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; | ||
|
|
||
| import Crawl4aiExample from '!!raw-loader!roa-loader!./code/08_crawl4ai.py'; | ||
|
|
||
| In this guide, you'll learn how to use the [Crawl4AI](https://crawl4ai.com/) library for LLM-ready web scraping in your Apify Actors. | ||
|
|
||
| ## Introduction | ||
|
|
||
| [Crawl4AI](https://crawl4ai.com/) is an open-source, asynchronous web crawler built for LLM and AI workflows. It renders a page in a real browser and turns the result into clean, structured markdown that's ready to feed into a language model or a retrieval-augmented generation (RAG) pipeline, while still giving you the raw HTML, extracted links, and media when you need them. | ||
|
|
||
| Some of the features that make Crawl4AI a good fit for Apify Actors: | ||
|
vdusek marked this conversation as resolved.
Outdated
|
||
|
|
||
| - **LLM-ready markdown** - Crawl4AI converts each page into clean markdown, stripping boilerplate and optionally filtering content, so the output can be fed straight into a language model. | ||
|
vdusek marked this conversation as resolved.
Outdated
|
||
| - **Real browser rendering** - Pages are loaded in a [Playwright](https://playwright.dev/)-driven browser, so JavaScript-heavy and dynamically rendered websites work out of the box. | ||
| - **Built-in link and media extraction** - Every crawl returns the page's links already split into `internal` and `external` groups, together with the media it found, which makes recursive crawling straightforward. | ||
| - **Flexible extraction strategies** - Beyond markdown, Crawl4AI can extract structured data with CSS/XPath schemas or with an LLM, all configured per request. | ||
| - **First-class async support** - The `AsyncWebCrawler` is built on `asyncio`, which integrates naturally with the asyncio-based Apify SDK. | ||
| - **Per-request proxy** - Each request can be routed through its own proxy, which pairs well with Apify Proxy and its rotating IP addresses. | ||
|
|
||
| Crawl4AI drives a real browser through Playwright, so after installing the library you need to download the browser binaries once with the `crawl4ai-setup` command: | ||
|
vdusek marked this conversation as resolved.
Outdated
|
||
|
|
||
| ```bash | ||
| pip install crawl4ai | ||
| crawl4ai-setup | ||
| ``` | ||
|
|
||
| ## Example Actor | ||
|
|
||
| The following Actor recursively crawls pages, starting from the URLs in the Actor input and following links up to a user-defined maximum depth. It uses Crawl4AI's `AsyncWebCrawler` to render each page through [Apify Proxy](https://docs.apify.com/platform/proxy), stores the page's markdown in the dataset, and follows the internal links that Crawl4AI discovers. | ||
|
vdusek marked this conversation as resolved.
Outdated
|
||
|
|
||
| The whole Actor fits in a single file. A `scrape_page` helper holds the Crawl4AI-specific crawling and parsing, while the `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy) and the [request queue](https://docs.apify.com/platform/storage/request-queue), opens a single browser-backed crawler, and drives the crawl: | ||
|
|
||
| <RunnableCodeBlock className="language-python" language="python"> | ||
| {Crawl4aiExample} | ||
| </RunnableCodeBlock> | ||
|
|
||
| A few things worth pointing out: | ||
|
vdusek marked this conversation as resolved.
Outdated
|
||
|
|
||
| - A single `AsyncWebCrawler` is opened once and reused for every request. The crawler manages one browser instance, so reusing it across the whole crawl is far cheaper than launching a new browser per page. | ||
|
vdusek marked this conversation as resolved.
Outdated
|
||
| - Keeping the crawling and parsing in `scrape_page` separates the Crawl4AI-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, so `main` decides what to store and what to enqueue. | ||
| - `result.markdown` is the rendered page as clean markdown, and `result.metadata` carries page-level fields such as the title - exactly the kind of output you want when preparing data for an LLM. | ||
| - `result.links` already separates `internal` (same-site) links from `external` ones, so the example follows only the internal links to keep the crawl on the same website. | ||
|
vdusek marked this conversation as resolved.
Outdated
|
||
| - `CacheMode.BYPASS` tells Crawl4AI to always fetch a fresh copy of the page instead of serving it from its local cache. | ||
|
|
||
| ## Using Apify Proxy | ||
|
|
||
| Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `scrape_page` for every request, which forwards it to Crawl4AI's per-request `CrawlerRunConfig`. | ||
|
|
||
| `ProxyConfig.from_string` parses the proxy URL returned by `ProxyConfiguration.new_url` (for example `http://groups-RESIDENTIAL:<password>@proxy.apify.com:8000`) into the server, username, and password that the browser needs - the browser cannot take the credentials embedded directly in the URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide. | ||
|
|
||
| ## Running on the Apify platform | ||
|
|
||
| Because Crawl4AI renders pages in a real browser, the Actor image needs a browser and its system-level dependencies. Build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser - Crawl4AI reuses those binaries, so no separate browser-install step is required in the Dockerfile. | ||
|
|
||
| Pin the Python 3.13 variant of that image (for example `apify/actor-python-playwright:3.13-1.60.0`), because some of Crawl4AI's dependencies do not yet publish wheels for the newest Python versions, which would otherwise force a slow source build during the image build. | ||
|
|
||
| Add `apify` and `crawl4ai` to your `requirements.txt`: | ||
|
|
||
| ```text | ||
| apify | ||
| crawl4ai | ||
| ``` | ||
|
|
||
| ## Conclusion | ||
|
|
||
| In this guide, you learned how to use Crawl4AI in your Apify Actors. You can now render pages in a real browser, turn them into LLM-ready markdown, follow the links Crawl4AI discovers, route requests through Apify Proxy, and run the whole thing on the Apify platform. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! | ||
|
vdusek marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Additional resources | ||
|
|
||
| - [Crawl4AI: Official documentation](https://docs.crawl4ai.com/) | ||
| - [Crawl4AI: AsyncWebCrawler and configuration](https://docs.crawl4ai.com/api/async-webcrawler/) | ||
| - [Crawl4AI: Proxy and security](https://docs.crawl4ai.com/advanced/proxy-security/) | ||
| - [Crawl4AI: GitHub repository](https://github.com/unclecode/crawl4ai) | ||
| - [Apify: Proxy management](https://docs.apify.com/platform/proxy) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| import asyncio | ||
| from typing import Any | ||
|
|
||
| from crawl4ai import ( | ||
| AsyncWebCrawler, | ||
| BrowserConfig, | ||
| CacheMode, | ||
| CrawlerRunConfig, | ||
| ProxyConfig, | ||
| ) | ||
|
|
||
| from apify import Actor, Request | ||
| from apify.storages import RequestQueue | ||
|
|
||
|
|
||
| async def scrape_page( | ||
| crawler: AsyncWebCrawler, | ||
| url: str, | ||
| *, | ||
| proxy_url: str | None = None, | ||
| ) -> tuple[dict[str, Any], list[str]]: | ||
| """Crawl a page with Crawl4AI and return its markdown and same-site links.""" | ||
| run_config = CrawlerRunConfig( | ||
| cache_mode=CacheMode.BYPASS, | ||
| proxy_config=ProxyConfig.from_string(proxy_url) if proxy_url else None, | ||
| ) | ||
|
|
||
| result = await crawler.arun(url, config=run_config) | ||
| if not result.success: | ||
| raise RuntimeError(result.error_message or f'Failed to crawl {url}') | ||
|
|
||
| data = { | ||
| 'url': result.url, | ||
| 'title': (result.metadata or {}).get('title'), | ||
| 'markdown': str(result.markdown), | ||
| } | ||
|
|
||
| # Crawl4AI already classifies links; follow only the internal ones. | ||
| internal_links = result.links.get('internal', []) | ||
| links = [link['href'] for link in internal_links if link.get('href')] | ||
|
|
||
| return data, links | ||
|
|
||
|
|
||
| async def enqueue_links( | ||
| request_queue: RequestQueue, | ||
| links: list[str], | ||
| *, | ||
| depth: int, | ||
| max_depth: int, | ||
| ) -> None: | ||
| """Enqueue the links one level deeper, unless max_depth was reached.""" | ||
| if depth >= max_depth: | ||
| return | ||
|
|
||
| for link_url in links: | ||
| Actor.log.info(f'Enqueuing {link_url} ...') | ||
| request = Request.from_url(link_url) | ||
| request.crawl_depth = depth + 1 | ||
| await request_queue.add_request(request) | ||
|
|
||
|
|
||
| async def main() -> None: | ||
| async with Actor: | ||
| # Read the Actor input. | ||
| actor_input = await Actor.get_input() or {} | ||
| start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) | ||
| max_depth = actor_input.get('maxDepth', 1) | ||
|
|
||
| if not start_urls: | ||
| Actor.log.info('No start URLs specified in Actor input, exiting...') | ||
| await Actor.exit() | ||
|
|
||
| # Set up Apify Proxy and the request queue. | ||
| proxy_configuration = await Actor.create_proxy_configuration() | ||
| request_queue = await Actor.open_request_queue() | ||
|
|
||
| # Enqueue the start URLs (crawl depth defaults to 0). | ||
| for start_url in start_urls: | ||
| url = start_url.get('url') | ||
| Actor.log.info(f'Enqueuing start URL: {url}') | ||
| await request_queue.add_request(Request.from_url(url)) | ||
|
|
||
| # Cap the crawl; raise or remove to follow more pages. | ||
| max_requests = 50 | ||
| handled_requests = 0 | ||
|
|
||
| # Reuse one headless browser-backed crawler for every request. | ||
| browser_config = BrowserConfig(headless=True) | ||
|
|
||
| async with AsyncWebCrawler(config=browser_config) as crawler: | ||
| while handled_requests < max_requests and ( | ||
| request := await request_queue.fetch_next_request() | ||
| ): | ||
| handled_requests += 1 | ||
| url = request.url | ||
| depth = request.crawl_depth | ||
| Actor.log.info(f'Scraping {url} (depth={depth}) ...') | ||
|
|
||
| try: | ||
| # Fresh proxy URL per request (None if no proxy). | ||
| proxy_url = None | ||
| if proxy_configuration: | ||
| proxy_url = await proxy_configuration.new_url() | ||
|
|
||
| data, links = await scrape_page(crawler, url, proxy_url=proxy_url) | ||
| await Actor.push_data(data) | ||
| Actor.log.info( | ||
| f'Stored data from {url} ' | ||
| f'(title={data["title"]!r}, {len(links)} links found).' | ||
| ) | ||
| await enqueue_links( | ||
| request_queue, links, depth=depth, max_depth=max_depth | ||
| ) | ||
|
|
||
| except Exception: | ||
| Actor.log.exception(f'Cannot extract data from {url}.') | ||
|
|
||
| finally: | ||
| await request_queue.mark_request_as_handled(request) | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| asyncio.run(main()) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.