Skip to content

Commit 437924e

Browse files
committed
address feedback
1 parent 360e0d0 commit 437924e

1 file changed

Lines changed: 17 additions & 17 deletions

File tree

docs/03_guides/08_crawl4ai.mdx

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
id: crawl4ai
33
title: LLM-ready scraping with Crawl4AI
4-
description: Build an Apify Actor that scrapes web pages into LLM-ready markdown using the Crawl4AI library.
4+
description: Build an Apify Actor that scrapes web pages into LLM-ready Markdown using the Crawl4AI library.
55
---
66

77
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
@@ -12,18 +12,18 @@ In this guide, you'll learn how to use the [Crawl4AI](https://crawl4ai.com/) lib
1212

1313
## Introduction
1414

15-
[Crawl4AI](https://crawl4ai.com/) is an open-source, asynchronous web crawler built for LLM and AI workflows. It renders a page in a real browser and turns the result into clean, structured markdown that's ready to feed into a language model or a retrieval-augmented generation (RAG) pipeline, while still giving you the raw HTML, extracted links, and media when you need them.
15+
[Crawl4AI](https://crawl4ai.com/) is an open-source, asynchronous web crawler built for LLM and AI workflows. It renders a page in a real browser and turns the result into clean, structured Markdown that you can feed into a language model or a retrieval-augmented generation (RAG) pipeline. It also gives you the raw HTML, extracted links, and media.
1616

17-
Some of the features that make Crawl4AI a good fit for Apify Actors:
17+
Crawl4AI is a great fit for Apify Actors:
1818

19-
- **LLM-ready markdown** - Crawl4AI converts each page into clean markdown, stripping boilerplate and optionally filtering content, so the output can be fed straight into a language model.
20-
- **Real browser rendering** - Pages are loaded in a [Playwright](https://playwright.dev/)-driven browser, so JavaScript-heavy and dynamically rendered websites work out of the box.
21-
- **Built-in link and media extraction** - Every crawl returns the page's links already split into `internal` and `external` groups, together with the media it found, which makes recursive crawling straightforward.
22-
- **Flexible extraction strategies** - Beyond markdown, Crawl4AI can extract structured data with CSS/XPath schemas or with an LLM, all configured per request.
23-
- **First-class async support** - The `AsyncWebCrawler` is built on `asyncio`, which integrates naturally with the asyncio-based Apify SDK.
24-
- **Per-request proxy** - Each request can be routed through its own proxy, which pairs well with Apify Proxy and its rotating IP addresses.
19+
- Crawl4AI converts each page into clean Markdown, stripping boilerplate and optionally filtering content, so the output can be fed straight into a language model.
20+
- Pages are loaded in a [Playwright](https://playwright.dev/)-driven browser, so JavaScript-heavy and dynamically rendered websites work out of the box.
21+
- Every crawl returns the page's links already split into `internal` and `external` groups, together with the media it found, which makes recursive crawling straightforward.
22+
- Beyond Markdown, Crawl4AI can extract structured data with CSS/XPath schemas or with an LLM, all configured per request.
23+
- The `AsyncWebCrawler` is built on `asyncio`, which integrates naturally with the asyncio-based Apify SDK.
24+
- Each request can be routed through its own proxy, which pairs well with Apify Proxy and its rotating IP addresses.
2525

26-
Crawl4AI drives a real browser through Playwright, so after installing the library you need to download the browser binaries once with the `crawl4ai-setup` command:
26+
Crawl4AI drives a real browser through Playwright. After installing the library, download the browser binaries once with the `crawl4ai-setup` command:
2727

2828
```bash
2929
pip install crawl4ai
@@ -32,27 +32,27 @@ crawl4ai-setup
3232

3333
## Example Actor
3434

35-
The following Actor recursively crawls pages, starting from the URLs in the Actor input and following links up to a user-defined maximum depth. It uses Crawl4AI's `AsyncWebCrawler` to render each page through [Apify Proxy](https://docs.apify.com/platform/proxy), stores the page's markdown in the dataset, and follows the internal links that Crawl4AI discovers.
35+
The following Actor recursively crawls pages, starting from the URLs in the Actor input and following links up to a user-defined maximum depth. It uses Crawl4AI's `AsyncWebCrawler` to render each page through [Apify Proxy](https://docs.apify.com/platform/proxy), stores the page's Markdown in the dataset, and follows the internal links that Crawl4AI discovers.
3636

3737
The whole Actor fits in a single file. A `scrape_page` helper holds the Crawl4AI-specific crawling and parsing, while the `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy) and the [request queue](https://docs.apify.com/platform/storage/request-queue), opens a single browser-backed crawler, and drives the crawl:
3838

3939
<RunnableCodeBlock className="language-python" language="python">
4040
{Crawl4aiExample}
4141
</RunnableCodeBlock>
4242

43-
A few things worth pointing out:
43+
Note that:
4444

45-
- A single `AsyncWebCrawler` is opened once and reused for every request. The crawler manages one browser instance, so reusing it across the whole crawl is far cheaper than launching a new browser per page.
45+
- A single `AsyncWebCrawler` is opened once and reused for every request. The crawler manages one browser instance, so reusing it across the whole crawl is cheaper than launching a new browser per page.
4646
- Keeping the crawling and parsing in `scrape_page` separates the Crawl4AI-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, so `main` decides what to store and what to enqueue.
47-
- `result.markdown` is the rendered page as clean markdown, and `result.metadata` carries page-level fields such as the title. This is exactly the kind of output you want when preparing data for an LLM.
48-
- `result.links` already separates `internal` (same-site) links from `external` ones, so the example follows only the internal links to keep the crawl on the same website.
47+
- `result.markdown` is the rendered page as clean Markdown, and `result.metadata` carries page-level fields such as the title. This is the kind of output you need when preparing data for an LLM.
48+
- `result.links` already separates `internal` (same-site) links from `external` ones. The example follows only the internal links to keep the crawl on the same website.
4949
- `CacheMode.BYPASS` tells Crawl4AI to always fetch a fresh copy of the page instead of serving it from its local cache.
5050

5151
## Using Apify Proxy
5252

5353
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `scrape_page` for every request, which forwards it to Crawl4AI's per-request `CrawlerRunConfig`.
5454

55-
`ProxyConfig.from_string` parses the proxy URL returned by `ProxyConfiguration.new_url` (for example `http://groups-RESIDENTIAL:<password>@proxy.apify.com:8000`) into the server, username, and password that the browser needs. The browser cannot take the credentials embedded directly in the URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
55+
`ProxyConfig.from_string` parses the proxy URL returned by `ProxyConfiguration.new_url` (for example `http://groups-RESIDENTIAL:<password>@proxy.apify.com:8000`) into the server, username, and password that the browser needs. The browser can't take the credentials embedded directly in the URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
5656

5757
## Running on the Apify platform
5858

@@ -69,7 +69,7 @@ crawl4ai
6969

7070
## Conclusion
7171

72-
In this guide, you learned how to use Crawl4AI in your Apify Actors. You can now render pages in a real browser, turn them into LLM-ready markdown, follow the links Crawl4AI discovers, route requests through Apify Proxy, and run the whole thing on the Apify platform. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
72+
In this guide, you learned how to use Crawl4AI in your Apify Actors. You can now render pages in a real browser, turn them into LLM-ready Markdown, follow the links Crawl4AI discovers, route requests through Apify Proxy, and run the whole thing on the Apify platform. To get started with your own scraping tasks, see the [Actor templates](https://apify.com/templates/categories/python). If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
7373

7474
## Additional resources
7575

0 commit comments

Comments
 (0)