Skip to content

Commit 0287ec7

Browse files
committed
Merge remote-tracking branch 'origin/master' into docs/add-browser-use-guide
# Conflicts: # docs/01_introduction/quick-start.mdx
2 parents 0f93f94 + 03f97a3 commit 0287ec7

14 files changed

Lines changed: 668 additions & 37 deletions

File tree

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,16 @@ All notable changes to this project will be documented in this file.
55
<!-- git-cliff-unreleased-start -->
66
## 3.4.2 - **not yet released**
77

8+
### 🐛 Bug Fixes
9+
10+
- **scrapy:** Correct proxy middleware exception log and import ([#953](https://github.com/apify/apify-sdk-python/pull/953)) ([5bd6eb9](https://github.com/apify/apify-sdk-python/commit/5bd6eb9843d90844cec083372e932413bceedec9)) by [@vdusek](https://github.com/vdusek)
11+
- **scrapy:** Skip a request that fails to convert instead of crashing the run ([#952](https://github.com/apify/apify-sdk-python/pull/952)) ([db9444f](https://github.com/apify/apify-sdk-python/commit/db9444faeb0158c29aa394121cf733ff2e843f28)) by [@vdusek](https://github.com/vdusek)
12+
813
### 🚜 Refactor
914

1015
- [**breaking**] Remove deprecated APIs ([#918](https://github.com/apify/apify-sdk-python/pull/918)) ([3e5728d](https://github.com/apify/apify-sdk-python/commit/3e5728d94cb8fd879d5a76e33a03d55792d835d5)) by [@vdusek](https://github.com/vdusek), closes [#635](https://github.com/apify/apify-sdk-python/issues/635)
1116
- [**breaking**] Mark secondary arguments as keyword-only ([#917](https://github.com/apify/apify-sdk-python/pull/917)) ([eb94c99](https://github.com/apify/apify-sdk-python/commit/eb94c992ec4aba1cd7cf4dfd7a98731cb304651b)) by [@vdusek](https://github.com/vdusek), closes [#881](https://github.com/apify/apify-sdk-python/issues/881)
17+
- [**breaking**] Adapt to apify-client v3 ([#719](https://github.com/apify/apify-sdk-python/pull/719)) ([10203bc](https://github.com/apify/apify-sdk-python/commit/10203bc51e67590c97938b37d81614376bc3d29a)) by [@vdusek](https://github.com/vdusek), closes [#697](https://github.com/apify/apify-sdk-python/issues/697), [#736](https://github.com/apify/apify-sdk-python/issues/736), [#770](https://github.com/apify/apify-sdk-python/issues/770), [#853](https://github.com/apify/apify-sdk-python/issues/853)
1218

1319
### ⚙️ Miscellaneous Tasks
1420

docs/01_introduction/quick-start.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,7 @@ To see how you can integrate the Apify SDK with popular web scraping libraries,
105105
- [Selenium](../guides/selenium)
106106
- [Crawlee](../guides/crawlee)
107107
- [Scrapy](../guides/scrapy)
108+
- [Crawl4AI](../guides/crawl4ai)
108109
- [Browser Use](../guides/browser-use)
109110
- [Running webserver](../guides/running-webserver)
110111
- [uv](../guides/uv)

docs/03_guides/08_crawl4ai.mdx

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
id: crawl4ai
3+
title: LLM-ready scraping with Crawl4AI
4+
description: Build an Apify Actor that scrapes web pages into LLM-ready Markdown using the Crawl4AI library.
5+
---
6+
7+
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
8+
9+
import Crawl4aiExample from '!!raw-loader!roa-loader!./code/08_crawl4ai.py';
10+
11+
In this guide, you'll learn how to use the [Crawl4AI](https://crawl4ai.com/) library for LLM-ready web scraping in your Apify Actors.
12+
13+
## Introduction
14+
15+
[Crawl4AI](https://crawl4ai.com/) is an open-source, asynchronous web crawler built for LLM and AI workflows. It renders a page in a real browser and turns the result into clean, structured Markdown that you can feed into a language model or a retrieval-augmented generation (RAG) pipeline. It also gives you the raw HTML, extracted links, and media.
16+
17+
Crawl4AI is a great fit for Apify Actors:
18+
19+
- Crawl4AI converts each page into clean Markdown, stripping boilerplate and optionally filtering content, so the output can be fed straight into a language model.
20+
- Pages are loaded in a [Playwright](https://playwright.dev/)-driven browser, so JavaScript-heavy and dynamically rendered websites work out of the box.
21+
- Every crawl returns the page's links already split into `internal` and `external` groups, together with the media it found, which makes recursive crawling straightforward.
22+
- Beyond Markdown, Crawl4AI can extract structured data with CSS/XPath schemas or with an LLM, all configured per request.
23+
- The `AsyncWebCrawler` is built on `asyncio`, which integrates naturally with the asyncio-based Apify SDK.
24+
- Each request can be routed through its own proxy, which pairs well with Apify Proxy and its rotating IP addresses.
25+
26+
Crawl4AI drives a real browser through Playwright. After installing the library, download the browser binaries once with the `crawl4ai-setup` command:
27+
28+
```bash
29+
pip install crawl4ai
30+
crawl4ai-setup
31+
```
32+
33+
## Example Actor
34+
35+
The following Actor recursively crawls pages, starting from the URLs in the Actor input and following links up to a user-defined maximum depth. It uses Crawl4AI's `AsyncWebCrawler` to render each page through [Apify Proxy](https://docs.apify.com/platform/proxy), stores the page's Markdown in the dataset, and follows the internal links that Crawl4AI discovers.
36+
37+
The whole Actor fits in a single file. A `scrape_page` helper holds the Crawl4AI-specific crawling and parsing, while the `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy) and the [request queue](https://docs.apify.com/platform/storage/request-queue), opens a single browser-backed crawler, and drives the crawl:
38+
39+
<RunnableCodeBlock className="language-python" language="python">
40+
{Crawl4aiExample}
41+
</RunnableCodeBlock>
42+
43+
Note that:
44+
45+
- A single `AsyncWebCrawler` is opened once and reused for every request. The crawler manages one browser instance, so reusing it across the whole crawl is cheaper than launching a new browser per page.
46+
- Keeping the crawling and parsing in `scrape_page` separates the Crawl4AI-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, so `main` decides what to store and what to enqueue.
47+
- `result.markdown` is the rendered page as clean Markdown, and `result.metadata` carries page-level fields such as the title. This is the kind of output you need when preparing data for an LLM.
48+
- `result.links` already separates `internal` (same-site) links from `external` ones. The example follows only the internal links to keep the crawl on the same website.
49+
- `CacheMode.BYPASS` tells Crawl4AI to always fetch a fresh copy of the page instead of serving it from its local cache.
50+
51+
## Using Apify Proxy
52+
53+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `scrape_page` for every request, which forwards it to Crawl4AI's per-request `CrawlerRunConfig`.
54+
55+
`ProxyConfig.from_string` parses the proxy URL returned by `ProxyConfiguration.new_url` (for example `http://groups-RESIDENTIAL:<password>@proxy.apify.com:8000`) into the server, username, and password that the browser needs. The browser can't take the credentials embedded directly in the URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
56+
57+
## Running on the Apify platform
58+
59+
Because Crawl4AI renders pages in a real browser, the Actor image needs a browser and its system-level dependencies. Build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser. Crawl4AI reuses those binaries, so no separate browser-install step is required in the Dockerfile.
60+
61+
Pin the Python 3.13 variant of that image (for example `apify/actor-python-playwright:3.13-1.60.0`), because some of Crawl4AI's dependencies do not yet publish wheels for the newest Python versions, which would otherwise force a slow source build during the image build.
62+
63+
Add `apify` and `crawl4ai` to your `requirements.txt`:
64+
65+
```text
66+
apify
67+
crawl4ai
68+
```
69+
70+
## Conclusion
71+
72+
In this guide, you learned how to use Crawl4AI in your Apify Actors. You can now render pages in a real browser, turn them into LLM-ready Markdown, follow the links Crawl4AI discovers, route requests through Apify Proxy, and run the whole thing on the Apify platform. To get started with your own scraping tasks, see the [Actor templates](https://apify.com/templates/categories/python). If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
73+
74+
## Additional resources
75+
76+
- [Crawl4AI: Official documentation](https://docs.crawl4ai.com/)
77+
- [Crawl4AI: AsyncWebCrawler and configuration](https://docs.crawl4ai.com/api/async-webcrawler/)
78+
- [Crawl4AI: Proxy and security](https://docs.crawl4ai.com/advanced/proxy-security/)
79+
- [Crawl4AI: GitHub repository](https://github.com/unclecode/crawl4ai)
80+
- [Apify: Proxy management](https://docs.apify.com/platform/proxy)

docs/03_guides/code/08_crawl4ai.py

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
import asyncio
2+
from typing import Any
3+
4+
from crawl4ai import (
5+
AsyncWebCrawler,
6+
BrowserConfig,
7+
CacheMode,
8+
CrawlerRunConfig,
9+
ProxyConfig,
10+
)
11+
12+
from apify import Actor, Request
13+
from apify.storages import RequestQueue
14+
15+
16+
async def scrape_page(
17+
crawler: AsyncWebCrawler,
18+
url: str,
19+
*,
20+
proxy_url: str | None = None,
21+
) -> tuple[dict[str, Any], list[str]]:
22+
"""Crawl a page with Crawl4AI and return its markdown and same-site links."""
23+
run_config = CrawlerRunConfig(
24+
cache_mode=CacheMode.BYPASS,
25+
proxy_config=ProxyConfig.from_string(proxy_url) if proxy_url else None,
26+
)
27+
28+
result = await crawler.arun(url, config=run_config)
29+
if not result.success:
30+
raise RuntimeError(result.error_message or f'Failed to crawl {url}')
31+
32+
data = {
33+
'url': result.url,
34+
'title': (result.metadata or {}).get('title'),
35+
'markdown': str(result.markdown),
36+
}
37+
38+
# Crawl4AI already classifies links; follow only the internal ones.
39+
internal_links = result.links.get('internal', [])
40+
links = [link['href'] for link in internal_links if link.get('href')]
41+
42+
return data, links
43+
44+
45+
async def enqueue_links(
46+
request_queue: RequestQueue,
47+
links: list[str],
48+
*,
49+
depth: int,
50+
max_depth: int,
51+
) -> None:
52+
"""Enqueue the links one level deeper, unless max_depth was reached."""
53+
if depth >= max_depth:
54+
return
55+
56+
for link_url in links:
57+
Actor.log.info(f'Enqueuing {link_url} ...')
58+
request = Request.from_url(link_url)
59+
request.crawl_depth = depth + 1
60+
await request_queue.add_request(request)
61+
62+
63+
async def main() -> None:
64+
async with Actor:
65+
# Read the Actor input.
66+
actor_input = await Actor.get_input() or {}
67+
start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
68+
max_depth = actor_input.get('maxDepth', 1)
69+
70+
if not start_urls:
71+
Actor.log.info('No start URLs specified in Actor input, exiting...')
72+
await Actor.exit()
73+
74+
# Set up Apify Proxy and the request queue.
75+
proxy_configuration = await Actor.create_proxy_configuration()
76+
request_queue = await Actor.open_request_queue()
77+
78+
# Enqueue the start URLs (crawl depth defaults to 0).
79+
for start_url in start_urls:
80+
url = start_url.get('url')
81+
Actor.log.info(f'Enqueuing start URL: {url}')
82+
await request_queue.add_request(Request.from_url(url))
83+
84+
# Cap the crawl; raise or remove to follow more pages.
85+
max_requests = 50
86+
handled_requests = 0
87+
88+
# Reuse one headless browser-backed crawler for every request.
89+
browser_config = BrowserConfig(headless=True)
90+
91+
async with AsyncWebCrawler(config=browser_config) as crawler:
92+
while handled_requests < max_requests and (
93+
request := await request_queue.fetch_next_request()
94+
):
95+
handled_requests += 1
96+
url = request.url
97+
depth = request.crawl_depth
98+
Actor.log.info(f'Scraping {url} (depth={depth}) ...')
99+
100+
try:
101+
# Fresh proxy URL per request (None if no proxy).
102+
proxy_url = None
103+
if proxy_configuration:
104+
proxy_url = await proxy_configuration.new_url()
105+
106+
data, links = await scrape_page(crawler, url, proxy_url=proxy_url)
107+
await Actor.push_data(data)
108+
Actor.log.info(
109+
f'Stored data from {url} '
110+
f'(title={data["title"]!r}, {len(links)} links found).'
111+
)
112+
await enqueue_links(
113+
request_queue, links, depth=depth, max_depth=max_depth
114+
)
115+
116+
except Exception:
117+
Actor.log.exception(f'Cannot extract data from {url}.')
118+
119+
finally:
120+
await request_queue.mark_request_as_handled(request)
121+
122+
123+
if __name__ == '__main__':
124+
asyncio.run(main())

src/apify/_charging.py

Lines changed: 70 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
from decimal import Decimal
88
from typing import TYPE_CHECKING, Annotated, Literal, Protocol, TypedDict
99

10-
from pydantic import BaseModel, ConfigDict, Field
10+
from pydantic import Field
1111

1212
import apify_client._models as _client_models
1313
from apify_client._models import ActorChargeEvent as ClientActorChargeEvent
@@ -28,14 +28,17 @@
2828

2929
from apify._configuration import Configuration
3030

31-
PricingModel = Literal['PAY_PER_EVENT', 'PRICE_PER_DATASET_ITEM', 'FLAT_PRICE_PER_MONTH', 'FREE']
32-
"""Pricing model for an Actor."""
31+
charging_manager_ctx: ContextVar[ChargingManager | None] = ContextVar('charging_manager_ctx', default=None)
32+
"""Holds the current `ChargingManager` instance, if any.
33+
34+
Allows PPE-aware dataset clients to access the charging manager without needing to pass it explicitly.
35+
"""
3336

3437
DEFAULT_DATASET_ITEM_EVENT = 'apify-default-dataset-item'
38+
"""Name of the synthetic event charged for each item pushed to the default dataset."""
3539

36-
# Context variable to hold the current `ChargingManager` instance, if any. This allows PPE-aware dataset clients to
37-
# access the charging manager without needing to pass it explicitly.
38-
charging_manager_ctx: ContextVar[ChargingManager | None] = ContextVar('charging_manager_ctx', default=None)
40+
PricingModel = Literal['PAY_PER_EVENT', 'PRICE_PER_DATASET_ITEM', 'FLAT_PRICE_PER_MONTH', 'FREE']
41+
"""Pricing model for an Actor."""
3942

4043
_ensure_context = ensure_context('active')
4144

@@ -49,48 +52,91 @@
4952
# `apify-client` instance) flows through the same code paths without conversion.
5053

5154

52-
class _RelaxedPricingMetadata(BaseModel):
53-
"""Mixin relaxing the `CommonActorPricingInfo` metadata fields the platform env var omits."""
54-
55-
model_config = ConfigDict(populate_by_name=True, extra='allow')
56-
57-
apify_margin_percentage: Annotated[float | None, Field(alias='apifyMarginPercentage')] = None
58-
created_at: Annotated[datetime | None, Field(alias='createdAt')] = None
59-
started_at: Annotated[datetime | None, Field(alias='startedAt')] = None
60-
61-
6255
@docs_group('Charging')
6356
class ActorChargeEvent(ClientActorChargeEvent):
64-
# `event_description` is required in apify-client but omitted from the env var.
57+
"""Definition of a single chargeable event in the pay-per-event pricing model."""
58+
6559
event_description: Annotated[str | None, Field(alias='eventDescription')] = None
60+
"""Human-readable description of the event.
61+
62+
Required in apify-client but omitted from the env var, so it is relaxed to optional.
63+
"""
6664

6765

6866
@docs_group('Charging')
6967
class PricingPerEvent(ClientPricingPerEvent):
68+
"""Pay-per-event pricing details - the chargeable events and their prices."""
69+
7070
actor_charge_events: Annotated[dict[str, ActorChargeEvent] | None, Field(alias='actorChargeEvents')] = None
71+
"""Mapping of event name to its charge definition."""
7172

7273

7374
@docs_group('Charging')
74-
class FreeActorPricingInfo(_RelaxedPricingMetadata, ClientFree):
75-
pass
75+
class FreeActorPricingInfo(ClientFree):
76+
"""Pricing info for an Actor offered free of charge."""
77+
78+
apify_margin_percentage: Annotated[float | None, Field(alias='apifyMarginPercentage')] = None
79+
"""Apify's margin on the price, as a percentage."""
80+
81+
created_at: Annotated[datetime | None, Field(alias='createdAt')] = None
82+
"""Timestamp when this pricing info was created."""
83+
84+
started_at: Annotated[datetime | None, Field(alias='startedAt')] = None
85+
"""Timestamp when this pricing became effective."""
7686

7787

7888
@docs_group('Charging')
79-
class FlatPricePerMonthActorPricingInfo(_RelaxedPricingMetadata, ClientFlatPricePerMonth):
89+
class FlatPricePerMonthActorPricingInfo(ClientFlatPricePerMonth):
90+
"""Pricing info for an Actor billed at a flat monthly price."""
91+
92+
apify_margin_percentage: Annotated[float | None, Field(alias='apifyMarginPercentage')] = None
93+
"""Apify's margin on the price, as a percentage."""
94+
95+
created_at: Annotated[datetime | None, Field(alias='createdAt')] = None
96+
"""Timestamp when this pricing info was created."""
97+
98+
started_at: Annotated[datetime | None, Field(alias='startedAt')] = None
99+
"""Timestamp when this pricing became effective."""
100+
80101
trial_minutes: Annotated[int | None, Field(alias='trialMinutes')] = None
102+
"""Length of the free trial period, in minutes."""
103+
81104
price_per_unit_usd: Annotated[float | None, Field(alias='pricePerUnitUsd')] = None
105+
"""Price per unit, in USD."""
82106

83107

84108
@docs_group('Charging')
85-
class PricePerDatasetItemActorPricingInfo(_RelaxedPricingMetadata, ClientPricePerDatasetItem):
109+
class PricePerDatasetItemActorPricingInfo(ClientPricePerDatasetItem):
110+
"""Pricing info for an Actor billed per dataset item produced."""
111+
112+
apify_margin_percentage: Annotated[float | None, Field(alias='apifyMarginPercentage')] = None
113+
"""Apify's margin on the price, as a percentage."""
114+
115+
created_at: Annotated[datetime | None, Field(alias='createdAt')] = None
116+
"""Timestamp when this pricing info was created."""
117+
118+
started_at: Annotated[datetime | None, Field(alias='startedAt')] = None
119+
"""Timestamp when this pricing became effective."""
120+
86121
unit_name: Annotated[str | None, Field(alias='unitName')] = None
87-
# `price_per_unit_usd` is already optional in apify-client - inherited.
122+
"""Name of the billed unit."""
88123

89124

90125
@docs_group('Charging')
91-
class PayPerEventActorPricingInfo(_RelaxedPricingMetadata, ClientPayPerEvent):
92-
# Re-typed to the relaxed element so an omitted `eventDescription` validates; the field stays required.
126+
class PayPerEventActorPricingInfo(ClientPayPerEvent):
127+
"""Pricing info for an Actor billed per charged event."""
128+
129+
apify_margin_percentage: Annotated[float | None, Field(alias='apifyMarginPercentage')] = None
130+
"""Apify's margin on the price, as a percentage."""
131+
132+
created_at: Annotated[datetime | None, Field(alias='createdAt')] = None
133+
"""Timestamp when this pricing info was created."""
134+
135+
started_at: Annotated[datetime | None, Field(alias='startedAt')] = None
136+
"""Timestamp when this pricing became effective."""
137+
93138
pricing_per_event: Annotated[PricingPerEvent, Field(alias='pricingPerEvent')]
139+
"""The pay-per-event pricing details."""
94140

95141

96142
ActorPricingInfoModel = ClientFree | ClientFlatPricePerMonth | ClientPricePerDatasetItem | ClientPayPerEvent

0 commit comments

Comments
 (0)