Skip to content

Commit c95f8dc

Browse files
committed
Merge branch 'master' into docs/improve-guides
# Conflicts: # docs/01_introduction/quick-start.mdx
2 parents 4355f84 + e54d9f9 commit c95f8dc

27 files changed

Lines changed: 2203 additions & 125 deletions

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,17 @@ All notable changes to this project will be documented in this file.
55
<!-- git-cliff-unreleased-start -->
66
## 3.4.2 - **not yet released**
77

8+
### 🐛 Bug Fixes
9+
10+
- **scrapy:** Correct proxy middleware exception log and import ([#953](https://github.com/apify/apify-sdk-python/pull/953)) ([5bd6eb9](https://github.com/apify/apify-sdk-python/commit/5bd6eb9843d90844cec083372e932413bceedec9)) by [@vdusek](https://github.com/vdusek)
11+
- **scrapy:** Skip a request that fails to convert instead of crashing the run ([#952](https://github.com/apify/apify-sdk-python/pull/952)) ([db9444f](https://github.com/apify/apify-sdk-python/commit/db9444faeb0158c29aa394121cf733ff2e843f28)) by [@vdusek](https://github.com/vdusek)
12+
- **scrapy:** [**breaking**] Serialize requests and HTTP cache as JSON instead of pickle ([#951](https://github.com/apify/apify-sdk-python/pull/951)) ([a87e8d1](https://github.com/apify/apify-sdk-python/commit/a87e8d1597478b4f12fd5bb9b379f65f637d8e96)) by [@vdusek](https://github.com/vdusek)
13+
814
### 🚜 Refactor
915

1016
- [**breaking**] Remove deprecated APIs ([#918](https://github.com/apify/apify-sdk-python/pull/918)) ([3e5728d](https://github.com/apify/apify-sdk-python/commit/3e5728d94cb8fd879d5a76e33a03d55792d835d5)) by [@vdusek](https://github.com/vdusek), closes [#635](https://github.com/apify/apify-sdk-python/issues/635)
1117
- [**breaking**] Mark secondary arguments as keyword-only ([#917](https://github.com/apify/apify-sdk-python/pull/917)) ([eb94c99](https://github.com/apify/apify-sdk-python/commit/eb94c992ec4aba1cd7cf4dfd7a98731cb304651b)) by [@vdusek](https://github.com/vdusek), closes [#881](https://github.com/apify/apify-sdk-python/issues/881)
18+
- [**breaking**] Adapt to apify-client v3 ([#719](https://github.com/apify/apify-sdk-python/pull/719)) ([10203bc](https://github.com/apify/apify-sdk-python/commit/10203bc51e67590c97938b37d81614376bc3d29a)) by [@vdusek](https://github.com/vdusek), closes [#697](https://github.com/apify/apify-sdk-python/issues/697), [#736](https://github.com/apify/apify-sdk-python/issues/736), [#770](https://github.com/apify/apify-sdk-python/issues/770), [#853](https://github.com/apify/apify-sdk-python/issues/853)
1219

1320
### ⚙️ Miscellaneous Tasks
1421

docs/03_guides/06_scrapy.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ The Apify SDK provides several custom components to support integration with the
4646
- <ApiLink to="class/ApifyScheduler">`apify.scrapy.ApifyScheduler`</ApiLink> - Replaces Scrapy's default [scheduler](https://docs.scrapy.org/en/latest/topics/scheduler.html) with one that uses Apify's [request queue](https://docs.apify.com/platform/storage/request-queue) for storing requests. It manages enqueuing, dequeuing, and maintaining the state and priority of requests.
4747
- <ApiLink to="class/ActorDatasetPushPipeline">`apify.scrapy.ActorDatasetPushPipeline`</ApiLink> - A Scrapy [item pipeline](https://docs.scrapy.org/en/latest/topics/item-pipeline.html) that pushes scraped items to Apify's [dataset](https://docs.apify.com/platform/storage/dataset). When enabled, every item produced by the spider is sent to the dataset.
4848
- <ApiLink to="class/ApifyHttpProxyMiddleware">`apify.scrapy.ApifyHttpProxyMiddleware`</ApiLink> - A Scrapy [middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) that manages proxy configurations. This middleware replaces Scrapy's default `HttpProxyMiddleware` to facilitate the use of Apify's proxy service.
49-
- <ApiLink to="class/ApifyCacheStorage">`apify.scrapy.extensions.ApifyCacheStorage`</ApiLink> - A storage backend for Scrapy's built-in [HTTP cache middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpcache). This backend uses Apify's [key-value store](https://docs.apify.com/platform/storage/key-value-store). Make sure to set `HTTPCACHE_ENABLED` and `HTTPCACHE_EXPIRATION_SECS` in your settings, or caching won't work.
49+
- <ApiLink to="class/ApifyCacheStorage">`apify.scrapy.extensions.ApifyCacheStorage`</ApiLink> - A storage backend for Scrapy's built-in [HTTP cache middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpcache). This backend uses Apify's [key-value store](https://docs.apify.com/platform/storage/key-value-store). To enable caching, set `HTTPCACHE_ENABLED` and `HTTPCACHE_EXPIRATION_SECS` in your settings. By default, when the spider closes, up to 100 expired and unreadable entries per run are cleaned up. To change this number, update `APIFY_HTTPCACHE_EXPIRATION_MAX_ITEMS`.
5050

5151
Additional helper functions in the [`apify.scrapy`](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy) subpackage include:
5252

docs/03_guides/08_crawl4ai.mdx

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
id: crawl4ai
3+
title: LLM-ready scraping with Crawl4AI
4+
description: Build an Apify Actor that scrapes web pages into LLM-ready Markdown using the Crawl4AI library.
5+
---
6+
7+
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
8+
9+
import Crawl4aiExample from '!!raw-loader!roa-loader!./code/08_crawl4ai.py';
10+
11+
In this guide, you'll learn how to use the [Crawl4AI](https://crawl4ai.com/) library for LLM-ready web scraping in your Apify Actors.
12+
13+
## Introduction
14+
15+
[Crawl4AI](https://crawl4ai.com/) is an open-source, asynchronous web crawler built for LLM and AI workflows. It renders a page in a real browser and turns the result into clean, structured Markdown that you can feed into a language model or a retrieval-augmented generation (RAG) pipeline. It also gives you the raw HTML, extracted links, and media.
16+
17+
Crawl4AI is a great fit for Apify Actors:
18+
19+
- Crawl4AI converts each page into clean Markdown, stripping boilerplate and optionally filtering content, so the output can be fed straight into a language model.
20+
- Pages are loaded in a [Playwright](https://playwright.dev/)-driven browser, so JavaScript-heavy and dynamically rendered websites work out of the box.
21+
- Every crawl returns the page's links already split into `internal` and `external` groups, together with the media it found, which makes recursive crawling straightforward.
22+
- Beyond Markdown, Crawl4AI can extract structured data with CSS/XPath schemas or with an LLM, all configured per request.
23+
- The `AsyncWebCrawler` is built on `asyncio`, which integrates naturally with the asyncio-based Apify SDK.
24+
- Each request can be routed through its own proxy, which pairs well with Apify Proxy and its rotating IP addresses.
25+
26+
Crawl4AI drives a real browser through Playwright. After installing the library, download the browser binaries once with the `crawl4ai-setup` command:
27+
28+
```bash
29+
pip install crawl4ai
30+
crawl4ai-setup
31+
```
32+
33+
## Example Actor
34+
35+
The following Actor recursively crawls pages, starting from the URLs in the Actor input and following links up to a user-defined maximum depth. It uses Crawl4AI's `AsyncWebCrawler` to render each page through [Apify Proxy](https://docs.apify.com/platform/proxy), stores the page's Markdown in the dataset, and follows the internal links that Crawl4AI discovers.
36+
37+
The whole Actor fits in a single file. A `scrape_page` helper holds the Crawl4AI-specific crawling and parsing, while the `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy) and the [request queue](https://docs.apify.com/platform/storage/request-queue), opens a single browser-backed crawler, and drives the crawl:
38+
39+
<RunnableCodeBlock className="language-python" language="python">
40+
{Crawl4aiExample}
41+
</RunnableCodeBlock>
42+
43+
Note that:
44+
45+
- A single `AsyncWebCrawler` is opened once and reused for every request. The crawler manages one browser instance, so reusing it across the whole crawl is cheaper than launching a new browser per page.
46+
- Keeping the crawling and parsing in `scrape_page` separates the Crawl4AI-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, so `main` decides what to store and what to enqueue.
47+
- `result.markdown` is the rendered page as clean Markdown, and `result.metadata` carries page-level fields such as the title. This is the kind of output you need when preparing data for an LLM.
48+
- `result.links` already separates `internal` (same-site) links from `external` ones. The example follows only the internal links to keep the crawl on the same website.
49+
- `CacheMode.BYPASS` tells Crawl4AI to always fetch a fresh copy of the page instead of serving it from its local cache.
50+
51+
## Using Apify Proxy
52+
53+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `scrape_page` for every request, which forwards it to Crawl4AI's per-request `CrawlerRunConfig`.
54+
55+
`ProxyConfig.from_string` parses the proxy URL returned by `ProxyConfiguration.new_url` (for example `http://groups-RESIDENTIAL:<password>@proxy.apify.com:8000`) into the server, username, and password that the browser needs. The browser can't take the credentials embedded directly in the URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
56+
57+
## Running on the Apify platform
58+
59+
Because Crawl4AI renders pages in a real browser, the Actor image needs a browser and its system-level dependencies. Build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser. Crawl4AI reuses those binaries, so no separate browser-install step is required in the Dockerfile.
60+
61+
Pin the Python 3.13 variant of that image (for example `apify/actor-python-playwright:3.13-1.60.0`), because some of Crawl4AI's dependencies do not yet publish wheels for the newest Python versions, which would otherwise force a slow source build during the image build.
62+
63+
Add `apify` and `crawl4ai` to your `requirements.txt`:
64+
65+
```text
66+
apify
67+
crawl4ai
68+
```
69+
70+
## Conclusion
71+
72+
In this guide, you learned how to use Crawl4AI in your Apify Actors. You can now render pages in a real browser, turn them into LLM-ready Markdown, follow the links Crawl4AI discovers, route requests through Apify Proxy, and run the whole thing on the Apify platform. To get started with your own scraping tasks, see the [Actor templates](https://apify.com/templates/categories/python). If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
73+
74+
## Additional resources
75+
76+
- [Crawl4AI: Official documentation](https://docs.crawl4ai.com/)
77+
- [Crawl4AI: AsyncWebCrawler and configuration](https://docs.crawl4ai.com/api/async-webcrawler/)
78+
- [Crawl4AI: Proxy and security](https://docs.crawl4ai.com/advanced/proxy-security/)
79+
- [Crawl4AI: GitHub repository](https://github.com/unclecode/crawl4ai)
80+
- [Apify: Proxy management](https://docs.apify.com/platform/proxy)

docs/03_guides/09_browser_use.mdx

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
---
2+
id: browser-use
3+
title: Browser AI agents with Browser Use
4+
description: Build an Apify Actor that automates a browser with an LLM agent using the Browser Use library.
5+
---
6+
7+
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
8+
9+
import BrowserUseExample from '!!raw-loader!roa-loader!./code/09_browser_use.py';
10+
11+
In this guide, you'll learn how to use the [Browser Use](https://browser-use.com/) library to drive a browser with an LLM agent in your Apify Actors.
12+
13+
## Introduction
14+
15+
[Browser Use](https://browser-use.com/) is a Python library that lets an LLM control a real web browser. Instead of writing selectors and navigation steps by hand, you give an agent a natural-language task, such as "find the top post on Hacker News and return its title and URL". The agent then decides which pages to open, what to click, and what to read until the task is done.
16+
17+
Browser Use is a great fit for Apify Actors:
18+
19+
- Describe what you want in plain English and the agent figures out the steps. This is especially useful with pages whose structure changes often or is hard to target with fixed selectors.
20+
- Browser Use ships wrappers for many providers, for example `ChatOpenAI`, `ChatAnthropic`, or `ChatGoogle`. You can pick the model that fits your task and budget.
21+
- Pass a [Pydantic](https://docs.pydantic.dev/) model as the output schema and the agent returns a validated object that maps onto an Apify dataset.
22+
- The agent drives a real Chromium over the Chrome DevTools Protocol, so JavaScript-heavy pages render just like they would for a human.
23+
- The agent's `run` method is asynchronous, which integrates naturally with the asyncio-based Apify SDK.
24+
25+
Browser Use needs only the `browser-use` package. To install it, use:
26+
27+
```bash
28+
pip install browser-use
29+
```
30+
31+
## Configuring the LLM
32+
33+
Browser Use needs an LLM to drive the agent. You choose a provider wrapper, give it a model name, and supply the provider's API key:
34+
35+
- **`ChatOpenAI`** - OpenAI models such as `gpt-4.1-mini` or `gpt-5-mini`. Reads the key from `OPENAI_API_KEY`, or accepts it via the `api_key` argument.
36+
- **`ChatAnthropic`** - Anthropic Claude models such as `claude-sonnet-4-5` or `claude-haiku-4-5`. Reads the key from `ANTHROPIC_API_KEY`.
37+
- **`ChatGoogle`** - Google Gemini models such as `gemini-2.5-flash`. Reads the key from `GOOGLE_API_KEY`.
38+
39+
The example Actor in this guide uses `ChatOpenAI`, but switching providers is a one-line change in `run_agent_task`. More capable models generally complete tasks in fewer steps and more reliably, while smaller models are cheaper per step.
40+
41+
Keep the API key out of the Actor input and source code. The example reads it from an environment variable, which on the Apify platform you set as a [secret environment variable](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables) (for example `OPENAI_API_KEY`), and locally you export in your shell.
42+
43+
## Example Actor
44+
45+
The following Actor runs a Browser Use agent for a single task and stores its structured result in the default dataset. By default, it opens [Hacker News](https://news.ycombinator.com) and returns the title and URL of the top five posts, but the task, model, and step limit are all configurable through the Actor input.
46+
47+
The whole Actor fits in a single file. A `run_agent_task` helper holds the Browser Use-specific logic: it defines the output schema and builds the LLM, browser, and agent. The `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy), runs the agent, and stores the result:
48+
49+
<RunnableCodeBlock className="language-python" language="python">
50+
{BrowserUseExample}
51+
</RunnableCodeBlock>
52+
53+
Note that:
54+
55+
- Keeping the agent setup in `run_agent_task` separates the Browser Use-specific code from the Actor's orchestration logic. `main` only decides what to read from the input and what to store.
56+
- Passing `output_model_schema=Posts` makes the agent return a validated `Posts` instance via `history.structured_output`, so `main` can push each item straight to the dataset. Adapt the task and the `Post`/`Posts` models together to fit your own use case.
57+
- `enable_signal_handler=False` leaves signal handling to the Actor, which manages the run's lifecycle. Without it, Browser Use would install its own handlers and interfere with a clean shutdown.
58+
- `headless=Actor.configuration.headless` runs the browser without a visible window, which is what you want on the platform.
59+
60+
The example runs one agent per Actor run, so each browser profile stays isolated. If you parallelize tasks within a single Actor, give every agent its own `Browser` instance with its own `user_data_dir`. Several concurrent agents sharing one profile can corrupt it.
61+
62+
## Using Apify Proxy
63+
64+
Running on the Apify platform gives your agent access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `run_agent_task`.
65+
66+
Browser Use expects the proxy as a `ProxySettings` object with separate `server`, `username`, and `password` fields, whereas `ProxyConfiguration.new_url` returns a single URL string (for example `http://user:pass@proxy.apify.com:8000`). The `_proxy_settings` helper splits that URL into the fields Browser Use expects. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
67+
68+
## Running on the Apify platform
69+
70+
Browser Use drives a real Chromium over CDP, so the Actor needs a browser binary available at runtime. The simplest way to provide one is to build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser together with all of its system-level dependencies. Browser Use discovers that browser automatically, so no extra install step is needed in the image.
71+
72+
To disable Browser Use's telemetry and cloud sync inside the Actor, set the `ANONYMIZED_TELEMETRY=false` and `BROWSER_USE_CLOUD_SYNC=false` environment variables in your Dockerfile.
73+
74+
When running the Actor locally, install the browser once with the `browser-use install` command, which downloads a Chromium build together with its dependencies:
75+
76+
```bash
77+
browser-use install
78+
```
79+
80+
Remember to provide the LLM API key in both environments: as a secret environment variable on the platform, and exported in your shell when running locally.
81+
82+
## Conclusion
83+
84+
In this guide, you learned how to use Browser Use in your Apify Actors. You can now drive a real browser with an LLM agent, return its results as a validated Pydantic model, route the browser through Apify Proxy, and run the whole thing on the Apify platform. To get started with your own automation tasks, see the [Actor templates](https://apify.com/templates/categories/python). If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy automating!
85+
86+
## Additional resources
87+
88+
- [Browser Use: Official documentation](https://docs.browser-use.com/)
89+
- [Browser Use: Supported models](https://docs.browser-use.com/customize/supported-models)
90+
- [Browser Use: Structured output](https://docs.browser-use.com/customize/agent/output-format)
91+
- [Browser Use: GitHub repository](https://github.com/browser-use/browser-use)
92+
- [Apify: Proxy management](https://docs.apify.com/platform/proxy)

0 commit comments

Comments
 (0)