Skip to content

Commit b5f026f

Browse files
authored
docs: Add Scrapling guide (#938)
Adds a guide for the [Scrapling](https://scrapling.readthedocs.io/) adaptive web scraping library in Apify Actors, following the structure of the existing scraping-library guides. - `docs/03_guides/07_scrapling.mdx` — the guide: introduction & features, choosing a fetcher (HTTP vs. browser-based), a runnable example Actor, Apify Proxy integration, and running browser fetchers (`DynamicFetcher`/`StealthyFetcher`) with the required `scrapling install` step in the Dockerfile. - `code/07_scrapling.py` — runnable single-file example: a recursive title scraper using Scrapling's async HTTP `AsyncFetcher` through Apify Proxy. `code/07_scrapling_browser.py` shows the browser-based variant. - Quick-start guides list updated. Verified locally (`apify run`) and on the Apify platform (build + run SUCCEEDED, correct dataset output via Apify Proxy), including the browser path. Lint + type-check pass. Closes: #836
1 parent e2034b0 commit b5f026f

10 files changed

Lines changed: 828 additions & 29 deletions

File tree

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -92,9 +92,11 @@ To create, run, and deploy your first Actor step by step, see the [Quick start g
9292

9393
## What are Actors?
9494

95-
Actors are serverless cloud programs that can do almost anything a human can do in a web browser. They range from small tasks, such as filling in forms or unsubscribing from online services, all the way up to scraping and processing vast numbers of web pages.
95+
Actors are serverless programs that can do almost anything. From simple scripts and web scrapers to complex automation workflows, AI agents, or even always-on services that expose HTTP endpoints.
9696

97-
They run either locally or on the [Apify platform](https://docs.apify.com/platform/), where you can run them at scale, monitor them, schedule them, or publish and monetize them. If you're new to Apify, learn [what Apify is](https://docs.apify.com/platform/about) in the platform documentation.
97+
They can run either locally or on the Apify platform, where you can scale their execution, monitor runs, schedule tasks, integrate them with other services, or even publish and monetize them. If you're new to Apify, learn more about the platform in the [Apify documentation](https://docs.apify.com/platform/about).
98+
99+
For more context, read the [Actor whitepaper](https://whitepaper.actor/).
98100
99101
## Features
100102
@@ -197,7 +199,7 @@ The full SDK documentation lives at **[docs.apify.com/sdk/python](https://docs.a
197199
| [Overview](https://docs.apify.com/sdk/python/docs/overview) | What the SDK is, what Actors are, and how the pieces fit together. |
198200
| [Quick start](https://docs.apify.com/sdk/python/docs/quick-start) | Create, run, and deploy your first Python Actor. |
199201
| [Concepts](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle) | Actor lifecycle, input, storages, events, proxy management, interacting with other Actors, webhooks, accessing the Apify API, logging, configuration, and pay-per-event. |
200-
| [Guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx) | Integrations with BeautifulSoup, Parsel, Playwright, Selenium, Crawlee, Scrapy, Crawl4AI, and Browser Use, plus running a web server and using uv. |
202+
| [Guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx) | Integrations with BeautifulSoup, Parsel, Playwright, Selenium, Crawlee, Scrapy, Scrapling, Crawl4AI, and Browser Use, plus running a web server and using uv. |
201203
| [Upgrading](https://docs.apify.com/sdk/python/docs/upgrading/upgrading-to-v4) | Migrating between major versions. |
202204
| [API reference](https://docs.apify.com/sdk/python/reference) | Generated reference for every class and method. |
203205
| [Changelog](https://docs.apify.com/sdk/python/docs/changelog) | Release history and breaking changes. |

docs/01_introduction/index.mdx

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,26 +9,42 @@ import CodeBlock from '@theme/CodeBlock';
99

1010
import IntroductionExample from '!!raw-loader!./code/01_introduction.py';
1111

12-
The Apify SDK for Python is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python. It provides everything you need to build an Actor and run it both locally and on the [Apify platform](https://docs.apify.com/platform). With the SDK, you can:
13-
14-
- Manage the Actor lifecycle: initialization, graceful shutdown, status messages, rebooting, and metamorphing.
15-
- Work with datasets, key-value stores, and request queues, with automatic local emulation when running outside the platform.
16-
- Read the Actor input, including automatic decryption of secret fields.
17-
- React to platform events (system info, migration, abort) and persist state across migrations and restarts.
18-
- Manage proxies, both [Apify Proxy](https://docs.apify.com/platform/proxy) and your own, with session and tiered-proxy support.
19-
- Start, call, and abort Actors and tasks, create webhooks, and reach the full Apify API client.
20-
- Charge users with the pay-per-event pricing model.
21-
- Integrate with [Crawlee](../guides/crawlee) and [Scrapy](../guides/scrapy), with guides for [Playwright](../guides/playwright) and others.
12+
The Apify SDK for Python is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python. It provides everything you need to build an Actor and run it both locally and on the [Apify platform](https://docs.apify.com/platform). It handles the Actor lifecycle, [storage](https://docs.apify.com/platform/storage) access, platform events, [Apify Proxy](https://docs.apify.com/platform/proxy), pay-per-event charging, and more.
2213

2314
<CodeBlock className="language-python">
2415
{IntroductionExample}
2516
</CodeBlock>
2617

27-
## What are Actors
18+
## What are Actors?
19+
20+
Actors are serverless programs that can do almost anything. From simple scripts and web scrapers to complex automation workflows, AI agents, or even always-on services that expose HTTP endpoints.
21+
22+
They can run either locally or on the Apify platform, where you can scale their execution, monitor runs, schedule tasks, integrate them with other services, or even publish and monetize them. If you're new to Apify, learn more about the platform in the [Apify documentation](https://docs.apify.com/platform/about).
23+
24+
For more context, read the [Actor whitepaper](https://whitepaper.actor/).
25+
26+
## Features
27+
28+
- Run the full Actor lifecycle inside `async with Actor:`, covering init, exit, failures, status messages, rebooting, and metamorphing ([Actor lifecycle](../concepts/actor-lifecycle)).
29+
- Read Actor input validated against your input schema with `Actor.get_input()`, including automatic decryption of secret fields ([Actor input](../concepts/actor-input)).
30+
- Read and write datasets, key-value stores, and request queues, locally or on the platform ([Working with storages](../concepts/storages)).
31+
- React to platform events such as system info, migration, and abort, and persist state across migrations and restarts ([Actor events](../concepts/actor-events)).
32+
- Route requests through Apify Proxy with group selection, country targeting, and rotation, with session and tiered-proxy support ([Proxy management](../concepts/proxy-management)).
33+
- Start, call, and abort other Actors and tasks, and attach webhooks to run events ([Interacting with other Actors](../concepts/interacting-with-other-actors), [Webhooks](../concepts/webhooks)).
34+
- Monetize your Actor with pay-per-event charging ([Pay-per-event](../concepts/pay-per-event)).
35+
- Reach the full [Apify API](https://docs.apify.com/api/v2) through a preconfigured `ApifyClient` ([Accessing the Apify API](../concepts/access-apify-api)).
36+
37+
## What you can build
38+
39+
Almost any Python project can become an Actor, including projects for:
2840

29-
Actors are serverless cloud programs capable of performing tasks in a web browser, similar to what a human can do. These tasks can range from simple operations, such as filling out forms or unsubscribing from services, to complex jobs like scraping and processing large numbers of web pages.
41+
- **Web scraping and crawling** - The SDK is fully compatible with [Crawlee](https://crawlee.dev/python), which makes Apify a natural place to deploy and scale your crawlers (see the [Crawlee guide](../guides/crawlee)). It also works with other popular scraping libraries, such as [Scrapy](../guides/scrapy), [Scrapling](../guides/scrapling), or [Crawl4AI](../guides/crawl4ai).
42+
- **Browser automation** - Drive a real browser with [Playwright](../guides/playwright) or [Selenium](../guides/selenium), or with higher-level tools such as [Browser Use](../guides/browser-use).
43+
- **Web servers and APIs** - Run a [web server](../guides/running-webserver) inside an Actor to serve HTTP requests, for example to expose your scraper as a live API.
44+
- **AI agents** - Host agents built with your framework of choice. Ready-made Actor templates cover [PydanticAI](https://apify.com/templates/python-pydanticai), [CrewAI](https://apify.com/templates/python-crewai), [LangGraph](https://apify.com/templates/python-langgraph), [LlamaIndex](https://apify.com/templates/python-llamaindex-agent), and [Smolagents](https://apify.com/templates/python-smolagents).
45+
- **MCP servers** - Deploy a Python MCP server as an Actor and make its tools available to any MCP client. See the [MCP server](https://apify.com/templates/python-mcp-empty) and [MCP proxy](https://apify.com/templates/python-mcp-proxy) templates.
3046

31-
Actors can be executed locally or on the [Apify platform](https://docs.apify.com/platform). The Apify platform lets you run Actors at scale and provides features for monitoring, scheduling, publishing, and monetizing them.
47+
Whatever you build, the Apify SDK doesn't lock you into a particular framework. Bring the libraries you already use, and let Apify run your project in the cloud.
3248

3349
## Quick start
3450

docs/03_guides/07_scrapling.mdx

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
id: scrapling
3+
title: Adaptive scraping with Scrapling
4+
description: Build an Apify Actor that scrapes web pages using the Scrapling adaptive web scraping library.
5+
---
6+
7+
import CodeBlock from '@theme/CodeBlock';
8+
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
9+
10+
import ScraplingExample from '!!raw-loader!roa-loader!./code/07_scrapling.py';
11+
import ScraplingBrowserScraper from '!!raw-loader!./code/07_scrapling_browser.py';
12+
13+
In this guide, you'll learn how to use the [Scrapling](https://scrapling.readthedocs.io/) library for adaptive web scraping in your Apify Actors.
14+
15+
## Introduction
16+
17+
[Scrapling](https://scrapling.readthedocs.io/) is an adaptive web scraping library for Python that combines fetching and parsing behind a single, high-level API. It can fetch a page with fast HTTP requests or with a real browser, parse the result with familiar CSS selectors and XPath, and relocate your selectors automatically when a website's structure changes.
18+
19+
Scrapling is a great fit for Apify Actors:
20+
21+
- A single API exposes a fast HTTP client with browser TLS-fingerprint impersonation, as well as full browser automation for JavaScript-heavy or protected pages.
22+
- Scrapling can remember the elements you scraped and find them again after a website redesign. Your scrapers keep working with fewer manual fixes.
23+
- Built-in stealth features (browser impersonation, realistic headers, and automatic Cloudflare Turnstile solving with the browser fetchers) help you avoid being blocked.
24+
- Elements are selected with CSS selectors (including the `::text` and `::attr()` pseudo-elements) or XPath, with a Scrapy/Parsel-like `.get()` and `.getall()` interface.
25+
- Every fetcher has an asynchronous variant, which integrates naturally with the asyncio-based Apify SDK.
26+
27+
Scrapling's parser works on its own. The fetchers are an optional extra. To get the HTTP and browser fetchers, install Scrapling with the `fetchers` extra:
28+
29+
```bash
30+
pip install "scrapling[fetchers]"
31+
```
32+
33+
## Choosing a fetcher
34+
35+
All of Scrapling's fetchers are importable from `scrapling.fetchers`. Pick the one that matches the website you're scraping:
36+
37+
- **`Fetcher` / `AsyncFetcher`** - Plain HTTP requests via `.get()`, `.post()`, `.put()`, and `.delete()`. Fast and lightweight, with optional browser TLS-fingerprint impersonation (`impersonate`) and realistic headers (`stealthy_headers`). This is the best choice for static pages and APIs, and it doesn't need browser binaries.
38+
- **`DynamicFetcher` / `DynamicSession`** - Full browser automation based on [Playwright](https://playwright.dev/), for pages that require JavaScript rendering or interaction. Fetch a page with `.fetch()` or its async variant `.async_fetch()`.
39+
- **`StealthyFetcher` / `StealthySession`** - A stealth-hardened browser fetcher that can automatically solve Cloudflare Turnstile challenges (`solve_cloudflare=True`). Use it for the most heavily protected websites.
40+
41+
The returned `Response` object is also a Scrapling selector, so you can call `.css()`, `.xpath()`, `.find_all()`, and the other parsing methods on it directly.
42+
43+
The HTTP fetchers work with just the `scrapling[fetchers]` extra. The browser-based fetchers (`DynamicFetcher` and `StealthyFetcher`) additionally need browser binaries, which you download with the `scrapling install` command. See [Running browser-based fetchers](#running-browser-based-fetchers).
44+
45+
The example Actor in this guide uses the HTTP `AsyncFetcher`, which is the simplest to deploy and pairs well with Apify Proxy.
46+
47+
## Example Actor
48+
49+
The following Actor recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth, starting from the URLs in the Actor input. It uses Scrapling's `AsyncFetcher` to fetch each page through [Apify Proxy](https://docs.apify.com/platform/proxy), and CSS selectors to extract the title, headings, and links.
50+
51+
The whole Actor fits in a single file. A `scrape_page` helper holds the Scrapling-specific fetching and parsing, while the `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy) and the [request queue](https://docs.apify.com/platform/storage/request-queue), and drives the crawl:
52+
53+
<RunnableCodeBlock className="language-python" language="python">
54+
{ScraplingExample}
55+
</RunnableCodeBlock>
56+
57+
Note that:
58+
59+
- Keeping the fetching and parsing in `scrape_page` separates the Scrapling-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, so `main` decides what to store and what to enqueue.
60+
- The response of `AsyncFetcher.get` is a Scrapling selector, so `response.css('title::text').get()` reads the page title and `response.css('a::attr(href)').getall()` returns every link's `href` in one call.
61+
- `response.urljoin(link_href)` resolves relative links against the page URL, so you can enqueue them directly.
62+
- The `impersonate='chrome'` and `stealthy_headers=True` options make the request look like it comes from a real Chrome browser. Combined with Apify Proxy, it reduces the chance of being blocked.
63+
64+
## Adaptive selectors
65+
66+
The example above uses plain CSS selectors. Scrapling can also track the elements you scrape and relocate them when a website changes its markup, so a redesign doesn't immediately break your scraper. This is most useful for scrapers that revisit the same pages over time, rather than one-off crawls.
67+
68+
1. Enable adaptive matching once on the fetcher:
69+
70+
```python
71+
AsyncFetcher.configure(adaptive=True)
72+
```
73+
74+
2. On the first run, pass `auto_save=True` when you select an element. Scrapling records a fingerprint of that element, keyed by the selector:
75+
76+
```python
77+
title = response.css('h1.product-title::text', auto_save=True).get()
78+
```
79+
80+
3. On a later run, if the selector no longer matches because the page changed, pass `adaptive=True` with the same selector. Scrapling uses the saved fingerprint to find the element in its new location:
81+
82+
```python
83+
title = response.css('h1.product-title::text', adaptive=True).get()
84+
```
85+
86+
Scrapling keeps these fingerprints in a local SQLite database. On the Apify platform the Actor's filesystem doesn't persist between runs, so to keep them across runs, store that database in a [key-value store](https://docs.apify.com/platform/storage/key-value-store) and restore it on startup. For details, see [Scrapling's adaptive parsing documentation](https://scrapling.readthedocs.io/en/latest/parsing/adaptive.html).
87+
88+
## Using Apify Proxy
89+
90+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `scrape_page` for every request, which forwards it to Scrapling's `proxy` argument.
91+
92+
Scrapling accepts the proxy as a URL string (for example `http://user:pass@proxy.apify.com:8000`), which is what `ProxyConfiguration.new_url` returns. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management). The browser-based fetchers accept the same `proxy` argument.
93+
94+
## Running browser-based fetchers
95+
96+
`DynamicFetcher` and `StealthyFetcher` drive a real browser, so they need the browser binaries installed with the `scrapling install` command. Locally, run it once after installing the `scrapling[fetchers]` extra:
97+
98+
```bash
99+
scrapling install
100+
```
101+
102+
To switch the example from HTTP to a real browser, fetch each page through a browser session instead of `AsyncFetcher`. Opening a fresh browser for every page would be wasteful, so `main` enters an `AsyncDynamicSession` once and reuses it for the whole crawl, while `scrape_page` fetches with `session.fetch`. The parsing API is identical, so the extraction code stays the same:
103+
104+
<CodeBlock className="language-python">
105+
{ScraplingBrowserScraper}
106+
</CodeBlock>
107+
108+
Note that:
109+
110+
- `AsyncDynamicSession` launches one browser and keeps it open across `session.fetch` calls, so the crawl doesn't pay the browser-startup cost on every page.
111+
- The proxy URL is passed per fetch, so each page can go through a fresh Apify Proxy IP while sharing the same browser.
112+
113+
To run this on the Apify platform, build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser together with all of its system-level dependencies. Then run `scrapling install` during the Docker build to download the browser binaries that Scrapling expects:
114+
115+
```docker title="Dockerfile"
116+
FROM apify/actor-python-playwright:3.14
117+
118+
# Install the Actor's Python dependencies.
119+
COPY requirements.txt ./
120+
RUN pip install -r requirements.txt
121+
122+
# Download the browser binaries that Scrapling's browser fetchers need.
123+
RUN scrapling install
124+
125+
# Copy in the source code and launch the Actor as a module.
126+
COPY . ./
127+
CMD ["python", "-m", "src"]
128+
```
129+
130+
## Conclusion
131+
132+
In this guide, you learned how to use Scrapling in your Apify Actors. You can now fetch pages with Scrapling's HTTP or browser-based fetchers, extract data with its CSS and XPath selectors, route requests through Apify Proxy, and run the whole thing on the Apify platform. To get started with your own scraping tasks, see the [Actor templates](https://apify.com/templates/categories/python). If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
133+
134+
## Additional resources
135+
136+
- [Scrapling: Official documentation](https://scrapling.readthedocs.io/)
137+
- [Scrapling: Fetchers](https://scrapling.readthedocs.io/en/latest/fetching/choosing/)
138+
- [Scrapling: Parsing and selecting elements](https://scrapling.readthedocs.io/en/latest/parsing/selection/)
139+
- [Scrapling: Adaptive parsing](https://scrapling.readthedocs.io/en/latest/parsing/adaptive.html)
140+
- [Scrapling: GitHub repository](https://github.com/D4Vinci/Scrapling)
141+
- [Apify: Proxy management](https://docs.apify.com/platform/proxy)

0 commit comments

Comments
 (0)