Skip to content

Commit 10c8199

Browse files
committed
docs: retitle and unify existing guides; move web server guide to 12
1 parent 65f8e0d commit 10c8199

21 files changed

Lines changed: 558 additions & 314 deletions

docs/01_introduction/quick-start.mdx

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ The Actor's source code is in the `src` folder. This folder contains two importa
6767
{MainExample}
6868
</CodeBlock>
6969
</TabItem>
70-
<TabItem value="__main__.py" label="__main.py__">
70+
<TabItem value="__main__.py" label="__main__.py">
7171
<CodeBlock className="language-python">
7272
{UnderscoreMainExample}
7373
</CodeBlock>
@@ -97,12 +97,20 @@ To learn more about the features of the Apify SDK and how to use them, check out
9797

9898
### Guides
9999

100-
To see how you can integrate the Apify SDK with popular web scraping libraries, check out our guides:
100+
To see how you can integrate the Apify SDK with popular scraping libraries and frameworks, check out these guides:
101101

102-
- [BeautifulSoup with HTTPX](../guides/beautifulsoup-httpx)
103-
- [Parsel with Impit](../guides/parsel-impit)
104-
- [Playwright](../guides/playwright)
105-
- [Selenium](../guides/selenium)
106-
- [Crawlee](../guides/crawlee)
107-
- [Scrapy](../guides/scrapy)
108-
- [Running webserver](../guides/running-webserver)
102+
- [Scraping with BeautifulSoup and HTTPX](../guides/beautifulsoup-httpx)
103+
- [Scraping with Parsel and Impit](../guides/parsel-impit)
104+
- [Browser automation with Playwright](../guides/playwright)
105+
- [Browser automation with Selenium](../guides/selenium)
106+
- [Building crawlers with Crawlee](../guides/crawlee)
107+
- [Building crawlers with Scrapy](../guides/scrapy)
108+
- [Adaptive scraping with Scrapling](../guides/scrapling)
109+
- [LLM-ready scraping with Crawl4AI](../guides/crawl4ai)
110+
- [Browser AI agents with Browser Use](../guides/browser-use)
111+
112+
For other aspects of Actor development, explore these guides:
113+
114+
- [Project management with uv](../guides/uv)
115+
- [Input validation with Pydantic](../guides/input-validation)
116+
- [Running a web server](../guides/running-webserver)

docs/03_guides/01_beautifulsoup_httpx.mdx

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
---
22
id: beautifulsoup-httpx
3-
title: Use BeautifulSoup with HTTPX
3+
title: Scraping with BeautifulSoup and HTTPX
44
description: Build an Apify Actor that scrapes web pages using BeautifulSoup and HTTPX.
55
---
66

77
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
88

99
import BeautifulSoupHttpxExample from '!!raw-loader!roa-loader!./code/01_beautifulsoup_httpx.py';
1010

11-
In this guide, you'll learn how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library with the [HTTPX](https://www.python-httpx.org/) library in your Apify Actors.
11+
In this guide, you'll learn how to scrape web pages with the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) and [HTTPX](https://www.python-httpx.org/) libraries in your Apify Actors.
1212

1313
## Introduction
1414

@@ -20,12 +20,16 @@ To create an Actor which uses those libraries, start from the [BeautifulSoup & P
2020

2121
## Example Actor
2222

23-
Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract titles and links to other pages.
23+
Below is a simple Actor that recursively scrapes data from linked pages on the same site, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages through [Apify Proxy](https://docs.apify.com/platform/proxy) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract the title, headings, and links to other pages.
2424

2525
<RunnableCodeBlock className="language-python" language="python">
2626
{BeautifulSoupHttpxExample}
2727
</RunnableCodeBlock>
2828

29+
## Using Apify Proxy
30+
31+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request, so each page goes through a different IP. A new HTTPX client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
32+
2933
## Conclusion
3034

3135
In this guide, you learned how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) with the [HTTPX](https://www.python-httpx.org/) in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/02_parsel_impit.mdx

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
---
22
id: parsel-impit
3-
title: Use Parsel with Impit
3+
title: Scraping with Parsel and Impit
44
description: Build an Apify Actor that scrapes web pages using Parsel selectors and the Impit HTTP client.
55
---
66

77
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
88

99
import ParselImpitExample from '!!raw-loader!roa-loader!./code/02_parsel_impit.py';
1010

11-
In this guide, you'll learn how to combine the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries when building Apify Actors.
11+
In this guide, you'll learn how to scrape web pages with the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries in your Apify Actors.
1212

1313
## Introduction
1414

@@ -18,12 +18,16 @@ In this guide, you'll learn how to combine the [Parsel](https://github.com/scrap
1818

1919
## Example Actor
2020

21-
The following example shows a simple Actor that recursively scrapes titles from linked pages, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages and [Parsel](https://github.com/scrapy/parsel) to extract titles and discover new links.
21+
The following example shows a simple Actor that recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages through [Apify Proxy](https://docs.apify.com/platform/proxy) and [Parsel](https://github.com/scrapy/parsel) to extract the title, headings, and links.
2222

2323
<RunnableCodeBlock className="language-python" language="python">
2424
{ParselImpitExample}
2525
</RunnableCodeBlock>
2626

27+
## Using Apify Proxy
28+
29+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request, so each page goes through a different IP. A new Impit client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
30+
2731
## Conclusion
2832

2933
In this guide, you learned how to use [Parsel](https://github.com/scrapy/parsel) with [Impit](https://github.com/apify/impit) in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: [Parsel](https://github.com/scrapy/parsel) provides excellent CSS selector and XPath support for data extraction, while [Impit](https://github.com/apify/impit) offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/03_playwright.mdx

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
id: playwright
3-
title: Use Playwright
3+
title: Browser automation with Playwright
44
description: Build an Apify Actor that scrapes dynamic web pages using Playwright browser automation.
55
---
66

@@ -11,7 +11,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
1111

1212
import PlaywrightExample from '!!raw-loader!roa-loader!./code/03_playwright.py';
1313

14-
In this guide, you'll learn how to use [Playwright](https://playwright.dev) for web scraping in your Apify Actors.
14+
In this guide, you'll learn how to use [Playwright](https://playwright.dev) for browser automation and web scraping in your Apify Actors.
1515

1616
## Introduction
1717

@@ -48,14 +48,18 @@ playwright install --with-deps`
4848

4949
## Example Actor
5050

51-
This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input.
51+
This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input.
5252

53-
It uses Playwright to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load.
53+
It uses Playwright to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.
5454

5555
<RunnableCodeBlock className="language-python" language="python">
5656
{PlaywrightExample}
5757
</RunnableCodeBlock>
5858

59+
## Using Apify Proxy
60+
61+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and launches the browser through it. Playwright applies the proxy at the browser level, so the whole run shares a single proxy URL rather than rotating per request; the `to_playwright_proxy` helper splits that URL into the `server`, `username`, and `password` fields Playwright expects. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
62+
5963
## Conclusion
6064

6165
In this guide you learned how to create Actors that use Playwright to scrape websites. Playwright is a powerful tool that can be used to manage browser instances and scrape websites that require JavaScript execution. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/04_selenium.mdx

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
---
22
id: selenium
3-
title: Use Selenium
3+
title: Browser automation with Selenium
44
description: Build an Apify Actor that scrapes dynamic web pages using Selenium WebDriver.
55
---
66

77
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
88

99
import SeleniumExample from '!!raw-loader!roa-loader!./code/04_selenium.py';
1010

11-
In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for web scraping in your Apify Actors.
11+
In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for browser automation and web scraping in your Apify Actors.
1212

1313
## Introduction
1414

@@ -32,14 +32,20 @@ Refer to the [Selenium documentation](https://www.selenium.dev/documentation/web
3232

3333
## Example Actor
3434

35-
This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input.
35+
This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input.
3636

37-
It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load.
37+
It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.
3838

3939
<RunnableCodeBlock className="language-python" language="python">
4040
{SeleniumExample}
4141
</RunnableCodeBlock>
4242

43+
## Using Apify Proxy
44+
45+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and routes the browser through it for the whole run.
46+
47+
Chrome ignores the credentials passed in the `--proxy-server` flag, so an authenticated proxy such as Apify Proxy has to be configured from inside a small extension. The `proxy_auth_extension` helper builds one at runtime: its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. Note that the new headless mode (`--headless=new`) is required for Chrome to load the extension. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
48+
4349
## Conclusion
4450

4551
In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/05_crawlee.mdx

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
id: crawlee
3-
title: Use Crawlee
3+
title: Building crawlers with Crawlee
44
description: Build Apify Actors using Crawlee's BeautifulSoupCrawler, ParselCrawler, or PlaywrightCrawler.
55
---
66

@@ -10,7 +10,7 @@ import CrawleeBeautifulSoupExample from '!!raw-loader!roa-loader!./code/05_crawl
1010
import CrawleeParselExample from '!!raw-loader!roa-loader!./code/05_crawlee_parsel.py';
1111
import CrawleePlaywrightExample from '!!raw-loader!roa-loader!./code/05_crawlee_playwright.py';
1212

13-
In this guide, you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
13+
In this guide, you'll learn how to build web crawlers with the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
1414

1515
## Introduction
1616

@@ -42,6 +42,10 @@ The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler
4242
{CrawleePlaywrightExample}
4343
</RunnableCodeBlock>
4444

45+
## Using Apify Proxy
46+
47+
All three crawlers above route their requests through [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. `Actor.create_proxy_configuration` returns a Crawlee-compatible proxy configuration, which is passed to the crawler as `proxy_configuration`; Crawlee then rotates the proxy IP for every request on its own. Because the configuration is only available inside the running Actor, the crawler is created in `main` and the request handler is registered on a standalone [`Router`](https://crawlee.dev/python/api/class/Router) up front. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
48+
4549
## Conclusion
4650

4751
In this guide, you learned how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. By using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/06_scrapy.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
id: scrapy
3-
title: Use Scrapy
3+
title: Building crawlers with Scrapy
44
description: Convert Scrapy spiders into Apify Actors with platform storage and proxy integration.
55
---
66

@@ -15,7 +15,7 @@ import ItemsExample from '!!raw-loader!./code/scrapy_project/src/items.py';
1515
import SpidersExample from '!!raw-loader!./code/scrapy_project/src/spiders/title.py';
1616
import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py';
1717

18-
In this guide, you'll learn how to use the [Scrapy](https://scrapy.org/) framework in your Apify Actors.
18+
In this guide, you'll learn how to build web crawlers with the [Scrapy](https://scrapy.org/) framework in your Apify Actors.
1919

2020
## Introduction
2121

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
---
22
id: running-webserver
3-
title: Run a web server
3+
title: Running a web server
44
description: Run an HTTP server inside your Actor for monitoring or serving content during execution.
55
---
66

77
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
88

9-
import WebserverExample from '!!raw-loader!roa-loader!./code/07_webserver.py';
9+
import WebserverExample from '!!raw-loader!roa-loader!./code/12_webserver.py';
10+
import WebserverFastApiExample from '!!raw-loader!roa-loader!./code/12_webserver_fastapi.py';
1011

1112
In this guide, you'll learn how to run a web server inside your Apify Actor. This is useful for monitoring Actor progress, creating custom APIs, or serving content during the Actor run.
1213

@@ -30,6 +31,29 @@ The following example shows how to start a simple web server in your Actor, whic
3031
{WebserverExample}
3132
</RunnableCodeBlock>
3233

34+
## Using FastAPI
35+
36+
The example above relies only on Python's standard library, which keeps it dependency-free but leaves you handling requests by hand. For anything beyond a single endpoint, a web framework such as [FastAPI](https://fastapi.tiangolo.com/) is a better fit - it gives you routing, request parsing, and automatic JSON responses, and is served by an ASGI server like [uvicorn](https://www.uvicorn.org/).
37+
38+
Install both, for example by adding them to your `requirements.txt`:
39+
40+
```text
41+
fastapi
42+
uvicorn[standard]
43+
```
44+
45+
The following Actor serves the same processed-items counter as before, but through a FastAPI endpoint. The key difference is that uvicorn runs inside the Actor's event loop as a background task, bound to `Actor.configuration.web_server_port` so the platform routes the container URL to it:
46+
47+
<RunnableCodeBlock className="language-python" language="python">
48+
{WebserverFastApiExample}
49+
</RunnableCodeBlock>
50+
51+
A few things worth pointing out:
52+
53+
- `uvicorn.Server(...).serve()` is a coroutine, so it runs as an `asyncio` task alongside the Actor's own work instead of blocking it. Setting `server.should_exit = True` triggers a graceful shutdown once the work is done.
54+
- The server binds to `0.0.0.0` (all interfaces) rather than `localhost`, so it's reachable through the container URL, not only from inside the container.
55+
- The same pattern powers an [Actor Standby](#actor-standby) service - swap the one-off work loop for an Actor that just keeps serving requests.
56+
3357
## Actor Standby
3458

3559
The example above runs a web server for the duration of a single Actor run. With [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby), you can instead expose your Actor as an always-ready HTTP API: the platform keeps the Actor running in the background and routes incoming HTTP requests to the web server inside it, spinning up additional instances as the load grows.

0 commit comments

Comments
 (0)