Skip to content

Commit e2034b0

Browse files
authored
docs: Retitle and improve the existing guides (#939)
Improves the existing framework guides — clearer titles, flatter example code, and a few content fixes. The `id:` slugs and URLs are unchanged, so existing links keep working. - **Retitle guides 01–06** to action-oriented names (e.g. "Use Crawlee" → "Building crawlers with Crawlee"). - **Flatten the scraper examples** (BeautifulSoup, Parsel, Playwright, Selenium): extract a `scrape_page` helper and track crawl depth via Crawlee's built-in `Request.crawl_depth` instead of a manual `user_data` counter. Each stays a single file with one runnable "Run on Apify" block. - **Crawlee & Scrapy guides**: tidy the Apify Proxy wording, and fix the Scrapy `__main__.py` tab label/title plus a grammar nit. - **Web server guide**: fix the `Actor.configuration.web_server_url`/`web_server_port` references (the prose used non-existent `container_*` attributes), add a FastAPI example and an Actor Standby section, and renumber it 07 → 12 to make room for the new guides. - **Quick-start**: refresh the guides list.
1 parent e41e43e commit e2034b0

44 files changed

Lines changed: 1428 additions & 825 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/01_introduction/quick-start.mdx

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -100,16 +100,20 @@ To learn more about the features of the Apify SDK and how to use them, check out
100100

101101
### Guides
102102

103-
To see how you can integrate the Apify SDK with popular web scraping libraries, check out our guides:
104-
105-
- [BeautifulSoup with HTTPX](../guides/beautifulsoup-httpx)
106-
- [Parsel with Impit](../guides/parsel-impit)
107-
- [Playwright](../guides/playwright)
108-
- [Selenium](../guides/selenium)
109-
- [Crawlee](../guides/crawlee)
110-
- [Scrapy](../guides/scrapy)
111-
- [Crawl4AI](../guides/crawl4ai)
112-
- [Browser Use](../guides/browser-use)
113-
- [Running webserver](../guides/running-webserver)
114-
- [uv](../guides/uv)
115-
- [Validate Actor input with Pydantic](../guides/input-validation)
103+
To see how you can integrate the Apify SDK with popular scraping libraries and frameworks, check out these guides:
104+
105+
- [Scraping with BeautifulSoup and HTTPX](../guides/beautifulsoup-httpx)
106+
- [Scraping with Parsel and Impit](../guides/parsel-impit)
107+
- [Browser automation with Playwright](../guides/playwright)
108+
- [Browser automation with Selenium](../guides/selenium)
109+
- [Building crawlers with Crawlee](../guides/crawlee)
110+
- [Building crawlers with Scrapy](../guides/scrapy)
111+
- [Adaptive scraping with Scrapling](../guides/scrapling)
112+
- [LLM-ready scraping with Crawl4AI](../guides/crawl4ai)
113+
- [Browser AI agents with Browser Use](../guides/browser-use)
114+
115+
For other aspects of Actor development, explore these guides:
116+
117+
- [Project management with uv](../guides/uv)
118+
- [Input validation with Pydantic](../guides/input-validation)
119+
- [Running a web server](../guides/running-webserver)

docs/03_guides/01_beautifulsoup_httpx.mdx

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
---
22
id: beautifulsoup-httpx
3-
title: Use BeautifulSoup with HTTPX
3+
title: Scraping with BeautifulSoup and HTTPX
44
description: Build an Apify Actor that scrapes web pages using BeautifulSoup and HTTPX.
55
---
66

77
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
88

99
import BeautifulSoupHttpxExample from '!!raw-loader!roa-loader!./code/01_beautifulsoup_httpx.py';
1010

11-
In this guide, you'll learn how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library with the [HTTPX](https://www.python-httpx.org/) library in your Apify Actors.
11+
In this guide, you'll learn how to scrape web pages with the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) and [HTTPX](https://www.python-httpx.org/) libraries in your Apify Actors.
1212

1313
## Introduction
1414

@@ -20,12 +20,16 @@ To create an Actor which uses those libraries, start from the [BeautifulSoup & P
2020

2121
## Example Actor
2222

23-
Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract titles and links to other pages.
23+
The following example is a simple Actor that recursively scrapes data from linked pages on the same site, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages through [Apify Proxy](https://docs.apify.com/platform/proxy) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract the title, headings, and links to other pages.
2424

2525
<RunnableCodeBlock className="language-python" language="python">
2626
{BeautifulSoupHttpxExample}
2727
</RunnableCodeBlock>
2828

29+
## Using Apify Proxy
30+
31+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request. Each page then goes through a different IP. A new HTTPX client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
32+
2933
## Conclusion
3034

3135
In this guide, you learned how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) with the [HTTPX](https://www.python-httpx.org/) in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/02_parsel_impit.mdx

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
---
22
id: parsel-impit
3-
title: Use Parsel with Impit
3+
title: Scraping with Parsel and Impit
44
description: Build an Apify Actor that scrapes web pages using Parsel selectors and the Impit HTTP client.
55
---
66

77
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
88

99
import ParselImpitExample from '!!raw-loader!roa-loader!./code/02_parsel_impit.py';
1010

11-
In this guide, you'll learn how to combine the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries when building Apify Actors.
11+
In this guide, you'll learn how to scrape web pages with the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries in your Apify Actors.
1212

1313
## Introduction
1414

@@ -18,12 +18,16 @@ In this guide, you'll learn how to combine the [Parsel](https://github.com/scrap
1818

1919
## Example Actor
2020

21-
The following example shows a simple Actor that recursively scrapes titles from linked pages, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages and [Parsel](https://github.com/scrapy/parsel) to extract titles and discover new links.
21+
The following example shows a simple Actor that recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages through [Apify Proxy](https://docs.apify.com/platform/proxy) and [Parsel](https://github.com/scrapy/parsel) to extract the title, headings, and links.
2222

2323
<RunnableCodeBlock className="language-python" language="python">
2424
{ParselImpitExample}
2525
</RunnableCodeBlock>
2626

27+
## Using Apify Proxy
28+
29+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request. Each page then goes through a different IP. A new Impit client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
30+
2731
## Conclusion
2832

2933
In this guide, you learned how to use [Parsel](https://github.com/scrapy/parsel) with [Impit](https://github.com/apify/impit) in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: [Parsel](https://github.com/scrapy/parsel) provides excellent CSS selector and XPath support for data extraction, while [Impit](https://github.com/apify/impit) offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/03_playwright.mdx

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
id: playwright
3-
title: Use Playwright
3+
title: Browser automation with Playwright
44
description: Build an Apify Actor that scrapes dynamic web pages using Playwright browser automation.
55
---
66

@@ -11,7 +11,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
1111

1212
import PlaywrightExample from '!!raw-loader!roa-loader!./code/03_playwright.py';
1313

14-
In this guide, you'll learn how to use [Playwright](https://playwright.dev) for web scraping in your Apify Actors.
14+
In this guide, you'll learn how to use [Playwright](https://playwright.dev) for browser automation and web scraping in your Apify Actors.
1515

1616
## Introduction
1717

@@ -48,14 +48,18 @@ playwright install --with-deps`
4848

4949
## Example Actor
5050

51-
This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input.
51+
This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input.
5252

53-
It uses Playwright to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load.
53+
It uses Playwright to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.
5454

5555
<RunnableCodeBlock className="language-python" language="python">
5656
{PlaywrightExample}
5757
</RunnableCodeBlock>
5858

59+
## Using Apify Proxy
60+
61+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request. Playwright applies a proxy per browser context. Each request runs in its own new context to rotate the IP. The `to_playwright_proxy` helper splits that URL into the `server`, `username`, and `password` fields Playwright expects. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
62+
5963
## Conclusion
6064

6165
In this guide you learned how to create Actors that use Playwright to scrape websites. Playwright is a powerful tool that can be used to manage browser instances and scrape websites that require JavaScript execution. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/04_selenium.mdx

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
---
22
id: selenium
3-
title: Use Selenium
3+
title: Browser automation with Selenium
44
description: Build an Apify Actor that scrapes dynamic web pages using Selenium WebDriver.
55
---
66

77
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
88

99
import SeleniumExample from '!!raw-loader!roa-loader!./code/04_selenium.py';
1010

11-
In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for web scraping in your Apify Actors.
11+
In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for browser automation and web scraping in your Apify Actors.
1212

1313
## Introduction
1414

@@ -32,14 +32,20 @@ Refer to the [Selenium documentation](https://www.selenium.dev/documentation/web
3232

3333
## Example Actor
3434

35-
This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input.
35+
This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input.
3636

37-
It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load.
37+
It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.
3838

3939
<RunnableCodeBlock className="language-python" language="python">
4040
{SeleniumExample}
4141
</RunnableCodeBlock>
4242

43+
## Using Apify Proxy
44+
45+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and routes the browser through it for the whole run.
46+
47+
Chrome ignores the credentials passed in the `--proxy-server` flag. Because of that, configure an authenticated proxy such as Apify Proxy from inside a small extension. The `proxy_auth_extension` helper builds one at runtime: its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. Note that the new headless mode (`--headless=new`) is required for Chrome to load the extension. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
48+
4349
## Conclusion
4450

4551
In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/05_crawlee.mdx

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
id: crawlee
3-
title: Use Crawlee
3+
title: Building crawlers with Crawlee
44
description: Build Apify Actors using Crawlee's BeautifulSoupCrawler, ParselCrawler, or PlaywrightCrawler.
55
---
66

@@ -10,7 +10,7 @@ import CrawleeBeautifulSoupExample from '!!raw-loader!roa-loader!./code/05_crawl
1010
import CrawleeParselExample from '!!raw-loader!roa-loader!./code/05_crawlee_parsel.py';
1111
import CrawleePlaywrightExample from '!!raw-loader!roa-loader!./code/05_crawlee_playwright.py';
1212

13-
In this guide, you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
13+
In this guide, you'll learn how to build web crawlers with the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
1414

1515
## Introduction
1616

@@ -42,6 +42,10 @@ The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler
4242
{CrawleePlaywrightExample}
4343
</RunnableCodeBlock>
4444

45+
## Using Apify Proxy
46+
47+
All three crawlers above route their requests through [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. `Actor.create_proxy_configuration` returns a Crawlee-compatible proxy configuration, which is passed to the crawler as `proxy_configuration`. Crawlee then rotates the proxy IP for every request on its own. Because the configuration is only available inside the running Actor, the crawler is created in `main` and the request handler is registered on a standalone [`Router`](https://crawlee.dev/python/api/class/Router) up front. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
48+
4549
## Conclusion
4650

4751
In this guide, you learned how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. By using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/06_scrapy.mdx

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
id: scrapy
3-
title: Use Scrapy
3+
title: Building crawlers with Scrapy
44
description: Convert Scrapy spiders into Apify Actors with platform storage and proxy integration.
55
---
66

@@ -15,17 +15,17 @@ import ItemsExample from '!!raw-loader!./code/scrapy_project/src/items.py';
1515
import SpidersExample from '!!raw-loader!./code/scrapy_project/src/spiders/title.py';
1616
import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py';
1717

18-
In this guide, you'll learn how to use the [Scrapy](https://scrapy.org/) framework in your Apify Actors.
18+
In this guide, you'll learn how to build web crawlers with the [Scrapy](https://scrapy.org/) framework in your Apify Actors.
1919

2020
## Introduction
2121

2222
[Scrapy](https://scrapy.org/) is an open-source web scraping framework for Python. It provides tools for defining scrapers, extracting data from web pages, following links, and handling pagination. With the Apify SDK, Scrapy projects can be converted into Apify [Actors](https://docs.apify.com/platform/actors), integrated with Apify [storages](https://docs.apify.com/platform/storage), and executed on the Apify [platform](https://docs.apify.com/platform).
2323

2424
## Integrating Scrapy with the Apify platform
2525

26-
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
26+
The Apify SDK provides an Apify-Scrapy integration. The main challenge is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
2727

28-
<CodeBlock className="language-python" title="__main.py__: The Actor entry point ">
28+
<CodeBlock className="language-python" title="__main__.py: The Actor entry point">
2929
{UnderscoreMainExample}
3030
</CodeBlock>
3131

@@ -74,7 +74,7 @@ For further details, see the [Scrapy migration guide](https://docs.apify.com/cli
7474
The following example shows a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.
7575

7676
<Tabs>
77-
<TabItem value="__main__.py" label="__main.py__">
77+
<TabItem value="__main__.py" label="__main__.py">
7878
<CodeBlock className="language-python">
7979
{UnderscoreMainExample}
8080
</CodeBlock>

0 commit comments

Comments
 (0)