You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this guide, you'll learn how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)library with the [HTTPX](https://www.python-httpx.org/)library in your Apify Actors.
11
+
In this guide, you'll learn how to scrape web pages with the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)and [HTTPX](https://www.python-httpx.org/)libraries in your Apify Actors.
12
12
13
13
## Introduction
14
14
@@ -20,12 +20,16 @@ To create an Actor which uses those libraries, start from the [BeautifulSoup & P
20
20
21
21
## Example Actor
22
22
23
-
Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract titles and links to other pages.
23
+
Below is a simple Actor that recursively scrapes data from linked pages on the same site, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages through [Apify Proxy](https://docs.apify.com/platform/proxy)and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract the title, headings, and links to other pages.
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request, so each page goes through a different IP. A new HTTPX client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
32
+
29
33
## Conclusion
30
34
31
35
In this guide, you learned how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) with the [HTTPX](https://www.python-httpx.org/) in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
In this guide, you'll learn how to combine the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries when building Apify Actors.
11
+
In this guide, you'll learn how to scrape web pages with the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries in your Apify Actors.
12
12
13
13
## Introduction
14
14
@@ -18,12 +18,16 @@ In this guide, you'll learn how to combine the [Parsel](https://github.com/scrap
18
18
19
19
## Example Actor
20
20
21
-
The following example shows a simple Actor that recursively scrapes titles from linked pages, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages and [Parsel](https://github.com/scrapy/parsel) to extract titles and discover new links.
21
+
The following example shows a simple Actor that recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages through [Apify Proxy](https://docs.apify.com/platform/proxy)and [Parsel](https://github.com/scrapy/parsel) to extract the title, headings, and links.
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request, so each page goes through a different IP. A new Impit client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
30
+
27
31
## Conclusion
28
32
29
33
In this guide, you learned how to use [Parsel](https://github.com/scrapy/parsel) with [Impit](https://github.com/apify/impit) in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: [Parsel](https://github.com/scrapy/parsel) provides excellent CSS selector and XPath support for data extraction, while [Impit](https://github.com/apify/impit) offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and launches the browser through it. Playwright applies the proxy at the browser level, so the whole run shares a single proxy URL rather than rotating per request; the `to_playwright_proxy` helper splits that URL into the `server`, `username`, and `password` fields Playwright expects. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
62
+
59
63
## Conclusion
60
64
61
65
In this guide you learned how to create Actors that use Playwright to scrape websites. Playwright is a powerful tool that can be used to manage browser instances and scrape websites that require JavaScript execution. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and routes the browser through it for the whole run.
46
+
47
+
Chrome ignores the credentials passed in the `--proxy-server` flag, so an authenticated proxy such as Apify Proxy has to be configured from inside a small extension. The `proxy_auth_extension` helper builds one at runtime: its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. Note that the new headless mode (`--headless=new`) is required for Chrome to load the extension. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
48
+
43
49
## Conclusion
44
50
45
51
In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
In this guide, you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
13
+
In this guide, you'll learn how to build web crawlers with the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
14
14
15
15
## Introduction
16
16
@@ -42,6 +42,10 @@ The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler
42
42
{CrawleePlaywrightExample}
43
43
</RunnableCodeBlock>
44
44
45
+
## Using Apify Proxy
46
+
47
+
All three crawlers above route their requests through [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. `Actor.create_proxy_configuration` returns a Crawlee-compatible proxy configuration, which is passed to the crawler as `proxy_configuration`; Crawlee then rotates the proxy IP for every request on its own. Because the configuration is only available inside the running Actor, the crawler is created in `main` and the request handler is registered on a standalone [`Router`](https://crawlee.dev/python/api/class/Router) up front. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.
48
+
45
49
## Conclusion
46
50
47
51
In this guide, you learned how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. By using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
In this guide, you'll learn how to run a web server inside your Apify Actor. This is useful for monitoring Actor progress, creating custom APIs, or serving content during the Actor run.
12
13
@@ -30,6 +31,29 @@ The following example shows how to start a simple web server in your Actor, whic
30
31
{WebserverExample}
31
32
</RunnableCodeBlock>
32
33
34
+
## Using FastAPI
35
+
36
+
The example above relies only on Python's standard library, which keeps it dependency-free but leaves you handling requests by hand. For anything beyond a single endpoint, a web framework such as [FastAPI](https://fastapi.tiangolo.com/) is a better fit - it gives you routing, request parsing, and automatic JSON responses, and is served by an ASGI server like [uvicorn](https://www.uvicorn.org/).
37
+
38
+
Install both, for example by adding them to your `requirements.txt`:
39
+
40
+
```text
41
+
fastapi
42
+
uvicorn[standard]
43
+
```
44
+
45
+
The following Actor serves the same processed-items counter as before, but through a FastAPI endpoint. The key difference is that uvicorn runs inside the Actor's event loop as a background task, bound to `Actor.configuration.web_server_port` so the platform routes the container URL to it:
-`uvicorn.Server(...).serve()` is a coroutine, so it runs as an `asyncio` task alongside the Actor's own work instead of blocking it. Setting `server.should_exit = True` triggers a graceful shutdown once the work is done.
54
+
- The server binds to `0.0.0.0` (all interfaces) rather than `localhost`, so it's reachable through the container URL, not only from inside the container.
55
+
- The same pattern powers an [Actor Standby](#actor-standby) service - swap the one-off work loop for an Actor that just keeps serving requests.
56
+
33
57
## Actor Standby
34
58
35
59
The example above runs a web server for the duration of a single Actor run. With [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby), you can instead expose your Actor as an always-ready HTTP API: the platform keeps the Actor running in the background and routes incoming HTTP requests to the web server inside it, spinning up additional instances as the load grows.
0 commit comments