Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/02_concepts/05_proxy_management.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,9 @@ import ApifyProxyConfig from '!!raw-loader!roa-loader!./code/05_apify_proxy_conf
import CustomProxyFunctionExample from '!!raw-loader!roa-loader!./code/05_custom_proxy_function.py';
import ProxyActorInputExample from '!!raw-loader!roa-loader!./code/05_proxy_actor_input.py';
import ProxyHttpxExample from '!!raw-loader!roa-loader!./code/05_proxy_httpx.py';
import ApiLink from '@site/src/components/ApiLink';

[IP address blocking](https://en.wikipedia.org/wiki/IP_address_blocking) is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in your anti IP blocking arsenal is a [proxy server](https://en.wikipedia.org/wiki/Proxy_server).

With the Apify SDK, you can use your own proxy servers, proxy servers acquired from third-party providers, or you can rely on [Apify Proxy](https://apify.com/proxy) for your scraping needs.
The Apify SDK provides built-in proxy management through the <ApiLink to="class/ProxyConfiguration">`ProxyConfiguration`</ApiLink> class, supporting both [Apify Proxy](https://apify.com/proxy) and custom proxy servers. Proxies are essential for web scraping to avoid [IP address blocking](https://en.wikipedia.org/wiki/IP_address_blocking) and distribute requests across multiple addresses.

## Quick start

Expand Down Expand Up @@ -107,3 +106,5 @@ Make sure you have the `httpx` library installed:
```bash
pip install httpx
```

For full details on proxy configuration options, see the <ApiLink to="class/ProxyConfiguration">`ProxyConfiguration`</ApiLink> API reference and the [Apify Proxy documentation](https://docs.apify.com/proxy).
5 changes: 4 additions & 1 deletion docs/02_concepts/06_interacting_with_other_actors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ import InteractingStartExample from '!!raw-loader!roa-loader!./code/06_interacti
import InteractingCallExample from '!!raw-loader!roa-loader!./code/06_interacting_call.py';
import InteractingCallTaskExample from '!!raw-loader!roa-loader!./code/06_interacting_call_task.py';
import InteractingMetamorphExample from '!!raw-loader!roa-loader!./code/06_interacting_metamorph.py';
import ApiLink from '@site/src/components/ApiLink';

There are several methods that interact with other Actors and Actor tasks on the Apify platform.
The Apify SDK lets you start, call, and transform (metamorph) other Actors directly from your Actor code. This is useful for composing complex workflows from smaller, reusable Actors.

## Actor start

Expand Down Expand Up @@ -50,3 +51,5 @@ For example, imagine you have an Actor that accepts a hotel URL on input, and th
<RunnableCodeBlock className="language-python" language="python">
{InteractingMetamorphExample}
</RunnableCodeBlock>

For the full list of methods for interacting with other Actors, see the <ApiLink to="class/Actor">`Actor`</ApiLink> API reference.
2 changes: 2 additions & 0 deletions docs/02_concepts/07_webhooks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,5 @@ To ensure that duplicate ad-hoc webhooks won't get created in a case of Actor re
<RunnableCodeBlock className="language-python" language="python">
{WebhookPreventingExample}
</RunnableCodeBlock>

For more information about webhooks, including event types and payloads, see the [Apify webhooks documentation](https://docs.apify.com/platform/integrations/webhooks).
6 changes: 3 additions & 3 deletions docs/02_concepts/08_access_apify_api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
import ActorClientExample from '!!raw-loader!roa-loader!./code/08_actor_client.py';
import ActorNewClientExample from '!!raw-loader!roa-loader!./code/08_actor_new_client.py';

The Apify SDK contains many useful features for making Actor development easier. However, it does not cover all the features the Apify API offers.

For working with the Apify API directly, you can use the provided instance of the [Apify API Client](https://docs.apify.com/api/client/python) library.
The Apify SDK provides a built-in instance of the [Apify API Client](https://docs.apify.com/api/client/python) for accessing Apify platform features beyond what the SDK covers directly.

## Actor client

Expand All @@ -30,3 +28,5 @@ If you want to create a completely new instance of the client, for example, to g
<RunnableCodeBlock className="language-python" language="python">
{ActorNewClientExample}
</RunnableCodeBlock>

For the full API client documentation, see the [Apify API Client for Python](https://docs.apify.com/api/client/python).
2 changes: 1 addition & 1 deletion docs/02_concepts/09_logging.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ import LoggerUsageExample from '!!raw-loader!roa-loader!./code/09_logger_usage.p
import RedirectLog from '!!raw-loader!roa-loader!./code/09_redirect_log.py';
import RedirectLogExistingRun from '!!raw-loader!roa-loader!./code/09_redirect_log_existing_run.py';

The Apify SDK is logging useful information through the [`logging`](https://docs.python.org/3/library/logging.html) module from Python's standard library, into the logger with the name `apify`.
The Apify SDK logs through Python's standard [`logging`](https://docs.python.org/3/library/logging.html) module, using the `apify` logger. Configuring log levels and formatting helps you debug Actors locally and monitor them on the platform.

## Automatic configuration

Expand Down
7 changes: 4 additions & 3 deletions docs/02_concepts/10_configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,9 @@ description: Customize Actor behavior through the Configuration class or environ
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import ConfigExample from '!!raw-loader!roa-loader!./code/10_config.py';
import ApiLink from '@site/src/components/ApiLink';

The [`Actor`](../../reference/class/Actor) class gets configured using the [`Configuration`](../../reference/class/Configuration) class, which initializes itself based on the provided environment variables.

If you're using the Apify SDK in your Actors on the Apify platform, or Actors running locally through the Apify CLI, you don't need to configure the `Actor` class manually, unless you have some specific requirements, everything will get configured automatically.
The <ApiLink to="class/Actor">`Actor`</ApiLink> class is configured through the <ApiLink to="class/Configuration">`Configuration`</ApiLink> class, which reads its settings from environment variables. When running on the Apify platform or through the Apify CLI, configuration is automatic — manual setup is only needed for custom requirements.

If you need some special configuration, you can adjust it either through the `Configuration` class directly, or by setting environment variables when running the Actor locally.

Expand All @@ -33,3 +32,5 @@ This Actor run will not persist its local storages to the filesystem:
```bash
APIFY_PERSIST_STORAGE=0 apify run
```

For the full list of configuration options, see the <ApiLink to="class/Configuration">`Configuration`</ApiLink> API reference.
2 changes: 1 addition & 1 deletion docs/03_guides/01_beautifulsoup_httpx.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: beautifulsoup-httpx
title: Using BeautifulSoup with HTTPX
title: Use BeautifulSoup with HTTPX
description: Build an Apify Actor that scrapes web pages using BeautifulSoup and HTTPX.
---

Expand Down
2 changes: 1 addition & 1 deletion docs/03_guides/02_parsel_impit.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: parsel-impit
title: Using Parsel with Impit
title: Use Parsel with Impit
description: Build an Apify Actor that scrapes web pages using Parsel selectors and the Impit HTTP client.
---

Expand Down
11 changes: 6 additions & 5 deletions docs/03_guides/03_playwright.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: playwright
title: Using Playwright
title: Use Playwright
description: Build an Apify Actor that scrapes dynamic web pages using Playwright browser automation.
---

Expand All @@ -19,10 +19,11 @@ In this guide, you'll learn how to use [Playwright](https://playwright.dev) for

Some of the key features of Playwright for web scraping include:

- **Cross-browser support** - Playwright supports the latest versions of major browsers like Chrome, Firefox, and Safari, so you can choose the one that suits your needs the best.
- **Headless mode** - Playwright can run in headless mode, meaning that the browser window is not visible on your screen while it is scraping, which can be useful for running scraping tasks in the background or in containers without a display.
- **Powerful selectors** - Playwright provides a variety of powerful selectors that allow you to target specific elements on a web page, including CSS selectors, XPath, and text matching.
- **Emulation of user interactions** - Playwright allows you to emulate user interactions like clicking, scrolling, filling out forms, and even typing in text, which can be useful for scraping websites that have dynamic content or require user input.
- **Cross-browser support** - Playwright supports Chromium, Firefox, and WebKit with a single API, ensuring consistent behavior across all browsers.
- **Auto-waiting** - Playwright automatically waits for elements to be ready before performing actions, reducing flaky scripts and eliminating the need for manual sleep calls.
- **Headless and headful modes** - Playwright can run with or without a visible browser window, making it suitable for both local development and containerized environments.
- **Powerful selectors** - Playwright provides CSS selectors, XPath, text matching, and its own resilient locator API for targeting elements on a page.
- **Network interception** - Playwright can intercept and modify network requests, allowing you to block unnecessary resources or mock API responses during scraping.

To create Actors which use Playwright, start from the [Playwright & Python](https://apify.com/templates/categories/python) Actor template.

Expand Down
16 changes: 6 additions & 10 deletions docs/03_guides/04_selenium.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: selenium
title: Using Selenium
title: Use Selenium
description: Build an Apify Actor that scrapes dynamic web pages using Selenium WebDriver.
---

Expand All @@ -16,15 +16,11 @@ In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for

Some of the key features of Selenium for web scraping include:

- **Cross-browser support** - Selenium supports the latest versions of major browsers like Chrome, Firefox, and Safari,
so you can choose the one that suits your needs the best.
- **Headless mode** - Selenium can run in headless mode,
meaning that the browser window is not visible on your screen while it is scraping,
which can be useful for running scraping tasks in the background or in containers without a display.
- **Powerful selectors** - Selenium provides a variety of powerful selectors that allow you to target specific elements on a web page,
including CSS selectors, XPath, and text matching.
- **Emulation of user interactions** - Selenium allows you to emulate user interactions like clicking, scrolling, filling out forms,
and even typing in text, which can be useful for scraping websites that have dynamic content or require user input.
- **Broad ecosystem** - Selenium has a large community and extensive documentation, with support for multiple programming languages beyond Python.
- **WebDriver protocol** - Selenium uses the W3C WebDriver protocol, providing standardized browser automation that works with Chrome, Firefox, Edge, and Safari.
- **Headless and headful modes** - Selenium can run with or without a visible browser window, making it suitable for both local development and containerized environments.
- **Flexible element selection** - Selenium provides CSS selectors, XPath, ID, class name, and other strategies for locating elements on a page.
- **User interaction emulation** - Selenium allows you to emulate user actions like clicking, scrolling, filling out forms, and typing, which is useful for scraping dynamic websites.

To create Actors which use Selenium, start from the [Selenium & Python](https://apify.com/templates/categories/python) Actor template.

Expand Down
10 changes: 5 additions & 5 deletions docs/03_guides/05_crawlee.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: crawlee
title: Using Crawlee
title: Use Crawlee
description: Build Apify Actors using Crawlee's BeautifulSoupCrawler, ParselCrawler, or PlaywrightCrawler.
---

Expand All @@ -10,7 +10,7 @@ import CrawleeBeautifulSoupExample from '!!raw-loader!roa-loader!./code/05_crawl
import CrawleeParselExample from '!!raw-loader!roa-loader!./code/05_crawlee_parsel.py';
import CrawleePlaywrightExample from '!!raw-loader!roa-loader!./code/05_crawlee_playwright.py';

In this guide you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
In this guide, you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.

## Introduction

Expand All @@ -20,23 +20,23 @@ In this guide, you'll learn how to use Crawlee with [`BeautifulSoupCrawler`](htt

## Actor with BeautifulSoupCrawler

The [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is ideal for extracting data from static HTML pages. It uses [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing and [`ImpitHttpClient`](https://crawlee.dev/python/api/class/ImpitHttpClient) for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is a great choice for your scraping tasks. Below is an example of how to use it` in an Apify Actor.
The [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is ideal for extracting data from static HTML pages. It uses [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing and [`ImpitHttpClient`](https://crawlee.dev/python/api/class/ImpitHttpClient) for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is a great choice for your scraping tasks. The following example shows how to use it in an Apify Actor.

<RunnableCodeBlock className="language-python" language="python">
{CrawleeBeautifulSoupExample}
</RunnableCodeBlock>

## Actor with ParselCrawler

The [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) works in the same way as [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), but it uses the [Parsel](https://parsel.readthedocs.io/en/latest/) library for HTML parsing. This allows for more powerful and flexible data extraction using [XPath](https://en.wikipedia.org/wiki/XPath) selectors. It should be faster than [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler). Below is an example of how to use [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) in an Apify Actor.
The [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) works in the same way as [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), but it uses the [Parsel](https://parsel.readthedocs.io/en/latest/) library for HTML parsing. This allows for more powerful and flexible data extraction using [XPath](https://en.wikipedia.org/wiki/XPath) selectors. It should be faster than [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler). The following example shows how to use [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) in an Apify Actor.

<RunnableCodeBlock className="language-python" language="python">
{CrawleeParselExample}
</RunnableCodeBlock>

## Actor with PlaywrightCrawler

The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) is built for handling dynamic web pages that rely on JavaScript for content rendering. Using the [Playwright](https://playwright.dev/) library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) in an Apify Actor.
The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) is built for handling dynamic web pages that rely on JavaScript for content rendering. Using the [Playwright](https://playwright.dev/) library, it provides a browser-based automation environment to interact with complex websites. The following example shows how to use [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) in an Apify Actor.

<RunnableCodeBlock className="language-python" language="python">
{CrawleePlaywrightExample}
Expand Down
4 changes: 2 additions & 2 deletions docs/03_guides/06_scrapy.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: scrapy
title: Using Scrapy
title: Use Scrapy
description: Convert Scrapy spiders into Apify Actors with platform storage and proxy integration.
---

Expand Down Expand Up @@ -70,7 +70,7 @@ For further details, see the [Scrapy migration guide](https://docs.apify.com/cli

## Example Actor

The following example demonstrates a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.
The following example shows a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.

<Tabs>
<TabItem value="__main__.py" label="__main.py__">
Expand Down
4 changes: 2 additions & 2 deletions docs/03_guides/07_running_webserver.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: running-webserver
title: Running webserver
title: Run a web server
description: Run an HTTP server inside your Actor for monitoring or serving content during execution.
---

Expand All @@ -24,7 +24,7 @@ The web server running inside the container must listen at the port defined by t

## Example Actor

The following example demonstrates how to start a simple web server in your Actor, which will respond to every GET request with the number of items that the Actor has processed so far:
The following example shows how to start a simple web server in your Actor, which will respond to every GET request with the number of items that the Actor has processed so far:

<RunnableCodeBlock className="language-python" language="python">
{WebserverExample}
Expand Down
Loading