Skip to content

Commit 83e5f9b

Browse files
committed
docs: fix inaccuracies and expand SDK documentation
Correct several factual errors verified against the codebase (Python 3.11 requirement, web server config properties, Actor event payload types, a stale env var, and an empty API link), fix a latent runtime bug in the events snippet, and fill gaps by adding a Storage clients page, documenting Actor Standby, and expanding the Introduction and Configuration pages.
1 parent edd23e6 commit 83e5f9b

11 files changed

Lines changed: 171 additions & 27 deletions

docs/01_introduction/index.mdx

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,16 @@ import CodeBlock from '@theme/CodeBlock';
99

1010
import IntroductionExample from '!!raw-loader!./code/01_introduction.py';
1111

12-
The Apify SDK for Python is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python. It provides useful features like Actor lifecycle management, local storage emulation, and Actor event handling.
12+
The Apify SDK for Python is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python. It gives you everything you need to build an Actor and run it both locally and on the [Apify platform](https://docs.apify.com/platform), including:
13+
14+
- **Actor lifecycle management** — initialization, graceful shutdown, status messages, rebooting, and metamorphing.
15+
- **Storage access** — datasets, key-value stores, and request queues, with automatic local emulation when running outside the platform.
16+
- **Actor input** — convenient access to the Actor input, including automatic decryption of secret fields.
17+
- **Events & state persistence** — react to platform events (system info, migration, abort) and persist state across migrations and restarts.
18+
- **Proxy management** — Apify Proxy and custom proxies, with session and tiered-proxy support.
19+
- **Platform interaction** — start, call, and abort other Actors and tasks, create webhooks, and reach the full Apify API client.
20+
- **Monetization** — charge users with the pay-per-event pricing model.
21+
- **Framework integrations** — first-class support for [Crawlee](../guides/crawlee) and [Scrapy](../guides/scrapy).
1322

1423
<CodeBlock className="language-python">
1524
{IntroductionExample}
@@ -29,7 +38,7 @@ Explore the Guides section in the sidebar for a deeper understanding of the SDK'
2938

3039
## Installation
3140

32-
The Apify SDK for Python requires Python version 3.10 or above. It is typically installed when you create a new Actor project using the [Apify CLI](https://docs.apify.com/cli). To install it manually in an existing project, use:
41+
The Apify SDK for Python requires Python version 3.11 or above. It is typically installed when you create a new Actor project using the [Apify CLI](https://docs.apify.com/cli). To install it manually in an existing project, use:
3342

3443
```bash
3544
pip install apify

docs/01_introduction/quick-start.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ To learn more about the features of the Apify SDK and how to use them, check out
8686
- [Actor lifecycle](../concepts/actor-lifecycle)
8787
- [Actor input](../concepts/actor-input)
8888
- [Working with storages](../concepts/storages)
89+
- [Storage clients](../concepts/storage-clients)
8990
- [Actor events & state persistence](../concepts/actor-events)
9091
- [Proxy management](../concepts/proxy-management)
9192
- [Interacting with other Actors](../concepts/interacting-with-other-actors)

docs/02_concepts/01_actor_lifecycle.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,4 +106,4 @@ Update the status only when the user's understanding of progress changes - avoid
106106

107107
## Conclusion
108108

109-
This page has presented the full Actor lifecycle: initialization, execution, error handling, rebooting, shutdown and status messages. You've seen how the SDK supports both context-based and manual control patterns. For deeper dives, explore the <ApiLink to="">reference docs</ApiLink>, [guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx), and [platform documentation](https://docs.apify.com/platform).
109+
This page has presented the full Actor lifecycle: initialization, execution, error handling, rebooting, shutdown and status messages. You've seen how the SDK supports both context-based and manual control patterns. For deeper dives, explore the <ApiLink to="class/Actor">`Actor` API reference</ApiLink>, [guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx), and [platform documentation](https://docs.apify.com/platform).

docs/02_concepts/03_storages.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,6 @@ To check if all the requests in the queue are handled, you can use the <ApiLink
183183

184184
## Storage clients
185185

186-
Behind the scenes, the SDK uses storage clients to communicate with the storage backend. The appropriate client is selected automatically based on the runtime environment — on the Apify platform, data is persisted via the Apify API, while local runs use the filesystem. For most use cases, you don't need to think about storage clients at all. If you want to learn more about how storage clients work, the available implementations, or how to configure them, see the [Crawlee storage clients guide](https://crawlee.dev/python/docs/guides/storage-clients). The Apify-specific clients are available in the `apify.storage_clients` module.
186+
Behind the scenes, the SDK uses storage clients to communicate with the storage backend. The appropriate client is selected automatically based on the runtime environment — on the Apify platform, data is persisted via the Apify API, while local runs use the filesystem. For most use cases, you don't need to think about storage clients at all. To learn about the available implementations, how to switch between a single and shared request queue, or how to configure a custom client, see the [Storage clients](./storage-clients) page. For a deeper look at how storage clients work internally, see the [Crawlee storage clients guide](https://crawlee.dev/python/docs/guides/storage-clients).
187187

188188
For comprehensive information about storage on the Apify platform, see the [storage documentation](https://docs.apify.com/platform/storage), including the pages on [datasets](https://docs.apify.com/platform/storage/dataset), [key-value stores](https://docs.apify.com/platform/storage/key-value-store), and [request queues](https://docs.apify.com/platform/storage/request-queue).

docs/02_concepts/04_actor_events.mdx

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o
1414

1515
## Event types
1616

17+
A listener can optionally receive a single argument — a Pydantic model with the event's data. The table below lists the events, the type of that data object, and when each event is emitted.
18+
1719
<table>
1820
<thead>
1921
<tr>
@@ -25,25 +27,23 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o
2527
<tbody>
2628
<tr>
2729
<td><code>SYSTEM_INFO</code></td>
28-
<td><pre>{`{
29-
"created_at": datetime,
30-
"cpu_current_usage": float,
31-
"mem_current_bytes": int,
32-
"is_cpu_overloaded": bool
33-
}`}
34-
</pre></td>
30+
<td><ApiLink to="class/EventSystemInfoData"><code>EventSystemInfoData</code></ApiLink></td>
3531
<td>
36-
<p>This event is emitted regularly and it indicates the current resource usage of the Actor.</p>
37-
The <code>is_cpu_overloaded</code> argument indicates whether the current CPU usage is higher than <code>Config.max_used_cpu_ratio</code>
32+
<p>Emitted regularly to report the Actor's current resource usage. The
33+
<code>cpu_info.used_ratio</code> field reports the fraction of CPU currently in use
34+
(a float between <code>0.0</code> and <code>1.0</code>), and <code>memory_info.current_size</code>
35+
reports the current memory usage. Compare <code>cpu_info.used_ratio</code> against
36+
<code>Configuration.max_used_cpu_ratio</code> to detect CPU overload.</p>
3837
</td>
3938
</tr>
4039
<tr>
4140
<td><code>MIGRATING</code></td>
42-
<td><code>None</code></td>
41+
<td><ApiLink to="class/EventMigratingData"><code>EventMigratingData</code></ApiLink></td>
4342
<td>
4443
<p>Emitted when the Actor running on the Apify platform
4544
is going to be <a href="https://docs.apify.com/platform/actors/development/state-persistence#what-is-a-migration">migrated</a>
46-
{' '}to another worker server soon.</p>
45+
{' '}to another worker server soon. The <code>time_remaining</code> field reports how much time
46+
the Actor has left before it is force-migrated.</p>
4747
You can use it to persist the state of the Actor so that once it is executed again on the new server,
4848
it doesn't have to start over from the beginning.
4949
Once you have persisted the state of your Actor, you can call <ApiLink to="class/Actor#reboot">`Actor.reboot`</ApiLink>
@@ -52,7 +52,7 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o
5252
</tr>
5353
<tr>
5454
<td><code>ABORTING</code></td>
55-
<td><code>None</code></td>
55+
<td><ApiLink to="class/EventAbortingData"><code>EventAbortingData</code></ApiLink></td>
5656
<td>
5757
When a user aborts an Actor run on the Apify platform,
5858
they can choose to abort gracefully to allow the Actor some time before getting killed.
@@ -61,7 +61,7 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o
6161
</tr>
6262
<tr>
6363
<td><code>PERSIST_STATE</code></td>
64-
<td><pre>{`{ "is_migrating": bool }`}</pre></td>
64+
<td><ApiLink to="class/EventPersistStateData"><code>EventPersistStateData</code></ApiLink></td>
6565
<td>
6666
<p>Emitted in regular intervals (by default 60 seconds) to notify the Actor that it should persist its state,
6767
in order to avoid repeating all work when the Actor restarts.</p>
@@ -73,7 +73,7 @@ During its runtime, the Actor receives Actor events sent by the Apify platform o
7373
</tr>
7474
<tr>
7575
<td><code>EXIT</code></td>
76-
<td><code>None</code></td>
76+
<td><ApiLink to="class/EventExitData"><code>EventExitData</code></ApiLink></td>
7777
<td>
7878
Emitted by the SDK (not the platform) when the Actor is about to exit. You can use this event to perform final cleanup tasks,
7979
such as closing external connections or sending notifications, before the Actor shuts down.

docs/02_concepts/10_configuration.mdx

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,29 @@ This will cause the Actor to persist its state every 10 seconds:
2727

2828
## Configuring via environment variables
2929

30-
All the configuration options can be set via environment variables. The environment variables are prefixed with `APIFY_`, and the configuration options are in uppercase, with underscores as separators. See the <ApiLink to="class/Configuration">`Configuration`</ApiLink> API reference for the full list of configuration options.
30+
All configuration options can also be set via environment variables. Most options are read from an environment variable named after the option in uppercase; many options accept several aliases — commonly with an `APIFY_`, `ACTOR_`, or `CRAWLEE_` prefix. See the <ApiLink to="class/Configuration">`Configuration`</ApiLink> API reference for the full list of configuration options.
3131

32-
This Actor run will not persist its local storages to the filesystem:
32+
For example, this Actor run will keep the contents of its local storages instead of purging them on start:
3333

3434
```bash
35-
APIFY_PERSIST_STORAGE=0 apify run
35+
APIFY_PURGE_ON_START=0 apify run
3636
```
3737

38+
### Commonly used options
39+
40+
The table below lists a few options you are most likely to set yourself. When running on the Apify platform or via the Apify CLI, the platform-related options are populated automatically.
41+
42+
| Option | Environment variable | Default | Description |
43+
| --- | --- | --- | --- |
44+
| `token` | `APIFY_TOKEN` | `None` | API token used to authenticate calls to the Apify API. |
45+
| `proxy_password` | `APIFY_PROXY_PASSWORD` | `None` | Password for [Apify Proxy](https://docs.apify.com/proxy). |
46+
| `purge_on_start` | `APIFY_PURGE_ON_START` | `True` | Whether to purge local storages when the Actor starts. |
47+
| `persist_state_interval` | `APIFY_PERSIST_STATE_INTERVAL_MILLIS` | `1 min` | How often the `PERSIST_STATE` event is emitted (the variable is in milliseconds). |
48+
| `log_level` | `APIFY_LOG_LEVEL` | `'INFO'` | Minimum severity of log messages that are printed. |
49+
| `headless` | `APIFY_HEADLESS` | `True` | Whether to run browsers in headless mode. |
50+
| `storage_dir` | `APIFY_LOCAL_STORAGE_DIR` | `'./storage'` | Directory holding local storages when running outside the platform. |
51+
| `is_at_home` | `APIFY_IS_AT_HOME` | `False` | Set by the platform — `True` when the Actor runs on Apify. |
52+
3853
## Reading the runtime environment
3954

4055
The <ApiLink to="class/Actor#get_env">`Actor.get_env`</ApiLink> method returns a dictionary with all `APIFY_*` environment variables parsed into their typed values. This is useful for inspecting the Actor's runtime context, such as the Actor ID, run ID, or default storage IDs. Variables that are not set or are invalid will have a value of `None`.
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
id: storage-clients
3+
title: Storage clients
4+
description: Choose and configure the backend the Actor uses for datasets, key-value stores, and request queues.
5+
---
6+
7+
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
8+
import ApiLink from '@theme/ApiLink';
9+
10+
import SharedRequestQueueExample from '!!raw-loader!roa-loader!./code/12_shared_request_queue.py';
11+
import CustomStorageClientExample from '!!raw-loader!roa-loader!./code/12_custom_storage_client.py';
12+
13+
Storage clients are the components that actually read and write your [storages](./storages) — datasets, key-value stores, and request queues. The Apify SDK selects an appropriate client automatically based on where the Actor runs, so for most Actors you never need to think about them. This page explains the available clients and how to customize them when you do.
14+
15+
## How the Actor selects a storage client
16+
17+
By default, the Actor uses a <ApiLink to="class/SmartApifyStorageClient">`SmartApifyStorageClient`</ApiLink> — a hybrid client that delegates to one of two underlying clients depending on the environment:
18+
19+
- When running **on the Apify platform** (detected automatically), or when you pass `force_cloud=True`, it uses the **cloud** client — <ApiLink to="class/ApifyStorageClient">`ApifyStorageClient`</ApiLink>, which persists data through the Apify API.
20+
- When running **locally**, it uses the **local** client — <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink>, which emulates platform storages on your filesystem under the `storage` folder.
21+
22+
This is what lets the same Actor code run unchanged both locally and on the platform.
23+
24+
## Available storage clients
25+
26+
The `apify.storage_clients` module provides the following clients:
27+
28+
- <ApiLink to="class/SmartApifyStorageClient">`SmartApifyStorageClient`</ApiLink> — the default hybrid client described above. It wraps a `cloud_storage_client` and a `local_storage_client` and routes each call to the right one.
29+
- <ApiLink to="class/ApifyStorageClient">`ApifyStorageClient`</ApiLink> — talks to the Apify API. Used as the cloud client.
30+
- <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink> — persists data to the local filesystem. Used as the default local client.
31+
- <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> — keeps everything in memory only; nothing is persisted. Useful for tests and short-lived runs.
32+
33+
## Single vs. shared request queue
34+
35+
`ApifyStorageClient` supports two ways of accessing the Apify request queue, selected via its `request_queue_access` argument:
36+
37+
- **`'single'`** (default) — optimized for a single consumer. It makes far fewer API calls, so it is cheaper and faster, but it does not support multiple clients consuming the same queue concurrently. This is the right choice for the majority of Actors.
38+
- **`'shared'`** — supports multiple consumers working on the same queue at the same time, at the cost of more API calls.
39+
40+
To opt into the shared client, set it as the cloud client of the `SmartApifyStorageClient` in the [service locator](https://crawlee.dev/python/docs/guides/service-locator) before entering the Actor context:
41+
42+
<RunnableCodeBlock className="language-python" language="python">
43+
{SharedRequestQueueExample}
44+
</RunnableCodeBlock>
45+
46+
## Using cloud storage while running locally
47+
48+
When developing locally, storages are read from and written to the local filesystem by default. To work with a storage on the Apify platform instead — for example, to read the output of a remote Actor run — pass `force_cloud=True` to <ApiLink to="class/Actor#open_dataset">`Actor.open_dataset`</ApiLink>, <ApiLink to="class/Actor#open_key_value_store">`Actor.open_key_value_store`</ApiLink>, or <ApiLink to="class/Actor#open_request_queue">`Actor.open_request_queue`</ApiLink>. This requires an Apify token, provided via the `APIFY_TOKEN` environment variable.
49+
50+
## Customizing the storage client
51+
52+
You can replace either of the underlying clients — for example, to keep all local data in memory instead of on disk. To do this, set a `SmartApifyStorageClient` with your chosen sub-clients in the service locator **before** entering the Actor context (or awaiting <ApiLink to="class/Actor#init">`Actor.init`</ApiLink>):
53+
54+
<RunnableCodeBlock className="language-python" language="python">
55+
{CustomStorageClientExample}
56+
</RunnableCodeBlock>
57+
58+
:::note
59+
60+
The Actor's storage client must be a `SmartApifyStorageClient`. Setting a bare `ApifyStorageClient` or `MemoryStorageClient` directly in the service locator raises an error — wrap it in a `SmartApifyStorageClient` as shown above.
61+
62+
:::
63+
64+
For a deeper look at how storage clients work and how to write your own, see the [Crawlee storage clients guide](https://crawlee.dev/python/docs/guides/storage-clients).

docs/02_concepts/code/04_actor_events.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import asyncio
2-
from typing import Any
32

4-
from apify import Actor, Event
3+
from apify import Actor, Event, EventPersistStateData
54

65

76
async def main() -> None:
@@ -15,9 +14,11 @@ async def main() -> None:
1514
processed_items = actor_state
1615

1716
# Save the state when the `PERSIST_STATE` event happens
18-
async def save_state(event_data: Any) -> None:
17+
async def save_state(event_data: EventPersistStateData) -> None:
1918
nonlocal processed_items
20-
Actor.log.info('Saving Actor state', extra=event_data)
19+
Actor.log.info(
20+
'Persisting Actor state (migrating=%s)', event_data.is_migrating
21+
)
2122
await Actor.set_value('STATE', processed_items)
2223

2324
Actor.on(Event.PERSIST_STATE, save_state)
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
import asyncio
2+
3+
from crawlee import service_locator
4+
5+
from apify import Actor
6+
from apify.storage_clients import MemoryStorageClient, SmartApifyStorageClient
7+
8+
9+
async def main() -> None:
10+
# Keep all local data in memory instead of writing it to the filesystem
11+
# when running outside the Apify platform.
12+
service_locator.set_storage_client(
13+
SmartApifyStorageClient(local_storage_client=MemoryStorageClient()),
14+
)
15+
16+
async with Actor:
17+
store = await Actor.open_key_value_store()
18+
await store.set_value('example', {'hello': 'world'})
19+
20+
21+
if __name__ == '__main__':
22+
asyncio.run(main())
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
import asyncio
2+
3+
from crawlee import service_locator
4+
5+
from apify import Actor
6+
from apify.storage_clients import ApifyStorageClient, SmartApifyStorageClient
7+
8+
9+
async def main() -> None:
10+
# Use the shared Apify request queue client, which supports multiple
11+
# consumers working on the same queue at the cost of more API calls.
12+
service_locator.set_storage_client(
13+
SmartApifyStorageClient(
14+
cloud_storage_client=ApifyStorageClient(request_queue_access='shared'),
15+
)
16+
)
17+
18+
async with Actor:
19+
request_queue = await Actor.open_request_queue()
20+
await request_queue.add_request('https://crawlee.dev')
21+
22+
23+
if __name__ == '__main__':
24+
asyncio.run(main())

0 commit comments

Comments
 (0)