Skip to content

Commit 55f13e8

Browse files
committed
Merge branch 'master' into apify-default-dataset-item-event
2 parents 7347d30 + 11af30a commit 55f13e8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1317
-823
lines changed

.gitignore

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@
1515
.serena
1616
.windsurf
1717
.zed-ai
18-
AGENTS.md
19-
CLAUDE.md
20-
GEMINI.md
18+
AGENTS.local.md
19+
CLAUDE.local.md
20+
GEMINI.local.md
2121

2222
# Cache
2323
__pycache__

.rules.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Coding guidelines
2+
3+
This file provides guidance to programming agents when working with code in this repository.
4+
5+
## Project Overview
6+
7+
The Apify SDK for Python (`apify` package on PyPI) is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python. It provides Actor lifecycle management, storage access (datasets, key-value stores, request queues), event handling, proxy configuration, and pay-per-event charging. It builds on top of the [Crawlee](https://crawlee.dev/python) web scraping framework and the [Apify API Client](https://docs.apify.com/api/client/python). Supports Python 3.10–3.14. Build system: hatchling.
8+
9+
## Common Commands
10+
11+
```bash
12+
# Install dependencies (including dev)
13+
uv sync --all-extras
14+
15+
# Install dev dependencies + pre-commit hooks
16+
uv run poe install-dev
17+
18+
# Format code (also auto-fixes lint issues via ruff check --fix)
19+
uv run poe format
20+
21+
# Lint (format check + ruff check)
22+
uv run poe lint
23+
24+
# Type check
25+
uv run poe type-check
26+
27+
# Run all checks (lint + type-check + unit tests)
28+
uv run poe check-code
29+
30+
# Unit tests (no API token needed)
31+
uv run poe unit-tests
32+
33+
# Run a single test file
34+
uv run pytest tests/unit/actor/test_actor_lifecycle.py
35+
36+
# Run a single test by name
37+
uv run pytest tests/unit/actor/test_actor_lifecycle.py -k "test_name"
38+
39+
# Integration tests (needs APIFY_TEST_USER_API_TOKEN)
40+
uv run poe integration-tests
41+
42+
# E2E tests (needs APIFY_TEST_USER_API_TOKEN, builds/deploys Actors on platform)
43+
uv run poe e2e-tests
44+
```
45+
46+
## Code Style
47+
48+
- **Formatter/Linter**: Ruff (line length 120, single quotes for inline, double quotes for docstrings)
49+
- **Type checker**: ty (targets Python 3.10)
50+
- **All ruff rules enabled** with specific ignores — see `pyproject.toml` `[tool.ruff.lint]` for the full ignore list
51+
- Tests are exempt from docstring rules (`D`), assert warnings (`S101`), and private member access (`SLF001`)
52+
- Unused imports are allowed in `__init__.py` files (re-exports)
53+
- **Pre-commit hooks**: lint check + type check run automatically on commit
54+
55+
## Architecture
56+
57+
### Core (`src/apify/`)
58+
59+
- **`_actor.py`** — The `_ActorType` class is the central API. `Actor` is a lazy-object-proxy (`lazy-object-proxy.Proxy`) wrapping `_ActorType` — it acts as both a class (e.g. `Actor.is_at_home()`) and an instance-like context manager (`async with Actor:`). On `__aenter__`, the proxy's `__wrapped__` is replaced with the active `_ActorType` instance. It manages the full Actor lifecycle (`init`, `exit`, `fail`), provides access to storages (`open_dataset`, `open_key_value_store`, `open_request_queue`), handles events, proxy configuration, charging, and platform API operations (`start`, `call`, `metamorph`, `reboot`).
60+
61+
- **`_configuration.py`**`Configuration` extends Crawlee's `Configuration` with Apify-specific settings (API URL, token, Actor run metadata, proxy settings, charging config). Configuration is populated from environment variables (`APIFY_*`).
62+
63+
- **`_charging.py`** — Pay-per-event billing system. `ChargingManager` / `ChargingManagerImplementation` handle charging events against pricing info fetched from the API.
64+
65+
- **`_proxy_configuration.py`**`ProxyConfiguration` manages Apify proxy setup (residential, datacenter, groups, country targeting).
66+
67+
- **`_models.py`** — Pydantic models for API data structures (Actor runs, webhooks, pricing info, etc.).
68+
69+
### Storage Clients (`src/apify/storage_clients/`)
70+
71+
Four storage client implementations, all implementing Crawlee's abstract storage client interface:
72+
73+
- **`_apify/`**`ApifyStorageClient`: talks to the Apify API for dataset, key-value store, and request queue operations (separate sub-clients for single vs. shared request queues). Used when running on the Apify platform.
74+
- **`_file_system/`**`FileSystemStorageClient` (alias `ApifyFileSystemStorageClient`): extends Crawlee's file system client with Apify-specific key-value store behavior.
75+
- **`_smart_apify/`**`SmartApifyStorageClient`: hybrid client that writes to both API and local file system for resilience.
76+
- **`MemoryStorageClient`** — re-exported from Crawlee for in-memory storage.
77+
78+
### Storages (`src/apify/storages/`)
79+
80+
Re-exports Crawlee's `Dataset`, `KeyValueStore`, and `RequestQueue` classes.
81+
82+
### Events (`src/apify/events/`)
83+
84+
- **`_apify_event_manager.py`**`ApifyEventManager` extends Crawlee's event system with platform-specific events received via WebSocket connection.
85+
86+
### Request Loaders (`src/apify/request_loaders/`)
87+
88+
- **`_apify_request_list.py`**`ApifyRequestList` creates request lists from Actor input URLs (supports both direct URLs and "requests from URL" sources).
89+
90+
### Scrapy Integration (`src/apify/scrapy/`)
91+
92+
Optional integration (`apify[scrapy]` extra) providing Scrapy scheduler, middlewares, pipelines, and extensions for running Scrapy spiders as Apify Actors.
93+
94+
### Key Dependencies
95+
96+
- **`crawlee`** — Base framework providing storage abstractions, event system, configuration, service locator pattern
97+
- **`apify-client`** — HTTP client for the Apify API (`ApifyClientAsync`)
98+
- **`apify-shared`** — Shared constants and utilities (`ApifyEnvVars`, `ActorEnvVars`, etc.)
99+
100+
## Testing
101+
102+
Three test levels in `tests/`:
103+
104+
- **`unit/`** — Fast tests with no external dependencies. Use mocked API clients (`ApifyClientAsyncPatcher` fixture). Run with `uv run poe unit-tests`.
105+
- **`integration/`** — Tests making real Apify API calls but not deploying Actors. Requires `APIFY_TEST_USER_API_TOKEN`. Run with `uv run poe integration-tests`.
106+
- **`e2e/`** — Full end-to-end tests that build and deploy Actors on the platform. Slowest. Requires `APIFY_TEST_USER_API_TOKEN`. Use `make_actor` and `run_actor` fixtures. Run with `uv run poe e2e-tests`.
107+
108+
All test levels use `pytest-asyncio` with `asyncio_mode = "auto"` (no need for `@pytest.mark.asyncio`). Tests run in parallel via `pytest-xdist` (`--numprocesses`). Each test gets isolated state via the autouse `_isolate_test_environment` fixture which resets `Actor`, `service_locator`, and `AliasResolver` state. Conftest files live in each subdirectory (`tests/unit/conftest.py`, etc.) — there is no top-level `tests/conftest.py`.
109+
110+
### Key Test Fixtures
111+
112+
- **`apify_client_async_patcher`** (unit) — `ApifyClientAsyncPatcher` instance for mocking `ApifyClientAsync` methods. Patch by `method`/`submethod`, tracks call history in `.calls`.
113+
- **`make_httpserver`/`httpserver`** (unit) — session-scoped `HTTPServer` via `pytest-httpserver` for HTTP interception.
114+
- **`apify_client_async`** (integration/e2e) — real `ApifyClientAsync` using `APIFY_TEST_USER_API_TOKEN`.
115+
- **`make_actor`** (e2e) — creates a temporary Actor on the platform from a function, `main_py` string, or source files dict; cleans up after the session.
116+
- **`run_actor`** (e2e) — calls an Actor and waits up to 10 minutes for completion.

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.rules.md

CHANGELOG.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,31 @@
33
All notable changes to this project will be documented in this file.
44

55
<!-- git-cliff-unreleased-start -->
6-
## 3.2.2 - **not yet released**
6+
## 3.3.1 - **not yet released**
7+
8+
### 🐛 Bug Fixes
9+
10+
- Fix f-string bugs in charging log message ([#817](https://github.com/apify/apify-sdk-python/pull/817)) ([bcb4050](https://github.com/apify/apify-sdk-python/commit/bcb4050b3f5ade0e577dd7499979dc65c0ba815e)) by [@vdusek](https://github.com/vdusek)
11+
- Fix BeforeValidator treating 0 as falsy in configuration fields ([#819](https://github.com/apify/apify-sdk-python/pull/819)) ([72efe88](https://github.com/apify/apify-sdk-python/commit/72efe883574ef0d05795934337c7f1c8c0d11877)) by [@vdusek](https://github.com/vdusek)
12+
- Clamp negative timedelta in _get_remaining_time() ([#818](https://github.com/apify/apify-sdk-python/pull/818)) ([69b8af9](https://github.com/apify/apify-sdk-python/commit/69b8af9b8d245cfed76875f983f374e05d93bba8)) by [@vdusek](https://github.com/vdusek)
13+
- **scrapy:** Close AsyncThread on scheduler open() failure ([#820](https://github.com/apify/apify-sdk-python/pull/820)) ([7dfaf1a](https://github.com/apify/apify-sdk-python/commit/7dfaf1a5c5af44743bd448a91140d9b074ac44bf)) by [@vdusek](https://github.com/vdusek)
14+
15+
16+
<!-- git-cliff-unreleased-end -->
17+
## [3.3.0](https://github.com/apify/apify-sdk-python/releases/tag/v3.3.0) (2026-02-25)
18+
19+
### 🚀 Features
20+
21+
- Support Actor schema storages with Alias mechanism ([#797](https://github.com/apify/apify-sdk-python/pull/797)) ([10986ac](https://github.com/apify/apify-sdk-python/commit/10986ac2f4a3d1112aa06eaf26f82884ab9c455a)) by [@Pijukatel](https://github.com/Pijukatel), closes [#762](https://github.com/apify/apify-sdk-python/issues/762)
22+
- Migrate to Scrapy&#x27;s native AsyncCrawlerRunner ([#793](https://github.com/apify/apify-sdk-python/pull/793)) ([01ad9da](https://github.com/apify/apify-sdk-python/commit/01ad9daf834894f798bbfa4362fc7d7f95bafe5c)) by [@vdusek](https://github.com/vdusek), closes [#638](https://github.com/apify/apify-sdk-python/issues/638)
723

824
### 🐛 Bug Fixes
925

1026
- Resolve LogRecord attribute conflict in event manager logging ([#802](https://github.com/apify/apify-sdk-python/pull/802)) ([e1bdbc9](https://github.com/apify/apify-sdk-python/commit/e1bdbc9e303c24571b9511f43ec0815e7e9f4b55)) by [@vdusek](https://github.com/vdusek)
1127
- Update models.py to align with the current API behavior ([#782](https://github.com/apify/apify-sdk-python/pull/782)) ([b06355d](https://github.com/apify/apify-sdk-python/commit/b06355dbc1c8276e9930ecbde72795b6570dde33)) by [@vdusek](https://github.com/vdusek), closes [#778](https://github.com/apify/apify-sdk-python/issues/778)
28+
- Handle `ServiceConflictError` when reusing `Actor` across sequential context ([#804](https://github.com/apify/apify-sdk-python/pull/804)) ([9e5078f](https://github.com/apify/apify-sdk-python/commit/9e5078fa7b1a19e44893bd3409b45108519aef63)) by [@Mantisus](https://github.com/Mantisus), closes [#678](https://github.com/apify/apify-sdk-python/issues/678)
1229

1330

14-
<!-- git-cliff-unreleased-end -->
1531
## [3.2.1](https://github.com/apify/apify-sdk-python/releases/tag/v3.2.1) (2026-02-17)
1632

1733
### 🐛 Bug Fixes

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.rules.md

GEMINI.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.rules.md

docs/02_concepts/11_pay_per_event.mdx

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ description: Monetize your Actors using the pay-per-event pricing model
66

77
import ActorChargeSource from '!!raw-loader!roa-loader!./code/11_actor_charge.py';
88
import ConditionalActorChargeSource from '!!raw-loader!roa-loader!./code/11_conditional_actor_charge.py';
9+
import ChargeLimitCheckSource from '!!raw-loader!roa-loader!./code/11_charge_limit_check.py';
910
import ApiLink from '@site/src/components/ApiLink';
1011
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
1112

@@ -31,6 +32,22 @@ Then you just push your code to Apify and that's it! The SDK will even keep trac
3132

3233
If you need finer control over charging, you can access call <ApiLink to="class/Actor#get_charging_manager">`Actor.get_charging_manager()`</ApiLink> to access the <ApiLink to="class/ChargingManager">`ChargingManager`</ApiLink>, which can provide more detailed information - for example how many events of each type can be charged before reaching the configured limit.
3334

35+
### Handling the charge limit
36+
37+
While the SDK automatically prevents overcharging by limiting how many events are charged and how many items are pushed, **it does not stop your Actor from running**. When the charge limit is reached, <ApiLink to="class/Actor#charge">`Actor.charge`</ApiLink> and `Actor.push_data` will silently stop charging and pushing data, but your Actor will keep running — potentially doing expensive work (scraping pages, calling APIs) for no purpose. This means your Actor may never terminate on its own if you don't check the charge limit yourself.
38+
39+
To avoid this, you should check the `event_charge_limit_reached` field in the result returned by <ApiLink to="class/Actor#charge">`Actor.charge`</ApiLink> or `Actor.push_data` and stop your Actor when the limit is reached. You can also use the `chargeable_within_limit` field from the result to plan ahead — it tells you how many events of each type can still be charged within the remaining budget.
40+
41+
<RunnableCodeBlock className="language-python" language="python">
42+
{ChargeLimitCheckSource}
43+
</RunnableCodeBlock>
44+
45+
Alternatively, you can periodically check the remaining budget via <ApiLink to="class/Actor#get_charging_manager">`Actor.get_charging_manager()`</ApiLink> instead of inspecting every `ChargeResult`. This can be useful when charging happens in multiple places across your code, or when using a crawler where you don't directly control the main loop.
46+
47+
:::caution
48+
Always check the charge limit in your Actor, whether through `ChargeResult` return values or the `ChargingManager`. Without this check, your Actor will continue running and consuming platform resources after the budget is exhausted, producing no output.
49+
:::
50+
3451
## Transitioning from a different pricing model
3552

3653
When you plan to start using the pay-per-event pricing model for an Actor that is already monetized with a different pricing model, your source code will need support both pricing models during the transition period enforced by the Apify platform. Arguably the most frequent case is the transition from the pay-per-result model which utilizes the `ACTOR_MAX_PAID_DATASET_ITEMS` environment variable to prevent returning unpaid dataset items. The following is an example how to handle such scenarios. The key part is the <ApiLink to="class/ChargingManager#get_pricing_info">`ChargingManager.get_pricing_info()`</ApiLink> method which returns information about the current pricing model.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
import asyncio
2+
3+
from apify import Actor
4+
5+
6+
async def main() -> None:
7+
async with Actor:
8+
urls = [
9+
'https://example.com/1',
10+
'https://example.com/2',
11+
'https://example.com/3',
12+
]
13+
14+
for url in urls:
15+
# Do some expensive work (e.g. scraping, API calls)
16+
result = {'url': url, 'data': f'Scraped data from {url}'}
17+
18+
# highlight-start
19+
# push_data returns a ChargeResult - check it to know if the budget ran out
20+
charge_result = await Actor.push_data(result, 'result-item')
21+
22+
if charge_result.event_charge_limit_reached:
23+
Actor.log.info('Charge limit reached, stopping the Actor')
24+
break
25+
# highlight-end
26+
27+
28+
if __name__ == '__main__':
29+
asyncio.run(main())

docs/03_guides/06_scrapy.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,13 @@ import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py'
1717

1818
## Integrating Scrapy with the Apify platform
1919

20-
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
20+
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
2121

2222
<CodeBlock className="language-python" title="__main.py__: The Actor entry point ">
2323
{UnderscoreMainExample}
2424
</CodeBlock>
2525

26-
In this setup, `apify.scrapy.initialize_logging` configures an Apify log formatter and reconfigures loggers to ensure consistent logging across Scrapy, the Apify SDK, and other libraries. The `apify.scrapy.run_scrapy_actor` bridges asyncio coroutines with Twisted's reactor, enabling the Actor's main coroutine, which contains the Scrapy spider, to be executed.
26+
In this setup, `apify.scrapy.initialize_logging` configures an Apify log formatter and reconfigures loggers to ensure consistent logging across Scrapy, the Apify SDK, and other libraries. The `apify.scrapy.run_scrapy_actor` installs Twisted's asyncio-compatible reactor and bridges asyncio coroutines with Twisted's reactor, enabling the Actor's main coroutine, which contains the Scrapy spider, to be executed.
2727

2828
Make sure the `SCRAPY_SETTINGS_MODULE` environment variable is set to the path of the Scrapy settings module. This variable is also used by the `Actor` class to detect that the project is a Scrapy project, triggering additional actions.
2929

@@ -47,7 +47,7 @@ Additional helper functions in the [`apify.scrapy`](https://github.com/apify/api
4747
- `apply_apify_settings` - Applies Apify-specific components to Scrapy settings.
4848
- `to_apify_request` and `to_scrapy_request` - Convert between Apify and Scrapy request objects.
4949
- `initialize_logging` - Configures logging for the Actor environment.
50-
- `run_scrapy_actor` - Bridges asyncio and Twisted event loops.
50+
- `run_scrapy_actor` - Installs Twisted's asyncio reactor and bridges asyncio and Twisted event loops.
5151

5252
## Create a new Apify-Scrapy project
5353

docs/03_guides/code/scrapy_project/src/__main__.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,5 @@
11
from __future__ import annotations
22

3-
from scrapy.utils.reactor import install_reactor
4-
5-
# Install Twisted's asyncio reactor before importing any other Twisted or
6-
# Scrapy components.
7-
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
8-
93
import os
104

115
from apify.scrapy import initialize_logging, run_scrapy_actor

0 commit comments

Comments
 (0)