Skip to content

Commit 4be0c2f

Browse files
valekjojgagneTC-MO
authored
feat(docs): Multiple datasets (#2228)
The feature is now supported in both SDKs and API docs is out - we can finally publish the respective section in docs :) --------- Co-authored-by: Justin Gagne <justin.gagne@apify.com> Co-authored-by: Michał Olender <92638966+TC-MO@users.noreply.github.com>
1 parent 65f4c72 commit 4be0c2f

3 files changed

Lines changed: 145 additions & 0 deletions

File tree

sources/platform/actors/development/actor_definition/actor_json.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@ Actor `name`, `version`, `buildTag`, and `environmentVariables` are currently on
8181
| `input` | Optional | You can embed your [input schema](./input_schema/index.md) object directly in `actor.json` under the `input` field. You can also provide a path to a custom input schema. If not provided, the input schema at `.actor/INPUT_SCHEMA.json` or `INPUT_SCHEMA.json` is used, in this order of preference. |
8282
| `changelog` | Optional | The path to the CHANGELOG file displayed in the Information tab of the Actor in Apify Console next to Readme. If not provided, the CHANGELOG at `.actor/CHANGELOG.md` or `CHANGELOG.md` is used, in this order of preference. Your Actor doesn't need to have a CHANGELOG but it is a good practice to keep it updated for published Actors. |
8383
| `storages.dataset` | Optional | You can define the schema of the items in your dataset under the `storages.dataset` field. This can be either an embedded object or a path to a JSON schema file. [Read more](/platform/actors/development/actor-definition/dataset-schema) about Actor dataset schemas. |
84+
| `storages.datasets` | Optional | You can define multiple datasets for the Actor under the `storages.datasets` field. This can be an object containing embedded objects or paths to a JSON schema files. [Read more](/platform/actors/development/actor-definition/dataset-schema/multiple-datasets) about multiple dataset schemas. |
8485
| `defaultMemoryMbytes` | Optional | Specifies the default amount of memory in megabytes to be used when the Actor is started. Can be an integer or a [dynamic memory expression string](./dynamic_actor_memory/index.md). |
8586
| `minMemoryMbytes` | Optional | Specifies the minimum amount of memory in megabytes required by the Actor to run. Requires an _integer_ value. If both `minMemoryMbytes` and `maxMemoryMbytes` are set, then `minMemoryMbytes` must be equal or lower than `maxMemoryMbytes`. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. |
8687
| `maxMemoryMbytes` | Optional | Specifies the maximum amount of memory in megabytes required by the Actor to run. It can be used to control the costs of run, especially when developing pay per result Actors. Requires an _integer_ value. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. |
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: Multiple datasets
3+
description: Learn how to use multiple datasets within your Actors to organize and store different types of data separately.
4+
sidebar_position: 2
5+
slug: /actors/development/actor-definition/dataset-schema/multiple-datasets
6+
---
7+
8+
import Tabs from '@theme/Tabs';
9+
import TabItem from '@theme/TabItem';
10+
11+
Actors that scrape different data types can store each type in its own dataset with separate validation rules. For example, an e-commerce scraper might store products in one dataset and categories in another.
12+
13+
Each dataset:
14+
15+
- Is created when the run starts
16+
- Follows the run's data retention policy
17+
- Can have its own validation schema
18+
19+
## Define multiple datasets
20+
21+
Define datasets in your Actor schema using the `datasets` object:
22+
23+
```json title=".actor/actor.json"
24+
{
25+
"actorSpecification": 1,
26+
"name": "my-e-commerce-scraper",
27+
"title": "E-Commerce Scraper",
28+
"version": "1.0.0",
29+
"storages": {
30+
"datasets": {
31+
"default": "./products_dataset_schema.json",
32+
"categories": "./categories_dataset_schema.json"
33+
}
34+
}
35+
}
36+
```
37+
38+
Provide schemas for individual datasets as file references or inline. Schemas follow the same structure as single-dataset schemas.
39+
40+
The keys of the `datasets` object are aliases that refer to specific datasets. The previous example defines two datasets aliased as `default` and `categories`.
41+
42+
:::info Alias versus named dataset
43+
44+
Aliases and names are different. Named datasets have specific behavior on the Apify platform (the automatic data retention policy doesn't apply to them). Aliased datasets follow the data retention of their run. Aliases only have meaning within a specific run.
45+
46+
:::
47+
48+
Requirements:
49+
50+
- The `datasets` object must contain the `default` alias
51+
- The `datasets` and `dataset` objects are mutually exclusive (use one or the other)
52+
53+
See the full [Actor schema reference](../actor_json.md#reference).
54+
55+
## Access datasets in Actor code
56+
57+
Access aliased datasets: using the Apify SDK, or reading the `ACTOR_STORAGES_JSON` environment variable directly.
58+
59+
### Apify SDK
60+
61+
<Tabs groupId="main">
62+
<TabItem value="JavaScript" label="JavaScript">
63+
64+
In the JavaScript/TypeScript SDK `>=3.7.0`, use `openDataset` with `alias` option:
65+
66+
```js
67+
const categoriesDataset = await Actor.openDataset({alias: 'categories'});
68+
```
69+
70+
:::note Running outside the Apify platform
71+
72+
When the JavaScript SDK runs outside the Apify platform, aliases fall back to names (using an alias is the same as using a named dataset). The dataset is purged on the first access when accessed using the `alias` option.
73+
74+
:::
75+
76+
</TabItem>
77+
<TabItem value="Python" label="Python">
78+
79+
In the Python SDK `>=3.3.0`, use `open_dataset` with `alias` parameter:
80+
81+
```py
82+
categories_dataset = await Actor.open_dataset(alias='categories')
83+
```
84+
85+
:::note Running outside the Apify platform
86+
87+
When the Python SDK runs outside the Apify platform, it uses the [Crawlee for Python aliasing mechanism](https://crawlee.dev/python/docs/guides/storages#named-and-unnamed-storages). Aliases are created as unnamed and purged on Actor start.
88+
89+
:::
90+
91+
</TabItem>
92+
</Tabs>
93+
94+
95+
### Environment variable
96+
97+
`ACTOR_STORAGES_JSON` contains JSON-encoded unique identifiers of all storages associated with the current Actor run. Use this approach when
98+
working without the SDK:
99+
100+
```sh
101+
echo $ACTOR_STORAGES_JSON | jq '.datasets.categories'
102+
# This will output id of the categories dataset, e.g. `"3ZojQDdFTsyE7Moy4"`
103+
```
104+
105+
106+
## Configure the output schema
107+
108+
### Storage tab
109+
110+
The **Storage** tab in the Actor run view displays all datasets defined by the Actor and used by the run (up to 10).
111+
112+
The Storage tab shows data but doesn't surface it clearly to end users. To present datasets more clearly, define an [output schema](../../actor_definition/output_schema/index.md).
113+
114+
### Output schema
115+
116+
Actors with output schemas can reference datasets through variables using aliases:
117+
118+
```json
119+
{
120+
"actorOutputSchemaVersion": 1,
121+
"title": "Output schema",
122+
"properties": {
123+
"products": {
124+
"type": "string",
125+
"title": "Products",
126+
"template": "{{storages.datasets.default.apiUrl}}/items"
127+
},
128+
"categories": {
129+
"type": "string",
130+
"title": "Categories",
131+
"template": "{{storages.datasets.categories.apiUrl}}/items"
132+
}
133+
}
134+
}
135+
```
136+
137+
[Read more](../output_schema/index.md#how-templates-work) about how templates work.
138+
139+
## Billing for non-default datasets
140+
141+
When an Actor uses multiple datasets, only items pushed to the `default` dataset trigger the built-in `apify-default-dataset-item` event. Items in other datasets are not charged automatically.
142+
143+
To charge for items in other datasets, implement custom billing in your Actor code. Refer to the [billing documentation](../../../publishing/monetize/pay_per_event.mdx) for implementation details.

sources/platform/actors/development/programming_interface/environment_variables.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ Here's a table of key system environment variables:
4444
| `ACTOR_BUILD_TAGS` | A comma-separated list of tags of the Actor build used in the run. Note that this environment variable is assigned at the time of start of the Actor and doesn't change over time, even if the assigned build tags change. |
4545
| `ACTOR_TASK_ID` | ID of the Actor task. Empty if Actor is run outside of any task, e.g. directly using the API. |
4646
| `ACTOR_EVENTS_WEBSOCKET_URL` | Websocket URL where Actor may listen for [events](/platform/actors/development/programming-interface/system-events) from Actor platform. |
47+
| `ACTOR_STORAGES_JSON` | JSON-encoded unique identifiers of storages associated with the current Actor run. |
4748
| `ACTOR_DEFAULT_DATASET_ID` | Unique identifier for the default dataset associated with the current Actor run. |
4849
| `ACTOR_DEFAULT_KEY_VALUE_STORE_ID` | Unique identifier for the default key-value store associated with the current Actor run. |
4950
| `ACTOR_DEFAULT_REQUEST_QUEUE_ID` | Unique identifier for the default request queue associated with the current Actor run. |

0 commit comments

Comments
 (0)