|
| 1 | +--- |
| 2 | +title: Multiple datasets |
| 3 | +description: Learn how to use multiple datasets within your Actors to organize and store different types of data separately. |
| 4 | +sidebar_position: 2 |
| 5 | +slug: /actors/development/actor-definition/dataset-schema/multiple-datasets |
| 6 | +--- |
| 7 | + |
| 8 | +import Tabs from '@theme/Tabs'; |
| 9 | +import TabItem from '@theme/TabItem'; |
| 10 | + |
| 11 | +Actors that scrape different data types can store each type in its own dataset with separate validation rules. For example, an e-commerce scraper might store products in one dataset and categories in another. |
| 12 | + |
| 13 | +Each dataset: |
| 14 | + |
| 15 | +- Is created when the run starts |
| 16 | +- Follows the run's data retention policy |
| 17 | +- Can have its own validation schema |
| 18 | + |
| 19 | +## Define multiple datasets |
| 20 | + |
| 21 | +Define datasets in your Actor schema using the `datasets` object: |
| 22 | + |
| 23 | +```json title=".actor/actor.json" |
| 24 | +{ |
| 25 | + "actorSpecification": 1, |
| 26 | + "name": "my-e-commerce-scraper", |
| 27 | + "title": "E-Commerce Scraper", |
| 28 | + "version": "1.0.0", |
| 29 | + "storages": { |
| 30 | + "datasets": { |
| 31 | + "default": "./products_dataset_schema.json", |
| 32 | + "categories": "./categories_dataset_schema.json" |
| 33 | + } |
| 34 | + } |
| 35 | +} |
| 36 | +``` |
| 37 | + |
| 38 | +Provide schemas for individual datasets as file references or inline. Schemas follow the same structure as single-dataset schemas. |
| 39 | + |
| 40 | +The keys of the `datasets` object are aliases that refer to specific datasets. The previous example defines two datasets aliased as `default` and `categories`. |
| 41 | + |
| 42 | +:::info Alias versus named dataset |
| 43 | + |
| 44 | +Aliases and names are different. Named datasets have specific behavior on the Apify platform (the automatic data retention policy doesn't apply to them). Aliased datasets follow the data retention of their run. Aliases only have meaning within a specific run. |
| 45 | + |
| 46 | +::: |
| 47 | + |
| 48 | +Requirements: |
| 49 | + |
| 50 | +- The `datasets` object must contain the `default` alias |
| 51 | +- The `datasets` and `dataset` objects are mutually exclusive (use one or the other) |
| 52 | + |
| 53 | +See the full [Actor schema reference](../actor_json.md#reference). |
| 54 | + |
| 55 | +## Access datasets in Actor code |
| 56 | + |
| 57 | +Access aliased datasets: using the Apify SDK, or reading the `ACTOR_STORAGES_JSON` environment variable directly. |
| 58 | + |
| 59 | +### Apify SDK |
| 60 | + |
| 61 | +<Tabs groupId="main"> |
| 62 | +<TabItem value="JavaScript" label="JavaScript"> |
| 63 | + |
| 64 | +In the JavaScript/TypeScript SDK `>=3.7.0`, use `openDataset` with `alias` option: |
| 65 | + |
| 66 | +```js |
| 67 | +const categoriesDataset = await Actor.openDataset({alias: 'categories'}); |
| 68 | +``` |
| 69 | + |
| 70 | +:::note Running outside the Apify platform |
| 71 | + |
| 72 | +When the JavaScript SDK runs outside the Apify platform, aliases fall back to names (using an alias is the same as using a named dataset). The dataset is purged on the first access when accessed using the `alias` option. |
| 73 | + |
| 74 | +::: |
| 75 | + |
| 76 | +</TabItem> |
| 77 | +<TabItem value="Python" label="Python"> |
| 78 | + |
| 79 | +In the Python SDK `>=3.3.0`, use `open_dataset` with `alias` parameter: |
| 80 | + |
| 81 | +```py |
| 82 | +categories_dataset = await Actor.open_dataset(alias='categories') |
| 83 | +``` |
| 84 | + |
| 85 | +:::note Running outside the Apify platform |
| 86 | + |
| 87 | +When the Python SDK runs outside the Apify platform, it uses the [Crawlee for Python aliasing mechanism](https://crawlee.dev/python/docs/guides/storages#named-and-unnamed-storages). Aliases are created as unnamed and purged on Actor start. |
| 88 | + |
| 89 | +::: |
| 90 | + |
| 91 | +</TabItem> |
| 92 | +</Tabs> |
| 93 | + |
| 94 | + |
| 95 | +### Environment variable |
| 96 | + |
| 97 | +`ACTOR_STORAGES_JSON` contains JSON-encoded unique identifiers of all storages associated with the current Actor run. Use this approach when |
| 98 | +working without the SDK: |
| 99 | + |
| 100 | +```sh |
| 101 | +echo $ACTOR_STORAGES_JSON | jq '.datasets.categories' |
| 102 | +# This will output id of the categories dataset, e.g. `"3ZojQDdFTsyE7Moy4"` |
| 103 | +``` |
| 104 | + |
| 105 | + |
| 106 | +## Configure the output schema |
| 107 | + |
| 108 | +### Storage tab |
| 109 | + |
| 110 | +The **Storage** tab in the Actor run view displays all datasets defined by the Actor and used by the run (up to 10). |
| 111 | + |
| 112 | +The Storage tab shows data but doesn't surface it clearly to end users. To present datasets more clearly, define an [output schema](../../actor_definition/output_schema/index.md). |
| 113 | + |
| 114 | +### Output schema |
| 115 | + |
| 116 | +Actors with output schemas can reference datasets through variables using aliases: |
| 117 | + |
| 118 | +```json |
| 119 | +{ |
| 120 | + "actorOutputSchemaVersion": 1, |
| 121 | + "title": "Output schema", |
| 122 | + "properties": { |
| 123 | + "products": { |
| 124 | + "type": "string", |
| 125 | + "title": "Products", |
| 126 | + "template": "{{storages.datasets.default.apiUrl}}/items" |
| 127 | + }, |
| 128 | + "categories": { |
| 129 | + "type": "string", |
| 130 | + "title": "Categories", |
| 131 | + "template": "{{storages.datasets.categories.apiUrl}}/items" |
| 132 | + } |
| 133 | + } |
| 134 | +} |
| 135 | +``` |
| 136 | + |
| 137 | +[Read more](../output_schema/index.md#how-templates-work) about how templates work. |
| 138 | + |
| 139 | +## Billing for non-default datasets |
| 140 | + |
| 141 | +When an Actor uses multiple datasets, only items pushed to the `default` dataset trigger the built-in `apify-default-dataset-item` event. Items in other datasets are not charged automatically. |
| 142 | + |
| 143 | +To charge for items in other datasets, implement custom billing in your Actor code. Refer to the [billing documentation](../../../publishing/monetize/pay_per_event.mdx) for implementation details. |
0 commit comments