Skip to content

Latest commit

 

History

History
549 lines (421 loc) · 20.1 KB

File metadata and controls

549 lines (421 loc) · 20.1 KB
id storage-clients
title Storage clients
description How to work with storage clients in Crawlee, including the built-in clients and how to create your own.

import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; import CodeBlock from '@theme/CodeBlock';

import MemoryStorageClientBasicExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/memory_storage_client_basic_example.py'; import FileSystemStorageClientBasicExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/file_system_storage_client_basic_example.py'; import FileSystemStorageClientConfigurationExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/file_system_storage_client_configuration_example.py'; import CustomStorageClientExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/custom_storage_client_example.py'; import RegisteringStorageClientsExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/registering_storage_clients_example.py'; import SQLStorageClientBasicExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/sql_storage_client_basic_example.py'; import SQLStorageClientConfigurationExample from '!!raw-loader!./code_examples/storage_clients/sql_storage_client_configuration_example.py'; import RedisStorageClientBasicExample from '!!raw-loader!./code_examples/storage_clients/redis_storage_client_basic_example.py'; import RedisStorageClientConfigurationExample from '!!raw-loader!./code_examples/storage_clients/redis_storage_client_configuration_example.py';

Storage clients provide a unified interface for interacting with Dataset, KeyValueStore, and RequestQueue, regardless of the underlying implementation. They handle operations like creating, reading, updating, and deleting storage instances, as well as managing data persistence and cleanup. This abstraction makes it easy to switch between different environments, such as local development and cloud production setups.

Built-in storage clients

Crawlee provides three main storage client implementations:

  • FileSystemStorageClient - Provides persistent file system storage with in-memory caching.
  • MemoryStorageClient - Stores data in memory with no persistence.
  • SqlStorageClient - Provides persistent storage using a SQL database (SQLite or PostgreSQL). Requires installing the extra dependency: crawlee[sql_sqlite] for SQLite or crawlee[sql_postgres] for PostgreSQL.
  • RedisStorageClient - Provides persistent storage using a Redis database v8.0+. Requires installing the extra dependency crawlee[redis].
  • ApifyStorageClient - Manages storage on the Apify platform, implemented in the Apify SDK.
---
config:
    class:
        hideEmptyMembersBox: true
---

classDiagram

%% ========================
%% Abstract classes
%% ========================

class StorageClient {
    <<abstract>>
}

%% ========================
%% Specific classes
%% ========================

class FileSystemStorageClient

class MemoryStorageClient

class SqlStorageClient

class RedisStorageClient

class ApifyStorageClient

%% ========================
%% Inheritance arrows
%% ========================

StorageClient --|> FileSystemStorageClient
StorageClient --|> MemoryStorageClient
StorageClient --|> SqlStorageClient
StorageClient --|> RedisStorageClient
StorageClient --|> ApifyStorageClient
Loading

File system storage client

The FileSystemStorageClient provides persistent storage by writing data directly to the file system. It uses intelligent caching and batch processing for better performance while storing data in human-readable JSON format. This is the default storage client used by Crawlee when no other storage client is specified, making it ideal for large datasets and long-running operations where data persistence is required.

:::warning Concurrency limitation The FileSystemStorageClient is not safe for concurrent access from multiple crawler processes. Use it only when running a single crawler process at a time. :::

This storage client is ideal for large datasets, and long-running operations where data persistence is required. Data can be easily inspected and shared with other tools.

{FileSystemStorageClientBasicExample}

Configuration options for the FileSystemStorageClient can be set through environment variables or the Configuration class:

  • storage_dir (env: CRAWLEE_STORAGE_DIR, default: './storage') - The root directory for all storage data.
  • purge_on_start (env: CRAWLEE_PURGE_ON_START, default: True) - Whether to purge default storages on start.

Data is stored using the following directory structure:

{CRAWLEE_STORAGE_DIR}/
├── datasets/
│   └── {DATASET_NAME}/
│       ├── __metadata__.json
│       ├── 000000001.json
│       └── 000000002.json
├── key_value_stores/
│   └── {KVS_NAME}/
│       ├── __metadata__.json
│       ├── key1.json
│       ├── key2.txt
│       └── key3.json
└── request_queues/
    └── {RQ_NAME}/
        ├── __metadata__.json
        ├── {REQUEST_ID_1}.json
        └── {REQUEST_ID_2}.json

Where:

  • {CRAWLEE_STORAGE_DIR} - The root directory for local storage.
  • {DATASET_NAME}, {KVS_NAME}, {RQ_NAME} - The unique names for each storage instance (defaults to "default").
  • Files are stored directly without additional metadata files for simpler structure.

Here is an example of how to configure the FileSystemStorageClient:

{FileSystemStorageClientConfigurationExample}

Memory storage client

The MemoryStorageClient stores all data in memory using Python data structures. It provides fast access but does not persist data between runs, meaning all data is lost when the program terminates. This storage client is primarily suitable for testing and development, and is usually not a good fit for production use. However, in some cases where speed is prioritized over persistence, it can make sense.

:::warning Persistence limitation The MemoryStorageClient does not persist data between runs. All data is lost when the program terminates. :::

{MemoryStorageClientBasicExample}

SQL storage client

:::warning Experimental feature The SqlStorageClient is experimental. Its API and behavior may change in future releases. :::

The SqlStorageClient provides persistent storage using a SQL database (SQLite by default, or PostgreSQL). It supports all Crawlee storage types and enables concurrent access from multiple independent clients or processes.

:::note dependencies The SqlStorageClient is not included in the core Crawlee package. To use it, you need to install Crawlee with the appropriate extra dependency:

  • For SQLite support, run: pip install 'crawlee[sql_sqlite]'
  • For PostgreSQL support, run: pip install 'crawlee[sql_postgres]' :::

By default, SqlStorageClient uses SQLite. To use PostgreSQL instead, just provide a PostgreSQL connection string via the connection_string parameter. No other code changes are needed—the same client works for both databases.

{SQLStorageClientBasicExample}

Data is organized in relational tables. Below are the main tables and columns used for each storage type:

---
config:
    class:
        hideEmptyMembersBox: true
---

classDiagram

%% ========================
%% Storage Clients
%% ========================

class SqlDatasetClient {
    <<Dataset>>
}

class SqlKeyValueStoreClient {
    <<Key-value store>>
}

%% ========================
%% Dataset Tables
%% ========================

class datasets {
    <<table>>
    + dataset_id (PK)
    + internal_name
    + name
    + accessed_at
    + created_at
    + modified_at
    + item_count
}

class dataset_records {
    <<table>>
    + item_id (PK)
    + dataset_id (FK)
    + data
}

class dataset_metadata_buffer {
    <<table>>
    + id (PK)
    + accessed_at
    + modified_at
    + dataset_id (FK)
    + delta_item_count
}

%% ========================
%% Key-Value Store Tables
%% ========================

class key_value_stores {
    <<table>>
    + key_value_store_id (PK)
    + internal_name
    + name
    + accessed_at
    + created_at
    + modified_at
}

class key_value_store_records {
    <<table>>
    + key_value_store_id (FK, PK)
    + key (PK)
    + value
    + content_type
    + size
}

class key_value_store_metadata_buffer {
    <<table>>
    + id (PK)
    + accessed_at
    + modified_at
    + key_value_store_id (FK)
}

%% ========================
%% Client to Table arrows
%% ========================

SqlDatasetClient --> datasets
SqlDatasetClient --> dataset_records
SqlDatasetClient --> dataset_metadata_buffer

SqlKeyValueStoreClient --> key_value_stores
SqlKeyValueStoreClient --> key_value_store_records
SqlKeyValueStoreClient --> key_value_store_metadata_buffer
Loading
---
config:
    class:
        hideEmptyMembersBox: true
---

classDiagram

%% ========================
%% Storage Clients
%% ========================

class SqlRequestQueueClient {
    <<Request queue>>
}

%% ========================
%% Request Queue Tables
%% ========================

class request_queues {
    <<table>>
    + request_queue_id (PK)
    + internal_name
    + name
    + accessed_at
    + created_at
    + modified_at
    + had_multiple_clients
    + handled_request_count
    + pending_request_count
    + total_request_count
}

class request_queue_records {
    <<table>>
    + request_id (PK)
    + request_queue_id (FK, PK)
    + data
    + sequence_number
    + is_handled
    + time_blocked_until
    + client_key
}

class request_queue_state {
    <<table>>
    + request_queue_id (FK, PK)
    + sequence_counter
    + forefront_sequence_counter
}

class request_queue_metadata_buffer {
    <<table>>
    + id (PK)
    + accessed_at
    + modified_at
    + request_queue_id (FK)
    + client_id
    + delta_handled_count
    + delta_pending_count
    + delta_total_count
    + need_recalc
}

%% ========================
%% Client to Table arrows
%% ========================

SqlRequestQueueClient --> request_queues
SqlRequestQueueClient --> request_queue_records
SqlRequestQueueClient --> request_queue_state
SqlRequestQueueClient --> request_queue_metadata_buffer
Loading

Configuration options for the SqlStorageClient can be set through environment variables or the Configuration class:

  • storage_dir (env: CRAWLEE_STORAGE_DIR, default: './storage') - The root directory where the default SQLite database will be created if no connection string is provided.
  • purge_on_start (env: CRAWLEE_PURGE_ON_START, default: True) - Whether to purge default storages on start.

Configuration options for the SqlStorageClient can be set via constructor arguments:

  • connection_string (default: SQLite in Configuration storage dir) - SQLAlchemy connection string, e.g. sqlite+aiosqlite:///my.db or postgresql+asyncpg://user:pass@host/db.
  • engine - Pre-configured SQLAlchemy AsyncEngine (optional).

For advanced scenarios, you can configure SqlStorageClient with a custom SQLAlchemy engine and additional options via the Configuration class. This is useful, for example, when connecting to an external PostgreSQL database or customizing connection pooling.

{SQLStorageClientConfigurationExample}

Redis storage client

:::warning Experimental feature The RedisStorageClient is experimental. Its API and behavior may change in future releases. :::

The RedisStorageClient provides persistent storage using Redis database. It supports concurrent access from multiple independent clients or processes and uses Redis native data structures for efficient operations.

:::note dependencies The RedisStorageClient is not included in the core Crawlee package. To use it, you need to install Crawlee with the Redis extra dependency:

pip install 'crawlee[redis]'

Additionally, Redis version 8.0 or higher is required. :::

:::note Redis persistence Data persistence in Redis depends on your database configuration. :::

The client requires either a Redis connection string or a pre-configured Redis client instance. Use a pre-configured client when you need custom Redis settings such as connection pooling, timeouts, or SSL/TLS encryption.

{RedisStorageClientBasicExample}

Data is organized using Redis key patterns. Below are the main data structures used for each storage type:

---
config:
    class:
        hideEmptyMembersBox: true
---

classDiagram

%% ========================
%% Storage Client
%% ========================

class RedisDatasetClient {
    <<Dataset>>
}

%% ========================
%% Dataset Keys
%% ========================

class DatasetKeys {
    datasets:[name]:items - JSON Array
    datasets:[name]:metadata - JSON Object
}

class DatasetsIndexes {
    datasets:id_to_name - Hash
    datasets:name_to_id - Hash
}

%% ========================
%% Client to Keys arrows
%% ========================

RedisDatasetClient --> DatasetKeys
RedisDatasetClient --> DatasetsIndexes
Loading
---
config:
    class:
        hideEmptyMembersBox: true
---

classDiagram

%% ========================
%% Storage Clients
%% ========================

class RedisKeyValueStoreClient {
    <<Key-value store>>
}

%% ========================
%% Key-Value Store Keys
%% ========================

class KeyValueStoreKeys {
    key_value_stores:[name]:items - Hash
    key_value_stores:[name]:metadata_items - Hash
    key_value_stores:[name]:metadata - JSON Object
}

class KeyValueStoresIndexes {
    key_value_stores:id_to_name - Hash
    key_value_stores:name_to_id - Hash
}

%% ========================
%% Client to Keys arrows
%% ========================

RedisKeyValueStoreClient --> KeyValueStoreKeys
RedisKeyValueStoreClient --> KeyValueStoresIndexes
Loading
---
config:
    class:
        hideEmptyMembersBox: true
---

classDiagram

%% ========================
%% Storage Clients
%% ========================

class RedisRequestQueueClient {
    <<Request queue>>
}

%% ========================
%% Request Queue Keys
%% ========================

class RequestQueueKeys{
    request_queues:[name]:queue - List
    request_queues:[name]:data - Hash
    request_queues:[name]:in_progress - Hash
    request_queues:[name]:added_bloom_filter - Bloom Filter | bloom queue_dedup_strategy
    request_queues:[name]:handled_bloom_filter - Bloom Filter | bloom queue_dedup_strategy
    request_queues:[name]:pending_set - Set | default queue_dedup_strategy
    request_queues:[name]:handled_set - Set | default queue_dedup_strategy
    request_queues:[name]:metadata - JSON Object
}

class RequestQueuesIndexes {
    request_queues:id_to_name - Hash
    request_queues:name_to_id - Hash
}

%% ========================
%% Client to Keys arrows
%% ========================

RedisRequestQueueClient --> RequestQueueKeys
RedisRequestQueueClient --> RequestQueuesIndexes
Loading

Configuration options for the RedisStorageClient can be set through environment variables or the Configuration class:

  • purge_on_start (env: CRAWLEE_PURGE_ON_START, default: True) - Whether to purge default storages on start.

Configuration options for the RedisStorageClient can be set via constructor arguments:

  • connection_string - Redis connection string, e.g. redis://localhost:6379/0.
  • redis - Pre-configured Redis client instance (optional).
{RedisStorageClientConfigurationExample}

Creating a custom storage client

A storage client consists of two parts: the storage client factory and individual storage type clients. The StorageClient acts as a factory that creates specific clients (DatasetClient, KeyValueStoreClient, RequestQueueClient) where the actual storage logic is implemented.

Here is an example of a custom storage client that implements the StorageClient interface:

{CustomStorageClientExample}

Custom storage clients can implement any storage logic, such as connecting to a database, using a cloud storage service, or integrating with other systems. They must implement the required methods for creating, reading, updating, and deleting data in the respective storages.

Registering storage clients

Storage clients can be registered in multiple ways:

  • Globally - Using the ServiceLocator or passing directly to the crawler.
  • Per storage - When opening a specific storage instance like Dataset, KeyValueStore, or RequestQueue.
{RegisteringStorageClientsExample}

You can also register different storage clients for each storage instance, allowing you to use different backends for different storages. This is useful when you want to use a fast in-memory storage for RequestQueue while persisting scraping results in Dataset or KeyValueStore.

Conclusion

Storage clients in Crawlee provide different backends for data storage. Use MemoryStorageClient for testing and fast operations without persistence, or FileSystemStorageClient for environments where data needs to persist. You can also create custom storage clients for specialized backends by implementing the StorageClient interface.

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!