Skip to content

FileSystemStorageClient performance issues #1382

@Mantisus

Description

@Mantisus

FileSystemStorageClient shows a significant performance drop compared to version 0.6.12.

Test code

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee import ConcurrencySettings
from crawlee.storage_clients import FileSystemStorageClient


async def main() -> None:
    storage_client = FileSystemStorageClient()
    crawler = ParselCrawler(
        storage_client=storage_client,
        concurrency_settings=ConcurrencySettings(desired_concurrency=20),
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links(strategy='same-domain')

    await crawler.run(['http://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Do not set max_requests_per_crawl, as performance decreases as the number of processed links increases.

Results:

[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[ParselCrawler] INFO  Error analysis: total_errors=1 unique_errors=1
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬──────────────────┐
│ requests_finished             │ 4512             │
│ requests_failed               │ 0                │
│ retry_histogram               │ [4511, 1]        │
│ request_avg_failed_duration   │ None             │
│ request_avg_finished_duration │ 2min 9.7s        │
│ requests_finished_per_minute  │ 81               │
│ requests_failed_per_minute    │ 0                │
│ request_total_duration        │ 162h 33min 11.4s │
│ requests_total                │ 4512             │
│ crawler_runtime               │ 55min 27.1s      │
└───────────────────────────────┴──────────────────┘

Test code for 0.6.12

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee import ConcurrencySettings
from crawlee.configuration import Configuration
from crawlee.storage_clients import MemoryStorageClient

async def main() -> None:
    storage_client = MemoryStorageClient.from_config(Configuration(write_metadata=True, persist_storage=True))
    crawler = ParselCrawler(
        storage_client=storage_client,
        concurrency_settings=ConcurrencySettings(desired_concurrency=20),
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links(strategy='same-domain')

    await crawler.run(['http://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Results:

[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬──────────────┐
│ requests_finished             │ 4512         │
│ requests_failed               │ 0            │
│ retry_histogram               │ [4512]       │
│ request_avg_failed_duration   │ None         │
│ request_avg_finished_duration │ 3.910247     │
│ requests_finished_per_minute  │ 874          │
│ requests_failed_per_minute    │ 0            │
│ request_total_duration        │ 17643.033593 │
│ requests_total                │ 4512         │
│ crawler_runtime               │ 309.715327   │
└───────────────────────────────┴──────────────┘

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions