FileSystemStorageClient shows a significant performance drop compared to version 0.6.12.
Test code
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee import ConcurrencySettings
from crawlee.storage_clients import FileSystemStorageClient
async def main() -> None:
storage_client = FileSystemStorageClient()
crawler = ParselCrawler(
storage_client=storage_client,
concurrency_settings=ConcurrencySettings(desired_concurrency=20),
)
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
await context.push_data(data)
await context.enqueue_links(strategy='same-domain')
await crawler.run(['http://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
Do not set max_requests_per_crawl, as performance decreases as the number of processed links increases.
Results:
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[ParselCrawler] INFO Error analysis: total_errors=1 unique_errors=1
[ParselCrawler] INFO Final request statistics:
┌───────────────────────────────┬──────────────────┐
│ requests_finished │ 4512 │
│ requests_failed │ 0 │
│ retry_histogram │ [4511, 1] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 2min 9.7s │
│ requests_finished_per_minute │ 81 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 162h 33min 11.4s │
│ requests_total │ 4512 │
│ crawler_runtime │ 55min 27.1s │
└───────────────────────────────┴──────────────────┘
Test code for 0.6.12
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee import ConcurrencySettings
from crawlee.configuration import Configuration
from crawlee.storage_clients import MemoryStorageClient
async def main() -> None:
storage_client = MemoryStorageClient.from_config(Configuration(write_metadata=True, persist_storage=True))
crawler = ParselCrawler(
storage_client=storage_client,
concurrency_settings=ConcurrencySettings(desired_concurrency=20),
)
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
await context.push_data(data)
await context.enqueue_links(strategy='same-domain')
await crawler.run(['http://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
Results:
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[ParselCrawler] INFO Final request statistics:
┌───────────────────────────────┬──────────────┐
│ requests_finished │ 4512 │
│ requests_failed │ 0 │
│ retry_histogram │ [4512] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 3.910247 │
│ requests_finished_per_minute │ 874 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 17643.033593 │
│ requests_total │ 4512 │
│ crawler_runtime │ 309.715327 │
└───────────────────────────────┴──────────────┘
FileSystemStorageClientshows a significant performance drop compared to version 0.6.12.Test code
Do not set
max_requests_per_crawl, as performance decreases as the number of processed links increases.Results:
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish [ParselCrawler] INFO Error analysis: total_errors=1 unique_errors=1 [ParselCrawler] INFO Final request statistics: ┌───────────────────────────────┬──────────────────┐ │ requests_finished │ 4512 │ │ requests_failed │ 0 │ │ retry_histogram │ [4511, 1] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ 2min 9.7s │ │ requests_finished_per_minute │ 81 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 162h 33min 11.4s │ │ requests_total │ 4512 │ │ crawler_runtime │ 55min 27.1s │ └───────────────────────────────┴──────────────────┘Test code for 0.6.12
Results:
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish [ParselCrawler] INFO Final request statistics: ┌───────────────────────────────┬──────────────┐ │ requests_finished │ 4512 │ │ requests_failed │ 0 │ │ retry_histogram │ [4512] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ 3.910247 │ │ requests_finished_per_minute │ 874 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 17643.033593 │ │ requests_total │ 4512 │ │ crawler_runtime │ 309.715327 │ └───────────────────────────────┴──────────────┘