Skip to content

Latest commit

 

History

History
19 lines (13 loc) · 1.6 KB

File metadata and controls

19 lines (13 loc) · 1.6 KB
id run-parallel-crawlers
title Run parallel crawlers

import ApiLink from '@site/src/components/ApiLink'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import RunParallelCrawlersExample from '!!raw-loader!roa-loader!./code_examples/run_parallel_crawlers.py';

This example demonstrates how to run two parallel crawlers where one crawler processes links discovered by another crawler.

In some situations, you may need different approaches for scraping data from a website. For example, you might use PlaywrightCrawler for navigating JavaScript-heavy pages and a faster, more lightweight ParselCrawler for processing static pages. One way to solve this is to use AdaptivePlaywrightCrawler, see the Adaptive Playwright crawler example to learn more.

The code below demonstrates an alternative approach using two separate crawlers. Links are passed between crawlers via RequestQueue aliases. The keep_alive option allows the Playwright crawler to run in the background and wait for incoming links without stopping when its queue is empty. You can also use different storage clients for each crawler without losing the ability to pass links between queues. Learn more about available storage clients in this guide.

{RunParallelCrawlersExample}