Skip to content

How to stop crawling a web site when a goal is reached #503

@sebpiq

Description

@sebpiq

I have a list of web sites from which I am trying to scrape a given piece of info. For each site, once I have found that info, I want to stop and move on to the next (with several sites being scraped concurrently).

I have tried the following approach (emptying the request queue when my goal is found) :

request_queue = await RequestQueue.open()
crawler = PlaywrightCrawler(
    request_provider=request_queue,
    headless=True,  # Show the browser window.
    browser_type='firefox',  # Use the Firefox browser.
)
    
await crawler.add_requests([root_url])

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    # ...
    if found:
        await request_queue.drop()

But that's actually raising an error :

 ValueError: Request queue with id "default" does not exist.

Any idea how I should proceed to have a finer control over the request queue ? Thanks !

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions