Skip to content

Latest commit

 

History

History
56 lines (37 loc) · 4.79 KB

File metadata and controls

56 lines (37 loc) · 4.79 KB
id avoid-blocking
title Avoid getting blocked
description How to avoid getting blocked when scraping

import ApiLink from '@site/src/components/ApiLink'; import CodeBlock from '@theme/CodeBlock'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import PlaywrightDefaultFingerprintGenerator from '!!raw-loader!roa-loader!./code_examples/avoid_blocking/playwright_with_fingerprint_generator.py'; import PlaywrightWithCamoufox from '!!raw-loader!roa-loader!../examples/code_examples/playwright_crawler_with_camoufox.py'; import PlaywrightWithCloakBrowser from '!!raw-loader!roa-loader!./code_examples/avoid_blocking/playwright_with_cloakbrowser.py';

import PlaywrightDefaultFingerprintGeneratorWithArgs from '!!raw-loader!./code_examples/avoid_blocking/default_fingerprint_generator_with_args.py';

A scraper might get blocked for numerous reasons. Let's narrow it down to the two main ones. The first is a bad or blocked IP address. You can learn about this topic in the proxy management guide. The second reason is browser fingerprints (or signatures), which we will explore more in this guide. Check the Apify Academy anti-scraping course to gain a deeper theoretical understanding of blocking and learn a few tips and tricks.

Browser fingerprint is a collection of browser attributes and significant features that can show if our browser is a bot or a real user. Moreover, most browsers have these unique features that allow the website to track the browser even within different IP addresses. This is the main reason why scrapers should change browser fingerprints while doing browser-based scraping. In return, it should significantly reduce the blocking.

Using browser fingerprints

Changing browser fingerprints can be a tedious job. Luckily, Crawlee provides this feature with minimal configuration necessary - the usage of fingerprints in PlaywrightCrawler is enabled by default. You can customize the fingerprints by using the fingerprint_generator argument of the PlaywrightCrawler.__init__, either pass your own implementation of FingerprintGenerator or use DefaultFingerprintGenerator.

{PlaywrightDefaultFingerprintGenerator}

In certain cases we want to narrow down the fingerprints used - e.g. specify a certain operating system, locale or browser. This is also possible with Crawlee - the crawler can have the generation algorithm customized to reflect the particular browser version and many more. For description of fingerprint generation options please see HeaderGeneratorOptions, ScreenOptions and DefaultFingerprintGenerator.__init__ See the example below:

{PlaywrightDefaultFingerprintGeneratorWithArgs}

If you do not want to use fingerprints, then pass fingerprint_generator=None argument to the PlaywrightCrawler.__init__.

Using Camoufox

In some cases even PlaywrightCrawler with fingerprints is not enough. You can try using PlaywrightCrawler together with Camoufox. See the example integration below:

{PlaywrightWithCamoufox}

Using CloakBrowser

For sites with aggressive anti-bot protection, CloakBrowser takes a different approach. Instead of overriding fingerprints at the JavaScript level (which anti-bot scripts can detect as tampering), CloakBrowser ships a custom Chromium binary with fingerprints modified directly in the C++ source code. It is also Chromium-based, which can matter when a target site behaves differently with Firefox than with Chrome. Install it separately with pip install cloakbrowser — the plugin calls ensure_binary() which automatically downloads and caches the Chromium binary on first run.

{PlaywrightWithCloakBrowser}

Related links