feat(web): waste headless chrome bandwidth#1614
Open
Xe wants to merge 1 commit into
Open
Conversation
Most of the worst of the worst scrapers run Headless Chrome. Headless Chrome is difficult for Anubis to combat because it follows all the rules that browsers do. The worst of the worst scrapers also use residential proxy services. Those residental proxy services charge upwards of $1 per GB of data egressed or ingressed. The Prompt API makes Chrome download a 4Gi or 16Gi machine learning model. When you ask it to start downloading, it will _continue_ downloading even when you leave the Anubis challenge page. This will make the local model answer "why is the sky blue?" in an absurt amount of detail, which wastes both bandwidth and scraper CPU (some scraping companies charge via Chrome CPU too). Signed-off-by: Xe Iaso <me@xeiaso.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Most of the worst of the worst scrapers run Headless Chrome. Headless Chrome is difficult for Anubis to combat because it follows all the rules that browsers do. The worst of the worst scrapers also use residential proxy services. Those residental proxy services charge upwards of $1 per GB of data egressed or ingressed. The Prompt API makes Chrome download a 4Gi or 16Gi machine learning model. When you ask it to start downloading, it will continue downloading even when you leave the Anubis challenge page.
This will make the local model answer "why is the sky blue?" in an absurt amount of detail, which wastes both bandwidth and scraper CPU (some scraping companies charge via Chrome CPU too).
Checklist:
[Unreleased]section of docs/docs/CHANGELOG.mdnpm run test:integration(unsupported on Windows, please use WSL)