Add Startpage backend and update scraper HTTP fingerprint#58
Open
DavidTeju wants to merge 3 commits into
Open
Conversation
Replace the hardcoded 2012-era Firefox 11 User-Agent with a pool of 8 modern browser UAs (Chrome 131, Firefox 133, Safari 17.5, Edge 131) and add browser-realistic headers (Accept, Accept-Language, Upgrade-Insecure-Requests, Sec-Fetch-*) matched to the selected browser family. UA is selected per-process via pid modulo, requiring no new dependencies. Addresses samtay#16 and samtay#32.
Google and DuckDuckGo now require JavaScript execution for search results, making their scraper backends non-functional. Startpage proxies Google results as static HTML that can be scraped without JS. - Add Startpage scraper with `a.result-title` CSS selector - Set Startpage as the default search engine for new installs - Add test fixture and parser test for Startpage results - Existing user configs are preserved (no forced migration)
- Update README search engines section to document Startpage as default - Update README example to use startpage instead of google - Add Startpage to HTML parsing benchmarks - Fix copy-paste doc error in Google scraper (said "duckduckgo")
8175f60 to
4a87af3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Accept,Accept-Language,Upgrade-Insecure-Requests,Sec-Fetch-*)Context
Both Google and DuckDuckGo have escalated their bot detection beyond User-Agent checking:
<noscript>redirect to/httpservice/retry/enablejs— no search results in the HTMLanomaly.jschallenge form that requires JavaScript to solveThese changes affect all HTTP clients equally — neither the old Firefox 11 UA nor modern browser UAs receive actual search results. The scraper backends need a fundamentally different approach.
Startpage (
startpage.com) is a privacy-focused search engine that proxies Google results and serves them as plain, scrapeable HTML. It:(site:X OR site:Y) queryformat already used by the Google/DDG scrapershreflinks to StackOverflow questions (no redirect wrappers)reqwestwith browser-like headersa.result-titleCSS selectorChanges
Commit 1: Update scraper HTTP fingerprint
src/stackexchange/mod.rsSCRAPER_USER_AGENTSpool (8 modern UAs),select_scraper_user_agent(),scraper_headers()with browser-matched headers,scraper_client()builder, 6 unit testssrc/stackexchange/search.rssuper::scraper_client()insearch_by_scraper()instead of bareClient::new()with single UA headerCommit 2: Add Startpage backend
src/stackexchange/scraper.rsStartpagestruct implementingScrapertrait witha.result-titleselectorsrc/config.rsStartpagevariant toSearchEngine, make it#[default]src/cli.rs"startpage"to search engine value parsersrc/stackexchange/search.rsSearchEngine::Startpagematch armtest/startpage/exit-vim.htmlZero new dependencies. UA rotation uses
std::process::id()modulo instead ofrand.Test plan
cargo fmt --all --checkpassescargo test— all 26 tests pass (8 new: 6 for UA/headers, 2 for Startpage URL + parser)--lucky -e startpage "how to exit vim"— returns full SO answer--lucky -e startpage "rust reverse string"— returns full SO answer--lucky -e stackexchange "rust reverse string"— still works (regression)Closes #16. Addresses #32.