Skip to content

Add Startpage backend and update scraper HTTP fingerprint#58

Open
DavidTeju wants to merge 3 commits into
samtay:mainfrom
DavidTeju:fix/update-scraper-fingerprint
Open

Add Startpage backend and update scraper HTTP fingerprint#58
DavidTeju wants to merge 3 commits into
samtay:mainfrom
DavidTeju:fix/update-scraper-fingerprint

Conversation

@DavidTeju
Copy link
Copy Markdown

@DavidTeju DavidTeju commented Mar 15, 2026

Summary

  • Add Startpage as a new search backend — Startpage proxies Google results as static HTML, making it scrapeable without JavaScript. Google and DuckDuckGo now serve JS-only challenge pages that no static HTTP client can bypass.
  • Make Startpage the default for new installs (existing user configs are preserved)
  • Update scraper HTTP fingerprint — replace the hardcoded 2012 Firefox 11 User-Agent with a rotating pool of 8 modern browser UAs and add browser-realistic headers (Accept, Accept-Language, Upgrade-Insecure-Requests, Sec-Fetch-*)

Context

Both Google and DuckDuckGo have escalated their bot detection beyond User-Agent checking:

  • Google now returns a JS shell page with <noscript> redirect to /httpservice/retry/enablejs — no search results in the HTML
  • DuckDuckGo returns an anomaly.js challenge form that requires JavaScript to solve

These changes affect all HTTP clients equally — neither the old Firefox 11 UA nor modern browser UAs receive actual search results. The scraper backends need a fundamentally different approach.

Startpage (startpage.com) is a privacy-focused search engine that proxies Google results and serves them as plain, scrapeable HTML. It:

  • Uses the same (site:X OR site:Y) query format already used by the Google/DDG scrapers
  • Returns direct href links to StackOverflow questions (no redirect wrappers)
  • Works reliably from reqwest with browser-like headers
  • Returns 10 results per page with the a.result-title CSS selector

Changes

Commit 1: Update scraper HTTP fingerprint

File Change
src/stackexchange/mod.rs Add SCRAPER_USER_AGENTS pool (8 modern UAs), select_scraper_user_agent(), scraper_headers() with browser-matched headers, scraper_client() builder, 6 unit tests
src/stackexchange/search.rs Use super::scraper_client() in search_by_scraper() instead of bare Client::new() with single UA header

Commit 2: Add Startpage backend

File Change
src/stackexchange/scraper.rs Add Startpage struct implementing Scraper trait with a.result-title selector
src/config.rs Add Startpage variant to SearchEngine, make it #[default]
src/cli.rs Add "startpage" to search engine value parser
src/stackexchange/search.rs Add SearchEngine::Startpage match arm
test/startpage/exit-vim.html Test fixture (real Startpage response)

Zero new dependencies. UA rotation uses std::process::id() modulo instead of rand.

Test plan

  • cargo fmt --all --check passes
  • cargo test — all 26 tests pass (8 new: 6 for UA/headers, 2 for Startpage URL + parser)
  • Smoke test --lucky -e startpage "how to exit vim" — returns full SO answer
  • Smoke test --lucky -e startpage "rust reverse string" — returns full SO answer
  • Smoke test --lucky -e stackexchange "rust reverse string" — still works (regression)
  • Existing Google/DDG backends unchanged and still available for users who want them

Closes #16. Addresses #32.

Replace the hardcoded 2012-era Firefox 11 User-Agent with a pool of 8
modern browser UAs (Chrome 131, Firefox 133, Safari 17.5, Edge 131) and
add browser-realistic headers (Accept, Accept-Language,
Upgrade-Insecure-Requests, Sec-Fetch-*) matched to the selected browser
family. UA is selected per-process via pid modulo, requiring no new
dependencies.

Addresses samtay#16 and samtay#32.
Google and DuckDuckGo now require JavaScript execution for search
results, making their scraper backends non-functional. Startpage proxies
Google results as static HTML that can be scraped without JS.

- Add Startpage scraper with `a.result-title` CSS selector
- Set Startpage as the default search engine for new installs
- Add test fixture and parser test for Startpage results
- Existing user configs are preserved (no forced migration)
- Update README search engines section to document Startpage as default
- Update README example to use startpage instead of google
- Add Startpage to HTML parsing benchmarks
- Fix copy-paste doc error in Google scraper (said "duckduckgo")
@DavidTeju DavidTeju force-pushed the fix/update-scraper-fingerprint branch from 8175f60 to 4a87af3 Compare March 15, 2026 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"✖ DuckDuckGo blocked this request" almost all the time

1 participant