Skip to content

Make scraper dependency optional or replace with lightweight alternative #396

@digitalcortex

Description

@digitalcortex

I'm using impit as a library for browser fingerprinted HTTP requests. Impit making scraper crate a required dependency causes two problems:

First problem: namespace collision. Scraper is a common crate name. When impit pulls it in as a transitive dependency, any workspace that also has a crate named scraper gets ambiguous resolution, for example I had to specify exact version of my own crate in order for this command to run: cargo run -p scraper@0.1.0 --bin scraper --, otherwise I was getting an error:

error: There are multiple `scraper` packages in your project, and the specification `scraper` is ambiguous.

Since scraper is a generic name, this is likely to affect other consumers too.

Second problem: using scraper as dependency is too heavyweight for what it does. The only usage of scraper crate I found is in response_parsing/mod.rs — prescan_bytestream() parses the first 1024 bytes into a full html5ever DOM tree just to read and attributes. This pulls in html5ever, selectors, ego-tree, cssparser, etc. for a two-selector lookup.

Additionally, this encoding detection isn't even wired into the response pipeline — Impit::send() returns a raw reqwest::Response. The decode() function is only exposed as an opt-in utility via impit::utils, so consumers who don't call it still pay the compile cost.

Suggested fixes (either would work):

  • Make scraper optional behind a feature flag (e.g. encoding-detection) so consumers who don't use impit::utils::decode don't pull it in.
  • Replace "scraper" with lol_html, a streaming HTML parser that can extract charset from tags without building any node trees. Different name avoids the namespace issue, and it's far lighter for this use case.

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions