Skip to content

fix: deduplicate crawled URLs, fix -m flag collision, add elapsed time + userAgent support#415

Open
VaishnavGunjari wants to merge 1 commit into
BuilderIO:mainfrom
VaishnavGunjari:fix/dedup-urls-flag-collision-useragent
Open

fix: deduplicate crawled URLs, fix -m flag collision, add elapsed time + userAgent support#415
VaishnavGunjari wants to merge 1 commit into
BuilderIO:mainfrom
VaishnavGunjari:fix/dedup-urls-flag-collision-useragent

Conversation

@VaishnavGunjari
Copy link
Copy Markdown

What's changed

Bug fixes

  • Duplicate -m CLI flag: Both --match and --maxPagesToCrawl were registered with the same -m shorthand, causing silent option shadowing. Fixed by changing --maxPagesToCrawl to -n.
  • Typo in variable name RESOURCE_EXCLUSTIONSRESOURCE_EXCLUSIONS (affected all 3 usages).

Improvements

  • URL deduplication: Added a seenUrls Set to skip already-visited URLs during a crawl, preventing duplicate entries in the output and wasted requests.
  • Elapsed-time logging: A summary line (✅ Crawl complete: N page(s) crawled in Xs) is printed when the crawl finishes.
  • Custom userAgent config option: Users can now set a userAgent string in config.ts to override the default Playwright User-Agent on every request.

No breaking changes

All changes are backward-compatible. New config fields are optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant