55[ ![ Python] ( https://img.shields.io/badge/python-3.10%2B-blue )] ( https://python.org )
66[ ![ License: MIT] ( https://img.shields.io/badge/license-MIT-green )] ( LICENSE )
77[ ![ CI] ( https://github.com/FAAQJAVED/Leadhunter_Pro/actions/workflows/ci.yml/badge.svg )] ( https://github.com/FAAQJAVED/Leadhunter_Pro/actions )
8- [ ![ Tests] ( https://img.shields.io/badge/tests-72 %20passing-brightgreen )] ( tests/ )
8+ [ ![ Tests] ( https://img.shields.io/badge/tests-78 %20passing-brightgreen )] ( tests/ )
99[ ![ Platform] ( https://img.shields.io/badge/platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey )] ( https://github.com/FAAQJAVED/Leadhunter_Pro )
1010
1111---
@@ -18,7 +18,6 @@ Found this useful? A ⭐ on GitHub helps other developers find it.
1818> Always check a site's ` robots.txt ` and terms of service before running
1919> LeadHunter Pro against it at scale.
2020
21-
2221## Table of Contents
2322
2423[ Preview] ( #preview ) · [ What It Does] ( #what-it-does ) · [ Use Cases] ( #use-cases ) · [ How It Works] ( #how-it-works ) · [ Features] ( #features ) · [ Performance] ( #performance ) · [ What Data You Get] ( #what-data-you-get ) · [ Quick Start] ( #quick-start ) · [ Blueprint Reference] ( #blueprint-reference ) · [ Run Phases Separately] ( #or-run-phases-separately ) · [ Configuration] ( #configuration ) · [ Runtime Controls] ( #runtime-controls ) · [ Output Format] ( #output-format ) · [ Diagnose Your Engines] ( #diagnose-your-engines ) · [ Architecture Notes] ( #architecture-notes ) · [ Tech Stack] ( #tech-stack ) · [ Project Structure] ( #project-structure ) · [ Requirements] ( #requirements ) · [ Troubleshooting] ( #troubleshooting ) · [ B2B Lead Toolkit] ( #part-of-the-b2b-lead-toolkit ) · [ License] ( #license )
@@ -51,14 +50,14 @@ Each engine runs in its own session with a warmup request to avoid HTTP 202 bot
5150
5251## Use Cases
5352
54- | Who uses it | What they do | Example query |
55- | ---| ---| ---|
56- | ** Sales teams** | Generate targeted prospect lists for cold email campaigns | ` "accountants london" ` → 400+ HOT leads with email |
57- | ** Marketing agencies** | Deliver multi-source lead lists for any UK industry vertical | ` "estate agents birmingham" ` → enriched Excel in 2 hours |
58- | ** Freelance lead gen** | Automate research for clients across any niche and geography | Any query → score-sorted Excel ready for CRM import |
59- | ** Recruiters** | Identify employers in a sector and geography with direct contact | ` "law firms edinburgh" ` → HR emails and direct lines |
60- | ** Market researchers** | Map a category using 4 independent search indexes simultaneously | Any query → deduplicated coverage from all 4 engines |
61- | ** SDRs** | Build daily outreach lists with pre-scored priority rankings | Multiple queries → HOT leads on top, COLD at bottom |
53+ | Who uses it | What they do | Example query |
54+ | ---------------------------- | ---------------------------------------------------------------- | ----------------------------------------------------------- |
55+ | ** Sales teams** | Generate targeted prospect lists for cold email campaigns | ` "accountants london" ` → 400+ HOT leads with email |
56+ | ** Marketing agencies** | Deliver multi-source lead lists for any UK industry vertical | ` "estate agents birmingham" ` → enriched Excel in 2 hours |
57+ | ** Freelance lead gen** | Automate research for clients across any niche and geography | Any query → score-sorted Excel ready for CRM import |
58+ | ** Recruiters** | Identify employers in a sector and geography with direct contact | ` "law firms edinburgh" ` → HR emails and direct lines |
59+ | ** Market researchers** | Map a category using 4 independent search indexes simultaneously | Any query → deduplicated coverage from all 4 engines |
60+ | ** SDRs** | Build daily outreach lists with pre-scored priority rankings | Multiple queries → HOT leads on top, COLD at bottom |
6261
6362---
6463
@@ -121,26 +120,26 @@ Each engine runs in its own session with a warmup request to avoid HTTP 202 bot
121120
122121## Performance
123122
124- | Mode | Queries | Leads generated | Enrichment | Time |
125- | ---| ---| ---| ---| ---|
126- | Single query | 1 | 20–60 leads | All 4 engines | 3–8 min |
127- | Small batch | 5–10 queries | 100–300 leads | Full 2-pass | 20–40 min |
128- | Overnight run | 50+ queries | 800–2,000 leads | Full 2-pass | 3–8 hours |
123+ | Mode | Queries | Leads generated | Enrichment | Time |
124+ | ------------- | ------------- | ---------------- | ------------- | ---------- |
125+ | Single query | 1 | 20–60 leads | All 4 engines | 3–8 min |
126+ | Small batch | 5–10 queries | 100–300 leads | Full 2-pass | 20–40 min |
127+ | Overnight run | 50+ queries | 800–2,000 leads | Full 2-pass | 3–8 hours |
129128
130129> ** Real run:** ` "property managers manchester" ` — 1 query across all 4 engines, ** 62 unique leads from Mojeek alone** (pages 1–9), full enrichment pipeline applied. HOT leads sorted to top with 100% keyword match.
131130
132131---
133132
134133## What Data You Get
135134
136- | Field | Example |
137- | ---| ---|
138- | Company Name | Prime Residential |
139- | Website | https://primeresidentialpm.com/ |
140- | Email | manchester@primeresidentialpm.com |
141- | Phone | 01612413335 |
142- | Lead Quality | HOT |
143- | Keyword Match % | 100 |
135+ | Field | Example |
136+ | --------------- | --------------------------------- |
137+ | Company Name | Prime Residential |
138+ | Website | https://primeresidentialpm.com/ |
139+ | Email | manchester@primeresidentialpm.com |
140+ | Phone | 01612413335 |
141+ | Lead Quality | HOT |
142+ | Keyword Match % | 100 |
144143
145144See [ ` assets/sample_output.csv ` ] ( assets/sample_output.csv ) for 20 rows of real output extracted from a live scrape.
146145
@@ -214,25 +213,25 @@ BING_PROXY = 'socks5://user:pass@proxy-host:1080'
214213cp config.example.yaml config.yaml
215214```
216215
217- | Key | Default | Description |
218- | ---| ---| ---|
219- | ` output_format ` | ` xlsx ` | Output format — ` xlsx ` or ` csv ` |
220- | ` http_timeout ` | ` [4, 6] ` | Pass 1 HTTP timeout range ` [min, max] ` in seconds |
221- | ` playwright_timeout ` | ` 8000 ` | Pass 2 Playwright page load timeout in milliseconds |
222- | ` browser_restart_every ` | ` 150 ` | Restart Chromium every N sites to prevent memory leaks |
223- | ` stop_at ` | ` "" ` | Wall-clock auto-stop in 24h format — ` "" ` = disabled (e.g. ` "23:00" ` ) |
224- | ` autosave_interval ` | ` 60 ` | Background checkpoint save interval in seconds |
225- | ` enricher_workers ` | ` 5 ` | Concurrent worker count for Pass 1 HTTP enrichment |
226- | ` rate_limit.min_seconds ` | ` 0.1 ` | Minimum delay between HTTP requests |
227- | ` rate_limit.max_seconds ` | ` 0.5 ` | Maximum delay between HTTP requests |
228- | ` GEO_SUSPECT_TLDS ` | ` [] ` | TLDs flagged as geo-suspect — e.g. ` ['in', 'pk', 'ru'] ` |
229- | ` score_boost_keywords ` | ` [] ` | URL keywords that give a +1 score boost to a lead |
230- | ` skip_email_keywords ` | ` [noreply, no-reply, …] ` | Local-part patterns that discard an email entirely (score 999) |
231- | ` generic_email_keywords ` | ` [info, admin, support, …] ` | Generics used to assign email quality tier (2 or 3) |
232- | ` junk_email_domains ` | ` [mailinator.com, …] ` | Domains whose emails are always discarded |
233- | ` contact_paths ` | ` [/contact, /about, …] ` | Sub-pages visited per site in Pass 1 after the homepage |
234- | ` locale ` | ` en-US ` | Browser locale passed to Playwright for Pass 2 |
235- | ` cookie_selectors ` | ` […] ` | Playwright selectors tried for cookie banner dismissal (10 defaults) |
216+ | Key | Default | Description |
217+ | -------------------------- | ------------------------------ | ------------------------------------------------------------------------- |
218+ | ` output_format ` | ` xlsx ` | Output format —` xlsx ` or ` csv ` |
219+ | ` http_timeout ` | ` [4, 6] ` | Pass 1 HTTP timeout range ` [min, max] ` in seconds |
220+ | ` playwright_timeout ` | ` 8000 ` | Pass 2 Playwright page load timeout in milliseconds |
221+ | ` browser_restart_every ` | ` 150 ` | Restart Chromium every N sites to prevent memory leaks |
222+ | ` stop_at ` | ` "" ` | Wall-clock auto-stop in 24h format —` "" ` = disabled (e.g. ` "23:00" ` ) |
223+ | ` autosave_interval ` | ` 60 ` | Background checkpoint save interval in seconds |
224+ | ` enricher_workers ` | ` 5 ` | Concurrent worker count for Pass 1 HTTP enrichment |
225+ | ` rate_limit.min_seconds ` | ` 0.1 ` | Minimum delay between HTTP requests |
226+ | ` rate_limit.max_seconds ` | ` 0.5 ` | Maximum delay between HTTP requests |
227+ | ` GEO_SUSPECT_TLDS ` | ` [] ` | TLDs flagged as geo-suspect — e.g.` ['in', 'pk', 'ru'] ` |
228+ | ` score_boost_keywords ` | ` [] ` | URL keywords that give a +1 score boost to a lead |
229+ | ` skip_email_keywords ` | ` [noreply, no-reply, …] ` | Local-part patterns that discard an email entirely (score 999) |
230+ | ` generic_email_keywords ` | ` [info, admin, support, …] ` | Generics used to assign email quality tier (2 or 3) |
231+ | ` junk_email_domains ` | ` [mailinator.com, …] ` | Domains whose emails are always discarded |
232+ | ` contact_paths ` | ` [/contact, /about, …] ` | Sub-pages visited per site in Pass 1 after the homepage |
233+ | ` locale ` | ` en-US ` | Browser locale passed to Playwright for Pass 2 |
234+ | ` cookie_selectors ` | ` […] ` | Playwright selectors tried for cookie banner dismissal (10 defaults) |
236235
237236---
238237
@@ -318,17 +317,17 @@ Launching a headless browser for every site would take 3–5 s per site versus ~
318317
319318## Tech Stack
320319
321- | Library | Role |
322- | ---| ---|
323- | ` httpx[http2] ` | Phase 1 — async HTTP/2 requests for search engine scraping |
324- | ` beautifulsoup4 ` | Phase 1 — HTML parsing for search result extraction |
325- | ` lxml ` | Phase 1 — fast HTML/XML parser (beautifulsoup backend) |
326- | ` playwright ` | Phase 2 — headless Chromium fallback for JS-rendered sites |
327- | ` requests ` | Phase 2 — lightweight HTTP GET for contact enrichment pass |
328- | ` openpyxl ` | Excel output with colour-coded rows and Summary sheet |
329- | ` pyyaml ` | YAML config loading for Phase 2 settings |
330- | ` tqdm ` | Live terminal progress bar with ETA for both phases |
331- | ` python-dotenv ` | Optional — loads BING_PROXY from .env file |
320+ | Library | Role |
321+ | ------------------ | ----------------------------------------------------------- |
322+ | ` httpx[http2] ` | Phase 1 — async HTTP/2 requests for search engine scraping |
323+ | ` beautifulsoup4 ` | Phase 1 — HTML parsing for search result extraction |
324+ | ` lxml ` | Phase 1 — fast HTML/XML parser (beautifulsoup backend) |
325+ | ` playwright ` | Phase 2 — headless Chromium fallback for JS-rendered sites |
326+ | ` requests ` | Phase 2 — lightweight HTTP GET for contact enrichment pass |
327+ | ` openpyxl ` | Excel output with colour-coded rows and Summary sheet |
328+ | ` pyyaml ` | YAML config loading for Phase 2 settings |
329+ | ` tqdm ` | Live terminal progress bar with ETA for both phases |
330+ | ` python-dotenv ` | Optional — loads BING_PROXY from .env file |
332331
333332---
334333
@@ -410,13 +409,13 @@ Checkpoint is saved every 50 queries. Re-run with the same `queries.txt` to resu
410409
411410## Part of the B2B Lead Toolkit
412411
413- | Repo | What it does |
414- | ---| ---|
415- | ** [ Leadhunter Pro] ( https://github.com/FAAQJAVED/Leadhunter_Pro ) ** ← * you are here* | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
416- | ** [ Email Phone Enrichment Tool] ( https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool ) ** | Scrapes contact emails + phones from company websites |
417- | ** [ Google Maps Business Scraper] ( https://github.com/FAAQJAVED/Google-Maps-Business-Scraper ) ** | Extracts and enriches business listings from Google Maps |
418- | ** [ Trustpilot Business Scraper] ( https://github.com/FAAQJAVED/trustpilot-business-scraper ) ** | Extracts business listings from Trustpilot search results |
419- | ** [ JSON Directory Harvester] ( https://github.com/FAAQJAVED/json-directory-harvester ) ** | Configurable harvester for any JSON directory API with geo-filtering |
412+ | Repo | What it does |
413+ | ----------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
414+ | ** [ Leadhunter Pro] ( https://github.com/FAAQJAVED/Leadhunter_Pro ) ** ← * you are here* | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
415+ | ** [ Email Phone Enrichment Tool] ( https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool ) ** | Scrapes contact emails + phones from company websites |
416+ | ** [ Google Maps Business Scraper] ( https://github.com/FAAQJAVED/Google-Maps-Business-Scraper ) ** | Extracts and enriches business listings from Google Maps |
417+ | ** [ Trustpilot Business Scraper] ( https://github.com/FAAQJAVED/trustpilot-business-scraper ) ** | Extracts business listings from Trustpilot search results |
418+ | ** [ JSON Directory Harvester] ( https://github.com/FAAQJAVED/json-directory-harvester ) ** | Configurable harvester for any JSON directory API with geo-filtering |
420419
421420---
422421
0 commit comments