Skip to content

Commit a803b02

Browse files
committed
corrected test numbers
1 parent 506d83f commit a803b02

1 file changed

Lines changed: 59 additions & 60 deletions

File tree

README.md

Lines changed: 59 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)
66
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
77
[![CI](https://github.com/FAAQJAVED/Leadhunter_Pro/actions/workflows/ci.yml/badge.svg)](https://github.com/FAAQJAVED/Leadhunter_Pro/actions)
8-
[![Tests](https://img.shields.io/badge/tests-72%20passing-brightgreen)](tests/)
8+
[![Tests](https://img.shields.io/badge/tests-78%20passing-brightgreen)](tests/)
99
[![Platform](https://img.shields.io/badge/platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey)](https://github.com/FAAQJAVED/Leadhunter_Pro)
1010

1111
---
@@ -18,7 +18,6 @@ Found this useful? A ⭐ on GitHub helps other developers find it.
1818
> Always check a site's `robots.txt` and terms of service before running
1919
> LeadHunter Pro against it at scale.
2020
21-
2221
## Table of Contents
2322

2423
[Preview](#preview) · [What It Does](#what-it-does) · [Use Cases](#use-cases) · [How It Works](#how-it-works) · [Features](#features) · [Performance](#performance) · [What Data You Get](#what-data-you-get) · [Quick Start](#quick-start) · [Blueprint Reference](#blueprint-reference) · [Run Phases Separately](#or-run-phases-separately) · [Configuration](#configuration) · [Runtime Controls](#runtime-controls) · [Output Format](#output-format) · [Diagnose Your Engines](#diagnose-your-engines) · [Architecture Notes](#architecture-notes) · [Tech Stack](#tech-stack) · [Project Structure](#project-structure) · [Requirements](#requirements) · [Troubleshooting](#troubleshooting) · [B2B Lead Toolkit](#part-of-the-b2b-lead-toolkit) · [License](#license)
@@ -51,14 +50,14 @@ Each engine runs in its own session with a warmup request to avoid HTTP 202 bot
5150

5251
## Use Cases
5352

54-
| Who uses it | What they do | Example query |
55-
|---|---|---|
56-
| **Sales teams** | Generate targeted prospect lists for cold email campaigns | `"accountants london"` → 400+ HOT leads with email |
57-
| **Marketing agencies** | Deliver multi-source lead lists for any UK industry vertical | `"estate agents birmingham"` → enriched Excel in 2 hours |
58-
| **Freelance lead gen** | Automate research for clients across any niche and geography | Any query → score-sorted Excel ready for CRM import |
59-
| **Recruiters** | Identify employers in a sector and geography with direct contact | `"law firms edinburgh"` → HR emails and direct lines |
60-
| **Market researchers** | Map a category using 4 independent search indexes simultaneously | Any query → deduplicated coverage from all 4 engines |
61-
| **SDRs** | Build daily outreach lists with pre-scored priority rankings | Multiple queries → HOT leads on top, COLD at bottom |
53+
| Who uses it | What they do | Example query |
54+
| ---------------------------- | ---------------------------------------------------------------- | ----------------------------------------------------------- |
55+
| **Sales teams** | Generate targeted prospect lists for cold email campaigns | `"accountants london"` → 400+ HOT leads with email |
56+
| **Marketing agencies** | Deliver multi-source lead lists for any UK industry vertical | `"estate agents birmingham"` → enriched Excel in 2 hours |
57+
| **Freelance lead gen** | Automate research for clients across any niche and geography | Any query → score-sorted Excel ready for CRM import |
58+
| **Recruiters** | Identify employers in a sector and geography with direct contact | `"law firms edinburgh"` → HR emails and direct lines |
59+
| **Market researchers** | Map a category using 4 independent search indexes simultaneously | Any query → deduplicated coverage from all 4 engines |
60+
| **SDRs** | Build daily outreach lists with pre-scored priority rankings | Multiple queries → HOT leads on top, COLD at bottom |
6261

6362
---
6463

@@ -121,26 +120,26 @@ Each engine runs in its own session with a warmup request to avoid HTTP 202 bot
121120

122121
## Performance
123122

124-
| Mode | Queries | Leads generated | Enrichment | Time |
125-
|---|---|---|---|---|
126-
| Single query | 1 | 20–60 leads | All 4 engines | 3–8 min |
127-
| Small batch | 5–10 queries | 100–300 leads | Full 2-pass | 20–40 min |
128-
| Overnight run | 50+ queries | 800–2,000 leads | Full 2-pass | 3–8 hours |
123+
| Mode | Queries | Leads generated | Enrichment | Time |
124+
| ------------- | ------------- | ---------------- | ------------- | ---------- |
125+
| Single query | 1 | 20–60 leads | All 4 engines | 3–8 min |
126+
| Small batch | 5–10 queries | 100–300 leads | Full 2-pass | 20–40 min |
127+
| Overnight run | 50+ queries | 800–2,000 leads | Full 2-pass | 3–8 hours |
129128

130129
> **Real run:** `"property managers manchester"` — 1 query across all 4 engines, **62 unique leads from Mojeek alone** (pages 1–9), full enrichment pipeline applied. HOT leads sorted to top with 100% keyword match.
131130
132131
---
133132

134133
## What Data You Get
135134

136-
| Field | Example |
137-
|---|---|
138-
| Company Name | Prime Residential |
139-
| Website | https://primeresidentialpm.com/ |
140-
| Email | manchester@primeresidentialpm.com |
141-
| Phone | 01612413335 |
142-
| Lead Quality | HOT |
143-
| Keyword Match % | 100 |
135+
| Field | Example |
136+
| --------------- | --------------------------------- |
137+
| Company Name | Prime Residential |
138+
| Website | https://primeresidentialpm.com/ |
139+
| Email | manchester@primeresidentialpm.com |
140+
| Phone | 01612413335 |
141+
| Lead Quality | HOT |
142+
| Keyword Match % | 100 |
144143

145144
See [`assets/sample_output.csv`](assets/sample_output.csv) for 20 rows of real output extracted from a live scrape.
146145

@@ -214,25 +213,25 @@ BING_PROXY = 'socks5://user:pass@proxy-host:1080'
214213
cp config.example.yaml config.yaml
215214
```
216215

217-
| Key | Default | Description |
218-
|---|---|---|
219-
| `output_format` | `xlsx` | Output format — `xlsx` or `csv` |
220-
| `http_timeout` | `[4, 6]` | Pass 1 HTTP timeout range `[min, max]` in seconds |
221-
| `playwright_timeout` | `8000` | Pass 2 Playwright page load timeout in milliseconds |
222-
| `browser_restart_every` | `150` | Restart Chromium every N sites to prevent memory leaks |
223-
| `stop_at` | `""` | Wall-clock auto-stop in 24h format — `""` = disabled (e.g. `"23:00"`) |
224-
| `autosave_interval` | `60` | Background checkpoint save interval in seconds |
225-
| `enricher_workers` | `5` | Concurrent worker count for Pass 1 HTTP enrichment |
226-
| `rate_limit.min_seconds` | `0.1` | Minimum delay between HTTP requests |
227-
| `rate_limit.max_seconds` | `0.5` | Maximum delay between HTTP requests |
228-
| `GEO_SUSPECT_TLDS` | `[]` | TLDs flagged as geo-suspect — e.g. `['in', 'pk', 'ru']` |
229-
| `score_boost_keywords` | `[]` | URL keywords that give a +1 score boost to a lead |
230-
| `skip_email_keywords` | `[noreply, no-reply, …]` | Local-part patterns that discard an email entirely (score 999) |
231-
| `generic_email_keywords` | `[info, admin, support, …]` | Generics used to assign email quality tier (2 or 3) |
232-
| `junk_email_domains` | `[mailinator.com, …]` | Domains whose emails are always discarded |
233-
| `contact_paths` | `[/contact, /about, …]` | Sub-pages visited per site in Pass 1 after the homepage |
234-
| `locale` | `en-US` | Browser locale passed to Playwright for Pass 2 |
235-
| `cookie_selectors` | `[…]` | Playwright selectors tried for cookie banner dismissal (10 defaults) |
216+
| Key | Default | Description |
217+
| -------------------------- | ------------------------------ | ------------------------------------------------------------------------- |
218+
| `output_format` | `xlsx` | Output format —`xlsx` or `csv` |
219+
| `http_timeout` | `[4, 6]` | Pass 1 HTTP timeout range `[min, max]` in seconds |
220+
| `playwright_timeout` | `8000` | Pass 2 Playwright page load timeout in milliseconds |
221+
| `browser_restart_every` | `150` | Restart Chromium every N sites to prevent memory leaks |
222+
| `stop_at` | `""` | Wall-clock auto-stop in 24h format —`""` = disabled (e.g. `"23:00"`) |
223+
| `autosave_interval` | `60` | Background checkpoint save interval in seconds |
224+
| `enricher_workers` | `5` | Concurrent worker count for Pass 1 HTTP enrichment |
225+
| `rate_limit.min_seconds` | `0.1` | Minimum delay between HTTP requests |
226+
| `rate_limit.max_seconds` | `0.5` | Maximum delay between HTTP requests |
227+
| `GEO_SUSPECT_TLDS` | `[]` | TLDs flagged as geo-suspect — e.g.`['in', 'pk', 'ru']` |
228+
| `score_boost_keywords` | `[]` | URL keywords that give a +1 score boost to a lead |
229+
| `skip_email_keywords` | `[noreply, no-reply, …]` | Local-part patterns that discard an email entirely (score 999) |
230+
| `generic_email_keywords` | `[info, admin, support, …]` | Generics used to assign email quality tier (2 or 3) |
231+
| `junk_email_domains` | `[mailinator.com, …]` | Domains whose emails are always discarded |
232+
| `contact_paths` | `[/contact, /about, …]` | Sub-pages visited per site in Pass 1 after the homepage |
233+
| `locale` | `en-US` | Browser locale passed to Playwright for Pass 2 |
234+
| `cookie_selectors` | `[…]` | Playwright selectors tried for cookie banner dismissal (10 defaults) |
236235

237236
---
238237

@@ -318,17 +317,17 @@ Launching a headless browser for every site would take 3–5 s per site versus ~
318317

319318
## Tech Stack
320319

321-
| Library | Role |
322-
|---|---|
323-
| `httpx[http2]` | Phase 1 — async HTTP/2 requests for search engine scraping |
324-
| `beautifulsoup4` | Phase 1 — HTML parsing for search result extraction |
325-
| `lxml` | Phase 1 — fast HTML/XML parser (beautifulsoup backend) |
326-
| `playwright` | Phase 2 — headless Chromium fallback for JS-rendered sites |
327-
| `requests` | Phase 2 — lightweight HTTP GET for contact enrichment pass |
328-
| `openpyxl` | Excel output with colour-coded rows and Summary sheet |
329-
| `pyyaml` | YAML config loading for Phase 2 settings |
330-
| `tqdm` | Live terminal progress bar with ETA for both phases |
331-
| `python-dotenv` | Optional — loads BING_PROXY from .env file |
320+
| Library | Role |
321+
| ------------------ | ----------------------------------------------------------- |
322+
| `httpx[http2]` | Phase 1 — async HTTP/2 requests for search engine scraping |
323+
| `beautifulsoup4` | Phase 1 — HTML parsing for search result extraction |
324+
| `lxml` | Phase 1 — fast HTML/XML parser (beautifulsoup backend) |
325+
| `playwright` | Phase 2 — headless Chromium fallback for JS-rendered sites |
326+
| `requests` | Phase 2 — lightweight HTTP GET for contact enrichment pass |
327+
| `openpyxl` | Excel output with colour-coded rows and Summary sheet |
328+
| `pyyaml` | YAML config loading for Phase 2 settings |
329+
| `tqdm` | Live terminal progress bar with ETA for both phases |
330+
| `python-dotenv` | Optional — loads BING_PROXY from .env file |
332331

333332
---
334333

@@ -410,13 +409,13 @@ Checkpoint is saved every 50 queries. Re-run with the same `queries.txt` to resu
410409

411410
## Part of the B2B Lead Toolkit
412411

413-
| Repo | What it does |
414-
|---|---|
415-
| **[Leadhunter Pro](https://github.com/FAAQJAVED/Leadhunter_Pro)***you are here* | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
416-
| **[Email Phone Enrichment Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)** | Scrapes contact emails + phones from company websites |
417-
| **[Google Maps Business Scraper](https://github.com/FAAQJAVED/Google-Maps-Business-Scraper)** | Extracts and enriches business listings from Google Maps |
418-
| **[Trustpilot Business Scraper](https://github.com/FAAQJAVED/trustpilot-business-scraper)** | Extracts business listings from Trustpilot search results |
419-
| **[JSON Directory Harvester](https://github.com/FAAQJAVED/json-directory-harvester)** | Configurable harvester for any JSON directory API with geo-filtering |
412+
| Repo | What it does |
413+
| ----------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
414+
| **[Leadhunter Pro](https://github.com/FAAQJAVED/Leadhunter_Pro)***you are here* | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
415+
| **[Email Phone Enrichment Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)** | Scrapes contact emails + phones from company websites |
416+
| **[Google Maps Business Scraper](https://github.com/FAAQJAVED/Google-Maps-Business-Scraper)** | Extracts and enriches business listings from Google Maps |
417+
| **[Trustpilot Business Scraper](https://github.com/FAAQJAVED/trustpilot-business-scraper)** | Extracts business listings from Trustpilot search results |
418+
| **[JSON Directory Harvester](https://github.com/FAAQJAVED/json-directory-harvester)** | Configurable harvester for any JSON directory API with geo-filtering |
420419

421420
---
422421

0 commit comments

Comments
 (0)