Skip to content

Commit 392399a

Browse files
authored
Revise README for clarity and detail
Updated README to enhance project description and clarify functionality.
1 parent ce8ed8a commit 392399a

1 file changed

Lines changed: 101 additions & 22 deletions

File tree

README.md

Lines changed: 101 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# LeadHunter Pro
22

3-
**Multi-engine search scraper + contact enricher. Finds business leads, extracts emails & phones, scores lead quality.**
3+
**Production-grade Python lead generation engine — scrapes 4 independent search engines simultaneously, enriches every result with email and phone, and scores each lead HOT / WARM / COLD for prioritised outreach. Type a query, get a ready-to-use Excel lead list.**
44

55
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)
66
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
@@ -10,20 +10,13 @@
1010

1111
---
1212

13-
## What It Does
14-
15-
LeadHunter Pro searches four independent search engines simultaneously to find real business websites matching your query. It then visits each website to extract a contact email address and phone number, and scores every lead as HOT, WARM, COLD, or NOISE based on how closely the page content matches what you searched for. The final output is a colour-coded Excel spreadsheet, ready to use.
13+
Found this useful? A ⭐ on GitHub helps other developers find it.
1614

1715
---
1816

19-
## Part of the B2B Lead Toolkit
17+
## Table of Contents
2018

21-
| Repo | What it does |
22-
| ----------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- |
23-
| **[Leadhunter Pro](https://github.com/FAAQJAVED/Leadhunter_Pro)***you are here* | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
24-
| **[Email Phone Enrichment Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)** | Scrapes contact emails + phones from company websites |
25-
| **[Google Maps Business Scraper](https://github.com/FAAQJAVED/Google-Maps-Business-Scraper)** | Extracts and enriches business listings from Google Maps |
26-
| **[Trustpilot Business Scraper](https://github.com/FAAQJAVED/trustpilot-business-scraper)** | Extracts business listings from Trustpilot search results |
19+
[Preview](#preview) · [What It Does](#what-it-does) · [Use Cases](#use-cases) · [How It Works](#how-it-works) · [Features](#features) · [Performance](#performance) · [What Data You Get](#what-data-you-get) · [Quick Start](#quick-start) · [Blueprint Reference](#blueprint-reference) · [Run Phases Separately](#or-run-phases-separately) · [Configuration](#configuration) · [Runtime Controls](#runtime-controls) · [Output Format](#output-format) · [Diagnose Your Engines](#diagnose-your-engines) · [Architecture Notes](#architecture-notes) · [Tech Stack](#tech-stack) · [Project Structure](#project-structure) · [Requirements](#requirements) · [Troubleshooting](#troubleshooting) · [B2B Lead Toolkit](#part-of-the-b2b-lead-toolkit) · [License](#license)
2720

2821
---
2922

@@ -39,6 +32,31 @@ LeadHunter Pro searches four independent search engines simultaneously to find r
3932

4033
---
4134

35+
## What It Does
36+
37+
1. **Reads `queries.txt`** — one search query per line (e.g. `property managers manchester`)
38+
2. **Phase 1 — Scrapes 4 search engines** (Mojeek, DuckDuckGo, Yahoo, Bing) for each query, deduplicates results across engines, and saves a lead CSV.
39+
3. **Phase 2 — Enriches every lead** by visiting each website: Pass 1 (fast HTTP GET) then Playwright fallback for JS-rendered sites.
40+
4. **Scores each lead** HOT / WARM / COLD / NOISE based on keyword matching against the original query — prioritised for outreach.
41+
5. **Outputs a styled Excel file** — colour-coded by score, sorted by quality, hyperlinked websites, and a Summary sheet with engine statistics.
42+
43+
Each engine runs in its own session with a warmup request to avoid HTTP 202 bot challenges. Results are deduplicated across all four engines using URL normalisation and domain deduplication before enrichment begins. A built-in `diagnose.py` tool checks each engine's health before a run.
44+
45+
---
46+
47+
## Use Cases
48+
49+
| Who uses it | What they do | Example query |
50+
|---|---|---|
51+
| **Sales teams** | Generate targeted prospect lists for cold email campaigns | `"accountants london"` → 400+ HOT leads with email |
52+
| **Marketing agencies** | Deliver multi-source lead lists for any UK industry vertical | `"estate agents birmingham"` → enriched Excel in 2 hours |
53+
| **Freelance lead gen** | Automate research for clients across any niche and geography | Any query → score-sorted Excel ready for CRM import |
54+
| **Recruiters** | Identify employers in a sector and geography with direct contact | `"law firms edinburgh"` → HR emails and direct lines |
55+
| **Market researchers** | Map a category using 4 independent search indexes simultaneously | Any query → deduplicated coverage from all 4 engines |
56+
| **SDRs** | Build daily outreach lists with pre-scored priority rankings | Multiple queries → HOT leads on top, COLD at bottom |
57+
58+
---
59+
4260
## How It Works
4361

4462
```
@@ -96,6 +114,33 @@ LeadHunter Pro searches four independent search engines simultaneously to find r
96114

97115
---
98116

117+
## Performance
118+
119+
| Mode | Queries | Leads generated | Enrichment | Time |
120+
|---|---|---|---|---|
121+
| Single query | 1 | 20–60 leads | All 4 engines | 3–8 min |
122+
| Small batch | 5–10 queries | 100–300 leads | Full 2-pass | 20–40 min |
123+
| Overnight run | 50+ queries | 800–2,000 leads | Full 2-pass | 3–8 hours |
124+
125+
> **Real run:** `"property managers manchester"` — 1 query across all 4 engines, **62 unique leads from Mojeek alone** (pages 1–9), full enrichment pipeline applied. HOT leads sorted to top with 100% keyword match.
126+
127+
---
128+
129+
## What Data You Get
130+
131+
| Field | Example |
132+
|---|---|
133+
| Company Name | Prime Residential |
134+
| Website | https://primeresidentialpm.com/ |
135+
| Email | manchester@primeresidentialpm.com |
136+
| Phone | 01612413335 |
137+
| Lead Quality | HOT |
138+
| Keyword Match % | 100 |
139+
140+
See [`assets/sample_output.csv`](assets/sample_output.csv) for 20 rows of real output extracted from a live scrape.
141+
142+
---
143+
99144
## Quick Start
100145

101146
```bash
@@ -117,6 +162,12 @@ python main.py
117162

118163
---
119164

165+
## Blueprint Reference
166+
167+
For a complete technical deep-dive — architecture decisions, engine behaviour, rate-limit strategy, scoring model, and extension guide — see [BLUEPRINT.md](BLUEPRINT.md).
168+
169+
---
170+
120171
## Or Run Phases Separately
121172

122173
```bash
@@ -242,6 +293,22 @@ Launching a headless browser for every site would take 3–5 s per site versus ~
242293

243294
---
244295

296+
## Tech Stack
297+
298+
| Library | Role |
299+
|---|---|
300+
| `httpx[http2]` | Phase 1 — async HTTP/2 requests for search engine scraping |
301+
| `beautifulsoup4` | Phase 1 — HTML parsing for search result extraction |
302+
| `lxml` | Phase 1 — fast HTML/XML parser (beautifulsoup backend) |
303+
| `playwright` | Phase 2 — headless Chromium fallback for JS-rendered sites |
304+
| `requests` | Phase 2 — lightweight HTTP GET for contact enrichment pass |
305+
| `openpyxl` | Excel output with colour-coded rows and Summary sheet |
306+
| `pyyaml` | YAML config loading for Phase 2 settings |
307+
| `tqdm` | Live terminal progress bar with ETA for both phases |
308+
| `python-dotenv` | Optional — loads BING_PROXY from .env file |
309+
310+
---
311+
245312
## Project Structure
246313

247314
```
@@ -302,21 +369,33 @@ Leadhunter_Pro/
302369

303370
---
304371

372+
## Troubleshooting
373+
374+
**Bing returning results in wrong language or region:**
375+
Set `BING_PROXY=http://user:pass@host:8080` in your `.env` file. `BING_PROXY` is read automatically at startup.
376+
377+
**DuckDuckGo returning HTTP 202 with no results:**
378+
DDG's warmup mechanism is handled automatically. If persistent, increase `DELAY_BETWEEN_ENGINES` in `config.py` or pause for 10–15 minutes.
379+
380+
**One engine returning zero results consistently:**
381+
Run `python diagnose.py` — it fires a test query at each engine and reports the HTTP status, result count, and error. Use it to identify which engine to temporarily disable in `ENGINES_PRIORITY` in `config.py`.
382+
383+
**Script stops mid-run:**
384+
Checkpoint is saved every 50 queries. Re-run with the same `queries.txt` to resume from where it stopped.
385+
305386
---
306387

307-
## Tech Stack
388+
## Part of the B2B Lead Toolkit
308389

309-
| Library | Role |
390+
| Repo | What it does |
310391
|---|---|
311-
| `httpx[http2]` | Phase 1 — async HTTP/2 requests for search engine scraping |
312-
| `beautifulsoup4` | Phase 1 — HTML parsing for search result extraction |
313-
| `lxml` | Phase 1 — fast HTML/XML parser (beautifulsoup backend) |
314-
| `playwright` | Phase 2 — headless Chromium fallback for JS-rendered sites |
315-
| `requests` | Phase 2 — lightweight HTTP GET for contact enrichment pass |
316-
| `openpyxl` | Excel output with colour-coded rows and Summary sheet |
317-
| `pyyaml` | YAML config loading for Phase 2 settings |
318-
| `tqdm` | Live terminal progress bar with ETA for both phases |
319-
| `python-dotenv` | Optional — loads BING_PROXY from .env file |
392+
| **[Leadhunter Pro](https://github.com/FAAQJAVED/Leadhunter_Pro)***you are here* | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
393+
| **[Email Phone Enrichment Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)** | Scrapes contact emails + phones from company websites |
394+
| **[Google Maps Business Scraper](https://github.com/FAAQJAVED/Google-Maps-Business-Scraper)** | Extracts and enriches business listings from Google Maps |
395+
| **[Trustpilot Business Scraper](https://github.com/FAAQJAVED/trustpilot-business-scraper)** | Extracts business listings from Trustpilot search results |
396+
| **[JSON Directory Harvester](https://github.com/FAAQJAVED/json-directory-harvester)** | Configurable harvester for any JSON directory API with geo-filtering |
397+
398+
---
320399

321400
## License
322401

0 commit comments

Comments
 (0)