Skip to content

Commit 0d7465b

Browse files
committed
docs: update README with new sections and adjust BLUEPRINT column list
1 parent 090c7ea commit 0d7465b

26 files changed

Lines changed: 257 additions & 188 deletions

BLUEPRINT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -420,7 +420,7 @@ Adds `Email`, `Phone`, `Lead Quality`, and `Keyword Match %` columns. The `lead_
420420
### Step 1: Installation
421421

422422
```bash
423-
git clone https://github.com/FAAQJAVED/leadhunter-pro
423+
git clone https://github.com/<<GITHUB_USERNAME>>/leadhunter-pro
424424
cd leadhunter-pro
425425
pip install -r requirements.txt
426426
python -m playwright install chromium

README.md

Lines changed: 41 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,26 @@
1+
## Readme** · **MD
2+
3+
Copy
4+
15
# LeadHunter Pro
26

37
**Multi-engine search scraper + contact enricher. Finds business leads, extracts emails & phones, scores lead quality.**
48

5-
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org/)
6-
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](https://claude.ai/chat/LICENSE)
7-
[![CI](https://github.com/FAAQJAVED/Leadhunter_Pro/actions/workflows/ci.yml/badge.svg)](https://github.com/FAAQJAVED/Leadhunter_Pro/actions)
8-
[![Platform](https://img.shields.io/badge/platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey)](https://github.com/FAAQJAVED/Leadhunter_Pro)
9+
[Show Image](https://python.org)
10+
[Show Image](LICENSE)
11+
[Show Image](https://github.com/FAAQJAVED/Leadhunter_Pro/actions)
12+
[Show Image](https://github.com/FAAQJAVED/Leadhunter_Pro)
913

1014
---
1115

1216
## What It Does
1317

1418
LeadHunter Pro searches four independent search engines simultaneously to find real business websites matching your query. It then visits each website to extract a contact email address and phone number, and scores every lead as HOT, WARM, COLD, or NOISE based on how closely the page content matches what you searched for. The final output is a colour-coded Excel spreadsheet, ready to use.
1519

16-
---
17-
18-
## Preview
19-
20-
> Add your screenshots to the `assets/` folder and they will appear here automatically.
21-
>
22-
> See [assets/README.md](https://claude.ai/chat/assets/README.md) for the recommended screenshot guide.
2320

24-
| Phase 1 — Scraping | Phase 2 — Enrichment |
25-
| -------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
26-
| ![Phase 1 scraping in progress](https://claude.ai/chat/assets/phase1-scraping.png) | ![Phase 2 enrichment running](https://claude.ai/chat/assets/phase2-enrichment.png) |
21+
---
2722

28-
| Excel Output | Diagnose Output |
29-
| --------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
30-
| ![Colour-coded Excel output](https://claude.ai/chat/assets/excel-output-sample.png) | ![Diagnose terminal output](https://claude.ai/chat/assets/diagnose-output.png) |
3123

32-
---
3324

3425
## How It Works
3526

@@ -90,6 +81,8 @@ LeadHunter Pro searches four independent search engines simultaneously to find r
9081

9182
## Quick Start
9283

84+
bash
85+
9386
```bash
9487
git clone https://github.com/FAAQJAVED/Leadhunter_Pro.git
9588
cd leadhunter-pro
@@ -111,6 +104,8 @@ python main.py
111104

112105
## Or Run Phases Separately
113106

107+
bash
108+
114109
```bash
115110
# Phase 1 only — specific engines, specific query
116111
python main.py --query "letting agents Manchester" --mojeek --ddg
@@ -136,16 +131,20 @@ python enricher.py --input outputs/leads_2026-05-01.csv
136131

137132
**Bing proxy options:**
138133

134+
python
135+
139136
```python
140137
# Authenticated residential proxy
141-
BING_PROXY = 'http://user:pass@uk.residential.proxy:8080'
138+
BING_PROXY ='http://user:pass@uk.residential.proxy:8080'
142139

143140
# SOCKS5
144-
BING_PROXY = 'socks5://user:pass@proxy-host:1080'
141+
BING_PROXY ='socks5://user:pass@proxy-host:1080'
145142
```
146143

147144
### `config.yaml` — Phase 2 (enricher) settings
148145

146+
bash
147+
149148
```bash
150149
cp config.example.yaml config.yaml
151150
```
@@ -210,6 +209,8 @@ Key settings: `http_timeout`, `playwright_timeout`, `stop_at`, `contact_paths`,
210209

211210
## Diagnose Your Engines
212211

212+
bash
213+
213214
```bash
214215
python diagnose.py # test Mojeek, DDG, Yahoo (default)
215216
python diagnose.py --bing # test Bing (run with VPN/proxy active)
@@ -240,20 +241,13 @@ Launching a headless browser for every site would take 3–5 s per site versus ~
240241

241242
## Part of the B2B Lead Toolkit
242243

243-
LeadHunter Pro is the **search and discovery layer** of a three-tool pipeline. Each tool can be used independently, or run in sequence end-to-end.
244-
245-
```
246-
Google Maps ──► LeadHunter Pro ──► Email Enricher
247-
(raw listings) (verified websites) (emails + phones)
248-
```
249-
250-
| Repo | Role in pipeline |
251-
| ------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
252-
| **[google-maps-scraper](https://github.com/FAAQJAVED/google-maps-scraper)** | Extracts raw business listings — name, address, phone, website — from Google Maps |
253-
| **[Leadhunter_Pro](https://github.com/FAAQJAVED/Leadhunter_Pro)***you are here* | Scrapes 4 search engines to find verified company websites, deduplicates and scores them |
254-
| **[Email-Phone-Number-Enrichment-Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)** | Visits each website to extract contact emails and phone numbers (standalone enricher) |
244+
This scraper is one component of a broader B2B lead generation pipeline targeting UK property management companies, letting agents, block managers, and HMO landlords.
255245

256-
> **Note:** LeadHunter Pro includes its own built-in Phase 2 enrichment — so you can run the full pipeline with this tool alone. The standalone enricher is useful if you already have a list of websites from another source (e.g. Google Maps) and just need the contact extraction step.
246+
| Repo | What it does |
247+
| ------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------- |
248+
| **[Leadhunter_Pro](https://github.com/FAAQJAVED/Leadhunter_Pro)***you are here* | Scrapes 4 search engines to find verified company websites, scores and deduplicates results |
249+
| **[Email-Phone-Number-Enrichment-Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)** | Scrapes contact emails + phones from company websites |
250+
| **[google-maps-scraper](https://github.com/FAAQJAVED/google-maps-scraper)** | Extracts business listings (name, address, phone, website) from Google Maps |
257251

258252
---
259253

@@ -272,6 +266,8 @@ If engines return no results, return HTTP 202 / 403, or the tool stops early, it
272266

273267
### Quick checks first
274268

269+
bash
270+
275271
```bash
276272
python diagnose.py --all # confirms which engines are healthy right now
277273
```
@@ -311,16 +307,20 @@ Bing is the most aggressively geo-blocked engine. If you have a residential prox
311307

312308
**Option A — `.env` file (recommended, keeps credentials out of code):**
313309

310+
bash
311+
314312
```bash
315313
# .env
316314
BING_PROXY=http://user:pass@your-proxy-host:8080
317315
```
318316

319317
**Option B — `config.py` directly:**
320318

319+
python
320+
321321
```python
322-
BING_PROXY = 'http://user:pass@your-proxy-host:8080' # HTTP proxy
323-
BING_PROXY = 'socks5://user:pass@your-proxy-host:1080' # SOCKS5 proxy
322+
BING_PROXY ='http://user:pass@your-proxy-host:8080'# HTTP proxy
323+
BING_PROXY ='socks5://user:pass@your-proxy-host:1080'# SOCKS5 proxy
324324
```
325325

326326
Leave `BING_PROXY` empty to skip Bing entirely and run only the other three engines — they work well without a proxy on most residential connections.
@@ -331,10 +331,12 @@ Leave `BING_PROXY` empty to skip Bing entirely and run only the other three engi
331331

332332
If you are hitting limits frequently, edit `config.py`:
333333

334+
python
335+
334336
```python
335-
DELAY_BETWEEN_REQUESTS = (8, 15) # seconds between individual HTTP requests (default 3–8)
336-
DELAY_BETWEEN_QUERIES = (30, 60) # seconds between queries (default 20–45)
337-
DELAY_BETWEEN_ENGINES = (90, 150) # seconds between engines (default 60–120)
337+
DELAY_BETWEEN_REQUESTS =(8,15)# seconds between individual HTTP requests (default 3–8)
338+
DELAY_BETWEEN_QUERIES =(30,60)# seconds between queries (default 20–45)
339+
DELAY_BETWEEN_ENGINES =(90,150)# seconds between engines (default 60–120)
338340
```
339341

340342
The tool will still run — it just paces itself more cautiously.
@@ -343,4 +345,4 @@ The tool will still run — it just paces itself more cautiously.
343345

344346
## License
345347

346-
MIT — see [LICENSE](https://claude.ai/chat/LICENSE)
348+
MIT — see [LICENSE](LICENSE)

assets/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Screenshots
2+
3+
Place your preview images here and reference them in the main README.md using:
4+
5+
![Phase 1 — scraping in progress](assets/phase1-scraping.png)
6+
![Phase 2 — enrichment running](assets/phase2-enrichment.png)
7+
![Excel output](assets/excel-output-sample.png)
8+
![Diagnose output](assets/diagnose-output.png)
9+
10+
## Recommended screenshots
11+
12+
- `phase1-scraping.png` — tqdm progress bar during a live Phase 1 run in VS Code terminal
13+
- `phase2-enrichment.png` — enricher running with live HOT/WARM/COLD output visible
14+
- `excel-output-sample.png` — the colour-coded Excel output open in Excel/LibreOffice
15+
- `diagnose-output.png` — diagnose.py terminal output showing all engines passing

config.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
# config.py — LeadHunter Pro: All Tuneable Settings
22

3-
from os import path as os_path
43
import os
4+
from os import path as os_path
5+
56
try:
67
from dotenv import load_dotenv
78
load_dotenv()

core/__init__.py

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -10,35 +10,35 @@
1010
relevance — query-keyword lead quality scoring
1111
"""
1212

13+
from core.browser_utils import dismiss_cookie_banner, enrich_one_browser, launch_browser
14+
from core.controls import (
15+
AutoSaver,
16+
ControlListener,
17+
State,
18+
check_cmd_file,
19+
check_disk,
20+
has_internet,
21+
should_stop,
22+
wait_for_internet,
23+
wait_if_paused,
24+
)
1325
from core.email_utils import (
14-
extract_emails_raw,
15-
extract_emails_full,
26+
best_email,
1627
decode_cloudflare_email,
28+
extract_emails_full,
29+
extract_emails_raw,
1730
extract_phones,
1831
score_email,
19-
best_email,
2032
)
21-
from core.http_utils import fetch_url, enrich_one_http
22-
from core.browser_utils import launch_browser, dismiss_cookie_banner, enrich_one_browser
33+
from core.http_utils import enrich_one_http, fetch_url
34+
from core.relevance import score_relevance
2335
from core.storage import (
24-
save_checkpoint,
36+
get_output_path,
2537
load_checkpoint,
26-
save_output,
2738
load_existing_output,
28-
get_output_path,
29-
)
30-
from core.controls import (
31-
State,
32-
ControlListener,
33-
AutoSaver,
34-
check_cmd_file,
35-
wait_if_paused,
36-
should_stop,
37-
has_internet,
38-
wait_for_internet,
39-
check_disk,
39+
save_checkpoint,
40+
save_output,
4041
)
41-
from core.relevance import score_relevance
4242

4343
__all__ = [
4444
# email_utils

core/_log.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,9 @@
1616
from __future__ import annotations
1717

1818
import time
19-
from typing import Optional
2019

2120
_start_time: float = time.time()
22-
_active_bar: Optional[object] = None
21+
_active_bar: object | None = None
2322

2423

2524
def set_start_time(t: float) -> None:
@@ -28,7 +27,7 @@ def set_start_time(t: float) -> None:
2827
_start_time = t
2928

3029

31-
def set_active_bar(bar: Optional[object]) -> None:
30+
def set_active_bar(bar: object | None) -> None:
3231
"""
3332
Register (or clear) the active tqdm progress bar.
3433

core/browser_utils.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,13 @@
2626
from __future__ import annotations
2727

2828
import time
29-
from typing import List, Tuple
3029

3130
from core._log import log
3231
from core.email_utils import (
32+
best_email,
3333
extract_emails_full,
3434
extract_phones,
3535
score_email,
36-
best_email,
3736
)
3837
from core.http_utils import _rate_limit, random_ua
3938

@@ -119,7 +118,7 @@ def dismiss_cookie_banner(page, cfg: dict) -> None: # noqa: ANN001
119118
pass
120119

121120

122-
def enrich_one_browser(page, target: dict, cfg: dict) -> Tuple[str, str]:
121+
def enrich_one_browser(page, target: dict, cfg: dict) -> tuple[str, str]:
123122
"""
124123
Pass 2: find a contact email **and** phone using a Playwright-rendered page.
125124
@@ -147,8 +146,8 @@ def enrich_one_browser(page, target: dict, cfg: dict) -> Tuple[str, str]:
147146
base = target["website"].rstrip("/")
148147
contact_paths = cfg.get("contact_paths", ["/contact", "/about"])
149148
pw_timeout = cfg.get("playwright_timeout", 8000)
150-
emails: List[str] = []
151-
phones: List[str] = []
149+
emails: list[str] = []
150+
phones: list[str] = []
152151

153152
urls_to_visit = [base] + [base + p for p in contact_paths[:2]]
154153

core/controls.py

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,6 @@
1717

1818
from __future__ import annotations
1919

20-
from datetime import datetime
21-
2220
import os
2321
import platform
2422
import select
@@ -27,11 +25,11 @@
2725
import sys
2826
import threading
2927
import time
28+
from datetime import datetime
3029

3130
from core._log import log
3231
from core.storage import save_output
3332

34-
3533
# ── Shared run state ──────────────────────────────────────────────────────────
3634

3735
class State:
@@ -113,15 +111,19 @@ def _handle(self, key: str) -> None:
113111
if key == "P":
114112
s.paused = not s.paused
115113
if s.paused:
116-
log("PAUSED — press P or R to resume", "warn"); _beep("stop")
114+
log("PAUSED — press P or R to resume", "warn")
115+
_beep("stop")
117116
else:
118-
log("RESUMED", "good"); _beep("resume")
117+
log("RESUMED", "good")
118+
_beep("resume")
119119
elif key == "R" and s.paused:
120120
s.paused = False
121-
log("RESUMED", "good"); _beep("resume")
121+
log("RESUMED", "good")
122+
_beep("resume")
122123
elif key == "Q":
123124
s.stop = True
124-
log("QUIT — saving and exiting …", "warn"); _beep("stop")
125+
log("QUIT — saving and exiting …", "warn")
126+
_beep("stop")
125127
elif key == "S":
126128
log(
127129
f"status → found:{self._ctx.get('found', 0)} "
@@ -277,11 +279,17 @@ def check_cmd_file(state: State, cmd_file: str, checkpoint_file: str) -> None:
277279
_f.write("")
278280

279281
if cmd == "pause":
280-
state.paused = True; log("PAUSED (cmd file)", "warn"); _beep("stop")
282+
state.paused = True
283+
log("PAUSED (cmd file)", "warn")
284+
_beep("stop")
281285
elif cmd in ("resume", "r"):
282-
state.paused = False; log("RESUMED (cmd file)", "good"); _beep("resume")
286+
state.paused = False
287+
log("RESUMED (cmd file)", "good")
288+
_beep("resume")
283289
elif cmd in ("stop", "q"):
284-
state.stop = True; log("STOP — saving …", "warn"); _beep("stop")
290+
state.stop = True
291+
log("STOP — saving …", "warn")
292+
_beep("stop")
285293
elif cmd == "fresh":
286294
if os.path.exists(checkpoint_file):
287295
os.remove(checkpoint_file)

0 commit comments

Comments
 (0)