Skip to content

Commit 090c7ea

Browse files
committed
Initial Release : Leadhunter_Pro
0 parents  commit 090c7ea

47 files changed

Lines changed: 8091 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# LeadHunter Pro — environment variables
2+
# Copy to .env and fill in your values. This file is safe to commit; .env is not.
3+
4+
# Residential proxy URL for Bing geo-unlock (optional — leave blank to skip Bing)
5+
# Format: http://user:pass@host:port or socks5://user:pass@host:port
6+
BING_PROXY=

.github/workflows/ci.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: CI
2+
3+
on: [push, pull_request]
4+
5+
jobs:
6+
test:
7+
runs-on: ubuntu-latest
8+
strategy:
9+
matrix:
10+
python-version: ["3.10", "3.11", "3.12"]
11+
12+
steps:
13+
- uses: actions/checkout@v4
14+
15+
- uses: actions/setup-python@v5
16+
with:
17+
python-version: ${{ matrix.python-version }}
18+
19+
- name: Install dependencies
20+
run: pip install -r requirements.txt -r requirements-dev.txt
21+
22+
- name: Lint with ruff
23+
run: ruff check .
24+
25+
- name: Run tests
26+
run: pytest tests/ --tb=short

.gitignore

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*.pyo
5+
.venv/
6+
venv/
7+
env/
8+
*.egg-info/
9+
dist/
10+
build/
11+
12+
# Project outputs (never commit scraped data)
13+
outputs/
14+
logs/
15+
checkpoints/
16+
debug_html/
17+
18+
# Enricher state
19+
enrich_checkpoint.json
20+
found_contacts_*.xlsx
21+
found_contacts_*.csv
22+
command.txt
23+
24+
# Config (user must copy from example)
25+
config.yaml
26+
27+
# OS
28+
.DS_Store
29+
Thumbs.db
30+
desktop.ini
31+
32+
# IDE
33+
.idea/
34+
.vscode/
35+
*.swp
36+
*.swo
37+
38+
# Coverage
39+
.coverage
40+
htmlcov/
41+
.pytest_cache/
42+
43+
# Environment variables (never commit)
44+
.env

BLUEPRINT.md

Lines changed: 535 additions & 0 deletions
Large diffs are not rendered by default.

CHANGELOG.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Changelog
2+
3+
## v1.1.0 — 2026-05-02
4+
5+
### Bug Fixes
6+
7+
| Ref | File | What changed |
8+
|-----|------|--------------|
9+
| BUG-1 | `core/controls.py`, `main.py` | Added `interruptible_sleep()` to `controls.py`. Phase 1 inter-engine and inter-query delays now use it instead of bare `time.sleep()`. `ControlListener` is now instantiated in `main.py` at startup so P/Q/R/S/W keys work instantly during any sleep. |
10+
| BUG-2 | `pipeline/data_cleaner.py` | Company name is now derived from the domain (`derive_name_from_domain()`) instead of the search-engine page title. Uses 2-pass CamelCase splitting (before lowercasing) + hyphen/underscore splitting. Example: `alpha-block-management.co.uk``Alpha Block Management`. |
11+
| BUG-3 | `core/email_utils.py` | Added `_PLACEHOLDER_DOMAINS` and `_PLACEHOLDER_LOCALS` blocklists. Addresses like `user@domain.com`, `john@doe.com`, and `filler@godaddy.com` are now rejected before they reach scoring. |
12+
| BUG-4 | `core/email_utils.py` | Mailto query strings (`?subject=…`) and URL fragments (`#…`) are now stripped from every extracted email address before validation. |
13+
| BUG-5 | `core/email_utils.py` | HTML entities in phone strings are decoded with `html.unescape()` before extraction. Phone candidates containing a decimal point (`412 132.305`) or three+ consecutive zeros are rejected as prices/placeholder numbers. |
14+
| BUG-6 | `pipeline/data_cleaner.py`, `enricher.py` | Directory domains are now hard-excluded in `DataCleaner.process()` — no `CleanRecord` is created. `enricher.py` additionally filters any row where `flagged=YES` and `flag_reason=directory` from its input. `threebestrated.co.uk`, `trovit.co.uk`, `idobusiness.co.uk`, `servicevista.co.uk` moved to `_ALWAYS_EXCLUDED`. |
15+
| BUG-7 | `pipeline/data_cleaner.py` | `_normalise_to_root()` threshold changed from `>= 2` segments to `>= 1`. Any URL with a path (`/property-management-london`) is collapsed to root (`/`). Cuts ~30% of Phase 2 HTTP requests. |
16+
17+
### Improvements
18+
19+
| Ref | File | What changed |
20+
|-----|------|--------------|
21+
| IMP-1 | `enricher.py` | Phase 2 Pass 1 (HTTP) now runs concurrently via `ThreadPoolExecutor`. Worker count controlled by `enricher_workers` in `config.yaml` (default 5). Thread-safe writes via `threading.Lock`. Playwright Pass 2 remains sequential. |
22+
| IMP-2 | `pipeline/data_cleaner.py`, `config.py` | `GEO_SUSPECT_TLDS` list added to `config.py` (default empty). Domains whose TLD matches get `flagged=True, flag_reason='geo-suspect'` and a `-2` score penalty. |
23+
| IMP-3 | Multiple files | All UK-specific hardcoding removed: city lists (`_UK_CITIES`, `_US_CITIES`), `.co.uk` scoring bonus, `.gov.uk`/`.org.uk` auto-flag, `en-GB` Accept-Language header, industry-specific `generic_email_keywords`. `SCORE_BOOST_KEYWORDS` added to `config.py` (default empty). Tool is now general-purpose. |
24+
| IMP-4 | `enricher.py` | Email list deduplicated with `list(set(emails))` before `best_email()` selection. `junk_email_domains` expanded to match `_PLACEHOLDER_DOMAINS` in `email_utils.py`. |
25+
| IMP-5 | `main.py` | Per-engine stats table printed in Phase 1 completion summary (engine, leads found, pages completed). |
26+
| IMP-6 | `engines/bing.py` | Bing loop detection added. If page 2 returns a domain set that is a subset of page 1's domains, engine logs `"Results are looping (geo-block confirmed)"` and sets `is_banned = True` immediately rather than running all 20 pages. |
27+
| CF-FIX | `core/email_utils.py` | Fixed Cloudflare regex typo: closing quote was inside the capture group (`([a-f0-9]+"`), causing zero matches. Correct pattern: `([a-f0-9]+)"`. |
28+
| CAMEL-FIX | `pipeline/data_cleaner.py` | `derive_name_from_domain()` now does CamelCase detection **before** lowercasing the domain string. Fixes `JPropertyManagement.com``J Property Management` (was `Jpropertymanagement`). |
29+
30+
### Test updates
31+
32+
- `tests/test_cleaner.py`: Updated `test_derive_name_from_domain` expectations; added `test_directory_domain_produces_none`, `test_normalise_to_root_single_segment`, `test_geo_suspect_flag`, `test_irrelevance_flag`.
33+
- `tests/test_email_utils.py`: Added `test_mailto_query_string_stripped`, `test_placeholder_domain_rejected`, `test_placeholder_local_rejected`, `test_html_entity_phone_decoded`, `test_decimal_phone_rejected`, `test_zero_loop_phone_rejected`.
34+
35+
---
36+
37+
## v1.0.0 — 2026-04-01
38+
39+
Initial release.

CONTRIBUTING.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Contributing to LeadHunter Pro
2+
3+
Contributions are welcome. Please read the guidelines below before opening a pull request — they keep the codebase consistent and reviews fast.
4+
5+
## Running the test suite
6+
7+
```bash
8+
pytest
9+
```
10+
11+
All tests live under `tests/`. Run the full suite before pushing; CI will also run it automatically on every PR.
12+
13+
## Adding a new search engine
14+
15+
1. Create `engines/<engine_name>.py` and subclass `engine_base.EngineBase`.
16+
2. Implement the `search(query: str, pages: int) -> list[SearchResult]` method.
17+
3. Register the engine in `engines/__init__.py` by adding it to `ENGINE_MAP`:
18+
19+
```python
20+
from engines.myengine import MyEngine
21+
ENGINE_MAP["myengine"] = MyEngine
22+
```
23+
24+
4. Add `"myengine"` to `ENGINES_PRIORITY` in `config.py` if it should run by default.
25+
5. Add at least one HTML-parsing test in `tests/test_engines.py` (see existing tests for the pattern).
26+
27+
## Code style
28+
29+
The project uses **ruff** for linting and formatting (configured in `pyproject.toml`). Line length is **100 characters**.
30+
31+
```bash
32+
ruff check .
33+
ruff format .
34+
```
35+
36+
Fix all ruff warnings before submitting. Do not suppress rules without a comment explaining why.
37+
38+
## Pull request guidelines
39+
40+
- **One PR per feature or fix.** Mixed-concern PRs are hard to review and harder to revert.
41+
- **Reference the relevant issue** in the PR description (e.g. `Closes #42`).
42+
- Keep commit messages short and imperative: `Add Ecosia engine`, `Fix Yahoo warmup retry logic`.
43+
- If your change touches scraping logic, include a note on which engine/site was tested and what the result looked like.

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2026 LeadHunter Pro Contributors
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

0 commit comments

Comments
 (0)