|
1 | 1 | # Changelog |
2 | 2 |
|
| 3 | +## v1.1.1 — 2026-05-12 |
| 4 | + |
| 5 | +### Test updates |
| 6 | + |
| 7 | +- `tests/test_email_utils.py`: Added 6 tests covering previously untested |
| 8 | + core functions — `decode_cloudflare_email`, `score_email`, and |
| 9 | + `best_email`. Test suite now has 78 tests total. |
| 10 | + |
| 11 | +| Test | Function | What it verifies | |
| 12 | +| ---------------------------------------------------- | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------- | |
| 13 | +| `test_decode_cloudflare_known_fixture` | `decode_cloudflare_email()` | Decodes a pre-computed XOR fixture (`key=0x1a`, plaintext `hello@example.com`) without any network or browser dependency | |
| 14 | +| `test_decode_cloudflare_invalid_hex_returns_empty` | `decode_cloudflare_email()` | Returns `""` for malformed hex input and empty string input | |
| 15 | +| `test_score_email_tier_1_personal` | `score_email()` | Personal-name local part scores 1 (highest quality) | |
| 16 | +| `test_score_email_tiers_2_3_and_999` | `score_email()` | Priority generic → 2; other generic → 3; skip keyword and junk domain → 999 | |
| 17 | +| `test_best_email_returns_lowest_score` | `best_email()` | Selects the lowest-scoring candidate from a mixed-tier list | |
| 18 | +| `test_best_email_discards_junk_score_999` | `best_email()` | Returns `""` when all candidates score 999; returns valid candidate from mixed junk+valid list | |
| 19 | + |
| 20 | +--- |
| 21 | + |
3 | 22 | ## v1.1.0 — 2026-05-02 |
4 | 23 |
|
5 | 24 | ### Bug Fixes |
6 | 25 |
|
7 | | -| Ref | File | What changed | |
8 | | -|-----|------|--------------| |
9 | | -| BUG-1 | `core/controls.py`, `main.py` | Added `interruptible_sleep()` to `controls.py`. Phase 1 inter-engine and inter-query delays now use it instead of bare `time.sleep()`. `ControlListener` is now instantiated in `main.py` at startup so P/Q/R/S/W keys work instantly during any sleep. | |
10 | | -| BUG-2 | `pipeline/data_cleaner.py` | Company name is now derived from the domain (`derive_name_from_domain()`) instead of the search-engine page title. Uses 2-pass CamelCase splitting (before lowercasing) + hyphen/underscore splitting. Example: `alpha-block-management.co.uk` → `Alpha Block Management`. | |
11 | | -| BUG-3 | `core/email_utils.py` | Added `_PLACEHOLDER_DOMAINS` and `_PLACEHOLDER_LOCALS` blocklists. Addresses like `user@domain.com`, `john@doe.com`, and `filler@godaddy.com` are now rejected before they reach scoring. | |
12 | | -| BUG-4 | `core/email_utils.py` | Mailto query strings (`?subject=…`) and URL fragments (`#…`) are now stripped from every extracted email address before validation. | |
13 | | -| BUG-5 | `core/email_utils.py` | HTML entities in phone strings are decoded with `html.unescape()` before extraction. Phone candidates containing a decimal point (`412 132.305`) or three+ consecutive zeros are rejected as prices/placeholder numbers. | |
| 26 | +| Ref | File | What changed | |
| 27 | +| ----- | --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | |
| 28 | +| BUG-1 | `core/controls.py`, `main.py` | Added `interruptible_sleep()` to `controls.py`. Phase 1 inter-engine and inter-query delays now use it instead of bare `time.sleep()`. `ControlListener` is now instantiated in `main.py` at startup so P/Q/R/S/W keys work instantly during any sleep. | |
| 29 | +| BUG-2 | `pipeline/data_cleaner.py` | Company name is now derived from the domain (`derive_name_from_domain()`) instead of the search-engine page title. Uses 2-pass CamelCase splitting (before lowercasing) + hyphen/underscore splitting. Example: `alpha-block-management.co.uk` → `Alpha Block Management`. | |
| 30 | +| BUG-3 | `core/email_utils.py` | Added `_PLACEHOLDER_DOMAINS` and `_PLACEHOLDER_LOCALS` blocklists. Addresses like `user@domain.com`, `john@doe.com`, and `filler@godaddy.com` are now rejected before they reach scoring. | |
| 31 | +| BUG-4 | `core/email_utils.py` | Mailto query strings (`?subject=…`) and URL fragments (`#…`) are now stripped from every extracted email address before validation. | |
| 32 | +| BUG-5 | `core/email_utils.py` | HTML entities in phone strings are decoded with `html.unescape()` before extraction. Phone candidates containing a decimal point (`412 132.305`) or three+ consecutive zeros are rejected as prices/placeholder numbers. | |
14 | 33 | | BUG-6 | `pipeline/data_cleaner.py`, `enricher.py` | Directory domains are now hard-excluded in `DataCleaner.process()` — no `CleanRecord` is created. `enricher.py` additionally filters any row where `flagged=YES` and `flag_reason=directory` from its input. `threebestrated.co.uk`, `trovit.co.uk`, `idobusiness.co.uk`, `servicevista.co.uk` moved to `_ALWAYS_EXCLUDED`. | |
15 | | -| BUG-7 | `pipeline/data_cleaner.py` | `_normalise_to_root()` threshold changed from `>= 2` segments to `>= 1`. Any URL with a path (`/property-management-london`) is collapsed to root (`/`). Cuts ~30% of Phase 2 HTTP requests. | |
| 34 | +| BUG-7 | `pipeline/data_cleaner.py` | `_normalise_to_root()` threshold changed from `>= 2` segments to `>= 1`. Any URL with a path (`/property-management-london`) is collapsed to root (`/`). Cuts ~30% of Phase 2 HTTP requests. | |
16 | 35 |
|
17 | 36 | ### Improvements |
18 | 37 |
|
19 | | -| Ref | File | What changed | |
20 | | -|-----|------|--------------| |
21 | | -| IMP-1 | `enricher.py` | Phase 2 Pass 1 (HTTP) now runs concurrently via `ThreadPoolExecutor`. Worker count controlled by `enricher_workers` in `config.yaml` (default 5). Thread-safe writes via `threading.Lock`. Playwright Pass 2 remains sequential. | |
22 | | -| IMP-2 | `pipeline/data_cleaner.py`, `config.py` | `GEO_SUSPECT_TLDS` list added to `config.py` (default empty). Domains whose TLD matches get `flagged=True, flag_reason='geo-suspect'` and a `-2` score penalty. | |
23 | | -| IMP-3 | Multiple files | All UK-specific hardcoding removed: city lists (`_UK_CITIES`, `_US_CITIES`), `.co.uk` scoring bonus, `.gov.uk`/`.org.uk` auto-flag, `en-GB` Accept-Language header, industry-specific `generic_email_keywords`. `SCORE_BOOST_KEYWORDS` added to `config.py` (default empty). Tool is now general-purpose. | |
24 | | -| IMP-4 | `enricher.py` | Email list deduplicated with `list(set(emails))` before `best_email()` selection. `junk_email_domains` expanded to match `_PLACEHOLDER_DOMAINS` in `email_utils.py`. | |
25 | | -| IMP-5 | `main.py` | Per-engine stats table printed in Phase 1 completion summary (engine, leads found, pages completed). | |
26 | | -| IMP-6 | `engines/bing.py` | Bing loop detection added. If page 2 returns a domain set that is a subset of page 1's domains, engine logs `"Results are looping (geo-block confirmed)"` and sets `is_banned = True` immediately rather than running all 20 pages. | |
27 | | -| CF-FIX | `core/email_utils.py` | Fixed Cloudflare regex typo: closing quote was inside the capture group (`([a-f0-9]+"`), causing zero matches. Correct pattern: `([a-f0-9]+)"`. | |
28 | | -| CAMEL-FIX | `pipeline/data_cleaner.py` | `derive_name_from_domain()` now does CamelCase detection **before** lowercasing the domain string. Fixes `JPropertyManagement.com` → `J Property Management` (was `Jpropertymanagement`). | |
| 38 | +| Ref | File | What changed | |
| 39 | +| --------- | ------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 40 | +| IMP-1 | `enricher.py` | Phase 2 Pass 1 (HTTP) now runs concurrently via `ThreadPoolExecutor`. Worker count controlled by `enricher_workers` in `config.yaml` (default 5). Thread-safe writes via `threading.Lock`. Playwright Pass 2 remains sequential. | |
| 41 | +| IMP-2 | `pipeline/data_cleaner.py`, `config.py` | `GEO_SUSPECT_TLDS` list added to `config.py` (default empty). Domains whose TLD matches get `flagged=True, flag_reason='geo-suspect'` and a `-2` score penalty. | |
| 42 | +| IMP-3 | Multiple files | All UK-specific hardcoding removed: city lists (`_UK_CITIES`, `_US_CITIES`), `.co.uk` scoring bonus, `.gov.uk`/`.org.uk` auto-flag, `en-GB` Accept-Language header, industry-specific `generic_email_keywords`. `SCORE_BOOST_KEYWORDS` added to `config.py` (default empty). Tool is now general-purpose. | |
| 43 | +| IMP-4 | `enricher.py` | Email list deduplicated with `list(set(emails))` before `best_email()` selection. `junk_email_domains` expanded to match `_PLACEHOLDER_DOMAINS` in `email_utils.py`. | |
| 44 | +| IMP-5 | `main.py` | Per-engine stats table printed in Phase 1 completion summary (engine, leads found, pages completed). | |
| 45 | +| IMP-6 | `engines/bing.py` | Bing loop detection added. If page 2 returns a domain set that is a subset of page 1's domains, engine logs `"Results are looping (geo-block confirmed)"` and sets `is_banned = True` immediately rather than running all 20 pages. | |
| 46 | +| CF-FIX | `core/email_utils.py` | Fixed Cloudflare regex typo: closing quote was inside the capture group (`([a-f0-9]+"`), causing zero matches. Correct pattern: `([a-f0-9]+)"`. | |
| 47 | +| CAMEL-FIX | `pipeline/data_cleaner.py` | `derive_name_from_domain()` now does CamelCase detection **before** lowercasing the domain string. Fixes `JPropertyManagement.com` → `J Property Management` (was `Jpropertymanagement`). | |
29 | 48 |
|
30 | 49 | ### Test updates |
31 | 50 |
|
|
0 commit comments