Skip to content

Commit 799203d

Browse files
committed
docs: add CHANGELOG entry for v1.1.1 test coverage additions
1 parent a803b02 commit 799203d

1 file changed

Lines changed: 37 additions & 18 deletions

File tree

CHANGELOG.md

Lines changed: 37 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,50 @@
11
# Changelog
22

3+
## v1.1.1 — 2026-05-12
4+
5+
### Test updates
6+
7+
- `tests/test_email_utils.py`: Added 6 tests covering previously untested
8+
core functions — `decode_cloudflare_email`, `score_email`, and
9+
`best_email`. Test suite now has 78 tests total.
10+
11+
| Test | Function | What it verifies |
12+
| ---------------------------------------------------- | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
13+
| `test_decode_cloudflare_known_fixture` | `decode_cloudflare_email()` | Decodes a pre-computed XOR fixture (`key=0x1a`, plaintext `hello@example.com`) without any network or browser dependency |
14+
| `test_decode_cloudflare_invalid_hex_returns_empty` | `decode_cloudflare_email()` | Returns `""` for malformed hex input and empty string input |
15+
| `test_score_email_tier_1_personal` | `score_email()` | Personal-name local part scores 1 (highest quality) |
16+
| `test_score_email_tiers_2_3_and_999` | `score_email()` | Priority generic → 2; other generic → 3; skip keyword and junk domain → 999 |
17+
| `test_best_email_returns_lowest_score` | `best_email()` | Selects the lowest-scoring candidate from a mixed-tier list |
18+
| `test_best_email_discards_junk_score_999` | `best_email()` | Returns `""` when all candidates score 999; returns valid candidate from mixed junk+valid list |
19+
20+
---
21+
322
## v1.1.0 — 2026-05-02
423

524
### Bug Fixes
625

7-
| Ref | File | What changed |
8-
|-----|------|--------------|
9-
| BUG-1 | `core/controls.py`, `main.py` | Added `interruptible_sleep()` to `controls.py`. Phase 1 inter-engine and inter-query delays now use it instead of bare `time.sleep()`. `ControlListener` is now instantiated in `main.py` at startup so P/Q/R/S/W keys work instantly during any sleep. |
10-
| BUG-2 | `pipeline/data_cleaner.py` | Company name is now derived from the domain (`derive_name_from_domain()`) instead of the search-engine page title. Uses 2-pass CamelCase splitting (before lowercasing) + hyphen/underscore splitting. Example: `alpha-block-management.co.uk``Alpha Block Management`. |
11-
| BUG-3 | `core/email_utils.py` | Added `_PLACEHOLDER_DOMAINS` and `_PLACEHOLDER_LOCALS` blocklists. Addresses like `user@domain.com`, `john@doe.com`, and `filler@godaddy.com` are now rejected before they reach scoring. |
12-
| BUG-4 | `core/email_utils.py` | Mailto query strings (`?subject=…`) and URL fragments (`#…`) are now stripped from every extracted email address before validation. |
13-
| BUG-5 | `core/email_utils.py` | HTML entities in phone strings are decoded with `html.unescape()` before extraction. Phone candidates containing a decimal point (`412 132.305`) or three+ consecutive zeros are rejected as prices/placeholder numbers. |
26+
| Ref | File | What changed |
27+
| ----- | --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
28+
| BUG-1 | `core/controls.py`, `main.py` | Added `interruptible_sleep()` to `controls.py`. Phase 1 inter-engine and inter-query delays now use it instead of bare `time.sleep()`. `ControlListener` is now instantiated in `main.py` at startup so P/Q/R/S/W keys work instantly during any sleep. |
29+
| BUG-2 | `pipeline/data_cleaner.py` | Company name is now derived from the domain (`derive_name_from_domain()`) instead of the search-engine page title. Uses 2-pass CamelCase splitting (before lowercasing) + hyphen/underscore splitting. Example: `alpha-block-management.co.uk``Alpha Block Management`. |
30+
| BUG-3 | `core/email_utils.py` | Added `_PLACEHOLDER_DOMAINS` and `_PLACEHOLDER_LOCALS` blocklists. Addresses like `user@domain.com`, `john@doe.com`, and `filler@godaddy.com` are now rejected before they reach scoring. |
31+
| BUG-4 | `core/email_utils.py` | Mailto query strings (`?subject=…`) and URL fragments (`#…`) are now stripped from every extracted email address before validation. |
32+
| BUG-5 | `core/email_utils.py` | HTML entities in phone strings are decoded with `html.unescape()` before extraction. Phone candidates containing a decimal point (`412 132.305`) or three+ consecutive zeros are rejected as prices/placeholder numbers. |
1433
| BUG-6 | `pipeline/data_cleaner.py`, `enricher.py` | Directory domains are now hard-excluded in `DataCleaner.process()` — no `CleanRecord` is created. `enricher.py` additionally filters any row where `flagged=YES` and `flag_reason=directory` from its input. `threebestrated.co.uk`, `trovit.co.uk`, `idobusiness.co.uk`, `servicevista.co.uk` moved to `_ALWAYS_EXCLUDED`. |
15-
| BUG-7 | `pipeline/data_cleaner.py` | `_normalise_to_root()` threshold changed from `>= 2` segments to `>= 1`. Any URL with a path (`/property-management-london`) is collapsed to root (`/`). Cuts ~30% of Phase 2 HTTP requests. |
34+
| BUG-7 | `pipeline/data_cleaner.py` | `_normalise_to_root()` threshold changed from `>= 2` segments to `>= 1`. Any URL with a path (`/property-management-london`) is collapsed to root (`/`). Cuts ~30% of Phase 2 HTTP requests. |
1635

1736
### Improvements
1837

19-
| Ref | File | What changed |
20-
|-----|------|--------------|
21-
| IMP-1 | `enricher.py` | Phase 2 Pass 1 (HTTP) now runs concurrently via `ThreadPoolExecutor`. Worker count controlled by `enricher_workers` in `config.yaml` (default 5). Thread-safe writes via `threading.Lock`. Playwright Pass 2 remains sequential. |
22-
| IMP-2 | `pipeline/data_cleaner.py`, `config.py` | `GEO_SUSPECT_TLDS` list added to `config.py` (default empty). Domains whose TLD matches get `flagged=True, flag_reason='geo-suspect'` and a `-2` score penalty. |
23-
| IMP-3 | Multiple files | All UK-specific hardcoding removed: city lists (`_UK_CITIES`, `_US_CITIES`), `.co.uk` scoring bonus, `.gov.uk`/`.org.uk` auto-flag, `en-GB` Accept-Language header, industry-specific `generic_email_keywords`. `SCORE_BOOST_KEYWORDS` added to `config.py` (default empty). Tool is now general-purpose. |
24-
| IMP-4 | `enricher.py` | Email list deduplicated with `list(set(emails))` before `best_email()` selection. `junk_email_domains` expanded to match `_PLACEHOLDER_DOMAINS` in `email_utils.py`. |
25-
| IMP-5 | `main.py` | Per-engine stats table printed in Phase 1 completion summary (engine, leads found, pages completed). |
26-
| IMP-6 | `engines/bing.py` | Bing loop detection added. If page 2 returns a domain set that is a subset of page 1's domains, engine logs `"Results are looping (geo-block confirmed)"` and sets `is_banned = True` immediately rather than running all 20 pages. |
27-
| CF-FIX | `core/email_utils.py` | Fixed Cloudflare regex typo: closing quote was inside the capture group (`([a-f0-9]+"`), causing zero matches. Correct pattern: `([a-f0-9]+)"`. |
28-
| CAMEL-FIX | `pipeline/data_cleaner.py` | `derive_name_from_domain()` now does CamelCase detection **before** lowercasing the domain string. Fixes `JPropertyManagement.com``J Property Management` (was `Jpropertymanagement`). |
38+
| Ref | File | What changed |
39+
| --------- | ------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
40+
| IMP-1 | `enricher.py` | Phase 2 Pass 1 (HTTP) now runs concurrently via `ThreadPoolExecutor`. Worker count controlled by `enricher_workers` in `config.yaml` (default 5). Thread-safe writes via `threading.Lock`. Playwright Pass 2 remains sequential. |
41+
| IMP-2 | `pipeline/data_cleaner.py`, `config.py` | `GEO_SUSPECT_TLDS` list added to `config.py` (default empty). Domains whose TLD matches get `flagged=True, flag_reason='geo-suspect'` and a `-2` score penalty. |
42+
| IMP-3 | Multiple files | All UK-specific hardcoding removed: city lists (`_UK_CITIES`, `_US_CITIES`), `.co.uk` scoring bonus, `.gov.uk`/`.org.uk` auto-flag, `en-GB` Accept-Language header, industry-specific `generic_email_keywords`. `SCORE_BOOST_KEYWORDS` added to `config.py` (default empty). Tool is now general-purpose. |
43+
| IMP-4 | `enricher.py` | Email list deduplicated with `list(set(emails))` before `best_email()` selection. `junk_email_domains` expanded to match `_PLACEHOLDER_DOMAINS` in `email_utils.py`. |
44+
| IMP-5 | `main.py` | Per-engine stats table printed in Phase 1 completion summary (engine, leads found, pages completed). |
45+
| IMP-6 | `engines/bing.py` | Bing loop detection added. If page 2 returns a domain set that is a subset of page 1's domains, engine logs `"Results are looping (geo-block confirmed)"` and sets `is_banned = True` immediately rather than running all 20 pages. |
46+
| CF-FIX | `core/email_utils.py` | Fixed Cloudflare regex typo: closing quote was inside the capture group (`([a-f0-9]+"`), causing zero matches. Correct pattern: `([a-f0-9]+)"`. |
47+
| CAMEL-FIX | `pipeline/data_cleaner.py` | `derive_name_from_domain()` now does CamelCase detection **before** lowercasing the domain string. Fixes `JPropertyManagement.com``J Property Management` (was `Jpropertymanagement`). |
2948

3049
### Test updates
3150

0 commit comments

Comments
 (0)