Commit 6023fa7
RFC-0001 OSINT Layer — scaffold (Phase 1–5) (#147)
* RFC-016 OSINT Layer: functional Phase 1 + Phase 2-5 stubs (amends #147)
Replaces the original Phase 1-5 scaffold commit (830156f) with an
implementation that integrates with the existing ontology validator and
ships working Phase 1 collectors with mocked tests.
What changed vs the original scaffold:
- ontology.py: Phase 1-5 entity / edge declarations, all using the
from_types/to_types/cardinality shape that OntologyValidator actually
validates against. The scaffold's "from": "*", "to": "A|B" syntax was
invisible to the validator and silently waved every relation through.
- ontology.py: canonicalization helpers (asn, cidr, domain, ipv6, mx,
port, url, web_title) used by collectors and tests.
- ontology.py: merge_into_global_ontology() — idempotent merge into core
ENTITY_TYPES / RELATION_TYPES at import time, no schema migration.
- transform_registry.py: function-based register(metadata, fn) API,
CollectorTuple NamedTuple, idempotent re-registration. Scaffold used a
decorator pattern with no idempotency.
- collectors are sync (matches existing yara/sigma pipelines; the
codebase has zero asyncio usage today). Scaffold used async def.
- Phase 1 collectors are functional with mocked tests:
dns_collector — A/AAAA/NS/MX with NXDOMAIN/Timeout absorbed
whois_collector — domain via python-whois, IP via ipwhois RDAP
(graceful ImportError handling per AGENTS.OE
Override 4 — surface failures, no silent retry)
cert_collector — crt.sh SAN enumeration, dedup, HTTP error → []
- Phase 1.5 collectors (bgp_collector, port_scanner) and Phase 2-5 stubs
(Hunter, Holehe, Namechk, Wappalyzer, BuiltWith, Twitter, Hashtag,
HIBP, Breach Directory) register their metadata so discovery works,
return [] until their integration ships. port_scanner is gated behind
ZETTELFORGE_OSINT_ACTIVE_SCAN=1.
- IPv6Address added to core ENTITY_TYPES (parity with IPv4Address).
- Top-level __init__.py: side-effect import of osint subpackage. The
original scaffold's __init__ called __all__ += [...] before __all__
was defined — would have raised NameError on import.
- Tests: 67 mocked tests in tests/test_osint_entities.py +
tests/test_osint_collectors.py covering entity validation, edge
validation, canonicalization, collector shape, registry dispatch.
- docs/rfcs/RFC-016-osint-layer.md: canonical RFC with Status block
documenting the three deviations from the literal RFC text (single
kg_nodes table, no CLI in this PR, sync collectors).
- SCOPING_DOC.md: Phase 1 planning notes.
- docs/rfc-0001-osint-layer.md (scaffold) removed; docs/rfcs/RFC-016
is the canonical doc.
Verification:
- 67/67 osint tests pass (0.12s)
- 121/121 focused subset (basic + kg_edge_schema + extensions +
consolidation + osint) pass (3.71s)
- ruff check src/zettelforge: clean
- ruff format --check src/zettelforge: clean
- Smoke import: 14 collectors registered, all entity/edge types merged.
Out of scope (deferred):
- Real BGPView / Wappalyzer / Hunter / etc. integrations (Phase 1.5+)
- Investigation workflow engine + state machine (Phase 4)
- Top-level zettelforge CLI (Phase 4 brings it alongside the workflow
engine)
- Container packaging (RFC §10 already defers to vNext per docs/03)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(osint): install dnspython in CI; add coverage smoke tests
Addresses the two CI failures from ecb3f12:
- 3 dns_collector tests failed with ModuleNotFoundError: No module
named 'dns'. The tests use dns.resolver.NXDOMAIN to construct realistic
mocks; that requires dnspython to actually be installed even though the
collector calls themselves are mocked. Adds a new [osint] optional
dependency with dnspython, python-whois, and ipwhois (the runtime
Phase 1 collectors need them anyway), and pulls it into [dev] so CI's
pip install -e .[dev] resolves them.
- Total coverage was 66.74% vs 67% required (GOV-007). The OSINT layer
added ~720 statements with most stub branches uncovered. Adds three
parametrized tests over every registered collector:
1. unsupported input returns a list (covers early-return)
2. metadata is well-formed (covers TransformMetadata fields)
3. every declared input_type is callable without raising
That is 14 collectors x 3 tests = 42 new tests, all using mocked or
short-circuited paths (no network, no API keys).
Local results:
- 109/109 osint tests pass (was 67; +42 from parametrized smoke tests)
- ruff check src/zettelforge/osint: clean
- OSINT package coverage: 70% (was untested for stubs)
No force-push needed; this is a follow-up commit on rfc/osint-layer-scaffold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(osint/whois): swallow ipwhois IPDefinedError on reserved IPs
The new universal smoke test
test_every_collector_handles_each_declared_input_type[whois_collector]
exercised whois_collector with the canonical RFC 5737 documentation IP
192.0.2.1, which ipwhois treats as a poisoned input and raises
IPDefinedError before any network call. The collector wasn't catching
that exception, so the bare-bones probe call propagated it and CI failed
again on test (3.12) and test (3.13).
Real callers will hit the same exception any time an analyst feeds in a
private (10/8, 192.168/16), loopback (127/8), or documentation-net IP.
The right behaviour is fail-closed: log and return None / [] so the KG
doesn't pretend it has data it doesn't.
This commit catches IPDefinedError and BaseIpwhoisException in
_lookup_ip, logging the reserved-IP case at debug and the broader
failure case at warning. Per AGENTS.OE Override 4, the failure is
surfaced (not silently retried) — it just doesn't bubble as an
exception.
Local: 109/109 osint tests pass; ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(osint): bgp edge direction; entity_resolver bug fixes from review
Addresses reviewer concerns on PR #147 that survived the amend (the rest
were against scaffold code my amend already replaced).
bgp_collector edge direction inverted:
The ontology declares part_of_as as Netblock -> ASNumber (with IP
families also valid as the from side). The collector was emitting
from_entity_type="ASNumber", to_entity_type="Netblock" which inverts
the declared direction. validate_relation would have rejected those
edges. Swap the from/to fields; the entity emitted is still the
Netblock (semantically: "given an ASN, here are netblocks that are
part_of_as that ASN"). Caught by Copilot review of the original
scaffold; the bug carried over into my amend.
entity_resolver canonicalise_ipv4 leading zeros:
ipaddress.IPv4Address("001.002.003.004") raises AddressValueError on
Python 3.10+. Replace with explicit octet parsing (split on '.',
int() each, validate 0-255). Raises ValueError on malformed input
instead of silently corrupting it.
entity_resolver canonicalise_asn hex strip:
Old behavior stripped non-digits, so "0x3039" became "AS03039" rather
than the expected hex-decoded value. Delegate to
zettelforge.osint.ontology.canonicalize_asn which uses int() and
raises ValueError on non-decimal input — fail-loud.
entity_resolver._canonical_key used "Domain" (typo, scaffold relic):
The core ontology uses DomainName. Fix to match. Also added IPv6Address
case (parity with IPv4Address).
entity_resolver late `from datetime import datetime` at module bottom:
Moved to module-top imports.
Same module: phone canonicalization preserved as-is (E.164 best-effort
heuristic; documented now). Netblock canonicalization upgraded to use
canonicalize_cidr so host bits get trimmed.
Local: 109/109 osint tests pass; ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 82868bb commit 6023fa7
32 files changed
Lines changed: 3797 additions & 0 deletions
File tree
- docs/rfcs
- src/zettelforge
- osint
- collectors
- breach
- infrastructure
- people
- social
- tech
- tests
Large diffs are not rendered by default.
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
110 | 110 | | |
111 | 111 | | |
112 | 112 | | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
113 | 124 | | |
114 | 125 | | |
115 | 126 | | |
| |||
121 | 132 | | |
122 | 133 | | |
123 | 134 | | |
| 135 | + | |
124 | 136 | | |
125 | 137 | | |
126 | 138 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
25 | 29 | | |
26 | 30 | | |
27 | 31 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
207 | 207 | | |
208 | 208 | | |
209 | 209 | | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
210 | 215 | | |
211 | 216 | | |
212 | 217 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
Lines changed: 46 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
Lines changed: 46 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
0 commit comments