Skip to content

Commit 6023fa7

Browse files
rolandpgclaude
andauthored
RFC-0001 OSINT Layer — scaffold (Phase 1–5) (#147)
* RFC-016 OSINT Layer: functional Phase 1 + Phase 2-5 stubs (amends #147) Replaces the original Phase 1-5 scaffold commit (830156f) with an implementation that integrates with the existing ontology validator and ships working Phase 1 collectors with mocked tests. What changed vs the original scaffold: - ontology.py: Phase 1-5 entity / edge declarations, all using the from_types/to_types/cardinality shape that OntologyValidator actually validates against. The scaffold's "from": "*", "to": "A|B" syntax was invisible to the validator and silently waved every relation through. - ontology.py: canonicalization helpers (asn, cidr, domain, ipv6, mx, port, url, web_title) used by collectors and tests. - ontology.py: merge_into_global_ontology() — idempotent merge into core ENTITY_TYPES / RELATION_TYPES at import time, no schema migration. - transform_registry.py: function-based register(metadata, fn) API, CollectorTuple NamedTuple, idempotent re-registration. Scaffold used a decorator pattern with no idempotency. - collectors are sync (matches existing yara/sigma pipelines; the codebase has zero asyncio usage today). Scaffold used async def. - Phase 1 collectors are functional with mocked tests: dns_collector — A/AAAA/NS/MX with NXDOMAIN/Timeout absorbed whois_collector — domain via python-whois, IP via ipwhois RDAP (graceful ImportError handling per AGENTS.OE Override 4 — surface failures, no silent retry) cert_collector — crt.sh SAN enumeration, dedup, HTTP error → [] - Phase 1.5 collectors (bgp_collector, port_scanner) and Phase 2-5 stubs (Hunter, Holehe, Namechk, Wappalyzer, BuiltWith, Twitter, Hashtag, HIBP, Breach Directory) register their metadata so discovery works, return [] until their integration ships. port_scanner is gated behind ZETTELFORGE_OSINT_ACTIVE_SCAN=1. - IPv6Address added to core ENTITY_TYPES (parity with IPv4Address). - Top-level __init__.py: side-effect import of osint subpackage. The original scaffold's __init__ called __all__ += [...] before __all__ was defined — would have raised NameError on import. - Tests: 67 mocked tests in tests/test_osint_entities.py + tests/test_osint_collectors.py covering entity validation, edge validation, canonicalization, collector shape, registry dispatch. - docs/rfcs/RFC-016-osint-layer.md: canonical RFC with Status block documenting the three deviations from the literal RFC text (single kg_nodes table, no CLI in this PR, sync collectors). - SCOPING_DOC.md: Phase 1 planning notes. - docs/rfc-0001-osint-layer.md (scaffold) removed; docs/rfcs/RFC-016 is the canonical doc. Verification: - 67/67 osint tests pass (0.12s) - 121/121 focused subset (basic + kg_edge_schema + extensions + consolidation + osint) pass (3.71s) - ruff check src/zettelforge: clean - ruff format --check src/zettelforge: clean - Smoke import: 14 collectors registered, all entity/edge types merged. Out of scope (deferred): - Real BGPView / Wappalyzer / Hunter / etc. integrations (Phase 1.5+) - Investigation workflow engine + state machine (Phase 4) - Top-level zettelforge CLI (Phase 4 brings it alongside the workflow engine) - Container packaging (RFC §10 already defers to vNext per docs/03) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(osint): install dnspython in CI; add coverage smoke tests Addresses the two CI failures from ecb3f12: - 3 dns_collector tests failed with ModuleNotFoundError: No module named 'dns'. The tests use dns.resolver.NXDOMAIN to construct realistic mocks; that requires dnspython to actually be installed even though the collector calls themselves are mocked. Adds a new [osint] optional dependency with dnspython, python-whois, and ipwhois (the runtime Phase 1 collectors need them anyway), and pulls it into [dev] so CI's pip install -e .[dev] resolves them. - Total coverage was 66.74% vs 67% required (GOV-007). The OSINT layer added ~720 statements with most stub branches uncovered. Adds three parametrized tests over every registered collector: 1. unsupported input returns a list (covers early-return) 2. metadata is well-formed (covers TransformMetadata fields) 3. every declared input_type is callable without raising That is 14 collectors x 3 tests = 42 new tests, all using mocked or short-circuited paths (no network, no API keys). Local results: - 109/109 osint tests pass (was 67; +42 from parametrized smoke tests) - ruff check src/zettelforge/osint: clean - OSINT package coverage: 70% (was untested for stubs) No force-push needed; this is a follow-up commit on rfc/osint-layer-scaffold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(osint/whois): swallow ipwhois IPDefinedError on reserved IPs The new universal smoke test test_every_collector_handles_each_declared_input_type[whois_collector] exercised whois_collector with the canonical RFC 5737 documentation IP 192.0.2.1, which ipwhois treats as a poisoned input and raises IPDefinedError before any network call. The collector wasn't catching that exception, so the bare-bones probe call propagated it and CI failed again on test (3.12) and test (3.13). Real callers will hit the same exception any time an analyst feeds in a private (10/8, 192.168/16), loopback (127/8), or documentation-net IP. The right behaviour is fail-closed: log and return None / [] so the KG doesn't pretend it has data it doesn't. This commit catches IPDefinedError and BaseIpwhoisException in _lookup_ip, logging the reserved-IP case at debug and the broader failure case at warning. Per AGENTS.OE Override 4, the failure is surfaced (not silently retried) — it just doesn't bubble as an exception. Local: 109/109 osint tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(osint): bgp edge direction; entity_resolver bug fixes from review Addresses reviewer concerns on PR #147 that survived the amend (the rest were against scaffold code my amend already replaced). bgp_collector edge direction inverted: The ontology declares part_of_as as Netblock -> ASNumber (with IP families also valid as the from side). The collector was emitting from_entity_type="ASNumber", to_entity_type="Netblock" which inverts the declared direction. validate_relation would have rejected those edges. Swap the from/to fields; the entity emitted is still the Netblock (semantically: "given an ASN, here are netblocks that are part_of_as that ASN"). Caught by Copilot review of the original scaffold; the bug carried over into my amend. entity_resolver canonicalise_ipv4 leading zeros: ipaddress.IPv4Address("001.002.003.004") raises AddressValueError on Python 3.10+. Replace with explicit octet parsing (split on '.', int() each, validate 0-255). Raises ValueError on malformed input instead of silently corrupting it. entity_resolver canonicalise_asn hex strip: Old behavior stripped non-digits, so "0x3039" became "AS03039" rather than the expected hex-decoded value. Delegate to zettelforge.osint.ontology.canonicalize_asn which uses int() and raises ValueError on non-decimal input — fail-loud. entity_resolver._canonical_key used "Domain" (typo, scaffold relic): The core ontology uses DomainName. Fix to match. Also added IPv6Address case (parity with IPv4Address). entity_resolver late `from datetime import datetime` at module bottom: Moved to module-top imports. Same module: phone canonicalization preserved as-is (E.164 best-effort heuristic; documented now). Netblock canonicalization upgraded to use canonicalize_cidr so host bits get trimmed. Local: 109/109 osint tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 82868bb commit 6023fa7

32 files changed

Lines changed: 3797 additions & 0 deletions

SCOPING_DOC.md

Lines changed: 283 additions & 0 deletions
Large diffs are not rendered by default.

docs/rfcs/RFC-016-osint-layer.md

Lines changed: 411 additions & 0 deletions
Large diffs are not rendered by default.

pyproject.toml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,17 @@ crewai = [
110110
"crewai>=1.14.0",
111111
]
112112

113+
# RFC-016 OSINT layer runtime deps. Phase 1 collectors (DNS, WHOIS,
114+
# crt.sh) need these to function; tests need them to construct realistic
115+
# mocks (e.g. raise dns.resolver.NXDOMAIN). Optional so non-OSINT users
116+
# don't pay the cost. CI's [dev] install pulls these in via the line
117+
# below.
118+
osint = [
119+
"dnspython>=2.4.0",
120+
"python-whois>=0.9.0",
121+
"ipwhois>=1.2.0",
122+
]
123+
113124
dev = [
114125
"pytest>=7.0.0",
115126
"pytest-asyncio>=0.21.0",
@@ -121,6 +132,7 @@ dev = [
121132
"uvicorn>=0.20.0", # for test_web_api.py
122133
"psutil>=5.9.0", # for test_web_api.py
123134
"jinja2>=3.0.0", # for test_web_api.py
135+
"zettelforge[osint]", # OSINT collectors need dnspython etc. for tests
124136
]
125137

126138
[project.urls]

src/zettelforge/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,10 @@
2222
- OpenCTI integration
2323
"""
2424

25+
# RFC-016 OSINT layer (Phase 1, Infrastructure tier).
26+
# Side-effect import: merges OSINT entity / edge types into the global
27+
# ontology and registers Phase 1 collectors with TRANSFORM_REGISTRY.
28+
from zettelforge import osint
2529
from zettelforge.blended_retriever import BlendedRetriever
2630
from zettelforge.edition import (
2731
Edition,

src/zettelforge/ontology.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,6 +207,11 @@
207207
"optional": ["belongs_to_ref", "resolves_to_refs"],
208208
"properties": {},
209209
},
210+
"IPv6Address": {
211+
"required": ["value"],
212+
"optional": ["belongs_to_ref", "resolves_to_refs"],
213+
"properties": {},
214+
},
210215
"DomainName": {
211216
"required": ["value"],
212217
"optional": ["resolves_to_refs"],

src/zettelforge/osint/__init__.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
"""
2+
ZettelForge OSINT layer (RFC-016 / RFC-0001).
3+
4+
Importing the package merges the OSINT entity / edge types into the global
5+
ontology and imports each collector subpackage so collectors register with
6+
``TRANSFORM_REGISTRY`` at module load time.
7+
8+
Phase 1 (Infrastructure) ships functional collectors (DNS, WHOIS, crt.sh)
9+
plus stubs for BGP and port scanning. Phases 2-5 ship as graceful stubs:
10+
each collector registers its metadata so callers can discover it, and
11+
returns ``[]`` until the underlying API integration lands.
12+
13+
Public surface:
14+
15+
- ``OSINT_ENTITY_TYPES`` / ``OSINT_RELATION_TYPES`` / ``ONTOLOGY`` — additive
16+
ontology declarations.
17+
- ``TRANSFORM_REGISTRY`` — the singleton registry.
18+
- ``CollectorTuple`` — collector return-row shape.
19+
- ``TransformMetadata`` / ``TransformRegistry`` — types for adding new
20+
collectors.
21+
- ``Investigation`` / ``EntityResolver`` — Phase 4 / Phase 1.5 utilities
22+
(re-exported from their modules).
23+
"""
24+
25+
from zettelforge.osint.ontology import (
26+
ONTOLOGY,
27+
OSINT_ENTITY_TYPES,
28+
OSINT_RELATION_TYPES,
29+
canonicalize_asn,
30+
canonicalize_cidr,
31+
canonicalize_domain,
32+
canonicalize_ipv6,
33+
canonicalize_mx,
34+
canonicalize_port,
35+
canonicalize_url,
36+
canonicalize_web_title,
37+
merge_into_global_ontology,
38+
)
39+
from zettelforge.osint.transform_registry import (
40+
TRANSFORM_REGISTRY,
41+
CollectorFn,
42+
CollectorTuple,
43+
TransformMetadata,
44+
TransformRegistry,
45+
get_transform_registry,
46+
)
47+
48+
# Merge OSINT types into the global ontology before any collector runs.
49+
# Idempotent — safe under repeated imports (pytest, REPL re-imports, etc.).
50+
merge_into_global_ontology()
51+
52+
# Trigger collector self-registration. Each subpackage's __init__ imports
53+
# the collector modules under it, and each module calls
54+
# ``TRANSFORM_REGISTRY.register(...)`` at import time.
55+
from zettelforge.osint.collectors import breach as _breach # noqa: F401
56+
from zettelforge.osint.collectors import infrastructure as _infrastructure # noqa: F401
57+
from zettelforge.osint.collectors import people as _people # noqa: F401
58+
from zettelforge.osint.collectors import social as _social # noqa: F401
59+
from zettelforge.osint.collectors import tech as _tech # noqa: F401
60+
61+
__all__ = [
62+
"ONTOLOGY",
63+
"OSINT_ENTITY_TYPES",
64+
"OSINT_RELATION_TYPES",
65+
"TRANSFORM_REGISTRY",
66+
"CollectorFn",
67+
"CollectorTuple",
68+
"TransformMetadata",
69+
"TransformRegistry",
70+
"canonicalize_asn",
71+
"canonicalize_cidr",
72+
"canonicalize_domain",
73+
"canonicalize_ipv6",
74+
"canonicalize_mx",
75+
"canonicalize_port",
76+
"canonicalize_url",
77+
"canonicalize_web_title",
78+
"get_transform_registry",
79+
"merge_into_global_ontology",
80+
]
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""OSINT collector subpackages.
2+
3+
Each tier (infrastructure, people, tech) is its own subpackage. Phase 1
4+
ships only ``infrastructure``.
5+
"""
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
"""
2+
Breach-data collectors (RFC-016 Phase 4 stubs).
3+
4+
HaveIBeenPwned (k-anon password / breach lookup) and Breach Directory
5+
collectors. Both register their metadata at import time but return ``[]``
6+
until their integrations land.
7+
"""
8+
9+
from zettelforge.osint.collectors.breach import (
10+
breach_directory,
11+
hibp_collector,
12+
)
13+
14+
__all__ = [
15+
"breach_directory",
16+
"hibp_collector",
17+
]
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""
2+
Breach Directory collector — Phase 4 stub (RFC-016 §5).
3+
4+
Looks up breach records via Breach Directory's API. Stub: requires
5+
``BREACH_DIRECTORY_API_KEY`` and returns ``[]`` without it. Phase 4
6+
ships the live lookup.
7+
"""
8+
9+
from __future__ import annotations
10+
11+
import os
12+
13+
from zettelforge.log import get_logger
14+
from zettelforge.osint.transform_registry import (
15+
TRANSFORM_REGISTRY,
16+
CollectorTuple,
17+
TransformMetadata,
18+
)
19+
20+
_logger = get_logger("zettelforge.osint.collectors.breach_directory")
21+
22+
API_KEY_ENV = "BREACH_DIRECTORY_API_KEY"
23+
24+
25+
def collect(input_entity_type: str, input_value: str) -> list[CollectorTuple]:
26+
"""Query Breach Directory for an EmailAddress. Stub: returns ``[]``."""
27+
if input_entity_type != "EmailAddress":
28+
return []
29+
if not os.environ.get(API_KEY_ENV):
30+
_logger.debug("breach_directory_no_api_key", env=API_KEY_ENV)
31+
return []
32+
# Phase 4: real Breach Directory call goes here. For now: fail closed.
33+
return []
34+
35+
36+
_METADATA = TransformMetadata(
37+
name="breach_directory",
38+
description="Breach Directory: look up breach records for an email address.",
39+
input_types=("EmailAddress",),
40+
output_types=(),
41+
api_dependencies=("breachdirectory.org",),
42+
rate_limit=2.0,
43+
)
44+
45+
46+
TRANSFORM_REGISTRY.register(_METADATA, collect)
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""
2+
HaveIBeenPwned collector — Phase 4 stub (RFC-016 §5).
3+
4+
Looks up breach exposure for an email address via the HIBP API. Stub:
5+
requires ``HIBP_API_KEY`` and returns ``[]`` without it. Phase 4 will
6+
ship the live lookup and breach-record emission.
7+
"""
8+
9+
from __future__ import annotations
10+
11+
import os
12+
13+
from zettelforge.log import get_logger
14+
from zettelforge.osint.transform_registry import (
15+
TRANSFORM_REGISTRY,
16+
CollectorTuple,
17+
TransformMetadata,
18+
)
19+
20+
_logger = get_logger("zettelforge.osint.collectors.hibp")
21+
22+
API_KEY_ENV = "HIBP_API_KEY"
23+
24+
25+
def collect(input_entity_type: str, input_value: str) -> list[CollectorTuple]:
26+
"""Look up breaches associated with an EmailAddress. Stub: returns ``[]``."""
27+
if input_entity_type != "EmailAddress":
28+
return []
29+
if not os.environ.get(API_KEY_ENV):
30+
_logger.debug("hibp_collector_no_api_key", env=API_KEY_ENV)
31+
return []
32+
# Phase 4: real HIBP call goes here. For now: fail closed.
33+
return []
34+
35+
36+
_METADATA = TransformMetadata(
37+
name="hibp_collector",
38+
description="HaveIBeenPwned: enumerate breach exposures for an email.",
39+
input_types=("EmailAddress",),
40+
output_types=(),
41+
api_dependencies=("haveibeenpwned.com",),
42+
rate_limit=2.0,
43+
)
44+
45+
46+
TRANSFORM_REGISTRY.register(_METADATA, collect)

0 commit comments

Comments
 (0)