Skip to content

Commit 8d91ace

Browse files
authored
Merge pull request #2049 from Open-Source-Legal/claude/hopeful-davinci-s5jd4h
Harden authority Phase-4 SSRF guard + address review (issue #2026)
2 parents 7d50c0f + fd8da6a commit 8d91ace

15 files changed

Lines changed: 1075 additions & 128 deletions
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
- **Closed a CGNAT SSRF gap in the authority-source fetch guard (issue #2026).**
2+
`opencontractserver/utils/safe_http.py::_assert_public_ip` rejected resolved
3+
addresses via the `ipaddress` property denylist (`is_private` / `is_loopback`
4+
/ `is_link_local` / `is_multicast` / `is_reserved` / `is_unspecified`), but
5+
RFC 6598 Carrier-Grade NAT shared address space (`100.64.0.0/10`) is
6+
classified as **neither** private **nor** reserved **nor** global on current
7+
CPython — verified `is_private == is_reserved == is_global == False` on 3.11
8+
and 3.12 — so a host resolving into that block would have passed validation
9+
and been fetched. It is now rejected explicitly and version-independently via
10+
a `_CGNAT_NETWORK` membership check, with the CIDR pinned in
11+
`opencontractserver/constants/safe_http.py::CGNAT_SHARED_ADDRESS_SPACE_CIDR`.
12+
The `.gov` host allowlist already made exploitation hard in practice (an
13+
attacker would need an allowlisted host to resolve into the block), but the
14+
IP guard is defence-in-depth and is also reachable by callers that pass a
15+
custom `allowlist`. Regression coverage in
16+
`opencontractserver/tests/test_safe_http.py` rejects the CGNAT block and
17+
proves the adjacent public addresses (`100.63.255.255`, `100.128.0.0`) still
18+
pass.
19+
- **Closed an IPv4-mapped IPv6 bypass of the same guard.** `_assert_public_ip`
20+
now unwraps an IPv4-mapped IPv6 address (`::ffff:a.b.c.d`) to its embedded
21+
IPv4 before the property/CGNAT checks. On CPython 3.11 the IPv6
22+
`is_private` / `_CGNAT_NETWORK` checks do not reflect the mapped IPv4 for the
23+
CGNAT-mapped form, so a resolver returning `::ffff:100.64.0.1` would have
24+
slipped past every check; unwrapping makes the guard version-independent for
25+
the mapped forms of private/loopback/link-local/CGNAT addresses. The CGNAT
26+
membership test is additionally guarded by `isinstance(ip, IPv4Address)` so a
27+
native IPv6 address is skipped rather than relying on `IPv6Address in
28+
IPv4Network` returning `False` (true only on CPython 3.11+; 3.10 raises
29+
`TypeError`). Covered by parametrized regressions in `test_safe_http.py`
30+
(mapped forms rejected; public native IPv6 passes). Other IPv6-embedded-IPv4
31+
forms (NAT64 `64:ff9b::/96` + RFC 8215 `64:ff9b:1::/48`, 6to4 `2002::/16`,
32+
Teredo `2001:0::/32`, deprecated IPv4-compatible `::/96`) need no special
33+
handling — CPython already flags those whole prefixes `is_private`/
34+
`is_reserved` on 3.11 and 3.12 — and `test_ipv6_embedded_ipv4_tunnels_rejected`
35+
now pins that coverage so a future Python change couldn't silently open the
36+
hole.
37+
- **Strip per-service credentials on cross-host redirects in `safe_fetch_bytes`.**
38+
Request headers are now dropped of `Authorization` / `Cookie` /
39+
`Proxy-Authorization` (`CROSS_HOST_STRIPPED_HEADERS`) when a redirect crosses to
40+
a different host (RFC 9110 §15.4). httpx — unlike browsers / `requests`
41+
forwards request headers verbatim across origins, so without this a future
42+
caller passing a `.gov` API credential could leak it from one allowlisted host
43+
to another (e.g. `ecfr.gov``federalregister.gov`) while following a redirect.
44+
No current caller passes credentials, so this is forward-looking hardening; the
45+
default `User-Agent` is preserved across the hop. The cross-origin test compares
46+
`netloc` (host **and** port), so a same-host/different-port redirect
47+
(`ecfr.gov``ecfr.gov:9000`, a different service) also strips. Covered by
48+
`TestSafeFetchBytesCredentialStripping` (cross-host, cross-port, same-host).
49+
- **`_assert_public_ip` now fails CLOSED on an empty DNS result.**
50+
`socket.getaddrinfo` can return an empty list **without** raising `gaierror` on
51+
some resolver configs; the per-address loop would then be a no-op and the host
52+
declared safe (fail-open) while httpx still resolves independently at connect
53+
time. It now raises `SSRFValidationError` when no addresses resolve. Covered by
54+
`test_empty_getaddrinfo_rejected`.
55+
- **Redirect-loop robustness in `safe_fetch_bytes`.** The hop check now keys off
56+
`r.has_redirect_location` rather than `r.is_redirect` — in httpx `is_redirect`
57+
is true for *any* 3xx, so a `304 Not Modified` (or any non-`Location` 3xx) was
58+
treated as a redirect, resolved `Location: ""` back to the current URL, and
59+
looped to the redirect cap with a misleading "exceeded N redirects". A present-
60+
but-empty `Location:` now fails fast with a clear error, and a negative
61+
`Content-Length` (e.g. `-1`, which parsed cleanly and slipped the `> max_bytes`
62+
guard) is rejected as malformed. None are SSRF bypasses (the loop is bounded
63+
and the streamed-bytes cap is the real backstop), but they remove wasted hops
64+
and misleading diagnostics under server misbehaviour. Covered by
65+
`test_empty_location_header_rejected`, `test_non_location_3xx_not_followed_as_redirect`,
66+
and `test_negative_content_length_raises`.
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
- **Authority discovery Phase-4 code-review follow-ups (issue #2026).**
2+
- `opencontractserver/utils/safe_http.py::safe_fetch_bytes` now sends a
3+
default `User-Agent` (`DEFAULT_USER_AGENT`) so the US Code title-ZIP loader
4+
and the agentic fetch tool identify OpenContracts to `.gov` servers instead
5+
of going out as an anonymous `httpx` client (politeness + fewer rate-limit
6+
blocks); a caller-supplied `User-Agent` (the FR/CFR providers) still
7+
overrides it. This applies to **every** no-`headers` caller of
8+
`safe_fetch_bytes`/`safe_fetch_text`, not only the providers — notably
9+
`enrichment/services/popular_name_importer.py`, which now sends
10+
`DEFAULT_USER_AGENT` instead of the bare `httpx` default (no test asserts on
11+
the old UA string, so nothing regresses). The two byte-identical `_USER_AGENT` literals in
12+
`cfr_provider.py` / `federal_register_provider.py` are consolidated into a
13+
single `AUTHORITY_PROVIDER_USER_AGENT` constant (same `constants/safe_http.py`
14+
neighbourhood) so the contact address can no longer drift between them, and
15+
`us_code_provider.py` (the third deterministic authority provider, previously
16+
sending the generic default) now sends the same UA so all three present
17+
consistently to `.gov` hosts. The default/caller merge uses `httpx.Headers`
18+
(case-insensitive), so a caller passing `user-agent` in any casing overrides
19+
the default rather than emitting two conflicting `User-Agent` lines.
20+
`AUTHORITY_PROVIDER_USER_AGENT` also gains a `+` before its info URL
21+
(`(+https://github.com/...)`, the conventional "informational URL" marker)
22+
to match `DEFAULT_USER_AGENT` — a cosmetic change to the UA bytes the
23+
FR/CFR/US Code providers send to `.gov` hosts, with no behavioural impact.
24+
- Replaced the bare `* 4` UTF-8 worst-case byte factor in
25+
`opencontractserver/pipeline/authority_source_providers/agentic_web_locator_provider.py`
26+
with the named `UTF8_MAX_BYTES_PER_CHAR` constant
27+
(`opencontractserver/constants/safe_http.py`).
28+
- Removed the unused `source_url` override parameter from
29+
`AuthorityGateService.evaluate`
30+
(`opencontractserver/enrichment/services/authority_gate_service.py`): no
31+
caller passed it and no test exercised it, so it was speculative API surface
32+
over a single source of truth (the domain gate reads
33+
`sections[0].source_url`).
34+
- Moved the agentic provider's private `_sanitize_for_prompt` helper into
35+
`opencontractserver/utils/prompt_sanitization.py` as the public
36+
`sanitize_for_prompt_strict` (its stricter, ASCII-only companion to the
37+
existing `sanitize_plaintext_for_prompt`), so this security helper is
38+
discoverable alongside the other prompt-injection utilities rather than
39+
buried in a provider file. Tests moved to `test_prompt_sanitization.py`.
40+
- Hardened the Federal Register provider's step-1 redirect parsing: the
41+
document-number capture in `_LOCATION_DOC_NUMBER_RE` is now `[\d-]+` (real FR
42+
numbers are `YYYY-NNNNN`), so a malformed/attacker-influenced `Location`
43+
carrying letters, underscores, or URL-special characters (`?`, `#`) fails to
44+
match and raises instead of silently being interpolated into the step-2 URL
45+
(wrong endpoint, no error). The step-1 `requests.get` magic `timeout=15` is
46+
replaced with the shared `(CONNECT_TIMEOUT_SECONDS, READ_TIMEOUT_SECONDS)`
47+
constants.
48+
- Documented the deliberate trust boundary on the agentic
49+
`_tool_web_search` query (LLM-generated, forwarded unsanitized to a
50+
text-only search tool — no SSRF surface), why `AuthorityGateService` does
51+
not extend `BaseService` (stateless, no user context), and why the redirect
52+
loop in `safe_fetch_bytes` does not `r.read()` the redirect body (unbounded
53+
read that would bypass the per-hop size cap).
54+
- **New backend test coverage (issue #2026).**
55+
`test_safe_http.py` adds the CGNAT rejection + boundary cases, the
56+
`MAX_REDIRECTS` exhaustion path, and the default/overridden `User-Agent`
57+
behaviour; `test_federal_register_provider.py` adds the step-3 raw-text
58+
size-cap `SSRFValidationError`→abstract fallback regression; `test_authority_gate.py`
59+
adds the no-colon `canonical_key` heading-fallback edge case; and
60+
`test_agentic_web_locator.py` adds `_tool_web_search` delegation/error-propagation
61+
coverage.

opencontractserver/constants/safe_http.py

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,58 @@
1818
}
1919
)
2020

21-
ALLOWED_SCHEMES: frozenset[str] = frozenset({"https"}) # gov sources are all TLS
21+
# HTTPS only. ``http://`` is intentionally excluded even for local/test
22+
# convenience: a downgraded hop is an SSRF/MITM foothold and every allowlisted
23+
# .gov source serves TLS. Do NOT add "http" here to make a test pass — mock the
24+
# transport instead (see test_safe_http.py::TestValidateUrlScheme).
25+
ALLOWED_SCHEMES: frozenset[str] = frozenset({"https"})
2226
MAX_REDIRECTS: int = 5
2327
CONNECT_TIMEOUT_SECONDS: float = 5.0
2428
READ_TIMEOUT_SECONDS: float = 60.0 # OLRC title ZIPs are large
2529

30+
# RFC 6598 Carrier-Grade NAT / shared address space. ``ipaddress`` does NOT
31+
# classify this block as private/reserved/global on any current CPython
32+
# (verified False for is_private AND is_reserved on 3.11 and 3.12) — it is simply
33+
# absent from CPython's ``ipaddress`` ``_private_networks`` list — so the
34+
# property-based denylist in ``_assert_public_ip`` would let a host resolving
35+
# here slip through. Rejected explicitly and version-independently; the behaviour
36+
# is pinned by ``test_safe_http.py::test_cgnat_shared_address_space_rejected``
37+
# (re-run it to re-verify the gap after a Python upgrade).
38+
CGNAT_SHARED_ADDRESS_SPACE_CIDR: str = "100.64.0.0/10"
39+
40+
# Identifies OpenContracts to public-domain .gov servers when a caller does not
41+
# supply its own User-Agent (the FR/CFR providers pass a more specific one that
42+
# overrides this). Sending a real UA — rather than the bare ``httpx`` default —
43+
# is polite and reduces the chance of being rate-limited or outright blocked.
44+
DEFAULT_USER_AGENT: str = (
45+
"OpenContracts/1.0 "
46+
"(+https://github.com/Open-Source-Legal/OpenContracts; "
47+
"contact: opensource@opencontracts.dev)"
48+
)
49+
50+
# Specific UA the deterministic authority-source providers (Federal Register,
51+
# eCFR) send, overriding DEFAULT_USER_AGENT. Single source of truth so the two
52+
# providers cannot drift the contact address / URL apart.
53+
AUTHORITY_PROVIDER_USER_AGENT: str = (
54+
"OpenContracts-authority-provider/1.0 "
55+
"(+https://github.com/Open-Source-Legal/OpenContracts; "
56+
"contact: opensource@opencontracts.dev)"
57+
)
58+
59+
# Per-service credential headers stripped when ``safe_fetch_bytes`` follows a
60+
# redirect to a DIFFERENT host (RFC 9110 §15.4). httpx — unlike browsers and
61+
# ``requests`` — forwards request headers verbatim across a cross-origin
62+
# redirect, so a caller-supplied Authorization/Cookie must not leak from one
63+
# allowlisted .gov service to another. Lowercase for case-insensitive matching.
64+
#
65+
# NOTE: this is the standard credential set, NOT an exhaustive safe-list.
66+
# Non-standard per-service headers (``X-Api-Key``, ``X-Auth-Token``, …) are NOT
67+
# stripped, so a caller that sends such a header for a specific service must not
68+
# rely on this set to protect it across a cross-host redirect.
69+
CROSS_HOST_STRIPPED_HEADERS: frozenset[str] = frozenset(
70+
{"authorization", "cookie", "proxy-authorization"}
71+
)
72+
2673
# Conservative DEFAULT body cap. Most authority fetches (FR JSON, eCFR/FR raw
2774
# text bodies) are well under this; a constrained worker should never buffer
2875
# hundreds of MB by default. Callers that genuinely need a larger body (only the
@@ -33,3 +80,9 @@
3380
# (Title 26, Tax) ships well under 100 MB, so 200 MB is generous headroom while
3481
# still bounding a runaway download far below the old 500 MB blanket default.
3582
OLRC_TITLE_ZIP_MAX_BYTES: int = 200 * 1024 * 1024 # 200 MB
83+
84+
# UTF-8 worst-case encodes one Unicode scalar in up to 4 bytes. Callers that cap
85+
# *characters* (e.g. the agentic locator's ``max_fetch_chars``) multiply by this
86+
# to derive the *byte* cap passed to ``safe_fetch_*`` — so streaming aborts at
87+
# the byte ceiling instead of buffering a huge body before truncating to chars.
88+
UTF8_MAX_BYTES_PER_CHAR: int = 4

opencontractserver/enrichment/services/authority_gate_service.py

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,11 @@ class AuthorityGateService:
4141
public-domain allowlist (else GATE_BLOCKED_DOMAIN).
4242
4. At least one section key or heading must match the canonical_key.
4343
5. If require_approval_for_agentic, park at pending_approval.
44+
45+
Unlike the other authority services this does NOT extend ``BaseService``:
46+
``evaluate`` is a pure, stateless classmethod with no user context and no
47+
ORM access, so the Tier-0 visibility/permission helpers BaseService provides
48+
have nothing to operate on here.
4449
"""
4550

4651
@classmethod
@@ -50,17 +55,15 @@ def evaluate(
5055
canonical_key: str,
5156
sections: list[AuthoritySection],
5257
provider_license: str,
53-
source_url: str | None = None,
5458
require_approval_for_agentic: bool = False,
5559
) -> GateDecision:
5660
"""Evaluate whether fetched sections may be ingested.
5761
5862
Args:
5963
canonical_key: The requested authority key (e.g. "usc-15:78j").
6064
sections: List of AuthoritySection objects returned by the provider.
65+
The domain gate checks ``sections[0].source_url``.
6166
provider_license: The provider's declared license ClassVar.
62-
source_url: Optional override for the URL to check against the
63-
allowlist. When None, uses sections[0].source_url if available.
6467
require_approval_for_agentic: When True, gate returns
6568
PENDING_APPROVAL for an otherwise-valid result.
6669
@@ -74,11 +77,11 @@ def evaluate(
7477
(fix the provider's license metadata vs. an allowlist/security
7578
review), so operators can filter on state alone without parsing
7679
``reason``.
77-
- A missing source URL (``source_url`` None/"" AND no
78-
``sections[0].source_url``) is treated as ``GATE_UNLOCATED`` when
79-
sections are present: a result we cannot attribute to an
80-
allowlisted domain must NOT bypass the domain gate on its license
81-
alone. See ``test_none_source_url_is_unlocated``.
80+
- A missing source URL (``sections[0].source_url`` None/"") is
81+
treated as ``GATE_UNLOCATED`` when sections are present: a result
82+
we cannot attribute to an allowlisted domain must NOT bypass the
83+
domain gate on its license alone. See
84+
``test_none_source_url_is_unlocated``.
8285
"""
8386
# 1) License gate -------------------------------------------------------
8487
if provider_license != "public-domain":
@@ -99,10 +102,7 @@ def evaluate(
99102
)
100103

101104
first = sections[0]
102-
effective_url = (
103-
source_url if source_url is not None else (first.source_url or "")
104-
)
105-
domain = urlparse(effective_url).hostname or None
105+
domain = urlparse(first.source_url or "").hostname or None
106106

107107
# 3) Source-domain allowlist --------------------------------------------
108108
# A missing/un-parseable source URL means we cannot attribute the result

0 commit comments

Comments
 (0)