Skip to content

Cap public crawl_authorities analyzer bounds to prevent unbounded/DoS crawls#2083

Open
JSv4 wants to merge 4 commits into
mainfrom
codex/fix-unbounded-authority-crawl-task
Open

Cap public crawl_authorities analyzer bounds to prevent unbounded/DoS crawls#2083
JSv4 wants to merge 4 commits into
mainfrom
codex/fix-unbounded-authority-crawl-task

Conversation

@JSv4

@JSv4 JSv4 commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Motivation

  • The public crawl_authorities corpus analyzer accepted attacker-controlled numeric bounds from GraphQL which were propagated into the crawl loop with no server-side validation or safe upper bounds, allowing authenticated users to request arbitrarily large crawls.
  • token_budget=0 was treated as unbounded by the service, enabling a trivial bypass of budget enforcement.

Description

  • Introduce explicit public safety caps in opencontractserver/enrichment/constants.py (CRAWL_MAX_* constants) to define server-side maximums for public analyzer parameters.
  • Add a _resolve_crawl_bound helper in opencontractserver/tasks/corpus_analysis_tasks.py that validates and clamps user-supplied bounds to safe minimum/maximum ranges and rejects non-integer/bool inputs.
  • Update the crawl_authorities analyzer input_schema to advertise the enforced maximums and require token_budget >= 1, and call _resolve_crawl_bound to pass clamped values into CrawlAuthoritiesService.crawl.
  • Add unit tests in opencontractserver/tests/test_crawl_authorities.py asserting the schema maximums and that _resolve_crawl_bound clamps/normalizes extreme or disallowed inputs.

Testing

  • Ran static/parse checks: compiled opencontractserver/enrichment/constants.py with python -m py_compile and parsed modified Python modules with ast.parse, both succeeded.
  • Attempted to run the affected test case (python -m pytest opencontractserver/tests/test_crawl_authorities.py::CeleryTaskImportTest -q) but it could not execute in this environment due to missing runtime dependency (ModuleNotFoundError: No module named 'django').
  • Added regression tests that exercise schema values and the clamping helper, but full Django-backed test execution was not possible here due to the missing dependency.

Codex Task

…orities

The decorator in shared/decorators.py attaches _oc_corpus_analyzer_input_schema
on the wrapper at runtime, but its declared return type is Callable[..., Any],
which mypy rightly rejects. Added a scoped # type: ignore[attr-defined] on the
single access site; the test remains a direct assertion (not getattr) so intent
is preserved. Applied black reformatting to the same file.
@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Code Review — PR #2083: Cap public crawl_authorities analyzer bounds

This PR addresses a real DoS surface: user-controlled integers flowing from GraphQL into an unbounded Celery crawl loop. The approach is sound — constants file for caps, a validation helper, clamped call sites, and new tests. A few issues need attention before merge.


1. LLM agent tool path still uncapped (security gap)

opencontractserver/llms/tools/core_tools/corpus_references.py:267

The Celery task is now protected, but crawl_authorities / acrawl_authorities in the LLM tools module bypass _resolve_crawl_bound entirely — they pass whatever the LLM supplies straight into CrawlAuthoritiesService.crawl:

# core_tools/corpus_references.py:267-274 — no clamping here
return CrawlAuthoritiesService.crawl(
    ...
    token_budget=token_budget,   # LLM could supply 10_000_000
    max_authorities=max_authorities,  # or 9_999
)

The PR title says "prevent unbounded/DoS crawls" but this surface is still open. The same _resolve_crawl_bound calls (or, better, service-layer enforcement) should be applied here too.


2. Cap constants direction is inverted — defaults equal caps

opencontractserver/enrichment/constants.py:98-102

CRAWL_MAX_DEPTH = CRAWL_DEFAULT_MAX_DEPTH      # both = 2
CRAWL_MAX_TOKEN_BUDGET = CRAWL_DEFAULT_TOKEN_BUDGET   # both = 2_000_000

The comment says "keep those values at or below the defaults", but the correct invariant for a safety cap is MAX ≥ DEFAULT. As written, if a future performance refactor lowers CRAWL_DEFAULT_MAX_DEPTH from 2 to 1, CRAWL_MAX_DEPTH silently becomes 1 as well. Any caller that was explicitly passing max_depth=2 (the old default, a perfectly valid value) would now be clamped down to 1 with no error or log line — a silent regression.

The correct approach is to define the caps as independent constants with values ≥ their defaults, and add a module-level assertion:

CRAWL_MAX_DEPTH = 2  # hard ceiling operators can raise
assert CRAWL_MAX_DEPTH >= CRAWL_DEFAULT_MAX_DEPTH

3. Schema minimums are magic numbers repeated in two places

opencontractserver/tasks/corpus_analysis_tasks.py:104,110,116,122,128 (schema) vs 183,190,197,204,211 (_resolve_crawl_bound calls)

The minimums 0, 0, 1, 1, 1 appear once in the JSON schema dict and again as positional arguments to each _resolve_crawl_bound call. CLAUDE.md §4 ("No magic numbers — use constants files") applies here. If token_budget's floor is ever raised from 1 to, say, 100, only one of the two sites will be updated and they'll silently diverge. These floor values should live in constants.py as CRAWL_MIN_* constants and be referenced from both places.


4. token_budget ≤ 0 service sentinel is now permanently dead code on the task path

opencontractserver/enrichment/services/crawl_authorities_service.py:163

The service explicitly uses token_budget <= 0 as "unbounded":

# token_budget <= 0 means "unbounded" (the check is skipped entirely).
if token_budget > 0 and tokens_spent >= token_budget:

The old task code passed 0 through; _resolve_crawl_bound with minimum=1 now clamps it to 1. A budget of 1 is not the same as unbounded — it will halt after the very first authority (real authorities cost thousands of tokens). The "run without a budget guard" path is now permanently inaccessible through the public task. This is likely intentional, but the comment on the service-layer guard should be updated to reflect that 0 is unreachable from the task path, or the dead <= 0 branch in the service should be removed.


5. bool rejection is untested

opencontractserver/tests/test_crawl_authorities.py

The isinstance(value, bool) guard in _resolve_crawl_bound exists precisely because bool is a subclass of int in Python (True + 1 == 2). Without it, True would silently become 1. No test currently exercises this:

# Missing test:
with self.assertRaises(ValueError):
    _resolve_crawl_bound(name="max_depth", value=True, default=2, minimum=0, maximum=2)

This is the subtlest part of the validation logic and the most likely to be accidentally removed in a refactor.


6. Five structurally identical call sites — DRY opportunity

opencontractserver/tasks/corpus_analysis_tasks.py:179-213

Each of the five _resolve_crawl_bound calls restates name, default, minimum, and maximum that are already declared in the schema dict 50 lines above. Adding a sixth bound parameter requires touching both the schema and a matching call site. A small loop over the schema properties would keep them in sync automatically:

BOUND_PARAMS = ("max_depth", "min_demand", "max_authorities", "per_jurisdiction_cap", "token_budget")
resolved = {
    k: _resolve_crawl_bound(
        name=k,
        value=locals().get(k),
        default=schema["properties"][k]["default"],
        minimum=schema["properties"][k]["minimum"],
        maximum=schema["properties"][k]["maximum"],
    )
    for k in BOUND_PARAMS
}
return CrawlAuthoritiesService.crawl(**resolved, make_public=make_public, log=task_log, ...)

Summary: The core fix (validate + clamp before passing to the service) is the right approach and the _resolve_crawl_bound helper is well-structured. The most important issue to address before merge is finding #1 (the LLM tool path bypasses all enforcement). Findings #2 and #4 are also worth resolving in this PR since they concern the same security invariant.

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/tasks/corpus_analysis_tasks.py 66.66% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

JSv4 added 2 commits June 29, 2026 08:24
The existing test exercised only the clamp path; add assertions for the
value=None -> default fallback and the ValueError raised for non-integers
(including bool), covering the two lines codecov flagged in
corpus_analysis_tasks.py.
@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Code Review

This PR introduces server-side caps on the crawl_authorities public analyzer to prevent DoS via attacker-controlled GraphQL parameters. The approach — a _resolve_crawl_bound helper plus CRAWL_MAX_* constants — is sound in structure. However, there are two confirmed behavioral regressions and a few design gaps worth addressing before merge.


1. max_depth ceiling silently dropped from 5 to 2 (regression)

File: opencontractserver/enrichment/constants.py:98

CRAWL_DEFAULT_MAX_DEPTH = 2 (line 88) and CRAWL_MAX_DEPTH = CRAWL_DEFAULT_MAX_DEPTH (line 98), so CRAWL_MAX_DEPTH = 2. The old schema had maximum: 5 hardcoded. Any caller who previously passed max_depth=3, 4, or 5 will now be silently clamped to 2 with no error or warning. A three-hop legal citation traversal designed with max_depth=3 will silently produce a shallower crawl. If the intended ceiling is 5, set CRAWL_MAX_DEPTH = 5 explicitly, independent of the default.


2. token_budget=0 semantic broken via the task path

Files: opencontractserver/enrichment/services/crawl_authorities_service.py:163, opencontractserver/tasks/corpus_analysis_tasks.py:209

The service documents and implements token_budget <= 0 as unbounded — if token_budget > 0 and tokens_spent >= token_budget: at line 164. The PR clamps token_budget=0 to 1 via minimum=1 in _resolve_crawl_bound. With token_budget=1, the budget guard fires on the first authority whose estimated cost is >= 1 token (~4 chars of text), effectively aborting the crawl immediately. This breaks any caller relying on 0 to mean "no limit." Either reject 0 with a ValueError for clear user feedback, or clamp to CRAWL_DEFAULT_TOKEN_BUDGET instead of minimum=1. The service docstring at line 104 also needs updating.


3. ValueError surfaces as async task failure, not a synchronous GraphQL error

Files: opencontractserver/shared/decorators.py:455, opencontractserver/tasks/corpus_analysis_tasks.py:82

The corpus_analyzer_task decorator catches ValueError, stamps Analysis.status = FAILED, stores str(e) in analysis.error_message, and re-raises. A caller who passes token_budget: true gets ok=True from the mutation, then discovers the failure asynchronously. Adding "not": {"type": "boolean"} to integer fields in the input_schema would reject booleans before dispatch if schema validation is enforced at that layer.


4. CRAWL_MAX_* constants are frozen at their defaults — no operator override path

File: opencontractserver/enrichment/constants.py:98-102

All five CRAWL_MAX_* values equal their CRAWL_DEFAULT_* counterparts. A trusted deployment that needs max_depth=4 can only do so by editing Python and redeploying, globally changing the default for every corpus. Other resource limits in the same file (LLM_MAX_CONCURRENCY, DOC_MAX_CONCURRENCY) use accessor functions backed by Django settings, giving operators an override path without code changes.


5. _resolve_crawl_bound belongs in a utility module, not a task file

File: opencontractserver/tasks/corpus_analysis_tasks.py:82

This is a pure stateless helper with no Celery or task-specific dependency. Per CLAUDE.md: "Before writing new utilities: Check existing utility files first (opencontractserver/utils/)." A functionally equivalent helper clamp_limit(limit, default, maximum) already exists at opencontractserver/llms/tools/core_tools/_helpers.py:82. Callers outside the task file must either import from a task module (inverted dependency) or duplicate the guard.


6. Type hint value: int | None misleads — bool inputs are rejected at runtime

File: opencontractserver/tasks/corpus_analysis_tasks.py:86

bool is a subclass of int, so mypy/pyright accepts value=True without complaint. The guard correctly rejects booleans at runtime, but the annotation should be value: int | bool | None paired with the rejection so the contract is visible to static analysis.


Severity File Finding
High enrichment/constants.py:98 max_depth max silently dropped 5 to 2; values 3-5 silently clamped
High crawl_authorities_service.py:163 token_budget=0 (unbounded) clamped to 1, effectively aborting crawl
Medium decorators.py:455 ValueError surfaces as async task failure, not sync error
Medium enrichment/constants.py:98-102 No operator override path for caps (unlike LLM_MAX_CONCURRENCY pattern)
Low corpus_analysis_tasks.py:82 Pure utility helper in task file; duplicate of _helpers.py:clamp_limit
Low corpus_analysis_tasks.py:86 value: int

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Code Review

This PR adds server-side safety caps to crawl_authorities to prevent authenticated users from triggering arbitrarily large crawls via attacker-controlled GraphQL parameters. The _resolve_crawl_bound helper is clean, the ValueError for non-integer/bool inputs is correctly handled by the corpus_analyzer_task decorator (marks Analysis FAILED, not a silent swallow), and the new tests exercise both the schema introspection and clamping paths. Four findings below.


1. LLM tool path bypasses every safety cap (Security)

opencontractserver/llms/tools/core_tools/corpus_references.py ~line 267

CrawlAuthoritiesService.crawl() is called directly from the LLM tool function with unclamped parameters — max_depth, max_authorities, per_jurisdiction_cap, and token_budget all pass through verbatim. The service itself (crawl_authorities_service.py lines 160–165) has no internal upper-bound guards. An LLM agent invoking acrawl_authorities with max_depth=1000, max_authorities=100000 completely bypasses the new caps added to the Celery task path.

Fix: apply the same _resolve_crawl_bound clamping inside the LLM tool (or add guards to the service itself, which would cover both paths).


2. min_demand cap protects the wrong end of the range (Security/Correctness)

opencontractserver/tasks/corpus_analysis_tasks.py lines 186–192 / opencontractserver/enrichment/constants.py line 99

min_demand is a floor filter: qs = qs.filter(mention_count__gte=min_demand) (authority_frontier_service.py lines 185–186). A lower value means more rows are processed (the dangerous direction); a higher value means fewer rows (more restrictive). The cap sets maximum=CRAWL_MAX_MIN_DEMAND=2, which blocks values above 2. But values above the default are already more restrictive — no DoS risk there. The actual DoS surface is min_demand=0 (no filter, all frontier rows processed), which the cap leaves completely open since minimum=0 in both the schema and the _resolve_crawl_bound call.

Fix: enforce a minimum floor, e.g. minimum=C.CRAWL_MIN_MIN_DEMAND (= 1 or = CRAWL_DEFAULT_MIN_DEMAND), not just a ceiling.


3. token_budget=0 clamped to 1 causes near-immediate crawl halt (Correctness)

opencontractserver/tasks/corpus_analysis_tasks.py lines 207–213 / opencontractserver/enrichment/services/crawl_authorities_service.py lines 163–165

The service documents and implements token_budget <= 0 as "unbounded" (the budget check is skipped). The PR sets minimum=1, so _resolve_crawl_bound maps 0 → 1. When token_budget=1 reaches the service, the check if token_budget > 0 and tokens_spent >= token_budget fires after the very first authority ingest (any document with content will spend more than 1 estimated token), stopping the crawl almost immediately. This silently converts a "run unbounded" intent into a "stop after first authority" result rather than returning a validation error or remapping to the default budget.

If the schema minimum of 1 is the intended contract (no unbounded crawls through the task), the _resolve_crawl_bound call should raise (not clamp) on token_budget=0, or remap None-sentinel handling so 0 explicitly falls back to default. Currently the behavior is a silent near-halt that's likely to be very confusing to operators.


4. Missing CRAWL_MIN_* constants (CLAUDE.md conventions)

opencontractserver/tasks/corpus_analysis_tasks.py lines 183, 190, 197, 204, 211 and JSON Schema lines 104, 110, 116, 122, 128

CLAUDE.md rule: "No magic numbers — we have constants files in opencontractserver/constants/ (backend) ... Use them for any hardcoded values." The minimum=0 and minimum=1 literals appear six times each across the _resolve_crawl_bound calls and the JSON Schema dict with no corresponding CRAWL_MIN_* constants in opencontractserver/enrichment/constants.py. The existing constants file already notes this expectation at its header. If the minimum for max_authorities ever changes to 2, the change would need to be made in four places manually.

Fix: add CRAWL_MIN_MAX_DEPTH = 0, CRAWL_MIN_MIN_DEMAND = 0 (or 1, per finding 2), CRAWL_MIN_AUTHORITIES = 1, CRAWL_MIN_PER_JURISDICTION_CAP = 1, CRAWL_MIN_TOKEN_BUDGET = 1 and reference them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant