Cap public `crawl_authorities` analyzer bounds to prevent unbounded/DoS crawls by JSv4 · Pull Request #2083 · Open-Source-Legal/OpenContracts

JSv4 · 2026-06-29T02:13:12Z

Motivation

The public crawl_authorities corpus analyzer accepted attacker-controlled numeric bounds from GraphQL which were propagated into the crawl loop with no server-side validation or safe upper bounds, allowing authenticated users to request arbitrarily large crawls.
token_budget=0 was treated as unbounded by the service, enabling a trivial bypass of budget enforcement.

Description

Introduce explicit public safety caps in opencontractserver/enrichment/constants.py (CRAWL_MAX_* constants) to define server-side maximums for public analyzer parameters.
Add a _resolve_crawl_bound helper in opencontractserver/tasks/corpus_analysis_tasks.py that validates and clamps user-supplied bounds to safe minimum/maximum ranges and rejects non-integer/bool inputs.
Update the crawl_authorities analyzer input_schema to advertise the enforced maximums and require token_budget >= 1, and call _resolve_crawl_bound to pass clamped values into CrawlAuthoritiesService.crawl.
Add unit tests in opencontractserver/tests/test_crawl_authorities.py asserting the schema maximums and that _resolve_crawl_bound clamps/normalizes extreme or disallowed inputs.

Testing

Ran static/parse checks: compiled opencontractserver/enrichment/constants.py with python -m py_compile and parsed modified Python modules with ast.parse, both succeeded.
Attempted to run the affected test case (python -m pytest opencontractserver/tests/test_crawl_authorities.py::CeleryTaskImportTest -q) but it could not execute in this environment due to missing runtime dependency (ModuleNotFoundError: No module named 'django').
Added regression tests that exercise schema values and the clamping helper, but full Django-backed test execution was not possible here due to the missing dependency.

…orities The decorator in shared/decorators.py attaches _oc_corpus_analyzer_input_schema on the wrapper at runtime, but its declared return type is Callable[..., Any], which mypy rightly rejects. Added a scoped # type: ignore[attr-defined] on the single access site; the test remains a direct assertion (not getattr) so intent is preserved. Applied black reformatting to the same file.

claude · 2026-06-29T12:16:15Z

Code Review — PR #2083: Cap public `crawl_authorities` analyzer bounds

This PR addresses a real DoS surface: user-controlled integers flowing from GraphQL into an unbounded Celery crawl loop. The approach is sound — constants file for caps, a validation helper, clamped call sites, and new tests. A few issues need attention before merge.

1. LLM agent tool path still uncapped (security gap)

opencontractserver/llms/tools/core_tools/corpus_references.py:267

The Celery task is now protected, but crawl_authorities / acrawl_authorities in the LLM tools module bypass _resolve_crawl_bound entirely — they pass whatever the LLM supplies straight into CrawlAuthoritiesService.crawl:

# core_tools/corpus_references.py:267-274 — no clamping here
return CrawlAuthoritiesService.crawl(
    ...
    token_budget=token_budget,   # LLM could supply 10_000_000
    max_authorities=max_authorities,  # or 9_999
)

The PR title says "prevent unbounded/DoS crawls" but this surface is still open. The same _resolve_crawl_bound calls (or, better, service-layer enforcement) should be applied here too.

2. Cap constants direction is inverted — defaults equal caps

opencontractserver/enrichment/constants.py:98-102

CRAWL_MAX_DEPTH = CRAWL_DEFAULT_MAX_DEPTH      # both = 2
CRAWL_MAX_TOKEN_BUDGET = CRAWL_DEFAULT_TOKEN_BUDGET   # both = 2_000_000

The comment says "keep those values at or below the defaults", but the correct invariant for a safety cap is MAX ≥ DEFAULT. As written, if a future performance refactor lowers CRAWL_DEFAULT_MAX_DEPTH from 2 to 1, CRAWL_MAX_DEPTH silently becomes 1 as well. Any caller that was explicitly passing max_depth=2 (the old default, a perfectly valid value) would now be clamped down to 1 with no error or log line — a silent regression.

The correct approach is to define the caps as independent constants with values ≥ their defaults, and add a module-level assertion:

CRAWL_MAX_DEPTH = 2  # hard ceiling operators can raise
assert CRAWL_MAX_DEPTH >= CRAWL_DEFAULT_MAX_DEPTH

3. Schema minimums are magic numbers repeated in two places

opencontractserver/tasks/corpus_analysis_tasks.py:104,110,116,122,128 (schema) vs 183,190,197,204,211 (_resolve_crawl_bound calls)

The minimums 0, 0, 1, 1, 1 appear once in the JSON schema dict and again as positional arguments to each _resolve_crawl_bound call. CLAUDE.md §4 ("No magic numbers — use constants files") applies here. If token_budget's floor is ever raised from 1 to, say, 100, only one of the two sites will be updated and they'll silently diverge. These floor values should live in constants.py as CRAWL_MIN_* constants and be referenced from both places.

4. `token_budget ≤ 0` service sentinel is now permanently dead code on the task path

opencontractserver/enrichment/services/crawl_authorities_service.py:163

The service explicitly uses token_budget <= 0 as "unbounded":

# token_budget <= 0 means "unbounded" (the check is skipped entirely).
if token_budget > 0 and tokens_spent >= token_budget:

The old task code passed 0 through; _resolve_crawl_bound with minimum=1 now clamps it to 1. A budget of 1 is not the same as unbounded — it will halt after the very first authority (real authorities cost thousands of tokens). The "run without a budget guard" path is now permanently inaccessible through the public task. This is likely intentional, but the comment on the service-layer guard should be updated to reflect that 0 is unreachable from the task path, or the dead <= 0 branch in the service should be removed.

5. `bool` rejection is untested

opencontractserver/tests/test_crawl_authorities.py

The isinstance(value, bool) guard in _resolve_crawl_bound exists precisely because bool is a subclass of int in Python (True + 1 == 2). Without it, True would silently become 1. No test currently exercises this:

# Missing test:
with self.assertRaises(ValueError):
    _resolve_crawl_bound(name="max_depth", value=True, default=2, minimum=0, maximum=2)

This is the subtlest part of the validation logic and the most likely to be accidentally removed in a refactor.

6. Five structurally identical call sites — DRY opportunity

opencontractserver/tasks/corpus_analysis_tasks.py:179-213

Each of the five _resolve_crawl_bound calls restates name, default, minimum, and maximum that are already declared in the schema dict 50 lines above. Adding a sixth bound parameter requires touching both the schema and a matching call site. A small loop over the schema properties would keep them in sync automatically:

BOUND_PARAMS = ("max_depth", "min_demand", "max_authorities", "per_jurisdiction_cap", "token_budget")
resolved = {
    k: _resolve_crawl_bound(
        name=k,
        value=locals().get(k),
        default=schema["properties"][k]["default"],
        minimum=schema["properties"][k]["minimum"],
        maximum=schema["properties"][k]["maximum"],
    )
    for k in BOUND_PARAMS
}
return CrawlAuthoritiesService.crawl(**resolved, make_public=make_public, log=task_log, ...)

Summary: The core fix (validate + clamp before passing to the service) is the right approach and the _resolve_crawl_bound helper is well-structured. The most important issue to address before merge is finding #1 (the LLM tool path bypasses all enforcement). Findings #2 and #4 are also worth resolving in this PR since they concern the same security invariant.

codecov · 2026-06-29T12:36:47Z

Codecov Report

❌ Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
opencontractserver/tasks/corpus_analysis_tasks.py	66.66%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

The existing test exercised only the clamp path; add assertions for the value=None -> default fallback and the ValueError raised for non-integers (including bool), covering the two lines codecov flagged in corpus_analysis_tasks.py.

…nd test

claude · 2026-06-29T13:31:26Z

Code Review

This PR introduces server-side caps on the crawl_authorities public analyzer to prevent DoS via attacker-controlled GraphQL parameters. The approach — a _resolve_crawl_bound helper plus CRAWL_MAX_* constants — is sound in structure. However, there are two confirmed behavioral regressions and a few design gaps worth addressing before merge.

1. max_depth ceiling silently dropped from 5 to 2 (regression)

File: opencontractserver/enrichment/constants.py:98

CRAWL_DEFAULT_MAX_DEPTH = 2 (line 88) and CRAWL_MAX_DEPTH = CRAWL_DEFAULT_MAX_DEPTH (line 98), so CRAWL_MAX_DEPTH = 2. The old schema had maximum: 5 hardcoded. Any caller who previously passed max_depth=3, 4, or 5 will now be silently clamped to 2 with no error or warning. A three-hop legal citation traversal designed with max_depth=3 will silently produce a shallower crawl. If the intended ceiling is 5, set CRAWL_MAX_DEPTH = 5 explicitly, independent of the default.

2. token_budget=0 semantic broken via the task path

Files: opencontractserver/enrichment/services/crawl_authorities_service.py:163, opencontractserver/tasks/corpus_analysis_tasks.py:209

The service documents and implements token_budget <= 0 as unbounded — if token_budget > 0 and tokens_spent >= token_budget: at line 164. The PR clamps token_budget=0 to 1 via minimum=1 in _resolve_crawl_bound. With token_budget=1, the budget guard fires on the first authority whose estimated cost is >= 1 token (~4 chars of text), effectively aborting the crawl immediately. This breaks any caller relying on 0 to mean "no limit." Either reject 0 with a ValueError for clear user feedback, or clamp to CRAWL_DEFAULT_TOKEN_BUDGET instead of minimum=1. The service docstring at line 104 also needs updating.

3. ValueError surfaces as async task failure, not a synchronous GraphQL error

Files: opencontractserver/shared/decorators.py:455, opencontractserver/tasks/corpus_analysis_tasks.py:82

The corpus_analyzer_task decorator catches ValueError, stamps Analysis.status = FAILED, stores str(e) in analysis.error_message, and re-raises. A caller who passes token_budget: true gets ok=True from the mutation, then discovers the failure asynchronously. Adding "not": {"type": "boolean"} to integer fields in the input_schema would reject booleans before dispatch if schema validation is enforced at that layer.

4. CRAWL_MAX_* constants are frozen at their defaults — no operator override path

File: opencontractserver/enrichment/constants.py:98-102

All five CRAWL_MAX_* values equal their CRAWL_DEFAULT_* counterparts. A trusted deployment that needs max_depth=4 can only do so by editing Python and redeploying, globally changing the default for every corpus. Other resource limits in the same file (LLM_MAX_CONCURRENCY, DOC_MAX_CONCURRENCY) use accessor functions backed by Django settings, giving operators an override path without code changes.

5. _resolve_crawl_bound belongs in a utility module, not a task file

File: opencontractserver/tasks/corpus_analysis_tasks.py:82

This is a pure stateless helper with no Celery or task-specific dependency. Per CLAUDE.md: "Before writing new utilities: Check existing utility files first (opencontractserver/utils/)." A functionally equivalent helper clamp_limit(limit, default, maximum) already exists at opencontractserver/llms/tools/core_tools/_helpers.py:82. Callers outside the task file must either import from a task module (inverted dependency) or duplicate the guard.

6. Type hint value: int | None misleads — bool inputs are rejected at runtime

File: opencontractserver/tasks/corpus_analysis_tasks.py:86

bool is a subclass of int, so mypy/pyright accepts value=True without complaint. The guard correctly rejects booleans at runtime, but the annotation should be value: int | bool | None paired with the rejection so the contract is visible to static analysis.

Severity	File	Finding
High	`enrichment/constants.py:98`	max_depth max silently dropped 5 to 2; values 3-5 silently clamped
High	`crawl_authorities_service.py:163`	token_budget=0 (unbounded) clamped to 1, effectively aborting crawl
Medium	`decorators.py:455`	ValueError surfaces as async task failure, not sync error
Medium	`enrichment/constants.py:98-102`	No operator override path for caps (unlike LLM_MAX_CONCURRENCY pattern)
Low	`corpus_analysis_tasks.py:82`	Pure utility helper in task file; duplicate of _helpers.py:clamp_limit
Low	`corpus_analysis_tasks.py:86`	value: int

claude · 2026-06-29T13:31:42Z

Code Review

This PR adds server-side safety caps to crawl_authorities to prevent authenticated users from triggering arbitrarily large crawls via attacker-controlled GraphQL parameters. The _resolve_crawl_bound helper is clean, the ValueError for non-integer/bool inputs is correctly handled by the corpus_analyzer_task decorator (marks Analysis FAILED, not a silent swallow), and the new tests exercise both the schema introspection and clamping paths. Four findings below.

1. LLM tool path bypasses every safety cap (Security)

opencontractserver/llms/tools/core_tools/corpus_references.py ~line 267

CrawlAuthoritiesService.crawl() is called directly from the LLM tool function with unclamped parameters — max_depth, max_authorities, per_jurisdiction_cap, and token_budget all pass through verbatim. The service itself (crawl_authorities_service.py lines 160–165) has no internal upper-bound guards. An LLM agent invoking acrawl_authorities with max_depth=1000, max_authorities=100000 completely bypasses the new caps added to the Celery task path.

Fix: apply the same _resolve_crawl_bound clamping inside the LLM tool (or add guards to the service itself, which would cover both paths).

2. `min_demand` cap protects the wrong end of the range (Security/Correctness)

opencontractserver/tasks/corpus_analysis_tasks.py lines 186–192 / opencontractserver/enrichment/constants.py line 99

min_demand is a floor filter: qs = qs.filter(mention_count__gte=min_demand) (authority_frontier_service.py lines 185–186). A lower value means more rows are processed (the dangerous direction); a higher value means fewer rows (more restrictive). The cap sets maximum=CRAWL_MAX_MIN_DEMAND=2, which blocks values above 2. But values above the default are already more restrictive — no DoS risk there. The actual DoS surface is min_demand=0 (no filter, all frontier rows processed), which the cap leaves completely open since minimum=0 in both the schema and the _resolve_crawl_bound call.

Fix: enforce a minimum floor, e.g. minimum=C.CRAWL_MIN_MIN_DEMAND (= 1 or = CRAWL_DEFAULT_MIN_DEMAND), not just a ceiling.

3. `token_budget=0` clamped to `1` causes near-immediate crawl halt (Correctness)

opencontractserver/tasks/corpus_analysis_tasks.py lines 207–213 / opencontractserver/enrichment/services/crawl_authorities_service.py lines 163–165

The service documents and implements token_budget <= 0 as "unbounded" (the budget check is skipped). The PR sets minimum=1, so _resolve_crawl_bound maps 0 → 1. When token_budget=1 reaches the service, the check if token_budget > 0 and tokens_spent >= token_budget fires after the very first authority ingest (any document with content will spend more than 1 estimated token), stopping the crawl almost immediately. This silently converts a "run unbounded" intent into a "stop after first authority" result rather than returning a validation error or remapping to the default budget.

If the schema minimum of 1 is the intended contract (no unbounded crawls through the task), the _resolve_crawl_bound call should raise (not clamp) on token_budget=0, or remap None-sentinel handling so 0 explicitly falls back to default. Currently the behavior is a silent near-halt that's likely to be very confusing to operators.

4. Missing `CRAWL_MIN_*` constants (CLAUDE.md conventions)

opencontractserver/tasks/corpus_analysis_tasks.py lines 183, 190, 197, 204, 211 and JSON Schema lines 104, 110, 116, 122, 128

CLAUDE.md rule: "No magic numbers — we have constants files in opencontractserver/constants/ (backend) ... Use them for any hardcoded values." The minimum=0 and minimum=1 literals appear six times each across the _resolve_crawl_bound calls and the JSON Schema dict with no corresponding CRAWL_MIN_* constants in opencontractserver/enrichment/constants.py. The existing constants file already notes this expectation at its header. If the minimum for max_authorities ever changes to 2, the change would need to be made in four places manually.

Fix: add CRAWL_MIN_MAX_DEPTH = 0, CRAWL_MIN_MIN_DEMAND = 0 (or 1, per finding 2), CRAWL_MIN_AUTHORITIES = 1, CRAWL_MIN_PER_JURISDICTION_CAP = 1, CRAWL_MIN_TOKEN_BUDGET = 1 and reference them.

Cap public authority crawl analyzer bounds

1116317

JSv4 added codex aardvark labels Jun 29, 2026 — with ChatGPT Codex Connector

JSv4 added codex aardvark labels Jun 29, 2026

JSv4 added 2 commits June 29, 2026 08:24

Scope type-ignore for intentional bad-type args in _resolve_crawl_bou…

ae6ece8

…nd test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cap public `crawl_authorities` analyzer bounds to prevent unbounded/DoS crawls#2083

Cap public `crawl_authorities` analyzer bounds to prevent unbounded/DoS crawls#2083
JSv4 wants to merge 4 commits into
mainfrom
codex/fix-unbounded-authority-crawl-task

JSv4 commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026

Uh oh!

codecov Bot commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

JSv4 commented Jun 29, 2026

Motivation

Description

Testing

Uh oh!

claude Bot commented Jun 29, 2026

Code Review — PR #2083: Cap public crawl_authorities analyzer bounds

1. LLM agent tool path still uncapped (security gap)

2. Cap constants direction is inverted — defaults equal caps

3. Schema minimums are magic numbers repeated in two places

4. token_budget ≤ 0 service sentinel is now permanently dead code on the task path

5. bool rejection is untested

6. Five structurally identical call sites — DRY opportunity

Uh oh!

codecov Bot commented Jun 29, 2026

Codecov Report

Uh oh!

claude Bot commented Jun 29, 2026

Code Review

Uh oh!

claude Bot commented Jun 29, 2026

Code Review

1. LLM tool path bypasses every safety cap (Security)

2. min_demand cap protects the wrong end of the range (Security/Correctness)

3. token_budget=0 clamped to 1 causes near-immediate crawl halt (Correctness)

4. Missing CRAWL_MIN_* constants (CLAUDE.md conventions)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Code Review — PR #2083: Cap public `crawl_authorities` analyzer bounds

4. `token_budget ≤ 0` service sentinel is now permanently dead code on the task path

5. `bool` rejection is untested

2. `min_demand` cap protects the wrong end of the range (Security/Correctness)

3. `token_budget=0` clamped to `1` causes near-immediate crawl halt (Correctness)

4. Missing `CRAWL_MIN_*` constants (CLAUDE.md conventions)