Fix Discover query vector cache bounds by JSv4 · Pull Request #2091 · Open-Source-Legal/OpenContracts

JSv4 · 2026-06-29T02:21:40Z

Motivation

The process-global LRU used the raw textSearch string as a cache key without any length bound, allowing attacker-controlled large strings to be retained in worker memory and enabling a remote memory-DoS when a default embedder is configured.
The cache also stored failed (None) embeddings, which could pin attacker inputs in the LRU until eviction.

Description

Add a new constant DISCOVER_TEXT_SEARCH_MAX_LENGTH = 512 to bound accepted Discover textSearch input. (opencontractserver/constants/search.py)
Introduce _normalise_text_search() to strip and reject empty or overlong searches, and route all five Discover resolvers to call it before performing text or semantic work. (config/graphql/discover_queries.py)
Change the query-vector cache wrapper to raise _UncacheableQueryVector when embedding fails instead of returning/caching None, and update _semantic_ids() to catch that exception and fall back to text-only results. (config/graphql/discover_queries.py)
Add regression tests asserting that an overlong query returns an empty result before any embedding call and that failed embeddings do not populate the LRU cache. (opencontractserver/tests/test_discover_search_graphql.py)

Testing

python -m py_compile config/graphql/discover_queries.py opencontractserver/constants/search.py opencontractserver/tests/test_discover_search_graphql.py succeeded (syntax check passed).
git diff --check produced no whitespace/patch errors.
pytest opencontractserver/tests/test_discover_search_graphql.py could not be executed in this environment due to a missing django dependency (ModuleNotFoundError: No module named 'django'), so full test-suite execution was not performed here.

Codex Task

claude · 2026-06-29T12:12:17Z

Review: Fix Discover query vector cache bounds

This PR correctly addresses two real security issues: unbounded user-controlled strings being retained as LRU cache keys, and failed embedding calls being cached as None (pinning the query string in worker memory until LRU eviction). The approach — input normalisation at the resolver layer plus converting the cache wrapper to raise-on-failure — is sound. A few issues worth addressing before merge:

1. Stale return-type annotation on `_cached_query_vector` — maintenance hazard

config/graphql/discover_queries.py, line 149

The function signature still reads -> Optional[list], but after this PR the function never returns None — it either returns a list or raises _UncacheableQueryVector. Any future caller that copies the old call pattern:

vector = _cached_query_vector(query_text, embedder_path)
if not vector:   # dead branch — None can never reach here
    return []

would silently swallow the uncaught exception as a 500 error to GraphQL instead of the intended text-only graceful fallback. Update the annotation to -> list so mypy and future readers see the real contract.

2. Missing changelog fragment — CLAUDE.md requirement

changelog.d/

CLAUDE.md mandates a changelog.d/<slug>.security.md (or .fixed.md) fragment for any security fix. No 2091* file appears in changelog.d/. The closest existing fragments (1919-discover-followup.changed.md, 1919-discover-followup.fixed.md) cover the prior caching refactor, not this DoS guard.

Add something like changelog.d/2091-discover-cache-dos.security.md with a bullet describing the vulnerability and fix.

3. Protection lives only at call sites, not at the resource boundary — altitude concern

config/graphql/discover_queries.py, line 119 (_normalise_text_search) and line 149 (_cached_query_vector)

The 512-char cap is enforced inside each of the five resolver methods. _cached_query_vector itself accepts an unbounded query_text. If a sixth resolver is added later (or any internal caller reuses _semantic_ids directly), the protection is silently absent — the long string becomes an LRU cache key with no warning.

The minimum fix is a # PRECONDITION: query_text must have passed _normalise_text_search docstring on both _semantic_ids and _cached_query_vector. A stronger fix is a defensive assert len(query_text) <= DISCOVER_TEXT_SEARCH_MAX_LENGTH inside _cached_query_vector that makes a violation loud rather than silent.

4. `test_overlong_query_returns_empty_before_searching` verifies the semantic arm only, not the text arm

opencontractserver/tests/test_discover_search_graphql.py, line 307

The test patches _query_vector and asserts it was never called, which proves the semantic arm is bypassed. It does not assert that the text arm (_text_ids) was also skipped. If a future refactor moves the length check to after the _text_ids call (guarding only the LRU), the test would still pass while the unbounded DB text-predicate DoS vector is silently reintroduced. Adding patch("config.graphql.discover_queries._text_ids") alongside the existing mock and asserting that too was not called would close this gap.

Minor: `test_overlong_query_returns_empty_before_searching` assertion is outside the `with` block

opencontractserver/tests/test_discover_search_graphql.py, line 310

        with patch("config.graphql.discover_queries._query_vector") as query_vector:
            result = self.graphene_client.execute(...)
        self.assertIsNone(result.get("errors"), ...)
        self.assertEqual(...)
        query_vector.assert_not_called()   # <-- outside the with block

query_vector.assert_not_called() runs after the patch has been deactivated. This works correctly in practice (mock objects retain their call history), but if the resolver ever spawns an async or deferred post-execute callback that calls _query_vector, those calls would not be captured by the inner mock after the with exits. Move lines 308–310 inside the with block to make the assertion scope explicit.

Overall the security intent is correct and the test coverage for the new behaviour is a good start. Addressing the stale type annotation and the missing changelog fragment are the two must-haves before merge; the other points are improvements worth discussing.

codecov · 2026-06-29T12:31:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Fix Discover query vector cache bounds

697c30b

JSv4 added aardvark codex labels Jun 29, 2026 — with ChatGPT Codex Connector

JSv4 added the aardvark label Jun 29, 2026

Apply isort import ordering to test_discover_search_graphql.py

6b12816

JSv4 merged commit f025fd6 into main Jun 29, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix Discover query vector cache bounds#2091

Fix Discover query vector cache bounds#2091
JSv4 merged 2 commits into
mainfrom
codex/propose-fix-for-discover-query-cache-vulnerability

JSv4 commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026

Uh oh!

codecov Bot commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

JSv4 commented Jun 29, 2026

Motivation

Description

Testing

Uh oh!

claude Bot commented Jun 29, 2026

Review: Fix Discover query vector cache bounds

1. Stale return-type annotation on _cached_query_vector — maintenance hazard

2. Missing changelog fragment — CLAUDE.md requirement

3. Protection lives only at call sites, not at the resource boundary — altitude concern

4. test_overlong_query_returns_empty_before_searching verifies the semantic arm only, not the text arm

Minor: test_overlong_query_returns_empty_before_searching assertion is outside the with block

Uh oh!

codecov Bot commented Jun 29, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Stale return-type annotation on `_cached_query_vector` — maintenance hazard

4. `test_overlong_query_returns_empty_before_searching` verifies the semantic arm only, not the text arm

Minor: `test_overlong_query_returns_empty_before_searching` assertion is outside the `with` block