Fix structured-data extraction: restore dropped column constraints, cap tool-looping, full-text for short docs by JSv4 · Pull Request #2070 · Open-Source-Legal/OpenContracts

JSv4 · 2026-06-26T11:35:07Z

Summary

An end-to-end evaluation of the structured-data extraction pipeline (Fieldset / Column /
Extract / Datacell) over 50 real Fort Worth municipal contracts surfaced three functional
bugs that severely degraded extraction quality and reliability. This PR fixes all three.

Measured on a 15-doc subset with an 8-field commercial-diligence fieldset (counterparty,
effective/expiration date, contract value, governing law, indemnification, MFN,
termination-for-convenience):

	before (no fixes)	after (this PR)
objective accuracy	27%	90%
failed cells	11 (`tool_loop_no_output`)	0
LLM tool calls / cell	~16–100	~1

Bugs fixed

1. `Column.instructions` / `must_contain_text` / `limit_to_label` silently dropped

These three Column fields are GraphQL-settable, persisted, copied on clone, and diffed — but
the marvin→doc-agent rewrite (commit 184903f62) stopped feeding them to the extraction LLM,
so per-column guidance/scoping had zero effect. The get_column_extraction_params helper
that still "returned" instructions had zero callers (dead code).
Fix: doc_extract_query_task folds all three back into the prompt; dead helper removed.

2. Runaway agent tool-looping (no request budget)

With no usage_limits, a weak model (e.g. gpt-4o-mini) would make up to 100
similarity_search calls on a hard/absent value before pydantic-ai gave up — 770 KB message
logs, ~100 s/cell, surfacing as failure_mode=tool_loop_no_output.
Fix: _structured_response_raw passes UsageLimits(request_limit=EXTRACT_AGENT_REQUEST_LIMIT)
(=20). A capable model commits well within it; the cap bounds the pathological case.

3. Short documents: agent can't confirm clause absence via retrieval → loops

Retrieval can't prove a clause is absent; for short contracts the agent searched endlessly
looking for clauses that weren't there.
Fix: for documents ≤ EXTRACT_FULL_TEXT_CHAR_LIMIT (24 K chars), doc_extract_query_task
injects the full (fenced) document text so the agent answers — and confirms absence — in one
read. Cut tool calls ~16 → ~1 per cell and eliminated all failures. Retrieval remains the
path for longer documents.

Tests

opencontractserver/tests/test_extract_prompt_wiring.py — asserts the three constraint fields
and the full-text both reach the prompt the agent runs (agent patched; no LLM hit).

Changelog

changelog.d/extract-column-constraints.fixed.md, extract-tool-budget.fixed.md,
extract-fulltext-short-docs.changed.md.

Notes / follow-ups

The dominant practical issue was runaway tool-looping (Bump traefik from v2.8.7 to v2.9.1 in /compose/production/traefik #2) + the inability to confirm
absence (Bump actions/setup-node from 3.4.1 to 3.5.1 #3); Bump postgres from 14.5 to 15.0 in /compose/production/postgres #1 is a real but quieter feature-loss.
A broader static review also flagged: no date/datetime output type (a CreateColumn
with output_type="date" crashes the cell in parse_model_or_primitive); legitimately
absent values are recorded as failed cells (the FE renders them as errors); a generic
return None on any structured-run exception can mislabel a real failure as
agent_committed_none. These are documented for follow-up.

…ap tool-looping, full-text for short docs An end-to-end eval of the extraction pipeline over 50 Fort Worth contracts surfaced three functional bugs. On a 15-doc commercial-diligence subset the fixes move objective accuracy 27% -> 90%, failed cells 11 -> 0, and LLM tool calls ~16 -> ~1 per cell. 1. Column.instructions / must_contain_text / limit_to_label were silently dropped by the marvin->doc-agent rewrite (184903f): persisted and GraphQL-settable but never fed to the extraction LLM. Re-wired into the prompt; the dead get_column_extraction_params helper (zero callers) removed. 2. No request budget let a weak model run away making up to ~100 similarity_search calls on a hard/absent value (failure_mode= tool_loop_no_output, 770KB logs, ~100s/cell). Added UsageLimits(request_limit=EXTRACT_AGENT_REQUEST_LIMIT=20) on the structured run. 3. Retrieval cannot prove a clause is absent, so on short contracts the agent looped. For documents <= EXTRACT_FULL_TEXT_CHAR_LIMIT (50K chars), inject the full (fenced) document text so the agent answers -- and confirms absence -- in one read. Retrieval remains the path for longer documents. Tests: opencontractserver/tests/test_extract_prompt_wiring.py asserts the three constraint fields and the full text all reach the prompt the agent runs.

JSv4 · 2026-06-26T11:39:37Z

Full-contract clause-selector validation

The 90% headline was measured on a renewal-notice subset that (it turned out) doesn't contain the diligence clauses. To validate the clause selectors on documents that do contain them, I ran the same fixed pipeline (claude-sonnet-4-6) on 2 full commercial agreements:

selector	result (LLM-judge)
governing_law	2/2 (Texas)
total_contract_value	2/2 ($371.93, $1,860,738.94)
has_indemnification	2/2 (True / False — correct)
has_most_favored_nation	2/2 (both False)
termination_for_convenience	2/2

100% on the clause selectors when the documents contain the relevant information — confirming the renewal-notice gaps were a data characteristic, not a pipeline issue. Across both subsets the fixed pipeline ran with 0 failures and ~1 tool call/cell.

claude · 2026-06-26T11:41:11Z

Code Review

This PR fixes three real extraction bugs that were killing quality on short municipal contracts: dropped column constraints, runaway tool-looping, and the agent's inability to confirm clause absence via retrieval alone. The fixes are well-motivated, the test coverage is appropriate, and the changelog fragments are correct. Three findings worth addressing before merge:

1. File handle leak in `_load_doc_text` — and a DRY violation (Bug)

opencontractserver/tasks/data_extract_tasks.py, ~line 426

f.open("rb")
raw = f.read()
f.close()

If f.read() raises (e.g., an S3/GCS network error mid-read), the bare outer except Exception catches it and returns "" — but f.close() is never reached, leaking the FieldFile handle. On cloud storage backends this holds an HTTP connection open; under load or repeated failures this exhausts connection pools.

Also: opencontractserver/utils/files.py:65 already has read_field_file_text that does exactly this — with a proper with context manager and the same bytes/str normalization — and is already used in six other call sites (mcp/tools.py, mcp/resources.py, export_v2.py, etc.). The new _load_doc_text re-implements it incorrectly.

Fix:

from opencontractserver.utils.files import read_field_file_text

full_text = await sync_to_async(read_field_file_text)(
    document.txt_extract_file, errors="replace"
) if document.txt_extract_file else ""

No inner function needed; the @sync_to_async wrapping is done inline on an existing utility.

2. `UsageLimitExceeded` silently mis-classified as `tool_loop_no_output` (Observability Bug)

opencontractserver/llms/agents/pydantic_ai_agents.py, line 1822

When the 20-request budget trips, pydantic-ai raises UsageLimitExceeded. The generic except Exception at line 1822 catches it, logs a generic warning, and returns None. The extraction task then calls _classify_none_result(messages), which sees many similarity_search calls in the message history and emits failure_mode = NONE_RESULT_TOOL_LOOP — the exact diagnosis the limit was introduced to fix. Operators can't distinguish "budget hit" from "genuine runaway loop."

Note: the streaming path (_stream_raw, line 1667) already handles UsageLimitExceeded specifically. Structured-response should do the same.

Fix: Catch it explicitly before the generic handler:

from pydantic_ai.exceptions import UsageLimitExceeded

try:
    run_result = await structured_agent.run(...)
    return run_result.output
except UsageLimitExceeded as e:
    logger.warning("Structured-extraction usage limit hit (request_limit=%d): %s", EXTRACT_AGENT_REQUEST_LIMIT, e)
    return None  # caller will classify; consider NONE_RESULT_USAGE_LIMIT constant
except Exception as e:
    ...

Ideally, also add a NONE_RESULT_USAGE_LIMIT = "usage_limit_exceeded" constant to opencontractserver/constants/llm.py and thread it through _classify_none_result so the datacell stacktrace field carries a distinguishable failure mode.

3. Deferred import of `fence_user_content` (Convention)

opencontractserver/tasks/data_extract_tasks.py, lines 437–439

from opencontractserver.utils.prompt_sanitization import (
    fence_user_content,
)

This import is inside the function body, while every other import in the file is top-level. CLAUDE.md's style convention and the existing pattern across the codebase (core_agents.py, pydantic_ai_agents.py) keep these at the module level. There's no conditional need here that would justify deferring it.

Fix: Move to the top-level import block.

Overall the logic is sound and the quality improvement (27% → 90% accuracy) is significant. Findings 1 and 2 are worth fixing before merge; finding 3 is a quick cleanup.

…reconcile prompt Adversarial review of the diff surfaced refinements (all confirmed): - Request budget is now applied via run_kwargs.setdefault() instead of a hardcoded explicit arg, so a caller may override it (usage_limits is whitelisted in _run_accepted) and the explicit value can never collide with a pass-through one into a duplicate-keyword TypeError (which the broad except would mask as a silent None). Documented as a default ceiling for all structured runs (tightening pydantic-ai's own default of 50), overridable per-call for e.g. a future corpus-wide structured analytic. - Full-text load reuses the canonical read_field_file_text() helper instead of an inline reimplementation, and logs storage errors instead of silently swallowing them. - Injected full text is prefixed with UNTRUSTED_CONTENT_NOTICE and passed through warn_if_content_large(); the prompt explicitly relaxes the system prompt's mandatory multi-search negative-case rule when the whole document is in context. - Lowered EXTRACT_FULL_TEXT_CHAR_LIMIT 50K -> 24K (~6K tokens) to bound redundant per-column token cost and stay well inside the context budget. - Clarified must_contain_text / limit_to_label are advisory prompt guidance, not enforced retrieval filters.

JSv4 · 2026-06-26T11:51:04Z

Review + verification

Ran an adversarial review of the diff (15 confirmed findings) and addressed the substantive ones in a56d4a631:

Request budget now applied via run_kwargs.setdefault(...) — overridable per-call and collision-proof (no duplicate-keyword TypeError); documented as a default ceiling for all structured runs (tightening pydantic-ai's own default of 50), not extraction-only.
DRY — full-text load reuses the canonical read_field_file_text(); storage errors are logged, not swallowed.
Untrusted content — injected text prefixed with UNTRUSTED_CONTENT_NOTICE + warn_if_content_large(); the prompt explicitly relaxes the system prompt's multi-search negative-case rule when the full doc is in context.
Token budget — EXTRACT_FULL_TEXT_CHAR_LIMIT 50K→24K (~6K tokens) to bound per-column cost.
Clarified must_contain_text/limit_to_label are advisory prompt guidance, not enforced retrieval filters.

Tests: 3 new (test_extract_prompt_wiring.py) pass; 8/9 existing test_extract_tasks pass (the VCR ones exercising the new prompt). The 1 failure (test_extract_completion_callback — MagicMock is not JSON serializable) is pre-existing (fails identically on base main) and untouched by this PR. pre-commit (black/isort/flake8/mypy) clean.

claude · 2026-06-26T11:53:01Z

Code Review

This PR fixes three real extraction bugs — column constraints silently ignored since the marvin→doc-agent rewrite, runaway tool-looping with no request budget, and short-document clause-absence loops — and backs each fix with a regression test. The measured accuracy jump (27% → 90%) and the test coverage for the wiring changes are both solid. One confirmed bug introduced by the budget-cap fix, one low-priority footgun.

1. `UsageLimitExceeded` is silently swallowed and misclassified — CONFIRMED bug

File: opencontractserver/llms/agents/pydantic_ai_agents.py, line ~1826
File: opencontractserver/tasks/data_extract_tasks.py, line ~641

The new UsageLimits(request_limit=EXTRACT_AGENT_REQUEST_LIMIT) is the right guard against runaway loops, but when the limit fires, pydantic-ai raises UsageLimitExceeded (a subclass of Exception). That exception hits the broad except Exception in _structured_response_raw, is logged as a warning, and returns None. The caller's _classify_none_result then inspects the message history and classifies the run as either tool_loop_no_output or no_final_response — neither of which is "budget exceeded."

Ironically this makes the new cap harder to observe: an operator who sets request_limit too low will see cells silently fail with the same codes as before, with no signal that the limit is the cause. Notably the streaming chat() path at line ~1667 already has "UsageLimitExceeded" in type(e).__name__ as a labelled guard — the same pattern is missing from _structured_response_raw.

Suggested fix: catch UsageLimitExceeded before the broad except and either re-raise it (letting doc_extract_query_task record a distinct failure_mode) or map it to a new NONE_RESULT_BUDGET_EXCEEDED constant — consistent with how the chat path already handles it.

from pydantic_ai.exceptions import UsageLimitExceeded  # at top of file

# inside _structured_response_raw, before the broad except:
except UsageLimitExceeded as e:
    logger.warning("Structured run hit request budget (%s): %s", EXTRACT_AGENT_REQUEST_LIMIT, e)
    return None  # caller can check failure_mode = NONE_RESULT_BUDGET_EXCEEDED

2. `usage_limits=None` in kwargs silently disables the new budget cap — low-severity footgun

File: opencontractserver/llms/agents/pydantic_ai_agents.py, line ~1818

run_kwargs.setdefault("usage_limits", UsageLimits(...)) correctly skips insertion when the caller has already provided a value. But if a caller passes usage_limits=None (intending "no limit" or just "use the default"), setdefault sees the key as already present and leaves None in place — which passes usage_limits=None to pydantic-ai and removes the cap entirely. The PR description's comment about setdefault preventing a TypeError is correct, but the None edge case is undocumented.

This won't break current callers (none pass usage_limits to the structured path today), but it's a trap for any future caller or test harness that passes usage_limits=None expecting the default to apply. A one-line comment or an explicit if run_kwargs.get("usage_limits") is not None guard would close it.

Minor notes (no action required)

changelog.d/extract-fulltext-short-docs.changed.md says EXTRACT_FULL_TEXT_CHAR_LIMIT is "24 K chars" but the constant is set to 24_000 — consistent, just worth confirming the token estimate (~6K) stays valid if the model changes.
The test's fake_agent returns ("ok", []) but the real get_structured_response_and_sources_from_document returns (value, sources) — the test works because it only checks the captured prompt, not the return value. This is fine for its purpose but worth a comment if the test grows.
The PR description already calls out date/datetime output type, agent_committed_none mislabeling, and absent-value-as-failed-cell as known follow-ups — all accurate; nothing to add there.

codecov · 2026-06-26T12:24:19Z

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...encontractserver/llms/agents/pydantic_ai_agents.py	77.77%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

…diligence

doc_extract_query_task now appends per-column constraints and (for short docs) the full document text to the agent prompt, so the prompt is no longer exactly task.query. The benchmark integration mock looked up its canned answer by exact prompt equality, returning '' and dropping answer_exact_match to 0.0. Match the canned answer by the query the prompt starts with (exact lookup first).

- Catch UsageLimitExceeded explicitly in _structured_response_raw and log it as a named condition (mirroring the streaming chat() path) instead of letting the broad except mask it as a generic structured-response failure. - Classify the budget hit distinctly: _classify_none_result recognises the >= EXTRACT_AGENT_REQUEST_LIMIT model-response fingerprint and records failure_mode=usage_limit_exceeded (new NONE_RESULT_USAGE_LIMIT constant), checked before tool-loop detection since a budget hit is the more precise diagnosis. Operators can now tell a too-tight budget from a runaway loop. - Fix usage_limits=None footgun: an explicit None from a caller no longer disables the cap; the default applies whenever the resolved value is None. - Hoist the prompt_sanitization import in data_extract_tasks to module level. - Tests for the budget-exhaustion classification, its boundary, final-result precedence, and the new failure message.

claude · 2026-06-27T18:50:33Z

Code Review

This PR fixes three real extraction-pipeline bugs (dropped column constraints, runaway tool-looping, inability to confirm clause absence in short docs) and the benchmark numbers make a compelling case that all three were material. The structure is clean — new constants in the right files, a named failure mode for budget exhaustion, and regression tests that pin the prompt-wiring behavior without hitting an LLM. Four issues worth addressing before merge:

1. Column constraint fields are injected raw — no fencing, no untrusted-content notice (security)

opencontractserver/tasks/data_extract_tasks.py lines 431–443

The full document text (lines 484–490) is correctly wrapped with UNTRUSTED_CONTENT_NOTICE + fence_user_content. The column constraint fields are not:

column_constraints.append(f"Additional instructions: {column.instructions}")
# ...
column_constraints.append(f"Only extract data from sections that contain the text: '{column.must_contain_text}'.")

Any authenticated user can set column.instructions, column.must_contain_text, and column.limit_to_label via CreateColumn / UpdateColumnMutation — both are guarded by @login_required only, with no staff/superuser check. Because these fields are injected before the fenced document block, they land in the instruction section of the prompt at higher positional authority than the document text. A value like "\n\nIgnore all previous instructions..." in column.instructions is a textbook prompt-injection vector, and the sanitization utilities that exist in prompt_sanitization.py (fence_user_content, sanitize_for_prompt_strict) are unused here despite being used one block later.

2. `warn_if_content_large` fires on virtually every short-document injection (log noise / semantic mismatch)

opencontractserver/tasks/data_extract_tasks.py line 478

UNTRUSTED_CONTENT_SIZE_WARNING_THRESHOLD = 1000 chars (opencontractserver/constants/moderation.py:12) but EXTRACT_FULL_TEXT_CHAR_LIMIT = 24_000 chars. Any document between 1,001 and 24,000 characters — the vast majority of real contracts — will emit a [PromptInjection] WARNING log on every extraction cell. An extract with 20 columns over 100 documents produces ~2,000 spurious [PromptInjection] warnings, burying the signal that warn_if_content_large is meant to surface.

warn_if_content_large was designed for short user-supplied field values (titles, messages). Document body text at up to 24 K chars is the expected input here, not an anomaly. The call should either be removed (the document text is already fenced and prefixed with UNTRUSTED_CONTENT_NOTICE) or its threshold documented as inapplicable in this context.

3. Full document text is read unconditionally before the length guard (wasted I/O for long docs)

opencontractserver/tasks/data_extract_tasks.py lines 463–477

full_text = await sync_to_async(read_field_file_text)(
    document.txt_extract_file, errors="replace"      # full file read into memory
)
# ...
if full_text and len(full_text) <= EXTRACT_FULL_TEXT_CHAR_LIMIT:   # only then check

read_field_file_text does a plain f.read() with no size limit. For documents above 24 K characters the entire file is fetched from object storage and immediately discarded. Django's FieldFile exposes a .size attribute (bytes); a pre-check like document.txt_extract_file.size <= EXTRACT_FULL_TEXT_CHAR_LIMIT * 4 before the read would eliminate the wasted round-trip for large documents.

4. `_classify_none_result` is coupled to the default budget constant (latent misclassification)

opencontractserver/tasks/data_extract_tasks.py line 130

if model_response_count >= EXTRACT_AGENT_REQUEST_LIMIT:   # hardcoded 20
    return NONE_RESULT_USAGE_LIMIT

_structured_response_raw applies EXTRACT_AGENT_REQUEST_LIMIT as a default that any caller can override by passing usage_limits=UsageLimits(request_limit=N). If a future caller sets a lower limit (e.g. 5), pydantic-ai raises UsageLimitExceeded after 5 responses, but _classify_none_result sees 5 < 20 and falls through to NONE_RESULT_TOOL_LOOP or NONE_RESULT_NO_FINAL — a wrong diagnosis written to Datacell.stacktrace. The inverse holds for a higher custom limit. No current caller exercises this, so it is latent today, but the same EXTRACT_AGENT_REQUEST_LIMIT comment in _structured_response_raw explicitly invites callers to override it. Passing the actual request_limit used into _classify_none_result, or annotating the None return with the exception type before returning, would decouple them.

Minor note: the UsageLimitExceeded warning log at pydantic_ai_agents.py:1843 always prints EXTRACT_AGENT_REQUEST_LIMIT as the limit, which would also be misleading if a caller supplies a different UsageLimits — same root cause as finding 4.

JSv4 added 3 commits June 27, 2026 12:57

Merge remote-tracking branch 'origin/main' into feature/extract-eval-…

f001631

…diligence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix structured-data extraction: restore dropped column constraints, cap tool-looping, full-text for short docs#2070

Fix structured-data extraction: restore dropped column constraints, cap tool-looping, full-text for short docs#2070
JSv4 wants to merge 5 commits into
mainfrom
feature/extract-eval-diligence

JSv4 commented Jun 26, 2026 •

edited

Loading

Uh oh!

JSv4 commented Jun 26, 2026

Uh oh!

claude Bot commented Jun 26, 2026

Uh oh!

JSv4 commented Jun 26, 2026

Uh oh!

claude Bot commented Jun 26, 2026

Uh oh!

codecov Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

JSv4 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bugs fixed

1. Column.instructions / must_contain_text / limit_to_label silently dropped

2. Runaway agent tool-looping (no request budget)

3. Short documents: agent can't confirm clause absence via retrieval → loops

Tests

Changelog

Notes / follow-ups

Uh oh!

JSv4 commented Jun 26, 2026

Full-contract clause-selector validation

Uh oh!

claude Bot commented Jun 26, 2026

Code Review

1. File handle leak in _load_doc_text — and a DRY violation (Bug)

2. UsageLimitExceeded silently mis-classified as tool_loop_no_output (Observability Bug)

3. Deferred import of fence_user_content (Convention)

Uh oh!

JSv4 commented Jun 26, 2026

Review + verification

Uh oh!

claude Bot commented Jun 26, 2026

Code Review

1. UsageLimitExceeded is silently swallowed and misclassified — CONFIRMED bug

2. usage_limits=None in kwargs silently disables the new budget cap — low-severity footgun

Minor notes (no action required)

Uh oh!

codecov Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented Jun 27, 2026

Code Review

1. Column constraint fields are injected raw — no fencing, no untrusted-content notice (security)

2. warn_if_content_large fires on virtually every short-document injection (log noise / semantic mismatch)

3. Full document text is read unconditionally before the length guard (wasted I/O for long docs)

4. _classify_none_result is coupled to the default budget constant (latent misclassification)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JSv4 commented Jun 26, 2026 •

edited

Loading

1. `Column.instructions` / `must_contain_text` / `limit_to_label` silently dropped

1. File handle leak in `_load_doc_text` — and a DRY violation (Bug)

2. `UsageLimitExceeded` silently mis-classified as `tool_loop_no_output` (Observability Bug)

3. Deferred import of `fence_user_content` (Convention)

1. `UsageLimitExceeded` is silently swallowed and misclassified — CONFIRMED bug

2. `usage_limits=None` in kwargs silently disables the new budget cap — low-severity footgun

codecov Bot commented Jun 26, 2026 •

edited

Loading

2. `warn_if_content_large` fires on virtually every short-document injection (log noise / semantic mismatch)

4. `_classify_none_result` is coupled to the default budget constant (latent misclassification)