Skip to content

Fix structured-data extraction: restore dropped column constraints, cap tool-looping, full-text for short docs#2070

Open
JSv4 wants to merge 5 commits into
mainfrom
feature/extract-eval-diligence
Open

Fix structured-data extraction: restore dropped column constraints, cap tool-looping, full-text for short docs#2070
JSv4 wants to merge 5 commits into
mainfrom
feature/extract-eval-diligence

Conversation

@JSv4

@JSv4 JSv4 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

An end-to-end evaluation of the structured-data extraction pipeline (Fieldset / Column /
Extract / Datacell) over 50 real Fort Worth municipal contracts surfaced three functional
bugs that severely degraded extraction quality and reliability. This PR fixes all three.

Measured on a 15-doc subset with an 8-field commercial-diligence fieldset (counterparty,
effective/expiration date, contract value, governing law, indemnification, MFN,
termination-for-convenience):

before (no fixes) after (this PR)
objective accuracy 27% 90%
failed cells 11 (tool_loop_no_output) 0
LLM tool calls / cell ~16–100 ~1

Bugs fixed

1. Column.instructions / must_contain_text / limit_to_label silently dropped

These three Column fields are GraphQL-settable, persisted, copied on clone, and diffed — but
the marvin→doc-agent rewrite (commit 184903f62) stopped feeding them to the extraction LLM,
so per-column guidance/scoping had zero effect. The get_column_extraction_params helper
that still "returned" instructions had zero callers (dead code).
Fix: doc_extract_query_task folds all three back into the prompt; dead helper removed.

2. Runaway agent tool-looping (no request budget)

With no usage_limits, a weak model (e.g. gpt-4o-mini) would make up to 100
similarity_search calls
on a hard/absent value before pydantic-ai gave up — 770 KB message
logs, ~100 s/cell, surfacing as failure_mode=tool_loop_no_output.
Fix: _structured_response_raw passes UsageLimits(request_limit=EXTRACT_AGENT_REQUEST_LIMIT)
(=20). A capable model commits well within it; the cap bounds the pathological case.

3. Short documents: agent can't confirm clause absence via retrieval → loops

Retrieval can't prove a clause is absent; for short contracts the agent searched endlessly
looking for clauses that weren't there.
Fix: for documents ≤ EXTRACT_FULL_TEXT_CHAR_LIMIT (24 K chars), doc_extract_query_task
injects the full (fenced) document text so the agent answers — and confirms absence — in one
read. Cut tool calls ~16 → ~1 per cell and eliminated all failures. Retrieval remains the
path for longer documents.

Tests

opencontractserver/tests/test_extract_prompt_wiring.py — asserts the three constraint fields
and the full-text both reach the prompt the agent runs (agent patched; no LLM hit).

Changelog

changelog.d/extract-column-constraints.fixed.md, extract-tool-budget.fixed.md,
extract-fulltext-short-docs.changed.md.

Notes / follow-ups

…ap tool-looping, full-text for short docs

An end-to-end eval of the extraction pipeline over 50 Fort Worth contracts
surfaced three functional bugs. On a 15-doc commercial-diligence subset the
fixes move objective accuracy 27% -> 90%, failed cells 11 -> 0, and LLM tool
calls ~16 -> ~1 per cell.

1. Column.instructions / must_contain_text / limit_to_label were silently
   dropped by the marvin->doc-agent rewrite (184903f): persisted and
   GraphQL-settable but never fed to the extraction LLM. Re-wired into the
   prompt; the dead get_column_extraction_params helper (zero callers) removed.

2. No request budget let a weak model run away making up to ~100
   similarity_search calls on a hard/absent value (failure_mode=
   tool_loop_no_output, 770KB logs, ~100s/cell). Added
   UsageLimits(request_limit=EXTRACT_AGENT_REQUEST_LIMIT=20) on the structured
   run.

3. Retrieval cannot prove a clause is absent, so on short contracts the agent
   looped. For documents <= EXTRACT_FULL_TEXT_CHAR_LIMIT (50K chars), inject the
   full (fenced) document text so the agent answers -- and confirms absence --
   in one read. Retrieval remains the path for longer documents.

Tests: opencontractserver/tests/test_extract_prompt_wiring.py asserts the three
constraint fields and the full text all reach the prompt the agent runs.
@JSv4

JSv4 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Full-contract clause-selector validation

The 90% headline was measured on a renewal-notice subset that (it turned out) doesn't contain the diligence clauses. To validate the clause selectors on documents that do contain them, I ran the same fixed pipeline (claude-sonnet-4-6) on 2 full commercial agreements:

selector result (LLM-judge)
governing_law 2/2 (Texas)
total_contract_value 2/2 ($371.93, $1,860,738.94)
has_indemnification 2/2 (True / False — correct)
has_most_favored_nation 2/2 (both False)
termination_for_convenience 2/2

100% on the clause selectors when the documents contain the relevant information — confirming the renewal-notice gaps were a data characteristic, not a pipeline issue. Across both subsets the fixed pipeline ran with 0 failures and ~1 tool call/cell.

@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown

Code Review

This PR fixes three real extraction bugs that were killing quality on short municipal contracts: dropped column constraints, runaway tool-looping, and the agent's inability to confirm clause absence via retrieval alone. The fixes are well-motivated, the test coverage is appropriate, and the changelog fragments are correct. Three findings worth addressing before merge:


1. File handle leak in _load_doc_text — and a DRY violation (Bug)

opencontractserver/tasks/data_extract_tasks.py, ~line 426

f.open("rb")
raw = f.read()
f.close()

If f.read() raises (e.g., an S3/GCS network error mid-read), the bare outer except Exception catches it and returns "" — but f.close() is never reached, leaking the FieldFile handle. On cloud storage backends this holds an HTTP connection open; under load or repeated failures this exhausts connection pools.

Also: opencontractserver/utils/files.py:65 already has read_field_file_text that does exactly this — with a proper with context manager and the same bytes/str normalization — and is already used in six other call sites (mcp/tools.py, mcp/resources.py, export_v2.py, etc.). The new _load_doc_text re-implements it incorrectly.

Fix:

from opencontractserver.utils.files import read_field_file_text

full_text = await sync_to_async(read_field_file_text)(
    document.txt_extract_file, errors="replace"
) if document.txt_extract_file else ""

No inner function needed; the @sync_to_async wrapping is done inline on an existing utility.


2. UsageLimitExceeded silently mis-classified as tool_loop_no_output (Observability Bug)

opencontractserver/llms/agents/pydantic_ai_agents.py, line 1822

When the 20-request budget trips, pydantic-ai raises UsageLimitExceeded. The generic except Exception at line 1822 catches it, logs a generic warning, and returns None. The extraction task then calls _classify_none_result(messages), which sees many similarity_search calls in the message history and emits failure_mode = NONE_RESULT_TOOL_LOOP — the exact diagnosis the limit was introduced to fix. Operators can't distinguish "budget hit" from "genuine runaway loop."

Note: the streaming path (_stream_raw, line 1667) already handles UsageLimitExceeded specifically. Structured-response should do the same.

Fix: Catch it explicitly before the generic handler:

from pydantic_ai.exceptions import UsageLimitExceeded

try:
    run_result = await structured_agent.run(...)
    return run_result.output
except UsageLimitExceeded as e:
    logger.warning("Structured-extraction usage limit hit (request_limit=%d): %s", EXTRACT_AGENT_REQUEST_LIMIT, e)
    return None  # caller will classify; consider NONE_RESULT_USAGE_LIMIT constant
except Exception as e:
    ...

Ideally, also add a NONE_RESULT_USAGE_LIMIT = "usage_limit_exceeded" constant to opencontractserver/constants/llm.py and thread it through _classify_none_result so the datacell stacktrace field carries a distinguishable failure mode.


3. Deferred import of fence_user_content (Convention)

opencontractserver/tasks/data_extract_tasks.py, lines 437–439

from opencontractserver.utils.prompt_sanitization import (
    fence_user_content,
)

This import is inside the function body, while every other import in the file is top-level. CLAUDE.md's style convention and the existing pattern across the codebase (core_agents.py, pydantic_ai_agents.py) keep these at the module level. There's no conditional need here that would justify deferring it.

Fix: Move to the top-level import block.


Overall the logic is sound and the quality improvement (27% → 90% accuracy) is significant. Findings 1 and 2 are worth fixing before merge; finding 3 is a quick cleanup.

…reconcile prompt

Adversarial review of the diff surfaced refinements (all confirmed):

- Request budget is now applied via run_kwargs.setdefault() instead of a
  hardcoded explicit arg, so a caller may override it (usage_limits is
  whitelisted in _run_accepted) and the explicit value can never collide with a
  pass-through one into a duplicate-keyword TypeError (which the broad except
  would mask as a silent None). Documented as a default ceiling for all
  structured runs (tightening pydantic-ai's own default of 50), overridable
  per-call for e.g. a future corpus-wide structured analytic.

- Full-text load reuses the canonical read_field_file_text() helper instead of
  an inline reimplementation, and logs storage errors instead of silently
  swallowing them.

- Injected full text is prefixed with UNTRUSTED_CONTENT_NOTICE and passed
  through warn_if_content_large(); the prompt explicitly relaxes the system
  prompt's mandatory multi-search negative-case rule when the whole document is
  in context.

- Lowered EXTRACT_FULL_TEXT_CHAR_LIMIT 50K -> 24K (~6K tokens) to bound
  redundant per-column token cost and stay well inside the context budget.

- Clarified must_contain_text / limit_to_label are advisory prompt guidance,
  not enforced retrieval filters.
@JSv4

JSv4 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Review + verification

Ran an adversarial review of the diff (15 confirmed findings) and addressed the substantive ones in a56d4a631:

  • Request budget now applied via run_kwargs.setdefault(...) — overridable per-call and collision-proof (no duplicate-keyword TypeError); documented as a default ceiling for all structured runs (tightening pydantic-ai's own default of 50), not extraction-only.
  • DRY — full-text load reuses the canonical read_field_file_text(); storage errors are logged, not swallowed.
  • Untrusted content — injected text prefixed with UNTRUSTED_CONTENT_NOTICE + warn_if_content_large(); the prompt explicitly relaxes the system prompt's multi-search negative-case rule when the full doc is in context.
  • Token budgetEXTRACT_FULL_TEXT_CHAR_LIMIT 50K→24K (~6K tokens) to bound per-column cost.
  • Clarified must_contain_text/limit_to_label are advisory prompt guidance, not enforced retrieval filters.

Tests: 3 new (test_extract_prompt_wiring.py) pass; 8/9 existing test_extract_tasks pass (the VCR ones exercising the new prompt). The 1 failure (test_extract_completion_callbackMagicMock is not JSON serializable) is pre-existing (fails identically on base main) and untouched by this PR. pre-commit (black/isort/flake8/mypy) clean.

@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown

Code Review

This PR fixes three real extraction bugs — column constraints silently ignored since the marvin→doc-agent rewrite, runaway tool-looping with no request budget, and short-document clause-absence loops — and backs each fix with a regression test. The measured accuracy jump (27% → 90%) and the test coverage for the wiring changes are both solid. One confirmed bug introduced by the budget-cap fix, one low-priority footgun.


1. UsageLimitExceeded is silently swallowed and misclassified — CONFIRMED bug

File: opencontractserver/llms/agents/pydantic_ai_agents.py, line ~1826
File: opencontractserver/tasks/data_extract_tasks.py, line ~641

The new UsageLimits(request_limit=EXTRACT_AGENT_REQUEST_LIMIT) is the right guard against runaway loops, but when the limit fires, pydantic-ai raises UsageLimitExceeded (a subclass of Exception). That exception hits the broad except Exception in _structured_response_raw, is logged as a warning, and returns None. The caller's _classify_none_result then inspects the message history and classifies the run as either tool_loop_no_output or no_final_response — neither of which is "budget exceeded."

Ironically this makes the new cap harder to observe: an operator who sets request_limit too low will see cells silently fail with the same codes as before, with no signal that the limit is the cause. Notably the streaming chat() path at line ~1667 already has "UsageLimitExceeded" in type(e).__name__ as a labelled guard — the same pattern is missing from _structured_response_raw.

Suggested fix: catch UsageLimitExceeded before the broad except and either re-raise it (letting doc_extract_query_task record a distinct failure_mode) or map it to a new NONE_RESULT_BUDGET_EXCEEDED constant — consistent with how the chat path already handles it.

from pydantic_ai.exceptions import UsageLimitExceeded  # at top of file

# inside _structured_response_raw, before the broad except:
except UsageLimitExceeded as e:
    logger.warning("Structured run hit request budget (%s): %s", EXTRACT_AGENT_REQUEST_LIMIT, e)
    return None  # caller can check failure_mode = NONE_RESULT_BUDGET_EXCEEDED

2. usage_limits=None in kwargs silently disables the new budget cap — low-severity footgun

File: opencontractserver/llms/agents/pydantic_ai_agents.py, line ~1818

run_kwargs.setdefault("usage_limits", UsageLimits(...)) correctly skips insertion when the caller has already provided a value. But if a caller passes usage_limits=None (intending "no limit" or just "use the default"), setdefault sees the key as already present and leaves None in place — which passes usage_limits=None to pydantic-ai and removes the cap entirely. The PR description's comment about setdefault preventing a TypeError is correct, but the None edge case is undocumented.

This won't break current callers (none pass usage_limits to the structured path today), but it's a trap for any future caller or test harness that passes usage_limits=None expecting the default to apply. A one-line comment or an explicit if run_kwargs.get("usage_limits") is not None guard would close it.


Minor notes (no action required)

  • changelog.d/extract-fulltext-short-docs.changed.md says EXTRACT_FULL_TEXT_CHAR_LIMIT is "24 K chars" but the constant is set to 24_000 — consistent, just worth confirming the token estimate (~6K) stays valid if the model changes.
  • The test's fake_agent returns ("ok", []) but the real get_structured_response_and_sources_from_document returns (value, sources) — the test works because it only checks the captured prompt, not the return value. This is fine for its purpose but worth a comment if the test grows.
  • The PR description already calls out date/datetime output type, agent_committed_none mislabeling, and absent-value-as-failed-cell as known follow-ups — all accurate; nothing to add there.

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...encontractserver/llms/agents/pydantic_ai_agents.py 77.77% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

JSv4 added 3 commits June 27, 2026 12:57
doc_extract_query_task now appends per-column constraints and (for short
docs) the full document text to the agent prompt, so the prompt is no
longer exactly task.query. The benchmark integration mock looked up its
canned answer by exact prompt equality, returning '' and dropping
answer_exact_match to 0.0. Match the canned answer by the query the
prompt starts with (exact lookup first).
- Catch UsageLimitExceeded explicitly in _structured_response_raw and log
  it as a named condition (mirroring the streaming chat() path) instead of
  letting the broad except mask it as a generic structured-response failure.
- Classify the budget hit distinctly: _classify_none_result recognises the
  >= EXTRACT_AGENT_REQUEST_LIMIT model-response fingerprint and records
  failure_mode=usage_limit_exceeded (new NONE_RESULT_USAGE_LIMIT constant),
  checked before tool-loop detection since a budget hit is the more precise
  diagnosis. Operators can now tell a too-tight budget from a runaway loop.
- Fix usage_limits=None footgun: an explicit None from a caller no longer
  disables the cap; the default applies whenever the resolved value is None.
- Hoist the prompt_sanitization import in data_extract_tasks to module level.
- Tests for the budget-exhaustion classification, its boundary, final-result
  precedence, and the new failure message.
@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review

This PR fixes three real extraction-pipeline bugs (dropped column constraints, runaway tool-looping, inability to confirm clause absence in short docs) and the benchmark numbers make a compelling case that all three were material. The structure is clean — new constants in the right files, a named failure mode for budget exhaustion, and regression tests that pin the prompt-wiring behavior without hitting an LLM. Four issues worth addressing before merge:


1. Column constraint fields are injected raw — no fencing, no untrusted-content notice (security)

opencontractserver/tasks/data_extract_tasks.py lines 431–443

The full document text (lines 484–490) is correctly wrapped with UNTRUSTED_CONTENT_NOTICE + fence_user_content. The column constraint fields are not:

column_constraints.append(f"Additional instructions: {column.instructions}")
# ...
column_constraints.append(f"Only extract data from sections that contain the text: '{column.must_contain_text}'.")

Any authenticated user can set column.instructions, column.must_contain_text, and column.limit_to_label via CreateColumn / UpdateColumnMutation — both are guarded by @login_required only, with no staff/superuser check. Because these fields are injected before the fenced document block, they land in the instruction section of the prompt at higher positional authority than the document text. A value like "\n\nIgnore all previous instructions..." in column.instructions is a textbook prompt-injection vector, and the sanitization utilities that exist in prompt_sanitization.py (fence_user_content, sanitize_for_prompt_strict) are unused here despite being used one block later.


2. warn_if_content_large fires on virtually every short-document injection (log noise / semantic mismatch)

opencontractserver/tasks/data_extract_tasks.py line 478

UNTRUSTED_CONTENT_SIZE_WARNING_THRESHOLD = 1000 chars (opencontractserver/constants/moderation.py:12) but EXTRACT_FULL_TEXT_CHAR_LIMIT = 24_000 chars. Any document between 1,001 and 24,000 characters — the vast majority of real contracts — will emit a [PromptInjection] WARNING log on every extraction cell. An extract with 20 columns over 100 documents produces ~2,000 spurious [PromptInjection] warnings, burying the signal that warn_if_content_large is meant to surface.

warn_if_content_large was designed for short user-supplied field values (titles, messages). Document body text at up to 24 K chars is the expected input here, not an anomaly. The call should either be removed (the document text is already fenced and prefixed with UNTRUSTED_CONTENT_NOTICE) or its threshold documented as inapplicable in this context.


3. Full document text is read unconditionally before the length guard (wasted I/O for long docs)

opencontractserver/tasks/data_extract_tasks.py lines 463–477

full_text = await sync_to_async(read_field_file_text)(
    document.txt_extract_file, errors="replace"      # full file read into memory
)
# ...
if full_text and len(full_text) <= EXTRACT_FULL_TEXT_CHAR_LIMIT:   # only then check

read_field_file_text does a plain f.read() with no size limit. For documents above 24 K characters the entire file is fetched from object storage and immediately discarded. Django's FieldFile exposes a .size attribute (bytes); a pre-check like document.txt_extract_file.size <= EXTRACT_FULL_TEXT_CHAR_LIMIT * 4 before the read would eliminate the wasted round-trip for large documents.


4. _classify_none_result is coupled to the default budget constant (latent misclassification)

opencontractserver/tasks/data_extract_tasks.py line 130

if model_response_count >= EXTRACT_AGENT_REQUEST_LIMIT:   # hardcoded 20
    return NONE_RESULT_USAGE_LIMIT

_structured_response_raw applies EXTRACT_AGENT_REQUEST_LIMIT as a default that any caller can override by passing usage_limits=UsageLimits(request_limit=N). If a future caller sets a lower limit (e.g. 5), pydantic-ai raises UsageLimitExceeded after 5 responses, but _classify_none_result sees 5 < 20 and falls through to NONE_RESULT_TOOL_LOOP or NONE_RESULT_NO_FINAL — a wrong diagnosis written to Datacell.stacktrace. The inverse holds for a higher custom limit. No current caller exercises this, so it is latent today, but the same EXTRACT_AGENT_REQUEST_LIMIT comment in _structured_response_raw explicitly invites callers to override it. Passing the actual request_limit used into _classify_none_result, or annotating the None return with the exception type before returning, would decouple them.


Minor note: the UsageLimitExceeded warning log at pydantic_ai_agents.py:1843 always prints EXTRACT_AGENT_REQUEST_LIMIT as the limit, which would also be misleading if a caller supplies a different UsageLimits — same root cause as finding 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant