Fix structured-data extraction: restore dropped column constraints, cap tool-looping, full-text for short docs#2070
Fix structured-data extraction: restore dropped column constraints, cap tool-looping, full-text for short docs#2070JSv4 wants to merge 5 commits into
Conversation
…ap tool-looping, full-text for short docs An end-to-end eval of the extraction pipeline over 50 Fort Worth contracts surfaced three functional bugs. On a 15-doc commercial-diligence subset the fixes move objective accuracy 27% -> 90%, failed cells 11 -> 0, and LLM tool calls ~16 -> ~1 per cell. 1. Column.instructions / must_contain_text / limit_to_label were silently dropped by the marvin->doc-agent rewrite (184903f): persisted and GraphQL-settable but never fed to the extraction LLM. Re-wired into the prompt; the dead get_column_extraction_params helper (zero callers) removed. 2. No request budget let a weak model run away making up to ~100 similarity_search calls on a hard/absent value (failure_mode= tool_loop_no_output, 770KB logs, ~100s/cell). Added UsageLimits(request_limit=EXTRACT_AGENT_REQUEST_LIMIT=20) on the structured run. 3. Retrieval cannot prove a clause is absent, so on short contracts the agent looped. For documents <= EXTRACT_FULL_TEXT_CHAR_LIMIT (50K chars), inject the full (fenced) document text so the agent answers -- and confirms absence -- in one read. Retrieval remains the path for longer documents. Tests: opencontractserver/tests/test_extract_prompt_wiring.py asserts the three constraint fields and the full text all reach the prompt the agent runs.
Full-contract clause-selector validationThe 90% headline was measured on a renewal-notice subset that (it turned out) doesn't contain the diligence clauses. To validate the clause selectors on documents that do contain them, I ran the same fixed pipeline (
100% on the clause selectors when the documents contain the relevant information — confirming the renewal-notice gaps were a data characteristic, not a pipeline issue. Across both subsets the fixed pipeline ran with 0 failures and ~1 tool call/cell. |
Code ReviewThis PR fixes three real extraction bugs that were killing quality on short municipal contracts: dropped column constraints, runaway tool-looping, and the agent's inability to confirm clause absence via retrieval alone. The fixes are well-motivated, the test coverage is appropriate, and the changelog fragments are correct. Three findings worth addressing before merge: 1. File handle leak in
|
…reconcile prompt Adversarial review of the diff surfaced refinements (all confirmed): - Request budget is now applied via run_kwargs.setdefault() instead of a hardcoded explicit arg, so a caller may override it (usage_limits is whitelisted in _run_accepted) and the explicit value can never collide with a pass-through one into a duplicate-keyword TypeError (which the broad except would mask as a silent None). Documented as a default ceiling for all structured runs (tightening pydantic-ai's own default of 50), overridable per-call for e.g. a future corpus-wide structured analytic. - Full-text load reuses the canonical read_field_file_text() helper instead of an inline reimplementation, and logs storage errors instead of silently swallowing them. - Injected full text is prefixed with UNTRUSTED_CONTENT_NOTICE and passed through warn_if_content_large(); the prompt explicitly relaxes the system prompt's mandatory multi-search negative-case rule when the whole document is in context. - Lowered EXTRACT_FULL_TEXT_CHAR_LIMIT 50K -> 24K (~6K tokens) to bound redundant per-column token cost and stay well inside the context budget. - Clarified must_contain_text / limit_to_label are advisory prompt guidance, not enforced retrieval filters.
Review + verificationRan an adversarial review of the diff (15 confirmed findings) and addressed the substantive ones in
Tests: 3 new ( |
Code ReviewThis PR fixes three real extraction bugs — column constraints silently ignored since the marvin→doc-agent rewrite, runaway tool-looping with no request budget, and short-document clause-absence loops — and backs each fix with a regression test. The measured accuracy jump (27% → 90%) and the test coverage for the wiring changes are both solid. One confirmed bug introduced by the budget-cap fix, one low-priority footgun. 1.
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
doc_extract_query_task now appends per-column constraints and (for short docs) the full document text to the agent prompt, so the prompt is no longer exactly task.query. The benchmark integration mock looked up its canned answer by exact prompt equality, returning '' and dropping answer_exact_match to 0.0. Match the canned answer by the query the prompt starts with (exact lookup first).
- Catch UsageLimitExceeded explicitly in _structured_response_raw and log it as a named condition (mirroring the streaming chat() path) instead of letting the broad except mask it as a generic structured-response failure. - Classify the budget hit distinctly: _classify_none_result recognises the >= EXTRACT_AGENT_REQUEST_LIMIT model-response fingerprint and records failure_mode=usage_limit_exceeded (new NONE_RESULT_USAGE_LIMIT constant), checked before tool-loop detection since a budget hit is the more precise diagnosis. Operators can now tell a too-tight budget from a runaway loop. - Fix usage_limits=None footgun: an explicit None from a caller no longer disables the cap; the default applies whenever the resolved value is None. - Hoist the prompt_sanitization import in data_extract_tasks to module level. - Tests for the budget-exhaustion classification, its boundary, final-result precedence, and the new failure message.
Code ReviewThis PR fixes three real extraction-pipeline bugs (dropped column constraints, runaway tool-looping, inability to confirm clause absence in short docs) and the benchmark numbers make a compelling case that all three were material. The structure is clean — new constants in the right files, a named failure mode for budget exhaustion, and regression tests that pin the prompt-wiring behavior without hitting an LLM. Four issues worth addressing before merge: 1. Column constraint fields are injected raw — no fencing, no untrusted-content notice (security)
The full document text (lines 484–490) is correctly wrapped with column_constraints.append(f"Additional instructions: {column.instructions}")
# ...
column_constraints.append(f"Only extract data from sections that contain the text: '{column.must_contain_text}'.")Any authenticated user can set 2.
|
Summary
An end-to-end evaluation of the structured-data extraction pipeline (Fieldset / Column /
Extract / Datacell) over 50 real Fort Worth municipal contracts surfaced three functional
bugs that severely degraded extraction quality and reliability. This PR fixes all three.
Measured on a 15-doc subset with an 8-field commercial-diligence fieldset (counterparty,
effective/expiration date, contract value, governing law, indemnification, MFN,
termination-for-convenience):
tool_loop_no_output)Bugs fixed
1.
Column.instructions/must_contain_text/limit_to_labelsilently droppedThese three
Columnfields are GraphQL-settable, persisted, copied on clone, and diffed — butthe marvin→doc-agent rewrite (commit
184903f62) stopped feeding them to the extraction LLM,so per-column guidance/scoping had zero effect. The
get_column_extraction_paramshelperthat still "returned"
instructionshad zero callers (dead code).Fix:
doc_extract_query_taskfolds all three back into the prompt; dead helper removed.2. Runaway agent tool-looping (no request budget)
With no
usage_limits, a weak model (e.g.gpt-4o-mini) would make up to 100similarity_searchcalls on a hard/absent value before pydantic-ai gave up — 770 KB messagelogs, ~100 s/cell, surfacing as
failure_mode=tool_loop_no_output.Fix:
_structured_response_rawpassesUsageLimits(request_limit=EXTRACT_AGENT_REQUEST_LIMIT)(=20). A capable model commits well within it; the cap bounds the pathological case.
3. Short documents: agent can't confirm clause absence via retrieval → loops
Retrieval can't prove a clause is absent; for short contracts the agent searched endlessly
looking for clauses that weren't there.
Fix: for documents ≤
EXTRACT_FULL_TEXT_CHAR_LIMIT(24 K chars),doc_extract_query_taskinjects the full (fenced) document text so the agent answers — and confirms absence — in one
read. Cut tool calls ~16 → ~1 per cell and eliminated all failures. Retrieval remains the
path for longer documents.
Tests
opencontractserver/tests/test_extract_prompt_wiring.py— asserts the three constraint fieldsand the full-text both reach the prompt the agent runs (agent patched; no LLM hit).
Changelog
changelog.d/extract-column-constraints.fixed.md,extract-tool-budget.fixed.md,extract-fulltext-short-docs.changed.md.Notes / follow-ups
absence (Bump actions/setup-node from 3.4.1 to 3.5.1 #3); Bump postgres from 14.5 to 15.0 in /compose/production/postgres #1 is a real but quieter feature-loss.
date/datetimeoutput type (aCreateColumnwith
output_type="date"crashes the cell inparse_model_or_primitive); legitimatelyabsent values are recorded as
failedcells (the FE renders them as errors); a genericreturn Noneon any structured-run exception can mislabel a real failure asagent_committed_none. These are documented for follow-up.