Skip to content

Commit 9521b68

Browse files
neoneyeclaude
andcommitted
napkin-math(compress): tighten token-overlap fallback to require every quote token (was 90%)
Code review on PR #744 noted the 90% threshold lets a long quote pass with one substituted content word (13/14 overlap), so 'highest'/'lowest' inversions in a long quote could verify even though they invert meaning. Tighten to require all quote tokens to appear in the source — the digit-bearing anchor is subsumed by the all-tokens rule, so it is removed. Empirically (1366 scored candidates across 6 plans, 8 compress runs): 0 items lose qv=True under the tightening. The 90% rule was never functionally looser than the 100% rule on observed data; the LLM's paraphrases are either full token-overlap (reordering/elision) or hit the substring fast path. Tightening removes the theoretical false-positive surface at no empirical cost. Adds a long-quote substitution test (14 tokens with one 'highest'/'lowest' swap) to lock the new bound in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4242687 commit 9521b68

2 files changed

Lines changed: 35 additions & 17 deletions

File tree

worker_plan/worker_plan_internal/parameter_extraction/compress_report_section.py

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -920,7 +920,6 @@ def normalise_for_quote_match(text: str) -> str:
920920

921921
WORD_TOKEN_PATTERN: re.Pattern[str] = re.compile(r"\w+", re.UNICODE)
922922

923-
QUOTE_MATCH_OVERLAP_THRESHOLD: float = 0.9
924923
QUOTE_MATCH_MIN_TOKENS: int = 3
925924

926925

@@ -929,8 +928,7 @@ def tokenize_for_quote_match(text: str) -> list[str]:
929928
930929
Language- and domain-neutral: it splits on whatever the Unicode word
931930
class considers a word character. Numeric tokens like ``$75,000`` split
932-
into ``["75", "000"]`` consistently in both quote and source, so digit
933-
anchoring still works.
931+
into ``["75", "000"]`` consistently in both quote and source.
934932
"""
935933
return WORD_TOKEN_PATTERN.findall(normalise_for_quote_match(text))
936934

@@ -940,12 +938,12 @@ def quote_is_in_source(quote: str, section_markdown: str) -> bool:
940938
941939
Fast path is the existing substring check after normalisation. When the
942940
LLM paraphrases (drops intermediate words, reorders the noun phrase),
943-
that fast path misses even though the quote is faithful to the source.
944-
A token-overlap fallback catches those cases without letting hallucinated
945-
numbers or substituted content words through: every digit-bearing token
946-
in the quote must appear in the source, and at least
947-
``QUOTE_MATCH_OVERLAP_THRESHOLD`` of all quote tokens must appear in the
948-
source token set.
941+
that fast path misses even though every content token came from the
942+
source. The fallback requires every quote token to appear in the source
943+
token set, which accepts reordering and elision but rejects any
944+
substituted word — including numeric substitutions, since digit-bearing
945+
tokens fall under the same all-tokens rule. A short-quote floor avoids
946+
trivial overlap on a large source.
949947
"""
950948
if not quote:
951949
return False
@@ -959,11 +957,7 @@ def quote_is_in_source(quote: str, section_markdown: str) -> bool:
959957
if len(quote_tokens) < QUOTE_MATCH_MIN_TOKENS:
960958
return False
961959
source_tokens = set(tokenize_for_quote_match(section_markdown))
962-
for tok in quote_tokens:
963-
if any(ch.isdigit() for ch in tok) and tok not in source_tokens:
964-
return False
965-
matched = sum(1 for tok in quote_tokens if tok in source_tokens)
966-
return matched / len(quote_tokens) >= QUOTE_MATCH_OVERLAP_THRESHOLD
960+
return all(tok in source_tokens for tok in quote_tokens)
967961

968962

969963
def numeric_density_bonus(text: str) -> float:

worker_plan/worker_plan_internal/parameter_extraction/tests/test_compress_report_section.py

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -510,9 +510,9 @@ def test_quote_is_in_source_rejects_hallucinated_number() -> None:
510510

511511

512512
def test_quote_is_in_source_rejects_substituted_content_word() -> None:
513-
"""Token-overlap threshold (90%) blocks single-word substitutions that
514-
invert meaning while keeping most surface tokens. ``highest`` is not in
515-
the source, so a six-other-tokens overlap fails."""
513+
"""All-tokens-in-source rule blocks single-word substitutions that invert
514+
meaning while keeping most surface tokens. ``highest`` is not in the
515+
source, so even a six-other-tokens overlap fails."""
516516
from worker_plan_internal.parameter_extraction.compress_report_section import (
517517
quote_is_in_source,
518518
)
@@ -524,6 +524,30 @@ def test_quote_is_in_source_rejects_substituted_content_word() -> None:
524524
) is False
525525

526526

527+
def test_quote_is_in_source_rejects_substitution_in_long_quote() -> None:
528+
"""A longer quote with one substituted content word must still fail —
529+
high fractional overlap is not a free pass. The all-tokens rule rejects
530+
on the single missing token regardless of quote length, which a
531+
fractional threshold like ≥90% would let through."""
532+
from worker_plan_internal.parameter_extraction.compress_report_section import (
533+
quote_is_in_source,
534+
)
535+
536+
source = (
537+
"If the lowest qualified bid for OPC UA middleware exceeds $75,000, "
538+
"then the project reverts to the current rule-based integration "
539+
"vendor and escalates to the steering committee."
540+
)
541+
# Same 14-token clause; only ``lowest`` swapped for ``highest``. 13/14 of
542+
# the tokens still appear in source, but the substituted word inverts
543+
# the meaning and must not verify.
544+
quote = (
545+
"highest qualified bid for OPC UA middleware exceeds $75,000 "
546+
"then project reverts to integration vendor"
547+
)
548+
assert quote_is_in_source(quote, source) is False
549+
550+
527551
def test_quote_is_in_source_rejects_short_unrelated_quote() -> None:
528552
"""Two- or one-token quotes do not get the token-overlap fallback —
529553
too easy to satisfy by coincidence on a large source."""

0 commit comments

Comments
 (0)