feat(detect): chunked validation with validator pools#126
feat(detect): chunked validation with validator pools#126lipikaramaswamy merged 20 commits intomainfrom
Conversation
Follow-ups from code review on #126. No user-facing behaviour change; the happy path is unchanged and all tests in the affected modules pass. Source hardening (chunked_validation.py): - Replace `assert last_exc is not None` in `_dispatch_chunk` with an explicit `if`+`RuntimeError` guard. `assert` is stripped under `python -O`, which would turn a caller-contract violation (empty `facades`) into a confusing `TypeError: exceptions must derive from BaseException` from `raise None`. The new guard survives `-O` and names the precondition being violated. - Drop the redundant `max_entities_per_call <= 0` check in `chunk_candidates`. Positivity is already enforced at two upstream layers (`AnonymizerDetectConfig.validation_max_entities_per_call` and `ChunkedValidationParams.max_entities_per_call`, both `Field(gt=0)`); the inner check was defence-in-depth at a layer below the validated boundary. - Widen `chunk_candidates` input from `list[...]` to `Sequence[...]` and normalize the return to `list(...)`, so the declared `list[list[tuple[...]]]` return type is honest regardless of whether a list, tuple, or other sequence is passed. Previously tuple input silently produced `list[tuple[tuple[...]]]`. Documentation (models.py): - Expand `normalize_entity_validator`'s docstring to explicitly document tuple acceptance (previously undocumented behaviour that matched Pydantic v2's default `list[str]` coercion but wasn't called out). Test coverage: - `test_tuple_input_returns_list_of_lists` replaces the now-obsolete `test_non_positive_limit_raises`; pins the sequence-input / list-output contract. - `test_tuple_coerced_to_list` on `DetectionModelSelection` pins the `entity_validator=(...)` path. - `test_system_prompt_is_forwarded_to_facade` and `test_system_prompt_default_none_is_forwarded_untouched` close the untested-forwarding gap on `ChunkedValidationParams.system_prompt`. The field is plumbed but not populated by any production caller today; tests pin the contract for when it's wired up (tracked in #127). Made-with: Cursor
andreatgretel
left a comment
There was a problem hiding this comment.
Additional comments on unchanged lines:
src/anonymizer/engine/ndd/model_loader.py:95 - str(value) in load_workflow_selections will silently mangle a list - now that entity_validator supports list-form, this becomes a trap if someone adds a pool to the bundled detection.yaml. maybe preserve native types here (return {str(key): value for ...}) or at least add a comment warning about it?
src/anonymizer/engine/detection/custom_columns.py:138 - build_validation_skeleton is no longer used in the detection pipeline (replaced by build_chunk_skeleton in the chunked path). worth removing to avoid confusion about which skeleton builder is authoritative?
Allows users to declare `entity_validator` as either a single alias (current behavior) or a list of aliases that form a validator pool. Scalars normalize to single-element lists so every consumer sees `list[str]`. Adds `validation_max_entities_per_call` (default 100) and `validation_excerpt_window_chars` (default 500) to Detect, which commit 2 will consume when it replaces the single validation LLM call with a chunked custom column that rotates across the pool. Behavior is unchanged in this commit: the workflow still issues one validation call per row using pool[0]. Adds `resolve_model_aliases()` helper for list-valued roles and updates `validate_model_alias_references` to flag unknown aliases in a pool by index. Made-with: Cursor
Replaces the single per-row validation LLM call with a chunked custom
column that partitions each row's validation candidates, dispatches
each chunk to one alias from the configured `entity_validator` pool
(round-robin by chunk index), and merges the decisions back into the
shape `apply_validation_decisions` already expects. Single-alias pools
behave identically to the prior path; multi-alias pools distribute
calls to bypass per-alias TPM/RPM caps.
Chunk size and the per-chunk excerpt window are driven by the Detect
config knobs added in commit 1 (`validation_max_entities_per_call`,
`validation_excerpt_window_chars`) and threaded through
`Anonymizer -> EntityDetectionWorkflow.run`. Per-row tag notation is
resolved once against the full text and forced onto every chunk so
excerpts stay internally consistent; `build_tagged_text` gains an
optional `notation=` override for that purpose.
Implementation uses data_designer's `CustomColumnConfig` with
`model_aliases=pool`, letting DD inject a `{alias: ModelFacade}` dict
and giving us its retry/throttle behavior for free. A terminal LLM
error in any chunk fails the whole row (surfaced as a `FailedRecord`);
decisions for ids outside the candidate set are dropped to mirror
the single-call path's `enrich_validation_decisions` filter.
Emits a warning when the pool has more than one alias noting that
`max_parallel_requests` is enforced per alias, so the pool multiplies
total in-flight validator calls.
Made-with: Cursor
Adds a "Chunked validation" section to the detection concept doc covering the two new Detect knobs (`validation_max_entities_per_call`, `validation_excerpt_window_chars`), when chunking kicks in, and how to tune in either direction. Adds a "Validator pools" subsection to the models doc demonstrating list-form `entity_validator:` YAML, with a warning admonition calling out that `max_parallel_requests` is enforced per alias (so a pool multiplies total in-flight calls) and that pool aliases should target equivalent models. Cross-links the two sections. Rate-limit jargon (TPM/RPM) is expanded inline on first use in each page since a reader can enter from either; the operator-facing log warning in detection_workflow still uses the acronyms. Also adds a behavioral regression test pinning single-chunk vs multi-chunk parity: same deterministic per-id decisions, pool-of-one, three chunks vs one chunk must produce byte-identical merged `COL_VALIDATION_DECISIONS` and byte-identical post-`apply_validation_decisions` entities, with the expected keep/reclass/drop/untouched tallies and the final entity list pinned. Made-with: Cursor
Follow-ups from code review on #126. No user-facing behaviour change; the happy path is unchanged and all tests in the affected modules pass. Source hardening (chunked_validation.py): - Replace `assert last_exc is not None` in `_dispatch_chunk` with an explicit `if`+`RuntimeError` guard. `assert` is stripped under `python -O`, which would turn a caller-contract violation (empty `facades`) into a confusing `TypeError: exceptions must derive from BaseException` from `raise None`. The new guard survives `-O` and names the precondition being violated. - Drop the redundant `max_entities_per_call <= 0` check in `chunk_candidates`. Positivity is already enforced at two upstream layers (`AnonymizerDetectConfig.validation_max_entities_per_call` and `ChunkedValidationParams.max_entities_per_call`, both `Field(gt=0)`); the inner check was defence-in-depth at a layer below the validated boundary. - Widen `chunk_candidates` input from `list[...]` to `Sequence[...]` and normalize the return to `list(...)`, so the declared `list[list[tuple[...]]]` return type is honest regardless of whether a list, tuple, or other sequence is passed. Previously tuple input silently produced `list[tuple[tuple[...]]]`. Documentation (models.py): - Expand `normalize_entity_validator`'s docstring to explicitly document tuple acceptance (previously undocumented behaviour that matched Pydantic v2's default `list[str]` coercion but wasn't called out). Test coverage: - `test_tuple_input_returns_list_of_lists` replaces the now-obsolete `test_non_positive_limit_raises`; pins the sequence-input / list-output contract. - `test_tuple_coerced_to_list` on `DetectionModelSelection` pins the `entity_validator=(...)` path. - `test_system_prompt_is_forwarded_to_facade` and `test_system_prompt_default_none_is_forwarded_untouched` close the untested-forwarding gap on `ChunkedValidationParams.system_prompt`. The field is plumbed but not populated by any production caller today; tests pin the contract for when it's wired up (tracked in #127). Made-with: Cursor
Four inline prose touches in the chunked-validation module. No code or
test behaviour changes; only doc/comment rewording so the source reads
as a standalone spec rather than a narrated diff against prior versions.
- `chunk_candidates` docstring: state the ``max_entities_per_call > 0``
precondition positively, point at the two upstream enforcement layers,
drop the "so we do not re-check it here" coda.
- `_dispatch_chunk` defensive-guard comment: keep the substantive reasons
(empty-facades precondition, loud named error vs. ``raise None``, guard
survives ``python -O``) and drop the assert-vs-if comparison that only
makes sense in the context of the review exchange that introduced it.
- `_dispatch_chunk` docstring: replace "matching the pre-failover
contract" with a direct statement of the single-alias-pool behaviour
("no alternate alias to try").
- `test_system_prompt_default_none_is_forwarded_untouched` comment:
tighten the "accidentally inject a placeholder" framing to just state
the current recipe contract and what the test pins.
Made-with: Cursor
…m YAML Addresses two review findings on the validator-pool config path. normalize_entity_validator now collapses duplicate aliases to the first occurrence (order preserved) and logs a warning. A duplicate would burn a failover attempt on an already-exhausted endpoint -- almost certainly not what the user intended. Empty pools still raise. load_workflow_selections preserves list-valued roles as list[str] instead of stringifying them. Stringifying would silently collapse a pool YAML block into a single garbled alias repr, which Pydantic would then treat as one alias. get_model_alias now raises TypeError when called on a list-valued role with a pointer to resolve_model_aliases for the pool shape. Tests cover dedup, the no-dup-no-warning path, list-value preservation, and the get_model_alias type error. Made-with: Cursor
build_validation_skeleton has been dead since chunked validation moved skeleton construction inline into chunked_validate_row. No internal callers remain; its tests go with it. Made-with: Cursor
…che Jinja template Drops asyncio.run / asyncio.to_thread / asyncio.gather from the chunked validation path. _dispatch_chunk and chunked_validate_row become sync defs; chunked_validate_row dispatches chunks via a ThreadPoolExecutor and calls facade.generate() directly. Per-alias concurrency is still enforced downstream by each facade's ThrottledModelClient, so the pool exists purely to overlap this row's chunks. Under DataDesigner's async engine the sync calls are transparently bridged to agenerate by the DD runtime (DD#545), so the code path stays engine-agnostic. Test call sites lose their asyncio.run(...) wrappers. A TODO(async-native) comment in the module docstring flags the follow-up migration once the async engine becomes the DD default. Wraps _compile_template in functools.lru_cache(maxsize=4) so a row with N chunks parses the Jinja source once instead of N times. Folds in the post-#119 column rename for this module (COL_MERGED_TAGGED_TEXT -> COL_SEED_TAGGED_TEXT, COL_VALIDATION_CANDIDATES -> COL_SEED_VALIDATION_CANDIDATES, prompt placeholder _merged_tagged_text -> _seed_tagged_text) that the rebase resolved in neighbouring modules but never applied here. Module docstring gains two follow-up-deferred paragraphs explaining (a) why per-instance validation is intentional -- dedup by (value, label) was considered and rejected because it conflates surface form with meaning -- and (b) how peak prompt memory scales with chunk count, with pointers to the orthogonal cost levers to pull if pressure shows up in a real workload. Both are tracked as separate follow-up issues. Made-with: Cursor
cfc8370 to
f093b99
Compare
The rejected-dedup rationale previously used "Apple"/"May" examples, which are actually *different* (value, label) keys -- so dedup would not affect them and the examples didn't demonstrate the concern. The real failure case is same (value, label) carrying different correctness across a document, which happens for short names that collide with common English words: "Ira" / "irate" (substring false positive), "Mark" / "mark", "Bill" / "bill", "Hope" / "hope". Validator would keep one and drop the other; dedup broadcasts one decision across both. Made-with: Cursor
model_copy(update=...) in Pydantic v2 silently skips field validators, so user overrides to selected_models bypassed DetectionModelSelection.normalize_entity_validator entirely: - entity_validator: [] -> accepted as [] (should raise) - entity_validator: [a, a, b] -> not deduped, no warning - entity_validator: [" a ", ""] -> whitespace not stripped Fix routes overrides through type(section).model_validate(merged) in _merge_selections so every field validator re-runs at parse time. Applied uniformly to detection/replace/rewrite to avoid leaving the same landmine for any validators we add to Replace or Rewrite later. Regression tests pin all three failure modes against parse_model_configs itself, which is the entry point the bug was reproduced against. Made-with: Cursor
…dule-level functions Matches the majority style of test_model_loader.py. Class-grouped shared-rationale docstring becomes a short block comment above the three functions. Made-with: Cursor
andreatgretel
left a comment
There was a problem hiding this comment.
LGTM! Two minor follow-ups:
-
merge_chunk_decisions:seen.add()runs before null-decision filtering - a null entry can shadow a valid decision for the same ID from another chunk. Unlikely since candidates partition cleanly, but the insertion order is technically wrong. -
Single-chunk rows now go through
build_chunk_excerpt()with a 500-char window instead of getting the full tagged text. A fast path forlen(chunks) == 1would preserve parity with the old behavior.
|
smoke_chunked_validation.py |
…Schema Mirrors the _coerce_proposed_label validator already on ValidationDecisionSchema. The chunked-validation path (added in #126, merged into this PR via ca08fb2) routes LLM output through RawValidationDecisionSchema instead of the wire-loose ValidationDecisionSchema, so the existing coercer didn't apply on that path. Failure mode (gemma4-e4b on the post-merge bench): model emits ``proposed_label: null`` for ``decision: keep`` decisions; pydantic rejects the entire chunk's records with ``string_type`` errors; DataDesigner drops the row. On a 5-note × 2-mode bench across 5 candidate models, this caused gemma4-e4b's success rate to crater to 1/5 redact + 2/5 rewrite even though the model otherwise delivers best-in-class quality (util 0.96 / leak 0.03 on the rows that survived). Tests cover null and int-typed proposed_label drift. 501 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Replaces the single per-row entity-validation LLM call with a chunked dispatch that partitions each row's candidates and rotates them across a configurable pool of validator aliases. This unblocks rows with many candidates (which can exceed the validator's context window or the provider's rate limits as a single call), gives users a quota-aggregation knob, and hardens validation against transient provider issues — all without changing defaults or single-validator behaviour.
Detectconfig knobs.validation_max_entities_per_call(default100) bounds per-call candidate count;validation_excerpt_window_chars(default500) bounds the text context per chunk. Rows under the candidate limit dispatch as one call, exactly equivalent to the prior path.entity_validatoraccepts list-form. A single alias remains valid (normalized to a one-element list); multiple aliases form a validator pool. Pools now serve two jobs:entity_validatorrequired splittinganonymizer.engine.ndd.model_loader.resolve_model_aliasinto a scalar helper (unchanged for the 12 still-scalar roles) and a new plural helperresolve_model_aliases()that returnslist[str]for list-valued roles and wraps scalars in a one-element list for uniformity.resolve_model_alias("entity_validator", ...)now raisesTypeErrorrather than silently returning a list. All in-repo callers are updated. These helpers are not re-exported fromanonymizer.__init__and are not part of the public package surface, so no supported external consumers are affected; theTypeErroris a loud, trivially-triagable signal for anyone who had imported the private path directly.result.failed_recordswithstep="detection", not as silent passthrough with un-scrubbed text.anonymizer validatereports unknown pool aliases by index.CustomColumnConfigin a newchunked_validationmodule: position-ordered chunking, per-chunk text excerpts with forced-consistent tag notation per row, per-chunk Jinja prompt rendering, dispatch through syncModelFacade.generate(wrapped inasyncio.to_threadfor per-row chunk concurrency) +PydanticResponseRecipe, and merged output back intoCOL_VALIDATION_DECISIONS. The sync facade path is required because DataDesigner's default thread-pool engine hands custom columns sync-modeHttpModelClientfacades whoseagenerateraises; it is also forward-compatible with the async engine after Anonymizer#119 + DD#545 land, since DD#545 bridges sync.generate()calls toageneratetransparently. Decisions for ids outside the candidate set are dropped, mirroring the legacyenrich_validation_decisionsfilter.len(pool) > 1, noting thatmax_parallel_requestsis enforced per alias so pooling multiplies total in-flight calls — rate-limit budget accordingly.failed_recordscontract.chunked_validate_rowcoverage for pool-of-one, pool-of-two round-robin, multi-chunk reassembly, missing-id drop, and bad-config guards; cross-alias failover coverage (primary-fails-secondary-succeeds, whole-pool-exhausted, single-alias-no-failover, wrap-around order); workflow-level tests for the newCustomColumnConfigwiring and kwarg threading; regression test pinning the DD-exposed wrapper as sync (guards against re-introducing the coroutine-return bug); behavioural regression test pinning 1-chunk vs 3-chunk parity forCOL_VALIDATION_DECISIONSandCOL_DETECTED_ENTITIESunder identical deterministic LLM behaviour.build.nvidia.com. 3-row smoke run withvalidation_max_entities_per_call=3and a{gpt-oss-120b, nemotron-30b-thinking}pool dispatched 30 chunks across the pool, hit a live 429 on one chunk, failed over to the other alias, recovered, and completed with 0 dropped rows.What users write
Chunking knobs on the Python
Detectconfig — safe defaults mean most users never touch these:List-form
entity_validatorin the models YAML to spread validation across aliases:Type of Change
Testing
make testpasses locallymake checkpasses locally (format + lint + typecheck + lock-check)Documentation
make docs-buildpasses locally