Skip to content

feat(detect): chunked validation with validator pools#126

Merged
lipikaramaswamy merged 20 commits intomainfrom
lipikaramaswamy/feature/chunked-validation-pool
Apr 23, 2026
Merged

feat(detect): chunked validation with validator pools#126
lipikaramaswamy merged 20 commits intomainfrom
lipikaramaswamy/feature/chunked-validation-pool

Conversation

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

@lipikaramaswamy lipikaramaswamy commented Apr 21, 2026

Summary

Replaces the single per-row entity-validation LLM call with a chunked dispatch that partitions each row's candidates and rotates them across a configurable pool of validator aliases. This unblocks rows with many candidates (which can exceed the validator's context window or the provider's rate limits as a single call), gives users a quota-aggregation knob, and hardens validation against transient provider issues — all without changing defaults or single-validator behaviour.

  • New Detect config knobs. validation_max_entities_per_call (default 100) bounds per-call candidate count; validation_excerpt_window_chars (default 500) bounds the text context per chunk. Rows under the candidate limit dispatch as one call, exactly equivalent to the prior path.
  • entity_validator accepts list-form. A single alias remains valid (normalized to a one-element list); multiple aliases form a validator pool. Pools now serve two jobs:
    1. Load spreading. Chunks are dispatched round-robin by chunk index, aggregating quota across equivalent endpoints to work around per-alias rate limits.
    2. Failover. If a chunk's assigned alias can't complete the call (unrecoverable rate limit, persistent 5xx, malformed response), the same chunk is automatically retried against the other aliases in the pool before the row is dropped. A chunk only fails once every alias in the pool has failed for it. Single-alias pools behave exactly as before.
  • Internal helper split (not user-facing). The list-form change to entity_validator required splitting anonymizer.engine.ndd.model_loader.resolve_model_alias into a scalar helper (unchanged for the 12 still-scalar roles) and a new plural helper resolve_model_aliases() that returns list[str] for list-valued roles and wraps scalars in a one-element list for uniformity. resolve_model_alias("entity_validator", ...) now raises TypeError rather than silently returning a list. All in-repo callers are updated. These helpers are not re-exported from anonymizer.__init__ and are not part of the public package surface, so no supported external consumers are affected; the TypeError is a loud, trivially-triagable signal for anyone who had imported the private path directly.
  • Row-level failure contract. A row is only dropped when every alias in the pool has failed on at least one of its chunks. Dropped rows surface on result.failed_records with step="detection", not as silent passthrough with un-scrubbed text. anonymizer validate reports unknown pool aliases by index.
  • Chunked validation via CustomColumnConfig in a new chunked_validation module: position-ordered chunking, per-chunk text excerpts with forced-consistent tag notation per row, per-chunk Jinja prompt rendering, dispatch through sync ModelFacade.generate (wrapped in asyncio.to_thread for per-row chunk concurrency) + PydanticResponseRecipe, and merged output back into COL_VALIDATION_DECISIONS. The sync facade path is required because DataDesigner's default thread-pool engine hands custom columns sync-mode HttpModelClient facades whose agenerate raises; it is also forward-compatible with the async engine after Anonymizer#119 + DD#545 land, since DD#545 bridges sync .generate() calls to agenerate transparently. Decisions for ids outside the candidate set are dropped, mirroring the legacy enrich_validation_decisions filter.
  • Operator warning when len(pool) > 1, noting that max_parallel_requests is enforced per alias so pooling multiplies total in-flight calls — rate-limit budget accordingly.
  • Docs. New "Chunked validation" section on the detection concept page covering both knobs and tuning direction. Updated "Validator pools" subsection on the models page covering both load-spreading and failover roles with YAML example, plus a "What happens when a row can't be validated" subsection on the detection page documenting the row-drop + failed_records contract.
  • Tests. Unit coverage of every pure helper in the chunked module; async chunked_validate_row coverage for pool-of-one, pool-of-two round-robin, multi-chunk reassembly, missing-id drop, and bad-config guards; cross-alias failover coverage (primary-fails-secondary-succeeds, whole-pool-exhausted, single-alias-no-failover, wrap-around order); workflow-level tests for the new CustomColumnConfig wiring and kwarg threading; regression test pinning the DD-exposed wrapper as sync (guards against re-introducing the coroutine-return bug); behavioural regression test pinning 1-chunk vs 3-chunk parity for COL_VALIDATION_DECISIONS and COL_DETECTED_ENTITIES under identical deterministic LLM behaviour.
  • Verified against build.nvidia.com. 3-row smoke run with validation_max_entities_per_call=3 and a {gpt-oss-120b, nemotron-30b-thinking} pool dispatched 30 chunks across the pool, hit a live 429 on one chunk, failed over to the other alias, recovered, and completed with 0 dropped rows.

What users write

Chunking knobs on the Python Detect config — safe defaults mean most users never touch these:

from anonymizer import Anonymizer, AnonymizerConfig, Detect, Redact
config = AnonymizerConfig(
    detect=Detect(
        validation_max_entities_per_call=50,   # smaller chunks -> more calls, less context per call
        validation_excerpt_window_chars=800,   # more text context per chunk
    ),
    replace=Redact(),
)

List-form entity_validator in the models YAML to spread validation across aliases:

selected_models:
  detection:
    entity_detector: gliner-pii-detector
    entity_validator:
      - gpt5-primary
      - gpt5-secondary
    entity_augmenter: gpt5-primary
    latent_detector: claude-sonnet

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • make test passes locally
  • make check passes locally (format + lint + typecheck + lock-check)
  • Added/updated tests for changes

Documentation

  • If docs changed: make docs-build passes locally

@lipikaramaswamy lipikaramaswamy requested a review from a team as a code owner April 21, 2026 02:05
lipikaramaswamy added a commit that referenced this pull request Apr 21, 2026
Follow-ups from code review on #126. No user-facing behaviour change; the
happy path is unchanged and all tests in the affected modules pass.

Source hardening (chunked_validation.py):
- Replace `assert last_exc is not None` in `_dispatch_chunk` with an explicit
  `if`+`RuntimeError` guard. `assert` is stripped under `python -O`, which
  would turn a caller-contract violation (empty `facades`) into a confusing
  `TypeError: exceptions must derive from BaseException` from `raise None`.
  The new guard survives `-O` and names the precondition being violated.
- Drop the redundant `max_entities_per_call <= 0` check in `chunk_candidates`.
  Positivity is already enforced at two upstream layers
  (`AnonymizerDetectConfig.validation_max_entities_per_call` and
  `ChunkedValidationParams.max_entities_per_call`, both `Field(gt=0)`); the
  inner check was defence-in-depth at a layer below the validated boundary.
- Widen `chunk_candidates` input from `list[...]` to `Sequence[...]` and
  normalize the return to `list(...)`, so the declared `list[list[tuple[...]]]`
  return type is honest regardless of whether a list, tuple, or other
  sequence is passed. Previously tuple input silently produced
  `list[tuple[tuple[...]]]`.

Documentation (models.py):
- Expand `normalize_entity_validator`'s docstring to explicitly document
  tuple acceptance (previously undocumented behaviour that matched Pydantic
  v2's default `list[str]` coercion but wasn't called out).

Test coverage:
- `test_tuple_input_returns_list_of_lists` replaces the now-obsolete
  `test_non_positive_limit_raises`; pins the sequence-input / list-output
  contract.
- `test_tuple_coerced_to_list` on `DetectionModelSelection` pins the
  `entity_validator=(...)` path.
- `test_system_prompt_is_forwarded_to_facade` and
  `test_system_prompt_default_none_is_forwarded_untouched` close the
  untested-forwarding gap on `ChunkedValidationParams.system_prompt`. The
  field is plumbed but not populated by any production caller today; tests
  pin the contract for when it's wired up (tracked in #127).

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comments on unchanged lines:

src/anonymizer/engine/ndd/model_loader.py:95 - str(value) in load_workflow_selections will silently mangle a list - now that entity_validator supports list-form, this becomes a trap if someone adds a pool to the bundled detection.yaml. maybe preserve native types here (return {str(key): value for ...}) or at least add a comment warning about it?

src/anonymizer/engine/detection/custom_columns.py:138 - build_validation_skeleton is no longer used in the detection pipeline (replaced by build_chunk_skeleton in the chunked path). worth removing to avoid confusion about which skeleton builder is authoritative?

Comment thread src/anonymizer/engine/detection/chunked_validation.py Outdated
Comment thread src/anonymizer/config/models.py
Comment thread src/anonymizer/engine/detection/chunked_validation.py Outdated
Comment thread src/anonymizer/engine/detection/chunked_validation.py Outdated
Comment thread src/anonymizer/engine/detection/chunked_validation.py
Comment thread src/anonymizer/engine/detection/chunked_validation.py
Comment thread src/anonymizer/engine/detection/chunked_validation.py
Allows users to declare `entity_validator` as either a single alias
(current behavior) or a list of aliases that form a validator pool.
Scalars normalize to single-element lists so every consumer sees
`list[str]`. Adds `validation_max_entities_per_call` (default 100) and
`validation_excerpt_window_chars` (default 500) to Detect, which
commit 2 will consume when it replaces the single validation LLM call
with a chunked custom column that rotates across the pool. Behavior
is unchanged in this commit: the workflow still issues one validation
call per row using pool[0].

Adds `resolve_model_aliases()` helper for list-valued roles and
updates `validate_model_alias_references` to flag unknown aliases in
a pool by index.

Made-with: Cursor
Replaces the single per-row validation LLM call with a chunked custom
column that partitions each row's validation candidates, dispatches
each chunk to one alias from the configured `entity_validator` pool
(round-robin by chunk index), and merges the decisions back into the
shape `apply_validation_decisions` already expects. Single-alias pools
behave identically to the prior path; multi-alias pools distribute
calls to bypass per-alias TPM/RPM caps.

Chunk size and the per-chunk excerpt window are driven by the Detect
config knobs added in commit 1 (`validation_max_entities_per_call`,
`validation_excerpt_window_chars`) and threaded through
`Anonymizer -> EntityDetectionWorkflow.run`. Per-row tag notation is
resolved once against the full text and forced onto every chunk so
excerpts stay internally consistent; `build_tagged_text` gains an
optional `notation=` override for that purpose.

Implementation uses data_designer's `CustomColumnConfig` with
`model_aliases=pool`, letting DD inject a `{alias: ModelFacade}` dict
and giving us its retry/throttle behavior for free. A terminal LLM
error in any chunk fails the whole row (surfaced as a `FailedRecord`);
decisions for ids outside the candidate set are dropped to mirror
the single-call path's `enrich_validation_decisions` filter.

Emits a warning when the pool has more than one alias noting that
`max_parallel_requests` is enforced per alias, so the pool multiplies
total in-flight validator calls.

Made-with: Cursor
Adds a "Chunked validation" section to the detection concept doc
covering the two new Detect knobs (`validation_max_entities_per_call`,
`validation_excerpt_window_chars`), when chunking kicks in, and how to
tune in either direction. Adds a "Validator pools" subsection to the
models doc demonstrating list-form `entity_validator:` YAML, with a
warning admonition calling out that `max_parallel_requests` is
enforced per alias (so a pool multiplies total in-flight calls) and
that pool aliases should target equivalent models. Cross-links the
two sections.

Rate-limit jargon (TPM/RPM) is expanded inline on first use in each
page since a reader can enter from either; the operator-facing log
warning in detection_workflow still uses the acronyms.

Also adds a behavioral regression test pinning single-chunk vs
multi-chunk parity: same deterministic per-id decisions, pool-of-one,
three chunks vs one chunk must produce byte-identical merged
`COL_VALIDATION_DECISIONS` and byte-identical post-`apply_validation_decisions`
entities, with the expected keep/reclass/drop/untouched tallies and
the final entity list pinned.

Made-with: Cursor
Follow-ups from code review on #126. No user-facing behaviour change; the
happy path is unchanged and all tests in the affected modules pass.

Source hardening (chunked_validation.py):
- Replace `assert last_exc is not None` in `_dispatch_chunk` with an explicit
  `if`+`RuntimeError` guard. `assert` is stripped under `python -O`, which
  would turn a caller-contract violation (empty `facades`) into a confusing
  `TypeError: exceptions must derive from BaseException` from `raise None`.
  The new guard survives `-O` and names the precondition being violated.
- Drop the redundant `max_entities_per_call <= 0` check in `chunk_candidates`.
  Positivity is already enforced at two upstream layers
  (`AnonymizerDetectConfig.validation_max_entities_per_call` and
  `ChunkedValidationParams.max_entities_per_call`, both `Field(gt=0)`); the
  inner check was defence-in-depth at a layer below the validated boundary.
- Widen `chunk_candidates` input from `list[...]` to `Sequence[...]` and
  normalize the return to `list(...)`, so the declared `list[list[tuple[...]]]`
  return type is honest regardless of whether a list, tuple, or other
  sequence is passed. Previously tuple input silently produced
  `list[tuple[tuple[...]]]`.

Documentation (models.py):
- Expand `normalize_entity_validator`'s docstring to explicitly document
  tuple acceptance (previously undocumented behaviour that matched Pydantic
  v2's default `list[str]` coercion but wasn't called out).

Test coverage:
- `test_tuple_input_returns_list_of_lists` replaces the now-obsolete
  `test_non_positive_limit_raises`; pins the sequence-input / list-output
  contract.
- `test_tuple_coerced_to_list` on `DetectionModelSelection` pins the
  `entity_validator=(...)` path.
- `test_system_prompt_is_forwarded_to_facade` and
  `test_system_prompt_default_none_is_forwarded_untouched` close the
  untested-forwarding gap on `ChunkedValidationParams.system_prompt`. The
  field is plumbed but not populated by any production caller today; tests
  pin the contract for when it's wired up (tracked in #127).

Made-with: Cursor
Four inline prose touches in the chunked-validation module. No code or
test behaviour changes; only doc/comment rewording so the source reads
as a standalone spec rather than a narrated diff against prior versions.

- `chunk_candidates` docstring: state the ``max_entities_per_call > 0``
  precondition positively, point at the two upstream enforcement layers,
  drop the "so we do not re-check it here" coda.
- `_dispatch_chunk` defensive-guard comment: keep the substantive reasons
  (empty-facades precondition, loud named error vs. ``raise None``, guard
  survives ``python -O``) and drop the assert-vs-if comparison that only
  makes sense in the context of the review exchange that introduced it.
- `_dispatch_chunk` docstring: replace "matching the pre-failover
  contract" with a direct statement of the single-alias-pool behaviour
  ("no alternate alias to try").
- `test_system_prompt_default_none_is_forwarded_untouched` comment:
  tighten the "accidentally inject a placeholder" framing to just state
  the current recipe contract and what the test pins.

Made-with: Cursor
…m YAML

Addresses two review findings on the validator-pool config path.

normalize_entity_validator now collapses duplicate aliases to the first
occurrence (order preserved) and logs a warning. A duplicate would burn
a failover attempt on an already-exhausted endpoint -- almost certainly
not what the user intended. Empty pools still raise.

load_workflow_selections preserves list-valued roles as list[str]
instead of stringifying them. Stringifying would silently collapse a
pool YAML block into a single garbled alias repr, which Pydantic would
then treat as one alias. get_model_alias now raises TypeError when
called on a list-valued role with a pointer to resolve_model_aliases
for the pool shape.

Tests cover dedup, the no-dup-no-warning path, list-value preservation,
and the get_model_alias type error.

Made-with: Cursor
build_validation_skeleton has been dead since chunked validation moved
skeleton construction inline into chunked_validate_row. No internal
callers remain; its tests go with it.

Made-with: Cursor
…che Jinja template

Drops asyncio.run / asyncio.to_thread / asyncio.gather from the chunked
validation path. _dispatch_chunk and chunked_validate_row become sync
defs; chunked_validate_row dispatches chunks via a ThreadPoolExecutor
and calls facade.generate() directly. Per-alias concurrency is still
enforced downstream by each facade's ThrottledModelClient, so the pool
exists purely to overlap this row's chunks. Under DataDesigner's async
engine the sync calls are transparently bridged to agenerate by the DD
runtime (DD#545), so the code path stays engine-agnostic. Test call
sites lose their asyncio.run(...) wrappers. A TODO(async-native)
comment in the module docstring flags the follow-up migration once the
async engine becomes the DD default.

Wraps _compile_template in functools.lru_cache(maxsize=4) so a row with
N chunks parses the Jinja source once instead of N times.

Folds in the post-#119 column rename for this module
(COL_MERGED_TAGGED_TEXT -> COL_SEED_TAGGED_TEXT,
COL_VALIDATION_CANDIDATES -> COL_SEED_VALIDATION_CANDIDATES, prompt
placeholder _merged_tagged_text -> _seed_tagged_text) that the rebase
resolved in neighbouring modules but never applied here.

Module docstring gains two follow-up-deferred paragraphs explaining
(a) why per-instance validation is intentional -- dedup by (value,
label) was considered and rejected because it conflates surface form
with meaning -- and (b) how peak prompt memory scales with chunk count,
with pointers to the orthogonal cost levers to pull if pressure shows
up in a real workload. Both are tracked as separate follow-up issues.

Made-with: Cursor
@lipikaramaswamy lipikaramaswamy force-pushed the lipikaramaswamy/feature/chunked-validation-pool branch from cfc8370 to f093b99 Compare April 22, 2026 22:37
The rejected-dedup rationale previously used "Apple"/"May" examples,
which are actually *different* (value, label) keys -- so dedup would
not affect them and the examples didn't demonstrate the concern. The
real failure case is same (value, label) carrying different correctness
across a document, which happens for short names that collide with
common English words: "Ira" / "irate" (substring false positive),
"Mark" / "mark", "Bill" / "bill", "Hope" / "hope". Validator would
keep one and drop the other; dedup broadcasts one decision across both.

Made-with: Cursor
model_copy(update=...) in Pydantic v2 silently skips field validators,
so user overrides to selected_models bypassed
DetectionModelSelection.normalize_entity_validator entirely:

- entity_validator: []            -> accepted as [] (should raise)
- entity_validator: [a, a, b]     -> not deduped, no warning
- entity_validator: ["  a  ", ""] -> whitespace not stripped

Fix routes overrides through type(section).model_validate(merged) in
_merge_selections so every field validator re-runs at parse time.
Applied uniformly to detection/replace/rewrite to avoid leaving the
same landmine for any validators we add to Replace or Rewrite later.

Regression tests pin all three failure modes against parse_model_configs
itself, which is the entry point the bug was reproduced against.

Made-with: Cursor
…dule-level functions

Matches the majority style of test_model_loader.py. Class-grouped
shared-rationale docstring becomes a short block comment above the
three functions.

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Two minor follow-ups:

  1. merge_chunk_decisions: seen.add() runs before null-decision filtering - a null entry can shadow a valid decision for the same ID from another chunk. Unlikely since candidates partition cleanly, but the insertion order is technically wrong.

  2. Single-chunk rows now go through build_chunk_excerpt() with a 500-char window instead of getting the full tagged text. A fast path for len(chunks) == 1 would preserve parity with the old behavior.

@lipikaramaswamy lipikaramaswamy merged commit 5030936 into main Apr 23, 2026
11 checks passed
@lipikaramaswamy lipikaramaswamy deleted the lipikaramaswamy/feature/chunked-validation-pool branch April 23, 2026 17:54
@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

smoke_chunked_validation.py
Just tracking the file i used to test chunked validation

mvansegbroeck added a commit that referenced this pull request Apr 26, 2026
…Schema

Mirrors the _coerce_proposed_label validator already on
ValidationDecisionSchema. The chunked-validation path (added in #126,
merged into this PR via ca08fb2) routes LLM output through
RawValidationDecisionSchema instead of the wire-loose
ValidationDecisionSchema, so the existing coercer didn't apply on
that path.

Failure mode (gemma4-e4b on the post-merge bench): model emits
``proposed_label: null`` for ``decision: keep`` decisions; pydantic
rejects the entire chunk's records with ``string_type`` errors;
DataDesigner drops the row. On a 5-note × 2-mode bench across 5
candidate models, this caused gemma4-e4b's success rate to crater
to 1/5 redact + 2/5 rewrite even though the model otherwise
delivers best-in-class quality (util 0.96 / leak 0.03 on the rows
that survived).

Tests cover null and int-typed proposed_label drift. 501 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants