Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
1cd4c1f
feat(detect): accept validator pool + chunked-validation config knobs
lipikaramaswamy Apr 20, 2026
2e7c97b
feat(detect): chunked validation across a validator pool
lipikaramaswamy Apr 20, 2026
f767e4c
docs(detect): document chunked validation + validator pools
lipikaramaswamy Apr 21, 2026
ab42bd5
fix: format
lipikaramaswamy Apr 21, 2026
c4196f2
feat(detect): cross-alias failover + DD engine compatibility
lipikaramaswamy Apr 21, 2026
fd4fe4d
docs: update wording
lipikaramaswamy Apr 21, 2026
e3c1ec6
refactor(detect): harden chunked-validation guards per review feedback
lipikaramaswamy Apr 21, 2026
134f4da
docs(detect): tidy review-history residue in chunked-validation prose
lipikaramaswamy Apr 21, 2026
841899f
refactor(config): dedupe validator pool and preserve native types fro…
lipikaramaswamy Apr 22, 2026
2fe252e
chore(detect): drop unused build_validation_skeleton helper
lipikaramaswamy Apr 22, 2026
f093b99
refactor(detect): replace asyncio fan-out with ThreadPoolExecutor; ca…
lipikaramaswamy Apr 22, 2026
113074d
docs(detect): fix misleading C5 examples in chunked-validation docstring
lipikaramaswamy Apr 22, 2026
d5fd992
docs: clean up docstrings
lipikaramaswamy Apr 22, 2026
0757e88
fix(config): re-validate selected_models overrides instead of model_copy
lipikaramaswamy Apr 22, 2026
6b613b0
style(tests): flatten TestParseModelConfigsRevalidatesOverrides to mo…
lipikaramaswamy Apr 22, 2026
ff80afc
make workflow yaml validation helper work with list valued pools
lipikaramaswamy Apr 23, 2026
4eaee1d
display: remove prompt template from logging of validation step
lipikaramaswamy Apr 23, 2026
025cbf7
update logging
lipikaramaswamy Apr 23, 2026
743a132
apply fast path for single chunk
lipikaramaswamy Apr 23, 2026
6cf1ce3
treat none decisions as not present when chunking validation
lipikaramaswamy Apr 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/concepts/detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,35 @@ config = AnonymizerConfig(
|-------|---------|-------------|
| `entity_labels` | `None` (all defaults) | List of labels to detect. Leave unset (or pass `None`) to use the full default set. |
| `gliner_threshold` | `0.3` | GLiNER confidence threshold (0.0--1.0). Lower values detect more entities but may increase false positives. |
| `validation_max_entities_per_call` | `100` | Maximum candidate entities per validator LLM call. Rows with more candidates are split into chunks. See [Chunked validation](#chunked-validation). |
| `validation_excerpt_window_chars` | `500` | Characters of context included before and after a chunk's entity spans in the validator prompt. Bounds per-chunk prompt size; not the model's context-window limit. |

---

## Chunked validation

When a row yields many entity candidates, validating them in a single LLM call can often exceed the model's context window or the provider's rate limits (tokens-per-minute or requests-per-minute quotas that many hosted models enforce). Anonymizer automatically splits validation for such rows: candidates are grouped in position order into chunks of at most `validation_max_entities_per_call`, and each chunk is validated independently with its own bounded text excerpt (`validation_excerpt_window_chars` before and after the chunk's span). Decisions are merged back into a single per-row set.

The chunked path is always on; if a row has fewer candidates than the limit, it runs as a single call and is exactly equivalent to the unchunked behavior. Tuning guidance:

- **Raise `validation_max_entities_per_call`** if your validator has a large context window and you want fewer, larger calls.
- **Lower it** if you hit provider rate limits or want more uniform per-call latency.
- **Raise `validation_excerpt_window_chars`** when short windows hide the context needed to disambiguate entities (e.g., `"John"` as first name vs. last name depends on surrounding text).
- **Lower it** to reduce per-chunk prompt tokens, at the risk of lower validation quality on context-sensitive labels.

### Validator pools

`entity_validator` can be a single alias (the default) or a list of aliases — a **pool**. When multiple aliases are configured, each chunk in a row is dispatched to the next alias in round-robin order, which lets you work around per-alias rate limits by spreading requests across equivalent endpoints.

Pools also act as **failover**. If a chunk's assigned alias can't complete the call (an unrecoverable rate limit, a 5xx that didn't clear on retry, a malformed response), the same chunk is automatically retried against the other aliases in your pool before the row is given up on. A chunk only fails once every alias in the pool has failed for it. This is a cheap way to harden validation against any one endpoint having a bad day, on top of the load-spreading role.

#### What happens when a row can't be validated

If validation can't get a complete answer for a row — every alias in the pool has failed on at least one of that row's chunks — the row is **dropped from the output** rather than passed through with some entities unvalidated. This is deliberate: the alternative would be writing the original text back out with those entities still un-scrubbed, which is exactly the outcome you're trying to avoid.

Dropped rows show up on `result.failed_records` with `step="detection"`, so you can tell which inputs didn't make it through by comparing input IDs against output IDs and reprocess those on a follow-up pass.

See [Validator pools](models.md#validator-pools) for the YAML syntax and caveats.


## Entity labels
Expand Down
27 changes: 27 additions & 0 deletions docs/concepts/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,33 @@ Roles you don't override keep their default alias selections, but those aliases
Use [`anonymizer.validate_config(config)`](../reference/anonymizer/interface/anonymizer.md) (or [`anonymizer validate`](../reference/anonymizer/interface/cli/main.md) from the CLI) after changing model configs to catch alias mismatches before processing data.


### Validator pools

`entity_validator` accepts either a single alias (shown above) or a list of aliases. A list forms a **validator pool** with two jobs:

1. **Load spreading.** [Chunked validation](detection.md#chunked-validation) dispatches each chunk to the next alias in round-robin order, aggregating quota across equivalent endpoints when a single alias would hit the provider's rate limits (tokens-per-minute or requests-per-minute quotas).
2. **Failover.** If a chunk's assigned alias can't complete the call (an unrecoverable rate limit, a 5xx that didn't clear on retry, a malformed response), the same chunk is automatically retried against the other aliases in your pool before the row is given up on. A row is only dropped when *every* alias in the pool has failed for the same chunk. Single-alias pools have nothing to fall back to, so they behave the same as not using a pool.

```yaml
selected_models:
detection:
entity_detector: gliner-pii-detector
entity_validator:
- gpt5-primary
- gpt5-secondary
entity_augmenter: gpt5-primary
latent_detector: claude-sonnet
```

Every alias in the pool must also appear in `model_configs`; `anonymizer validate` flags unknown aliases by index. A scalar value remains valid and is equivalent to a one-element list.

!!! warning "`max_parallel_requests` is enforced per alias"

A pool with N aliases effectively allows up to `sum(max_parallel_requests for alias in pool)` concurrent validator calls per row when chunks exist. Budget your provider rate limits accordingly — the whole point of pooling is to multiply in-flight requests, but the multiplication is real.

Pool aliases should target **equivalent models** (same model family, similar quality). Mixing heterogeneous models produces inconsistent validation across chunks in the same row and is almost always a misconfiguration.


### Choosing custom models

For Anonymizer, the best overall leaderboard model is not always the best default for every role.
Expand Down
18 changes: 18 additions & 0 deletions src/anonymizer/config/anonymizer_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,24 @@ class Detect(BaseModel):
gliner_threshold: float = Field(
default=0.3, ge=0.0, le=1.0, description="GLiNER detection confidence threshold (0.0-1.0)."
)
validation_max_entities_per_call: int = Field(
default=100,
gt=0,
description=(
"Maximum number of candidate entities included in a single validator LLM call. "
"When a row has more candidates than this, validation is split into chunks that "
"are dispatched (round-robin) across the validator pool."
),
)
validation_excerpt_window_chars: int = Field(
default=500,
gt=0,
description=(
"Number of characters to include before and after a chunk's entity span when "
"building the text excerpt sent to the validator. Bounds the prompt context the "
"validator sees per chunk; it is NOT the LLM's context window limit."
),
)

@field_validator("entity_labels")
@classmethod
Expand Down
65 changes: 62 additions & 3 deletions src/anonymizer/config/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,76 @@

from __future__ import annotations

from pydantic import BaseModel
import logging
from typing import Any

from pydantic import BaseModel, field_validator

logger = logging.getLogger(__name__)


class DetectionModelSelection(BaseModel):
"""Model aliases for the entity detection pipeline."""
"""Model aliases for the entity detection pipeline.

``entity_validator`` accepts either a single alias or a list of aliases.
A list forms a validator *pool*: chunked validation rotates calls
across the pool in round-robin order, which is useful for bypassing
per-alias TPM/RPM limits. A single scalar is normalized to a
one-element list.
"""

entity_detector: str
entity_validator: str
entity_validator: list[str]
entity_augmenter: str
latent_detector: str

@field_validator("entity_validator", mode="before")
@classmethod
def normalize_entity_validator(cls, value: Any) -> list[str]:
"""Accept a scalar alias, a list of aliases, or a tuple of aliases; return a non-empty deduplicated list.

Normalizing at parse time keeps every downstream consumer on the
same shape (``list[str]``) regardless of whether the user wrote
``entity_validator: some-alias`` or
``entity_validator: [alias-a, alias-b]``. Tuples are accepted for
parity with Pydantic v2's default coercion for ``list[str]`` fields,
which lets programmatic callers pass either
``DetectionModelSelection(entity_validator=["a", "b"])`` or
``DetectionModelSelection(entity_validator=("a", "b"))`` without
caring about the concrete sequence type. Any other input type
raises ``TypeError``.

Duplicate aliases are collapsed to the first occurrence (order
preserved) and a warning is logged. A duplicate in the pool would
burn a failover attempt on an already-exhausted endpoint, which
almost certainly isn't what the user wants.
"""
if isinstance(value, str):
aliases: list[str] = [value]
elif isinstance(value, (list, tuple)):
aliases = [str(item) for item in value]
else:
raise TypeError(f"entity_validator must be a string or list of strings, got {type(value).__name__}")
cleaned = [alias.strip() for alias in aliases if alias.strip()]
Comment thread
lipikaramaswamy marked this conversation as resolved.
if not cleaned:
raise ValueError("entity_validator must name at least one model alias.")
seen: set[str] = set()
deduped: list[str] = []
for alias in cleaned:
if alias in seen:
continue
seen.add(alias)
deduped.append(alias)
if len(deduped) != len(cleaned):
removed = [alias for alias in cleaned if cleaned.count(alias) > 1]
logger.warning(
"entity_validator pool contained duplicate aliases %s; collapsing to %s. "
"Duplicates burn a failover attempt on an already-exhausted endpoint.",
sorted(set(removed)),
deduped,
)
return deduped


class ReplaceModelSelection(BaseModel):
"""Model aliases for the replacement pipeline."""
Expand Down
Loading
Loading