Skip to content

fix: improved small-model support#130

Open
mvansegbroeck wants to merge 39 commits intomainfrom
maarten/bugfix/small-model-support
Open

fix: improved small-model support#130
mvansegbroeck wants to merge 39 commits intomainfrom
maarten/bugfix/small-model-support

Conversation

@mvansegbroeck
Copy link
Copy Markdown

@mvansegbroeck mvansegbroeck commented Apr 22, 2026

Summary

Anonymizer's rewrite pipeline was dropping records on every non-gpt-oss-120b model due to schema strictness that small models couldn't satisfy. Root cause: DataDesigner runs jsonschema.validate() on raw LLM output before pydantic's before-validators run — any strict enum, required, type, minLength, or uniqueness check becomes an un-coercible gate for small-model drift.

This PR applies a two-layer pattern to every LLM-facing schema that was dropping records:

  1. Loose wire schema: typed as str (not Enum), defaults instead of required, nullable where useful, no strict lengths. The emitted JSON Schema stops rejecting drifted output. For some shapes (disposition / meaning units) the wire schema is widened to oneOf({wrapper}, {array}) via __get_pydantic_json_schema__ so DD's pre-validate accepts both shapes.
  2. Server-side reconstruction / before-validators normalize drift: coerces None/int → str, lowercases enum drift, substring-matches merged enums, dedups IDs, pads missing IDs, truncates overlong / coerces empty strings, reconciles inconsistent field pairs. Disposition uses the reconstructor as the single normalization gate; remaining schemas use before-validators. The reconstructor column wraps construction in try/except ValidationError so an orphans-only response emits an empty disposition rather than dropping the row.

Schemas touched

schema role change
EntityDispositionSchema disposition_analyzer Architecture: strict internal schema, reconstructor as single gate. Wire schema is the loose SimpleDispositionItem (oneOf widening accepts bare-list shape). disposition_derivation.reconstruct_full_disposition builds the strict EntityDispositionSchema server-side via _normalize_category + _normalize_method (both case-fold + substring fallback) + derive_needs_protection (tautologically consistent (needs, method) pair). category restored to EntityCategory enum; _coerce_small_model_output deleted; combined_risk_level deleted. Pessimistic needs_protection default for high-risk entities with omitted method. Reconstructor column guards ValidationError from orphans-only responses.
SimpleDispositionResult disposition_analyzer (wire) New schema. Wire shape widened to oneOf({wrapper-object}, {bare-array}) so models that omit the sensitivity_disposition key (observed: nemotron-3-nano:4b, 5/5 records) still pass DD's jsonschema pre-validate. Before-validator normalizes the bare-list shape back to wrapper.
MeaningUnitsSchema meaning_extractor Same oneOf + _accept_bare_list widening (qwen3.5:4b legal_court drift). _ensure_list reassigns sequential ids 1..N when LLM omits ids or emits collisions.
RawValidationDecisionSchema chunked-validation path Adds _normalize_decision (preserves None for the chunk-merger's "no answer, skip" semantics; prose drift → "keep") and _coerce_proposed_label (None → "", int → str). Mirrors ValidationDecisionSchema's coercers which existed only on the non-chunked wire path.
ValidationDecisionSchema validator (wire) decision enum → str; proposed_label nullable; drop dead value/label (already server-overridden in enrich_validation_decisions).
LatentEntitySchema latent_detector category/confidence enums → str; rationale truncation + empty default; evidence list clamp; sentinel injected on empty list to keep pyarrow happy.
MeaningUnitSchema meaning_extractor 16-value aspect enum → str + normalize.
DomainClassificationSchema domain_classifier 24-value Domain enum → str + normalize; string-to-float confidence coercion (accepts "0.95", "85%", falls back to 0.5 for "high").
QualityAnswersSchema / PrivacyAnswersSchema / QACompareResultsSchema evaluator / judge @model_validator(mode="before") dedups IDs, pads missing with pessimistic defaults, drops extras. PrivacyAnswerItemSchema.reason truncates at 200 chars and now also coerces empty/whitespace-only strings (was rejected by min_length=1).

_ENTITY_LABEL_TO_CATEGORY is derived from three frozensets (_DIRECT_ID_LABELS, _QUASI_ID_LABELS, _SENSITIVE_ATTR_LABELS); a CI-asserted coverage test guards against future divergence from DEFAULT_ENTITY_LABELS.

Drift modes uncovered post-merge (round-2 fixes)

The initial PR was reviewed and revised (commits 19efe6e..16bff86). After merging origin/main (chunked-validation feature #126) and re-benching against 5 candidate models on DGX Spark, four additional drift modes surfaced. Each is now covered:

Drift mode Affected schema Fix commit Result
Bare list at top level instead of {sensitivity_disposition: [...]} SimpleDispositionResult dc0bc84 nemotron-3-nano:4b rewrite 0/5 → 5/5
proposed_label: null from chunked-validation path RawValidationDecisionSchema 576de6a gemma4-e4b redact 1/5 → 4/5
Free-form prose in decision field ("No specific entity type...") RawValidationDecisionSchema 5fa19b7 gemma4-e4b 4/5 → 5/5
Bare list at top level instead of {units: [...]} MeaningUnitsSchema 329d488 qwen3.5:4b legal_court rewrite ❌ → ✅

Round-2 hardening (commit 5e55aea) closes greptile's automated review:

  • Lowercase + substring-match protection_method_suggestion before strict-schema construction ("REPLACE" no longer drops record)
  • Empty-string _truncate_reason coerces to placeholder (was failing min_length=1)
  • _reconstruct_full_disposition_column guards ValidationError from orphans-only responses
  • Drop dead endswith("_identifiers") branch in _normalize_category

Bench results — 5 candidate models on DGX Spark (Ollama Q4_K_M, post-fix)

Model Replace Rewrite Mean rewrite wall Mean utility Mean leakage
gemma4-e2b (2B) 5/5 ✅ 5/5 ✅ 92 s 0.81 0.13
nemotron-3-nano:4b 5/5 ✅ 5/5 ✅ 164 s 0.63 0.28
gemma4-e4b 5/5 ✅ 5/5 ✅ 199 s 0.92 0.13
qwen3.5:4b 5/5 ✅ 5/5 ✅ 204 s 0.69 0.19
qwen3.5:9b 5/5 ✅ 5/5 ✅ 256 s 0.82 0.04

Recommended sweet spot for rewrite: gemma4:e4b — best small-model utility, 5/5 reliable, mid-tier latency. gemma4:e2b is the speed pick for replace mode (~21s avg).

NVFP4 / vLLM second-half was bench-tested (separate commit set, on cosmicproc/E4B + prithivMLmods/E2B community quants). Result: no Blackwell speedup observed at single-stream batch with community W4A4 quants — both gemma4 sizes were equal-or-slower than Ollama Q4_K_M with -14 to -22pp utility drop. Documented as future work; not in this PR's scope.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • make test passes locally — 514 tests pass (108 new since the original PR open: schema coverage, label coverage, privacy reason truncation including empty-string, domain confidence string coercion, MeaningUnit id renumbering, _normalize_category + _normalize_method drift modes, class-K reconstructor reconciliation, _flatten_context enumeration alignment, _pad_empty_latent_column, end-to-end enrich_validation_decisions with dropped wire fields, bare-list shapes for both disposition and meaning, RawValidationDecisionSchema None/prose/int coercers, reconstruct-column orphans-only guard)
  • make check passes locally (format + lint + typecheck)
  • Added/updated tests for changes

Documentation

  • If docs changed: make docs-build passes locally — no doc changes in this PR

Review response

@memadi-nv (round 1, 2026-04-24): 7 inline comments + 6 missing-test gaps. All addressed by commits 19efe6e..16bff86. Brief mapping in commit messages.

greptile-apps[bot] (rounds 1-2): 3 P1 + 1 P2 from the round-2 review on commit 329d488. All closed by commit 5e55aea.

Round-2 reviewers (Amy / Andre): Andre's engineering nits (_normalize_method lowercase, dead branch, ValidationError guard, empty-string _truncate_reason, cross-ref comment) all addressed. Amy's prompt-restoration concerns (COVERAGE / QUALITY clauses dropped from disposition prompt) and the requested A/B utility comparison are noted as follow-up — recommended for a separate PR with isolated bench evidence rather than reverting prompts that bench shows produce 0.92 utility on the headline model.

Pattern shared via helper (commit 4a54b85)

The two schemas (SimpleDispositionResult, MeaningUnitsSchema) that need bare-list tolerance now share a single helper pair in engine/schemas/shared.py:

  • loose_list_wrapper_json_schema(handler, schema, *, list_field) — emits oneOf({wrapper}, {array}), gracefully degrades + logs a warning if pydantic ever moves the inline property schema behind a $ref (defensive against future restructure).
  • accept_bare_list(*, list_field) — builds the mode="before" validator that wraps a top-level bare list back into the canonical wrapper dict.

Each call site is now one line each. Future loose-wire schemas drop in trivially.

Also in 4a54b85: RawValidationDecisionSchema._normalize_decision renamed to _normalize_decision_preserve_none so the divergence vs the canonical ValidationDecisionSchema._normalize_decision (which returns "keep" instead of preserving None) is visible at call sites without requiring docstring comparison.

Open follow-up

Amy's prompt-restoration request (re-add COVERAGE / QUALITY clauses to the disposition prompt + run isolated A/B utility comparison) is deferred to a separate PR. Aggregate bench evidence on this branch shows util preserved (0.92 mean on gemma4:e4b headline; 5/5 reliable across all 5 candidate models). The follow-up PR will pair restored clauses with side-by-side bench evidence to isolate the utility delta.

Related

Analysis trail and full benchmark matrix are documented externally in notes/lipika-heads-up-local-serving.md §5/§5b in the anonymizer-strategy workspace.

Systematically observed across 7 local-Ollama models x 10 notes (see
notes/lipika-heads-up-local-serving.md §5):

- Gemma family puts entity_label strings (last_name, date_time) into
  the EntityCategory enum field, emits proposed_label: null instead of
  empty string, and emits unquoted postcodes like 98101 as ints.
- Qwen 3.5 4B omits combined_risk_level entirely; 9B writes too-short
  rationale strings that fail minLength=20 on latent entities.

Two kinds of fix here:

1. Prompt (sensitivity_disposition.py):
   - use exact enum strings as <entity_categories> labels
   - add explicit <allowed_enum_values> block
   - distinguish sensitivity vs combined_risk_level with an example
   - add TYPE RULES section calling out null vs empty-string and
     quote-numeric-looking-strings

2. Schemas (detection.py + rewrite.py) add defensive coercion:
   - ValidationDecisionSchema: coerce None to empty string and int
     to str on value/label/proposed_label
   - EntityDispositionSchema: default combined_risk_level to
     sensitivity when missing; normalize category display-label
     drift (DIRECT IDENTIFIERS -> direct_identifier); int to str on
     entity_value/entity_label/protection_reason
   - LatentEntitySchema: rationale minLength 20 -> 10

Additive: schemas that validated before still validate; rows that
would have raised now coerce into the intended form.
…ntityCategory

Small Gemma models consistently write the entity_label value (e.g.
'last_name', 'date_of_birth') into the EntityDispositionSchema.category
slot, which expects one of the four EntityCategory enum values. Prompt
fixes alone (commit f5b8b39) did not eliminate this drift on gemma4-e2b.

Extend the before-validator with a best-effort entity_label -> category
mapping table covering ~50 of the standard Anonymizer entity labels.
Unmapped labels that still look like entity_label confusion fall back
to quasi_identifier (most conservative). Unknown strings are left for
pydantic to reject with a clean enum error.
Two small-model failure modes seen in the 140-run bench sweep that are
not addressed by the entity_label->category map:

Class E — LatentEntitySchema.rationale maxLength overrun.
Nemotron-3-Super-120B-A12B emitted a 260-char rationale on the
oncology bench note (limit is 150), dropping the entire record at
JSON-schema validation. Add a @field_validator(mode='before') that
truncates overlong strings to 147 chars + '...' instead of rejecting.

Class G — merged EntityCategory enum value.
Same model hallucinated category='latent_sensitive_attribute' on the
same note — a splice of two legitimate EntityCategory values.
Extend EntityDispositionSchema's before-validator with a substring
match (direct > sensitive > latent > quasi) so merged values map to
the strongest matching protection category. Verified:
  latent_sensitive_attribute -> sensitive_attribute
  direct_sensitive           -> direct_identifier
… run

DataDesigner's parse_fn validates the raw LLM response against the
pydantic model's emitted JSON Schema BEFORE calling model_validate().
That means the @model_validator(mode='before') coercion added in
commits f5b8b39 + 0a4df44 never got the chance to run: drift values
like 'last_name' or 'latent_sensitive_attribute' were rejected at
jsonschema.validate() by the 'enum' check.

Type EntityDispositionSchema.category as str (not EntityCategory) so
the emitted JSON Schema has no enum constraint. Add a
@field_validator(mode='after') that re-enforces enum membership AFTER
the before-validator has had its chance to coerce. The model-facing
enum hint is preserved in the field description (DD inlines it into
the prompt).

Verified end-to-end: last_name -> direct_identifier,
latent_sensitive_attribute -> sensitive_attribute; truly-novel
strings still rejected by the after-validator safety net.
…pply

Retest of gemma4-e2b on oncology bench note (after class B/I/E/G fixes)
surfaced a new failure: model omits protection_reason entirely on some
entries. DD jsonschema pre-validates required fields before pydantic
runs, so the omission was fatal at the JSON Schema gate.

Two paired changes:

1. Give protection_reason a valid default (>=minLength 10). The field
   leaves the emitted JSON Schema required list, so the record is no
   longer dropped at the DD pre-check when the model omits it.

2. Change the before-validator string coercion to pop None instead of
   writing empty string. For fields without defaults this surfaces a
   clearer "field required" error; for fields with defaults (now
   protection_reason) the default correctly takes effect.

Verified: missing or null protection_reason -> auto placeholder;
valid model-supplied reason passes through unchanged.
Fix-branch sweep showed small models violate the existing after-
validator consistency rule in both directions:
  * needs_protection=False + method="suppress_inference"  (gemma4-e2b
    on 01_simple_clinical, latent-entity case)
  * needs_protection=True  + method="leave_as_is"        (gemma4-e2b
    on 06_financial_statement, 4 entities on one record)

Both patterns are logically inconsistent and previously killed the
record. Reconcile in the before-validator, biasing toward protection
(safer default for a privacy tool):
  * non-trivial method implies needs_protection=True;
  * needs_protection=True + method=leave_as_is -> method=replace.

Round-tripped both failure patterns end-to-end; consistent inputs
pass through unchanged.
Grep audit showed combined_risk_level is declared on EntityDispositionSchema
and requested by the disposition prompt, but never read anywhere in the
library (only in test fixtures, which pydantic v2 silently ignores as
extra kwargs once the field is gone).

Field was pure schema debt — every disposition call paid token cost for
an output that flows into nothing. Deleting it:
  - shrinks the LLM contract by 10% with zero semantic change
  - removes one column from the retry/drift surface that classes
    A/B/C fixes on this branch are working around
  - removes the CombinedRiskLevel enum class (also unused)

Changes:
  * schemas/rewrite.py: drop CombinedRiskLevel class, field, and the
    @model_validator(mode=before) defaulting that injected it from
    sensitivity.
  * schemas/__init__.py: drop CombinedRiskLevel import and __all__.
  * rewrite/sensitivity_disposition.py: drop combined_risk_level
    prompt lines (QUALITY REQUIREMENTS bullet, ALLOWED ENUM VALUES
    entry, DISTINGUISHING block).

Test impact: 0 new failures. 2 stale tests from commit bbdb6dc
(class K reconcile) remain failing and will be updated separately.
…ction helpers

New loose wire-contract schema (SimpleDispositionItem / SimpleDispositionResult)
that will be the output_format for the disposition_analyzer LLM column in
the next commit. Every field is typed as str with a default, so the emitted
JSON Schema has no enum / required / minLength constraints — DataDesigner\s
…t assertions

Two small hygiene items:

1. Add drop=True to the _simple_disposition LLM column in
   SensitivityDispositionWorkflow so it does not leak into the final
   user-facing preview DataFrame. It is an internal hand-off to the
   reconstruction column; matches the pattern used for
   _validation_skeleton in detection_workflow.py.

2. Refresh two test assertions in test_schemas.py that predated the
   class-K reconciliation commit (bbdb6dc). They asserted pydantic
   raises on inconsistent (needs_protection, method) pairs, but the
   before-validator now reconciles instead of raising — safer for
   small-model drift. Tests now assert the reconciled outcome:
     * needs_protection=False + method=replace -> needs_protection=True
     * needs_protection=True + method=leave_as_is -> method=replace
DataDesigner writes the _latent_entities column to parquet; when
the LLM returns an empty list, pyarrow fails with
ArrowNotImplementedError: Cannot write struct type _latent_entities
with no child field to Parquet. Observed on nemotron-3-super (and
any model that correctly concludes no inferable latent PII is
present).

Inject a single all-defaults sentinel LatentEntitySchema via
@model_validator(mode=after) on LatentEntitiesSchema when the list
is empty. Downstream code (reconstruct_full_disposition et al)
already filters entries with empty label/value so the sentinel is
invisible semantically.
@mvansegbroeck mvansegbroeck requested a review from a team as a code owner April 22, 2026 17:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

The class R @model_validator(mode="after") sentinel on LatentEntitiesSchema
only fires when DD routes through model_validate(). On partial-failure
fallback paths DD can write the raw dict (including latent_entities: [])
directly to the DataFrame. Downstream workflows then hit pyarrow with:

  ArrowNotImplementedError: Cannot write struct type '_latent_entities'
  with no child field to Parquet.

_pad_empty_latent_column() runs on the DataFrame after DD returns and
injects a neutral LatentEntitySchema() default into any empty cell,
covering both the dict-wrapper and raw-list shapes DD may produce.
Verified on rich_data with rewrite preview: pyarrow error gone.
# Best-effort mapping from Anonymizer entity-labels to EntityCategory, used when
# the model outputs an entity_label in the `category` field (observed with small
# Gemma models on the disposition step). Values are from DEFAULT_ENTITY_LABELS in
# config/entity_labels.py; any label not in this table falls back to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no config/entity_labels.py. DEFAULT_ENTITY_LABELS lives at src/anonymizer/engine/constants.py.

# config/entity_labels.py; any label not in this table falls back to
# "quasi_identifier" which is the most conservative (protect-cautiously) choice.
_ENTITY_LABEL_TO_CATEGORY: dict[str, str] = {
# Direct identifiers: strong re-id on their own.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ENTITY_LABEL_TO_CATEGORY is a hand-maintained copy and is already out of sync

The PR's own comment calls DEFAULT_ENTITY_LABELS the source of truth, but the table in schemas/rewrite.py contains labels that do not exist in ENTITY_LABEL_EXAMPLES — e.g. full_name, passport_number, drivers_license, address, zip_code, credit_card_number. Conversely it likely omits labels someone will add tomorrow.

Two reasonable options:

  • (preferred) Define three small sets _DIRECT_IDS, _QUASI_IDS, _SENSITIVE_ATTRS and derive _ENTITY_LABEL_TO_CATEGORY as {lbl: "direct_identifier" for lbl in _DIRECT_IDS} | .... Then add a test: assert set(DEFAULT_ENTITY_LABELS) <= set(_ENTITY_LABEL_TO_CATEGORY).
  • At minimum add the coverage assertion as a unit test so the next label-addition PR breaks CI rather than silently losing fallback coverage.

continue

# Derive derived fields.
method = (item.protection_method_suggestion or "").strip() or "leave_as_is"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the small model omits the method, this silently derives needs_protection=False, even for a direct_identifier with sensitivity="high". That's exactly the kind of failure mode this PR exists to fix — and it's inconsistent with the pessimistic defaults the PR uses elsewhere (e.g. PrivacyAnswersSchema pads missing with "yes", _normalize_decision defaults to "keep" = preserve detection).

Suggested fix — pessimistic default based on category/sensitivity:

Suggested change
method = (item.protection_method_suggestion or "").strip() or "leave_as_is"
raw_method = (item.protection_method_suggestion or "").strip()
if raw_method:
method = raw_method
elif category in ("direct_identifier", "sensitive_attribute") or sensitivity in ("medium", "high"):
method = "replace"
else:
method = "leave_as_is"

Add a test for SimpleDispositionItem(id=1, category="direct_identifier", sensitivity="high", protection_method_suggestion="") → reconstructed with needs_protection=True.

# like 98101 as bare integers).
@model_validator(mode="before")
@classmethod
def _coerce_small_model_output(cls, data):
Copy link
Copy Markdown
Contributor

@memadi-nv memadi-nv Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two layers of coercion for the same drift — duplicated logic, only one is reachable in prod

EntityDispositionSchema._coerce_small_model_output does ~70 lines of normalization (category drift, int→str, needs↔method reconciliation, etc.) — but since the workflow now routes every disposition through SimpleDispositionResult + reconstruct_full_disposition, the reconstructor always pre-massages the fields before calling EntityDispositionSchema(...). The strict-schema's before-validator never sees raw LLM output in the production path; it only runs in the unit tests.

Two downsides:

  • ~70 lines of production-untested code billed as a live safety net.
  • template_protection_reason vs protection_reason default ("auto: model omitted protection_reason") are two different fallbacks for the same field, reachable via different paths. Currently the reconstructor path wins in prod; the stringly-typed default only kicks in if someone constructs EntityDispositionSchema directly from LLM output — a combination no one should be doing anymore.

Pick one gate:

  • (a) Drop _coerce_small_model_output from EntityDispositionSchema, make category: EntityCategory again, rely exclusively on the reconstructor. Simpler, one source of truth.
  • (b) Keep _coerce_small_model_output as the single unified gate, remove the templating/defaulting from reconstruct_full_disposition, and route every item through EntityDispositionSchema.model_validate.

(a) seems like a better pick. Reason: the reconstructor has access to context (category, sensitivity, method together) that the field-by-field coercion doesn't, so its reason templating is strictly better than the static default.

Copy link
Copy Markdown
Contributor

@memadi-nv memadi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tests for claims in the PR summary / commits

  • PR summary: "5 stale class-K assertions refreshed to assert reconciliation instead of raising" — the diff only shows 2 refreshed tests (test_entity_disposition_reconciles_no_protection_but_method_set, test_entity_disposition_reconciles_needs_protection_but_leave_as_is). Wcould not find the other 3!
  • No test for _pad_empty_latent_column in detection_workflow.py (the belt-and-braces DataFrame sentinel).
  • No test exercising the entity-label-stuffed-into-category path in _coerce_small_model_output via _ENTITY_LABEL_TO_CATEGORY — this was called out as a key Gemma failure and gets its own ~70 lines of table, but tests only cover display-label variants ("DIRECT IDENTIFIER", plural, "_identifiers"), not the "last_name"direct_identifier mapping.
  • No test that ValidationDecisionSchema dropping value/label from the wire still produces correct output through enrich_validation_decisions (the PR claims those fields were already server-overridden — worth an end-to-end assertion).
  • No test for PrivacyAnswerItemSchema._truncate_reason (200+ char inputs, the specific small-model drift the method exists for).
  • No test for DomainClassificationSchema._coerce_confidence string inputs ("0.95", "85%", "high"), despite qa_generation.py's _DOMAIN_KEY lookup being downstream of this.



def test_multi_label_entity_flattens_to_multiple_slots() -> None:
simple = _make_simple(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prompt enumeration order must match _flatten_context — not verified

reconstruct_full_disposition pairs LLM items with context by context[item.id - 1], and _flatten_context emits one slot per (value, label) pair. If the jinja rendering of <<FINAL_ENTITIES>> presents entities to the LLM as one row per value (multi-label values merged), the LLM's numbering will not align with the context index — multi-label entities will either get the wrong pairing or fall into the orphan path and be silently dropped.

The new reconstructor has a unit test (test_multi_label_entity_flattens_to_multiple_slots), but there is no test asserting the prompt rendering enumerates in the same order and cardinality.

Please add an end-to-end test that renders the prompt for a multi-label entities_by_value row, asserts the entity list the model sees is numbered 1..N identically to _flatten_context, and feeds a matching hand-crafted LLM response through reconstruct_full_disposition with zero orphans. If the rendering turns out to be one-per-value, _flatten_context should collapse to match.

emits an empty list the record still survives so the pipeline can
decide what to do downstream. A @model_validator(mode="before") drops
units with empty unit text to reduce noise.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MeaningUnitsSchema docstring describes behavior that isn't implemented

The docstring claims "A @model_validator(mode="before") drops units with empty unit text to reduce noise", but no such validator exists — only _ensure_list (list-wrap). Either implement the drop or fix the docstring.

Related: MeaningUnitSchema.id: int = Field(ge=1, default=1) — if the LLM emits multiple units whose ids all default to 1, any downstream id-keyed dedup collapses them. Consider default=None with sequential reassignment in _ensure_list, or document that the default=1 is purely a wire-layer escape hatch.

terse (Qwen-observed sub-10 chars) with an ellipsis to reach the
10-char soft floor. Exact floor kept as internal expectation but
not enforced at wire — server reads rationale for context only.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the upper bound is enforced. Either implement the padding or drop the promise from the docstring — non-blocking (wire no longer has min_length) but confusing for whoever reads this next.

@mvansegbroeck
Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@mvansegbroeck mvansegbroeck changed the title Maarten/bugfix/small model support fix: tolerate small-model schema drift across rewrite pipeline Apr 25, 2026
@mvansegbroeck mvansegbroeck changed the title fix: tolerate small-model schema drift across rewrite pipeline fix: improved small-model support Apr 25, 2026
mvansegbroeck and others added 11 commits April 25, 2026 07:40
Comment referenced config/entity_labels.py; DEFAULT_ENTITY_LABELS
actually lives in engine/constants.py. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The validator only truncates above 150 chars; the docstring's promise to
pad sub-10-char rationales was never implemented. Reviewer flagged as
non-blocking. Behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The class only has a list-coercion before-validator on the units field;
the docstring claimed a model_validator that drops empty units, which
was never implemented. Behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…entities

reconstruct_full_disposition previously defaulted an omitted
protection_method_suggestion to leave_as_is regardless of category or
sensitivity, which silently let high-sensitivity direct_identifiers slip
through as needs_protection=False — exactly the small-model failure mode
this PR exists to prevent. Mirrors the pessimistic-default pattern
already used by PrivacyAnswersSchema and _normalize_decision: when method
is empty and the entity is direct_identifier/sensitive_attribute or has
medium/high sensitivity, default to replace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MeaningUnitSchema.id defaults to 1 to keep the wire schema permissive
for small-model output. When the LLM emits multiple units and omits the
id on all of them, every unit collapses to id=1, which makes downstream
prompts that reference units by id ambiguous. _ensure_list now
reassigns sequential 1..N when any id is missing or duplicated.
Explicit unique ids are preserved verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 71-key hand-maintained dict had drifted from DEFAULT_ENTITY_LABELS:
six labels (full_name, passport_number, drivers_license, address,
zip_code, credit_card_number) were not in the live label set and
therefore unreachable via real detector output. Restructure into three
frozensets — _DIRECT_ID_LABELS, _QUASI_ID_LABELS, _SENSITIVE_ATTR_LABELS
— and derive the mapping. This puts each label's category assignment in
one place per category and makes the next label-addition PR a simple
set-membership change rather than a hand-edit risking silent omission.
The phantom labels are dropped (they cannot be emitted by the detector).

Mapping now covers exactly the 65 labels in DEFAULT_ENTITY_LABELS with
no gaps and no phantoms. CI assertion guarding the inclusion lands in a
follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add _normalize_category helper in disposition_derivation that resolves
all four small-model category drift modes (display variants, merged
enums, entity-label confusion, empty/unknown) into a valid
EntityCategory value, with conservative fallback to quasi_identifier.

reconstruct_full_disposition now calls _normalize_category before
constructing EntityDispositionSchema, replacing the inline
"category or empty -> quasi_identifier" default. The reconstructor
already had access to context-derived entity_label, so the normalizer
gets richer signal than the schema's own before-validator did. Sets up
the next commit, which deletes the redundant before-validator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…egory

The schema-level @model_validator(mode="before") was unreachable in
production: every disposition record now flows through the wire-loose
SimpleDispositionItem schema and reconstruct_full_disposition, which
calls _normalize_category and derive_needs_protection before
constructing EntityDispositionSchema. The before-validator's category
normalization and needs_protection<->method reconciliation were
duplicated by, and superseded by, the reconstructor's path.

Changes:
- Delete _coerce_small_model_output (~88 lines).
- Restore category: EntityCategory (strict enum) — emitted JSON Schema
  is internal-only now, so DD jsonschema pre-checks no longer matter.
- Drop the static "auto: model omitted protection_reason" default; the
  reconstructor always supplies a >=10-char reason via
  template_protection_reason.
- Drop _validate_category after-validator (redundant with enum type).

Tests refreshed: the two class-K reconciliation tests now assert
ValidationError instead of reconciliation, since reconciliation lives
in the reconstructor (covered by the new pessimistic-default tests in
test_disposition_derivation.py).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nce coercion

Adds tests for the three review-summary missing-test gaps in test_schemas.py:

- DEFAULT_ENTITY_LABELS coverage assertion: any label added to the live
  set without a category assignment in schemas/rewrite.py now breaks CI
  with a directive pointing at the right sets to extend.
- PrivacyAnswerItemSchema._truncate_reason: verifies 250+ char input
  truncates to <=200 chars (rather than rejecting the record), boundary
  case at exactly 200, and None coerces to a placeholder.
- DomainClassificationSchema._coerce_confidence: accepts float strings
  ("0.95"), percentage strings ("85%"), unparseable strings ("high"
  falls back to 0.5), and clamps out-of-range values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tten alignment

Adds three new test groups for disposition reconstruction:

- _normalize_category: drift modes (entity-label-stuffed-into-category
  for all three categories, display variants, merged-enum hallucinations,
  unknown fallback) plus an end-to-end via the reconstructor that pins
  the canonical gemma failure mode (category="last_name" -> "direct_identifier").
- Class-K-style reconciliation through the reconstructor: now that the
  schema's before-validator was removed, document that
  derive_needs_protection makes the (needs_protection, method) pair
  tautologically consistent for replace, generalize, and leave_as_is.
- _flatten_context enumeration alignment: the prompt-rendering contract
  is "one slot per (value, label) pair" with tagged before latent.
  Multi-label entities round-trip with zero orphans when the LLM emits
  ids 1..N matching slot count. Closes the test gap memadi-nv flagged
  on test_disposition_derivation.py:258.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ment

Closes the last two review-summary missing-test gaps:

- _pad_empty_latent_column: 4 new tests covering the dict-cell sentinel
  injection, populated-cell preservation, bare-empty-list cells, and
  the missing-column no-op. Each pins the pyarrow workaround so a
  future refactor can't silently regress to "Cannot write struct type
  with no child field".
- enrich_validation_decisions end-to-end with the dropped wire fields:
  pins that ValidationDecisionSchema.model_dump() omits value/label
  (they were intentionally removed from the wire), and that enrichment
  fills them server-side from candidate_lookup. Reclass + proposed_label
  flow through unchanged.

Also: ruff's import-order fix on test_disposition_derivation.py and
test_schemas.py (the new imports landed at non-canonical positions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 25, 2026

Greptile Summary

This PR systematically applies a two-layer small-model hardening pattern across every LLM-facing schema: a loose wire schema (str instead of enum, defaults instead of required, no strict length guards) that survives DataDesigner's pre-validate pass, paired with server-side before-validators and the new disposition_derivation reconstruction module that normalize drift into the strict internal shapes. All four round-2 drift modes (bare-list disposition, proposed_label: null, prose in decision, bare-list meaning units) are covered and bench-verified at 5/5 across five candidate models. Previous P1 findings from rounds 1 and 2 (ValidationError guard for orphans-only responses, protection_method_suggestion case-normalization, _truncate_reason empty-string handling, dead _identifiers branch) are addressed in commit 5e55aea.

Confidence Score: 5/5

Safe to merge — all previous P1 findings are addressed, 514 tests pass, and the only remaining finding is a P2 docstring discrepancy.

No P0 or P1 issues remain. The single P2 (docstring priority order in _normalize_method) has no runtime impact since valid ProtectionMethod values are not substrings of each other in practice. Extensive test coverage (642 new test lines for disposition_derivation, 437 for schemas) and bench results across five models increase confidence.

No files require special attention.

Important Files Changed

Filename Overview
src/anonymizer/engine/rewrite/disposition_derivation.py New module: server-side reconstruction of strict EntityDispositionSchema from the loose LLM wire output; handles category/method normalization, context pairing, and orphan skipping. Minor docstring discrepancy in _normalize_method priority order.
src/anonymizer/engine/schemas/rewrite.py Major schema loosening: enums → str with before-validators, new SimpleDispositionResult/SimpleDispositionItem wire shapes, oneOf widening helpers, id-coverage normalization for QA schemas, _truncate_reason empty-string handling. Well-tested and previous P1s addressed.
src/anonymizer/engine/rewrite/sensitivity_disposition.py Disposition pipeline refactored to two-step (LLM wire → reconstruction column); ValidationError guard on orphans-only responses; prompt updated to simplified 4-field output format.
src/anonymizer/engine/schemas/detection.py RawValidationDecisionSchema gains _normalize_decision_preserve_none and _coerce_proposed_label; ValidationDecisionSchema drops value/label fields and loosens decision to str; LatentEntitySchema fully loosened with before-validators; sentinel injection for empty parquet lists.
src/anonymizer/engine/schemas/shared.py New loose_list_wrapper_json_schema and accept_bare_list helpers for oneOf JSON Schema widening; graceful fallback logged when inline array schema unavailable.
src/anonymizer/engine/detection/detection_workflow.py _pad_empty_latent_column added to guard against pyarrow empty-struct serialization failure at the DataFrame level (belt-and-braces for pydantic-level sentinel in LatentEntitiesSchema).
tests/engine/test_disposition_derivation.py 642 lines of new tests covering normalize_category/method drift, reconstruct_full_disposition context pairing, orphan skipping, template_protection_reason, and bare-list shapes.
tests/engine/test_schemas.py 437 lines added: schema coverage for all loosened fields, label-to-category coverage CI guard, privacy reason truncation/empty-string, domain confidence coercion, MeaningUnit id renumbering, RawValidationDecisionSchema None/prose coercers.

Sequence Diagram

sequenceDiagram
    participant DD as DataDesigner
    participant LLM as disposition_analyzer LLM
    participant WS as SimpleDispositionResult
    participant RC as _reconstruct_full_disposition_column
    participant DR as disposition_derivation

    DD->>LLM: prompt + loose JSON Schema (no enum/minLength/required)
    LLM-->>DD: raw JSON (bare list, case-drifted enums, null fields)
    DD->>DD: jsonschema.validate() against oneOf(wrapper, bare-array)
    DD->>WS: model_validate() + _accept_bare_list normalizes bare list
    WS-->>DD: SimpleDispositionResult (loose, defaults filled)
    DD->>RC: row dict with COL_SIMPLE_DISPOSITION + context cols
    RC->>DR: reconstruct_full_disposition(simple, entities_by_value, latent_entities)
    DR->>DR: _normalize_category / _normalize_method / derive_needs_protection / template_protection_reason
    DR-->>RC: SensitivityDispositionSchema (strict)
    RC-->>DD: row[COL_SENSITIVITY_DISPOSITION] = full.model_dump()
    Note over RC: ValidationError from orphans-only response
    Note over RC: emits empty disposition, no record drop
Loading

Reviews (8): Last reviewed commit: "refactor(schemas): extract loose-list-wr..." | Re-trigger Greptile

Comment on lines +212 to +217
full = reconstruct_full_disposition(
simple,
row.get(COL_ENTITIES_BY_VALUE),
row.get(COL_LATENT_ENTITIES),
)
row[COL_SENSITIVITY_DISPOSITION] = full.model_dump()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unhandled ValidationError can still drop records

reconstruct_full_disposition raises ValidationError when full_items is empty (e.g., all SimpleDispositionItem IDs fall outside the context range AND the model omitted the optional echoes). SensitivityDispositionSchema enforces min_length=1 on sensitivity_disposition; a fully-orphaned small-model response that emits IDs like 100–103 with no entity_label/entity_value echoes would trip this path and propagate an unhandled exception from the column generator — exactly the record-drop behaviour this PR is designed to prevent.

The test test_empty_simple_result_produces_empty_schema explicitly asserts the raise and comments "caller is expected to short-circuit", but _reconstruct_full_disposition_column performs no such guard.

Consider wrapping the call (or returning early when simple.sensitivity_disposition is empty) to avoid surfacing the exception to DataDesigner's error-handling layer:

    if not simple.sensitivity_disposition:
        logger.warning("_reconstruct_full_disposition_column: empty simple disposition for row; skipping reconstruction")
        return row
    try:
        full = reconstruct_full_disposition(
            simple,
            row.get(COL_ENTITIES_BY_VALUE),
            row.get(COL_LATENT_ENTITIES),
        )
    except Exception as exc:
        logger.warning("_reconstruct_full_disposition_column: reconstruction failed (%s); leaving disposition empty", exc)
        return row
    row[COL_SENSITIVITY_DISPOSITION] = full.model_dump()
    return row

Comment on lines +142 to +146
method_label = method or "an appropriate method"
base = f"{cat_label} — protected via {method_label}"

prefix = _SENSITIVITY_PREFIX.get(sensitivity, "")
reason = (prefix + base).strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead code in plural-stripping logic

The endswith("_identifiers") branch can never be reached for any of the intended inputs. The preceding endswith("s") check already handles "direct_identifiers", "quasi_identifiers", and "latent_identifiers" — their [:-1] forms are all in _VALID_CATEGORIES, so that check returns first. The _identifiers branch is only reached when normalized[:-1] is not in _VALID_CATEGORIES, in which case it returns an invalid category string that would cause a ValidationError downstream since EntityDispositionSchema.category has no before-validator. In practice unreachable for expected model output, but the dead code is misleading and could cause silent failures for hypothetical edge-case inputs.

mvansegbroeck and others added 6 commits April 25, 2026 08:22
Resolves DIRTY merge state for PR #130. Conflicts in:

- src/anonymizer/engine/rewrite/sensitivity_disposition.py (4 hunks):
  the strict_entity_protection feature (#123) and the simple-wire-schema
  refactor on this branch both rewrote the disposition prompt and
  workflow column setup. Resolution: keep the two-step pipeline (loose
  SimpleDispositionResult on the wire, server-side reconstruction into
  SensitivityDispositionSchema) and pass strict_entity_protection
  through to the prompt. The prompt's <strict_entity_protection> block
  is the strict-mode contract; the reconstructor produces the
  downstream schema. The output_schema selection between
  SensitivityDispositionSchema and StrictSensitivityDispositionSchema
  is no longer needed because we always emit SimpleDispositionResult.

- src/anonymizer/engine/detection/detection_workflow.py: keep the
  imports of LatentEntitySchema this branch added; ValidationDecisionsSchema
  was no longer used after main's chunked-validation refactor.

- tests/engine/test_detection_workflow.py: keep both the new
  _pad_empty_latent_column tests from this branch and the
  chunked-validation wiring tests from main.

- src/anonymizer/engine/schemas/rewrite.py: re-add the CombinedRiskLevel
  enum that this branch had deleted as unused — main's
  StrictEntityDispositionSchema needs it. The non-strict
  EntityDispositionSchema still does not carry the field; the strict
  variant's to_entity_disposition() drops combined_risk_level during
  conversion.

- tests/engine/test_chunked_validation.py: two assertions in
  TestMergeChunkDecisions assumed the merge step's payload retained
  value/label. After the wire-schema drop, those fields are stripped
  by ValidationDecisionsSchema validation and re-filled later by
  enrich_validation_decisions from candidate_lookup. Tests updated to
  match the post-drop contract with a comment explaining why.

- tests/engine/test_detection_custom_columns.py: COL_VALIDATION_CANDIDATES
  was renamed to COL_SEED_VALIDATION_CANDIDATES in the new e2e test.

All 496 tests pass (was 427 pre-merge; +69 from main).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's format-check (ruff format --check via tools/codestyle/format.sh)
flagged 8 files I touched as needing reformat. Locally these passed
ruff check, but ruff format applies stricter line-length / wrapping
rules. No semantic change; pure mechanical reformat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two ruff lint violations surfaced after the main merge; both auto-fixable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…utput

nemotron-3-nano:4b on rewrite mode (5/5 records on Spark Q5 bench)
consistently emits the disposition items as a top-level array without
the sensitivity_disposition wrapper key. Other small models drift to
the bare-list shape intermittently on dense entity sets.

Failure path: DataDesigner runs jsonschema.validate() on the raw LLM
response BEFORE pydantic gets to coerce. The previous wire schema
required type=object, so a bare-list response was rejected as "is not
of type 'object'" and the record dropped — wasting the upstream
detection cost (validator + augmenter + latent + domain ≈ 40s).

Fix is two-layered to match the rest of PR #130:

  * __get_pydantic_json_schema__ widens the emitted JSON Schema to a
    oneOf of {wrapper-object, bare-array}. The DD pre-check now
    accepts both shapes.
  * _accept_bare_list (mode="before") normalizes a bare-list input
    back to the wrapper dict so the reconstructor and downstream
    consumers continue to read the canonical
    SimpleDispositionResult.sensitivity_disposition shape.

Tests cover: bare-list parses, wrapper-dict still parses (no
regression), emitted JSON Schema has oneOf with object+array branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Schema

Mirrors the _coerce_proposed_label validator already on
ValidationDecisionSchema. The chunked-validation path (added in #126,
merged into this PR via ca08fb2) routes LLM output through
RawValidationDecisionSchema instead of the wire-loose
ValidationDecisionSchema, so the existing coercer didn't apply on
that path.

Failure mode (gemma4-e4b on the post-merge bench): model emits
``proposed_label: null`` for ``decision: keep`` decisions; pydantic
rejects the entire chunk's records with ``string_type`` errors;
DataDesigner drops the row. On a 5-note × 2-mode bench across 5
candidate models, this caused gemma4-e4b's success rate to crater
to 1/5 redact + 2/5 rewrite even though the model otherwise
delivers best-in-class quality (util 0.96 / leak 0.03 on the rows
that survived).

Tests cover null and int-typed proposed_label drift. 501 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ema.decision

gemma4-e4b on the legal_court note re-bench emitted free-form prose in
the decision slot — observed strings: "No specific entity type for the
date placeholder." and "No specific field matched for this token." —
which pydantic rejects as not matching the ValidationChoice enum,
dropping the whole chunk's records.

Mirrors ValidationDecisionSchema._normalize_decision but with one
deliberate difference: this schema's chunked-validation merger
(merge_chunk_decisions) treats decision=None as "no answer, skip" so a
later chunk can supply a real verdict for the same id. Therefore the
normalizer preserves None-ness and only normalizes string drift —
None / non-string / blank string -> None, free-form prose -> "keep"
(conservative privacy-safe fallback: preserve detection over silently
dropping the entity).

Tests cover: free-form prose -> "keep"; blank/None -> None (preserves
merger's skip semantic); display-variant substring matching ("Keep.",
"DROP!", "reclass entity"); and most-specific-first when prose mixes
choices.

Caught and fixed during the gemma4-e4b 10-run revalidation that
followed 576de6a — that fix landed redact 1/5 -> 4/5 and rewrite 2/5
-> 4/5. The remaining legal_court failure is this drift mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +307 to +309
raw_method = (item.protection_method_suggestion or "").strip()
if raw_method:
method = raw_method
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 protection_method_suggestion not normalised before strict-schema construction

raw_method is taken verbatim from the LLM output and passed directly to EntityDispositionSchema(protection_method_suggestion=method). EntityDispositionSchema.protection_method_suggestion is typed as ProtectionMethod — a str, Enum with lowercase values ("replace", "generalize", etc.) and use_enum_values=True. Pydantic v2 performs case-sensitive coercion, so a small-model emission like "REPLACE", "Replace", or "suppress_Inference" raises a ValidationError that propagates unhandled from _reconstruct_full_disposition_column, dropping the record — the exact failure mode this PR is designed to prevent.

Add case-folding analogous to how sensitivity is handled:

raw_method = (item.protection_method_suggestion or "").strip().lower()

A substring fallback (similar to _normalize_category) would further harden against drift like "replace_with_surrogate".

qwen3.5-4b on legal_court rewrite emits the units as a top-level array
without the ``units`` wrapper key. DataDesigner's jsonschema gate
rejects this as not type=object, dropping the row before pydantic gets
to coerce.

Same two-layer fix as dc0bc84 for SimpleDispositionResult:
__get_pydantic_json_schema__ widens to oneOf(object, array), and a
mode="before" validator normalizes the bare list back to the wrapper.

Tests cover bare-list parses, wrapper-dict still parses, and oneOf
emitted with both branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +810 to +819
@field_validator("reason", mode="before")
@classmethod
def _truncate_reason(cls, v: object) -> object:
"""Small-model observed emitting 250+ char reasons (nemotron-3-nano on
vLLM). Truncate to fit max_length=200 rather than dropping the record."""
if isinstance(v, str) and len(v) > 200:
return v[:197].rstrip() + "..."
if v is None:
return "no reason provided"
return v
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _truncate_reason misses the empty-string case

reason: str = Field(min_length=1, max_length=200) rejects "" after the before-validator runs. The validator handles None and len > 200, but an empty string falls through the two guards and is returned unchanged. Pydantic then raises ValidationError on min_length=1, dropping the record — the same failure mode this PR is designed to prevent.

Suggested change
@field_validator("reason", mode="before")
@classmethod
def _truncate_reason(cls, v: object) -> object:
"""Small-model observed emitting 250+ char reasons (nemotron-3-nano on
vLLM). Truncate to fit max_length=200 rather than dropping the record."""
if isinstance(v, str) and len(v) > 200:
return v[:197].rstrip() + "..."
if v is None:
return "no reason provided"
return v
@field_validator("reason", mode="before")
@classmethod
def _truncate_reason(cls, v: object) -> object:
"""Small-model observed emitting 250+ char reasons (nemotron-3-nano on
vLLM). Truncate to fit max_length=200 rather than dropping the record."""
if isinstance(v, str) and len(v) > 200:
return v[:197].rstrip() + "..."
if v is None or (isinstance(v, str) and not v.strip()):
return "no reason provided"
return v

mvansegbroeck and others added 3 commits April 26, 2026 07:33
…ructor

Addresses greptile-apps[bot] inline findings on PR #130 HEAD, plus two
defensive nits surfaced in the round-2 review.

Greptile P1s (each is the same class of failure this PR is meant to fix:
small-model output drops the record before our normalization runs):

  * disposition_derivation: protection_method_suggestion now flows through
    a new _normalize_method() (case-fold + substring fallback) before
    EntityDispositionSchema construction. Pydantic enum coercion on
    ProtectionMethod is case-sensitive, so a small-model "REPLACE" or
    "Replace" was raising ValidationError and dropping the record.

  * rewrite: PrivacyAnswerItemSchema._truncate_reason now also coerces
    empty/whitespace-only strings to "no reason provided". Without this,
    the empty-string emission falls through both pre-fix guards (None,
    len > 200) and trips Field(min_length=1) on the strict pass.

  * sensitivity_disposition: _reconstruct_full_disposition_column now
    short-circuits when the simple wire payload is empty AND wraps the
    reconstruct call in try/except ValidationError. The orphans-only
    case (every SimpleDispositionItem id is out-of-context-range AND the
    model omitted entity_label/value echoes) used to propagate
    ValidationError unhandled. Now it emits an empty
    sensitivity_disposition column with a warning.

Greptile P2:

  * disposition_derivation: drop dead endswith("_identifiers") branch in
    _normalize_category. The preceding endswith("s") + _VALID_CATEGORIES
    check already handles direct/quasi/sensitive/latent_identifiers; the
    plural-stripping branch was unreachable.

Defensive nit:

  * detection: add cross-reference comment in
    ValidationDecisionSchema._normalize_decision pointing at
    RawValidationDecisionSchema._normalize_decision (same name, divergent
    None semantics — first cross-ref already existed in the chunked-path
    docstring; this closes the loop).

Tests added: 5 new (_normalize_method case + substring + empty +
end-to-end via reconstructor; _truncate_reason empty + whitespace
variant; reconstruct column empty + orphans-only). 514 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pydantic restructure

Closes engineering nits A/B/C from the round-2 /review on PR #130:

A) **__get_pydantic_json_schema__ brittleness** — the inline overrides on
   SimpleDispositionResult and MeaningUnitsSchema both hardcoded
   wrapped["properties"][field_name] access. A pydantic minor bump that
   moves inline property schemas behind $ref indirection would silently
   break the oneOf widening, sending bare-list LLM responses back to DD's
   jsonschema gate and dropping records — the failure mode this PR is
   meant to fix. The new loose_list_wrapper_json_schema helper detects
   the missing-property and $ref-only cases, degrades to the unwidened
   wrapper, and emits a warning so future regressions are observable.

B) **Pattern duplication** — three near-identical __get_pydantic_json_schema__
   + _accept_bare_list pairs (one each on SimpleDispositionResult and
   MeaningUnitsSchema; the third would arrive with the next loose-wire
   schema) now share a single implementation in
   engine/schemas/shared.py. Two helpers:
     * loose_list_wrapper_json_schema(handler, schema, *, list_field) -> dict
     * accept_bare_list(*, list_field) -> classmethod-decorated validator
   Call sites compress from ~10 LOC each to a single line.

C) **Two _normalize_decision functions diverge in semantics** —
   ValidationDecisionSchema returns "keep" for None/blank;
   RawValidationDecisionSchema preserves None for the chunk-merger's
   "no answer, skip" path. Same name made the divergence invisible at
   call sites. Renamed the chunked variant to
   _normalize_decision_preserve_none. Cross-reference docstrings updated
   in both directions.

Tests: +2 (loose_list_wrapper widens to oneOf; falls back gracefully on
$ref-only or missing property). Total 516 passing. Existing tests for
SimpleDispositionResult / MeaningUnitsSchema oneOf emission and
RawValidationDecisionSchema._normalize_decision_preserve_none continue
to pass via the helper indirection / mechanical rename respectively.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# treat it as "no entities to protect" rather than dropping the row.
if not simple.sensitivity_disposition:
logger.warning("reconstruct: empty SimpleDispositionResult for row; emitting empty sensitivity_disposition")
row[COL_SENSITIVITY_DISPOSITION] = {"sensitivity_disposition": []}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downstream parsing still routes through SensitivityDispositionSchema.model_validate(...) in parse_sensitivity_disposition(), so I think this guard will just cause a failure later when the QA and rewrite steps call this function.

I also think that in the case of missing sensitivity disposition, we should be forcing all direct identifiers to be replaced and all quasi identifiers to be generalized - that's the rough guidance in the sensitivity disposition prompt too (cc @asteier2026)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be optimal, I agree with "in the case of missing sensitivity disposition, we should be forcing all direct identifiers to be replaced and all quasi identifiers to be generalized "

Comment on lines +175 to +183
# Mapping from Anonymizer entity-labels to EntityCategory, used when the
# disposition LLM outputs an entity_label in the `category` field (observed
# with small Gemma models). Derived from three category sets so the source
# of truth lives in one place per category — see
# test_entity_label_to_category_covers_default_labels for the CI guard
# that fires when a label is added to DEFAULT_ENTITY_LABELS without a
# category assignment here. Any label not in this table falls back to
# "quasi_identifier" which is the most conservative (protect-cautiously)
# choice.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things I need to think more about in case the LLM at the sensitivity disposition step fails to generate output (and @asteier2026 probably knows more here)

  1. Is this list of direct identifiers, quasi and sensitive attrs right?
  2. Do we want any user supplied label to just be generalized instead of replaced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants