fix: improved small-model support#130
Conversation
Systematically observed across 7 local-Ollama models x 10 notes (see
notes/lipika-heads-up-local-serving.md §5):
- Gemma family puts entity_label strings (last_name, date_time) into
the EntityCategory enum field, emits proposed_label: null instead of
empty string, and emits unquoted postcodes like 98101 as ints.
- Qwen 3.5 4B omits combined_risk_level entirely; 9B writes too-short
rationale strings that fail minLength=20 on latent entities.
Two kinds of fix here:
1. Prompt (sensitivity_disposition.py):
- use exact enum strings as <entity_categories> labels
- add explicit <allowed_enum_values> block
- distinguish sensitivity vs combined_risk_level with an example
- add TYPE RULES section calling out null vs empty-string and
quote-numeric-looking-strings
2. Schemas (detection.py + rewrite.py) add defensive coercion:
- ValidationDecisionSchema: coerce None to empty string and int
to str on value/label/proposed_label
- EntityDispositionSchema: default combined_risk_level to
sensitivity when missing; normalize category display-label
drift (DIRECT IDENTIFIERS -> direct_identifier); int to str on
entity_value/entity_label/protection_reason
- LatentEntitySchema: rationale minLength 20 -> 10
Additive: schemas that validated before still validate; rows that
would have raised now coerce into the intended form.
…ntityCategory Small Gemma models consistently write the entity_label value (e.g. 'last_name', 'date_of_birth') into the EntityDispositionSchema.category slot, which expects one of the four EntityCategory enum values. Prompt fixes alone (commit f5b8b39) did not eliminate this drift on gemma4-e2b. Extend the before-validator with a best-effort entity_label -> category mapping table covering ~50 of the standard Anonymizer entity labels. Unmapped labels that still look like entity_label confusion fall back to quasi_identifier (most conservative). Unknown strings are left for pydantic to reject with a clean enum error.
Two small-model failure modes seen in the 140-run bench sweep that are not addressed by the entity_label->category map: Class E — LatentEntitySchema.rationale maxLength overrun. Nemotron-3-Super-120B-A12B emitted a 260-char rationale on the oncology bench note (limit is 150), dropping the entire record at JSON-schema validation. Add a @field_validator(mode='before') that truncates overlong strings to 147 chars + '...' instead of rejecting. Class G — merged EntityCategory enum value. Same model hallucinated category='latent_sensitive_attribute' on the same note — a splice of two legitimate EntityCategory values. Extend EntityDispositionSchema's before-validator with a substring match (direct > sensitive > latent > quasi) so merged values map to the strongest matching protection category. Verified: latent_sensitive_attribute -> sensitive_attribute direct_sensitive -> direct_identifier
… run DataDesigner's parse_fn validates the raw LLM response against the pydantic model's emitted JSON Schema BEFORE calling model_validate(). That means the @model_validator(mode='before') coercion added in commits f5b8b39 + 0a4df44 never got the chance to run: drift values like 'last_name' or 'latent_sensitive_attribute' were rejected at jsonschema.validate() by the 'enum' check. Type EntityDispositionSchema.category as str (not EntityCategory) so the emitted JSON Schema has no enum constraint. Add a @field_validator(mode='after') that re-enforces enum membership AFTER the before-validator has had its chance to coerce. The model-facing enum hint is preserved in the field description (DD inlines it into the prompt). Verified end-to-end: last_name -> direct_identifier, latent_sensitive_attribute -> sensitive_attribute; truly-novel strings still rejected by the after-validator safety net.
…pply Retest of gemma4-e2b on oncology bench note (after class B/I/E/G fixes) surfaced a new failure: model omits protection_reason entirely on some entries. DD jsonschema pre-validates required fields before pydantic runs, so the omission was fatal at the JSON Schema gate. Two paired changes: 1. Give protection_reason a valid default (>=minLength 10). The field leaves the emitted JSON Schema required list, so the record is no longer dropped at the DD pre-check when the model omits it. 2. Change the before-validator string coercion to pop None instead of writing empty string. For fields without defaults this surfaces a clearer "field required" error; for fields with defaults (now protection_reason) the default correctly takes effect. Verified: missing or null protection_reason -> auto placeholder; valid model-supplied reason passes through unchanged.
Fix-branch sweep showed small models violate the existing after-
validator consistency rule in both directions:
* needs_protection=False + method="suppress_inference" (gemma4-e2b
on 01_simple_clinical, latent-entity case)
* needs_protection=True + method="leave_as_is" (gemma4-e2b
on 06_financial_statement, 4 entities on one record)
Both patterns are logically inconsistent and previously killed the
record. Reconcile in the before-validator, biasing toward protection
(safer default for a privacy tool):
* non-trivial method implies needs_protection=True;
* needs_protection=True + method=leave_as_is -> method=replace.
Round-tripped both failure patterns end-to-end; consistent inputs
pass through unchanged.
Grep audit showed combined_risk_level is declared on EntityDispositionSchema
and requested by the disposition prompt, but never read anywhere in the
library (only in test fixtures, which pydantic v2 silently ignores as
extra kwargs once the field is gone).
Field was pure schema debt — every disposition call paid token cost for
an output that flows into nothing. Deleting it:
- shrinks the LLM contract by 10% with zero semantic change
- removes one column from the retry/drift surface that classes
A/B/C fixes on this branch are working around
- removes the CombinedRiskLevel enum class (also unused)
Changes:
* schemas/rewrite.py: drop CombinedRiskLevel class, field, and the
@model_validator(mode=before) defaulting that injected it from
sensitivity.
* schemas/__init__.py: drop CombinedRiskLevel import and __all__.
* rewrite/sensitivity_disposition.py: drop combined_risk_level
prompt lines (QUALITY REQUIREMENTS bullet, ALLOWED ENUM VALUES
entry, DISTINGUISHING block).
Test impact: 0 new failures. 2 stale tests from commit bbdb6dc
(class K reconcile) remain failing and will be updated separately.
…ction helpers New loose wire-contract schema (SimpleDispositionItem / SimpleDispositionResult) that will be the output_format for the disposition_analyzer LLM column in the next commit. Every field is typed as str with a default, so the emitted JSON Schema has no enum / required / minLength constraints — DataDesigner\s
…side reconstruction
…text over LLM echo
…t assertions Two small hygiene items: 1. Add drop=True to the _simple_disposition LLM column in SensitivityDispositionWorkflow so it does not leak into the final user-facing preview DataFrame. It is an internal hand-off to the reconstruction column; matches the pattern used for _validation_skeleton in detection_workflow.py. 2. Refresh two test assertions in test_schemas.py that predated the class-K reconciliation commit (bbdb6dc). They asserted pydantic raises on inconsistent (needs_protection, method) pairs, but the before-validator now reconciles instead of raising — safer for small-model drift. Tests now assert the reconciled outcome: * needs_protection=False + method=replace -> needs_protection=True * needs_protection=True + method=leave_as_is -> method=replace
DataDesigner writes the _latent_entities column to parquet; when the LLM returns an empty list, pyarrow fails with ArrowNotImplementedError: Cannot write struct type _latent_entities with no child field to Parquet. Observed on nemotron-3-super (and any model that correctly concludes no inferable latent PII is present). Inject a single all-defaults sentinel LatentEntitySchema via @model_validator(mode=after) on LatentEntitiesSchema when the list is empty. Downstream code (reconstruct_full_disposition et al) already filters entries with empty label/value so the sentinel is invisible semantically.
|
All contributors have signed the DCO ✍️ ✅ |
The class R @model_validator(mode="after") sentinel on LatentEntitiesSchema only fires when DD routes through model_validate(). On partial-failure fallback paths DD can write the raw dict (including latent_entities: []) directly to the DataFrame. Downstream workflows then hit pyarrow with: ArrowNotImplementedError: Cannot write struct type '_latent_entities' with no child field to Parquet. _pad_empty_latent_column() runs on the DataFrame after DD returns and injects a neutral LatentEntitySchema() default into any empty cell, covering both the dict-wrapper and raw-list shapes DD may produce. Verified on rich_data with rewrite preview: pyarrow error gone.
| # Best-effort mapping from Anonymizer entity-labels to EntityCategory, used when | ||
| # the model outputs an entity_label in the `category` field (observed with small | ||
| # Gemma models on the disposition step). Values are from DEFAULT_ENTITY_LABELS in | ||
| # config/entity_labels.py; any label not in this table falls back to |
There was a problem hiding this comment.
There is no config/entity_labels.py. DEFAULT_ENTITY_LABELS lives at src/anonymizer/engine/constants.py.
| # config/entity_labels.py; any label not in this table falls back to | ||
| # "quasi_identifier" which is the most conservative (protect-cautiously) choice. | ||
| _ENTITY_LABEL_TO_CATEGORY: dict[str, str] = { | ||
| # Direct identifiers: strong re-id on their own. |
There was a problem hiding this comment.
_ENTITY_LABEL_TO_CATEGORY is a hand-maintained copy and is already out of sync
The PR's own comment calls DEFAULT_ENTITY_LABELS the source of truth, but the table in schemas/rewrite.py contains labels that do not exist in ENTITY_LABEL_EXAMPLES — e.g. full_name, passport_number, drivers_license, address, zip_code, credit_card_number. Conversely it likely omits labels someone will add tomorrow.
Two reasonable options:
- (preferred) Define three small sets
_DIRECT_IDS,_QUASI_IDS,_SENSITIVE_ATTRSand derive_ENTITY_LABEL_TO_CATEGORYas{lbl: "direct_identifier" for lbl in _DIRECT_IDS} | .... Then add a test:assert set(DEFAULT_ENTITY_LABELS) <= set(_ENTITY_LABEL_TO_CATEGORY). - At minimum add the coverage assertion as a unit test so the next label-addition PR breaks CI rather than silently losing fallback coverage.
| continue | ||
|
|
||
| # Derive derived fields. | ||
| method = (item.protection_method_suggestion or "").strip() or "leave_as_is" |
There was a problem hiding this comment.
When the small model omits the method, this silently derives needs_protection=False, even for a direct_identifier with sensitivity="high". That's exactly the kind of failure mode this PR exists to fix — and it's inconsistent with the pessimistic defaults the PR uses elsewhere (e.g. PrivacyAnswersSchema pads missing with "yes", _normalize_decision defaults to "keep" = preserve detection).
Suggested fix — pessimistic default based on category/sensitivity:
| method = (item.protection_method_suggestion or "").strip() or "leave_as_is" | |
| raw_method = (item.protection_method_suggestion or "").strip() | |
| if raw_method: | |
| method = raw_method | |
| elif category in ("direct_identifier", "sensitive_attribute") or sensitivity in ("medium", "high"): | |
| method = "replace" | |
| else: | |
| method = "leave_as_is" |
Add a test for SimpleDispositionItem(id=1, category="direct_identifier", sensitivity="high", protection_method_suggestion="") → reconstructed with needs_protection=True.
| # like 98101 as bare integers). | ||
| @model_validator(mode="before") | ||
| @classmethod | ||
| def _coerce_small_model_output(cls, data): |
There was a problem hiding this comment.
Two layers of coercion for the same drift — duplicated logic, only one is reachable in prod
EntityDispositionSchema._coerce_small_model_output does ~70 lines of normalization (category drift, int→str, needs↔method reconciliation, etc.) — but since the workflow now routes every disposition through SimpleDispositionResult + reconstruct_full_disposition, the reconstructor always pre-massages the fields before calling EntityDispositionSchema(...). The strict-schema's before-validator never sees raw LLM output in the production path; it only runs in the unit tests.
Two downsides:
- ~70 lines of production-untested code billed as a live safety net.
template_protection_reasonvsprotection_reasondefault ("auto: model omitted protection_reason") are two different fallbacks for the same field, reachable via different paths. Currently the reconstructor path wins in prod; the stringly-typed default only kicks in if someone constructsEntityDispositionSchemadirectly from LLM output — a combination no one should be doing anymore.
Pick one gate:
- (a) Drop
_coerce_small_model_outputfromEntityDispositionSchema, makecategory: EntityCategoryagain, rely exclusively on the reconstructor. Simpler, one source of truth. - (b) Keep
_coerce_small_model_outputas the single unified gate, remove the templating/defaulting fromreconstruct_full_disposition, and route every item throughEntityDispositionSchema.model_validate.
(a) seems like a better pick. Reason: the reconstructor has access to context (category, sensitivity, method together) that the field-by-field coercion doesn't, so its reason templating is strictly better than the static default.
memadi-nv
left a comment
There was a problem hiding this comment.
Missing tests for claims in the PR summary / commits
- PR summary: "5 stale class-K assertions refreshed to assert reconciliation instead of raising" — the diff only shows 2 refreshed tests (
test_entity_disposition_reconciles_no_protection_but_method_set,test_entity_disposition_reconciles_needs_protection_but_leave_as_is). Wcould not find the other 3! - No test for
_pad_empty_latent_columnindetection_workflow.py(the belt-and-braces DataFrame sentinel). - No test exercising the entity-label-stuffed-into-category path in
_coerce_small_model_outputvia_ENTITY_LABEL_TO_CATEGORY— this was called out as a key Gemma failure and gets its own ~70 lines of table, but tests only cover display-label variants ("DIRECT IDENTIFIER", plural,"_identifiers"), not the"last_name"→direct_identifiermapping. - No test that
ValidationDecisionSchemadroppingvalue/labelfrom the wire still produces correct output throughenrich_validation_decisions(the PR claims those fields were already server-overridden — worth an end-to-end assertion). - No test for
PrivacyAnswerItemSchema._truncate_reason(200+ char inputs, the specific small-model drift the method exists for). - No test for
DomainClassificationSchema._coerce_confidencestring inputs ("0.95","85%","high"), despiteqa_generation.py's_DOMAIN_KEYlookup being downstream of this.
|
|
||
|
|
||
| def test_multi_label_entity_flattens_to_multiple_slots() -> None: | ||
| simple = _make_simple( |
There was a problem hiding this comment.
Prompt enumeration order must match _flatten_context — not verified
reconstruct_full_disposition pairs LLM items with context by context[item.id - 1], and _flatten_context emits one slot per (value, label) pair. If the jinja rendering of <<FINAL_ENTITIES>> presents entities to the LLM as one row per value (multi-label values merged), the LLM's numbering will not align with the context index — multi-label entities will either get the wrong pairing or fall into the orphan path and be silently dropped.
The new reconstructor has a unit test (test_multi_label_entity_flattens_to_multiple_slots), but there is no test asserting the prompt rendering enumerates in the same order and cardinality.
Please add an end-to-end test that renders the prompt for a multi-label entities_by_value row, asserts the entity list the model sees is numbered 1..N identically to _flatten_context, and feeds a matching hand-crafted LLM response through reconstruct_full_disposition with zero orphans. If the rendering turns out to be one-per-value, _flatten_context should collapse to match.
| emits an empty list the record still survives so the pipeline can | ||
| decide what to do downstream. A @model_validator(mode="before") drops | ||
| units with empty unit text to reduce noise. | ||
| """ |
There was a problem hiding this comment.
MeaningUnitsSchema docstring describes behavior that isn't implemented
The docstring claims "A @model_validator(mode="before") drops units with empty unit text to reduce noise", but no such validator exists — only _ensure_list (list-wrap). Either implement the drop or fix the docstring.
Related: MeaningUnitSchema.id: int = Field(ge=1, default=1) — if the LLM emits multiple units whose ids all default to 1, any downstream id-keyed dedup collapses them. Consider default=None with sequential reassignment in _ensure_list, or document that the default=1 is purely a wire-layer escape hatch.
| terse (Qwen-observed sub-10 chars) with an ellipsis to reach the | ||
| 10-char soft floor. Exact floor kept as internal expectation but | ||
| not enforced at wire — server reads rationale for context only. | ||
| """ |
There was a problem hiding this comment.
Only the upper bound is enforced. Either implement the padding or drop the promise from the docstring — non-blocking (wire no longer has min_length) but confusing for whoever reads this next.
|
I have read the DCO document and I hereby sign the DCO. |
Comment referenced config/entity_labels.py; DEFAULT_ENTITY_LABELS actually lives in engine/constants.py. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The validator only truncates above 150 chars; the docstring's promise to pad sub-10-char rationales was never implemented. Reviewer flagged as non-blocking. Behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The class only has a list-coercion before-validator on the units field; the docstring claimed a model_validator that drops empty units, which was never implemented. Behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…entities reconstruct_full_disposition previously defaulted an omitted protection_method_suggestion to leave_as_is regardless of category or sensitivity, which silently let high-sensitivity direct_identifiers slip through as needs_protection=False — exactly the small-model failure mode this PR exists to prevent. Mirrors the pessimistic-default pattern already used by PrivacyAnswersSchema and _normalize_decision: when method is empty and the entity is direct_identifier/sensitive_attribute or has medium/high sensitivity, default to replace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MeaningUnitSchema.id defaults to 1 to keep the wire schema permissive for small-model output. When the LLM emits multiple units and omits the id on all of them, every unit collapses to id=1, which makes downstream prompts that reference units by id ambiguous. _ensure_list now reassigns sequential 1..N when any id is missing or duplicated. Explicit unique ids are preserved verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 71-key hand-maintained dict had drifted from DEFAULT_ENTITY_LABELS: six labels (full_name, passport_number, drivers_license, address, zip_code, credit_card_number) were not in the live label set and therefore unreachable via real detector output. Restructure into three frozensets — _DIRECT_ID_LABELS, _QUASI_ID_LABELS, _SENSITIVE_ATTR_LABELS — and derive the mapping. This puts each label's category assignment in one place per category and makes the next label-addition PR a simple set-membership change rather than a hand-edit risking silent omission. The phantom labels are dropped (they cannot be emitted by the detector). Mapping now covers exactly the 65 labels in DEFAULT_ENTITY_LABELS with no gaps and no phantoms. CI assertion guarding the inclusion lands in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add _normalize_category helper in disposition_derivation that resolves all four small-model category drift modes (display variants, merged enums, entity-label confusion, empty/unknown) into a valid EntityCategory value, with conservative fallback to quasi_identifier. reconstruct_full_disposition now calls _normalize_category before constructing EntityDispositionSchema, replacing the inline "category or empty -> quasi_identifier" default. The reconstructor already had access to context-derived entity_label, so the normalizer gets richer signal than the schema's own before-validator did. Sets up the next commit, which deletes the redundant before-validator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…egory The schema-level @model_validator(mode="before") was unreachable in production: every disposition record now flows through the wire-loose SimpleDispositionItem schema and reconstruct_full_disposition, which calls _normalize_category and derive_needs_protection before constructing EntityDispositionSchema. The before-validator's category normalization and needs_protection<->method reconciliation were duplicated by, and superseded by, the reconstructor's path. Changes: - Delete _coerce_small_model_output (~88 lines). - Restore category: EntityCategory (strict enum) — emitted JSON Schema is internal-only now, so DD jsonschema pre-checks no longer matter. - Drop the static "auto: model omitted protection_reason" default; the reconstructor always supplies a >=10-char reason via template_protection_reason. - Drop _validate_category after-validator (redundant with enum type). Tests refreshed: the two class-K reconciliation tests now assert ValidationError instead of reconciliation, since reconciliation lives in the reconstructor (covered by the new pessimistic-default tests in test_disposition_derivation.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nce coercion
Adds tests for the three review-summary missing-test gaps in test_schemas.py:
- DEFAULT_ENTITY_LABELS coverage assertion: any label added to the live
set without a category assignment in schemas/rewrite.py now breaks CI
with a directive pointing at the right sets to extend.
- PrivacyAnswerItemSchema._truncate_reason: verifies 250+ char input
truncates to <=200 chars (rather than rejecting the record), boundary
case at exactly 200, and None coerces to a placeholder.
- DomainClassificationSchema._coerce_confidence: accepts float strings
("0.95"), percentage strings ("85%"), unparseable strings ("high"
falls back to 0.5), and clamps out-of-range values.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tten alignment Adds three new test groups for disposition reconstruction: - _normalize_category: drift modes (entity-label-stuffed-into-category for all three categories, display variants, merged-enum hallucinations, unknown fallback) plus an end-to-end via the reconstructor that pins the canonical gemma failure mode (category="last_name" -> "direct_identifier"). - Class-K-style reconciliation through the reconstructor: now that the schema's before-validator was removed, document that derive_needs_protection makes the (needs_protection, method) pair tautologically consistent for replace, generalize, and leave_as_is. - _flatten_context enumeration alignment: the prompt-rendering contract is "one slot per (value, label) pair" with tagged before latent. Multi-label entities round-trip with zero orphans when the LLM emits ids 1..N matching slot count. Closes the test gap memadi-nv flagged on test_disposition_derivation.py:258. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ment Closes the last two review-summary missing-test gaps: - _pad_empty_latent_column: 4 new tests covering the dict-cell sentinel injection, populated-cell preservation, bare-empty-list cells, and the missing-column no-op. Each pins the pyarrow workaround so a future refactor can't silently regress to "Cannot write struct type with no child field". - enrich_validation_decisions end-to-end with the dropped wire fields: pins that ValidationDecisionSchema.model_dump() omits value/label (they were intentionally removed from the wire), and that enrichment fills them server-side from candidate_lookup. Reclass + proposed_label flow through unchanged. Also: ruff's import-order fix on test_disposition_derivation.py and test_schemas.py (the new imports landed at non-canonical positions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR systematically applies a two-layer small-model hardening pattern across every LLM-facing schema: a loose wire schema (str instead of enum, defaults instead of required, no strict length guards) that survives DataDesigner's pre-validate pass, paired with server-side before-validators and the new Confidence Score: 5/5Safe to merge — all previous P1 findings are addressed, 514 tests pass, and the only remaining finding is a P2 docstring discrepancy. No P0 or P1 issues remain. The single P2 (docstring priority order in _normalize_method) has no runtime impact since valid ProtectionMethod values are not substrings of each other in practice. Extensive test coverage (642 new test lines for disposition_derivation, 437 for schemas) and bench results across five models increase confidence. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant DD as DataDesigner
participant LLM as disposition_analyzer LLM
participant WS as SimpleDispositionResult
participant RC as _reconstruct_full_disposition_column
participant DR as disposition_derivation
DD->>LLM: prompt + loose JSON Schema (no enum/minLength/required)
LLM-->>DD: raw JSON (bare list, case-drifted enums, null fields)
DD->>DD: jsonschema.validate() against oneOf(wrapper, bare-array)
DD->>WS: model_validate() + _accept_bare_list normalizes bare list
WS-->>DD: SimpleDispositionResult (loose, defaults filled)
DD->>RC: row dict with COL_SIMPLE_DISPOSITION + context cols
RC->>DR: reconstruct_full_disposition(simple, entities_by_value, latent_entities)
DR->>DR: _normalize_category / _normalize_method / derive_needs_protection / template_protection_reason
DR-->>RC: SensitivityDispositionSchema (strict)
RC-->>DD: row[COL_SENSITIVITY_DISPOSITION] = full.model_dump()
Note over RC: ValidationError from orphans-only response
Note over RC: emits empty disposition, no record drop
Reviews (8): Last reviewed commit: "refactor(schemas): extract loose-list-wr..." | Re-trigger Greptile |
| full = reconstruct_full_disposition( | ||
| simple, | ||
| row.get(COL_ENTITIES_BY_VALUE), | ||
| row.get(COL_LATENT_ENTITIES), | ||
| ) | ||
| row[COL_SENSITIVITY_DISPOSITION] = full.model_dump() |
There was a problem hiding this comment.
Unhandled
ValidationError can still drop records
reconstruct_full_disposition raises ValidationError when full_items is empty (e.g., all SimpleDispositionItem IDs fall outside the context range AND the model omitted the optional echoes). SensitivityDispositionSchema enforces min_length=1 on sensitivity_disposition; a fully-orphaned small-model response that emits IDs like 100–103 with no entity_label/entity_value echoes would trip this path and propagate an unhandled exception from the column generator — exactly the record-drop behaviour this PR is designed to prevent.
The test test_empty_simple_result_produces_empty_schema explicitly asserts the raise and comments "caller is expected to short-circuit", but _reconstruct_full_disposition_column performs no such guard.
Consider wrapping the call (or returning early when simple.sensitivity_disposition is empty) to avoid surfacing the exception to DataDesigner's error-handling layer:
if not simple.sensitivity_disposition:
logger.warning("_reconstruct_full_disposition_column: empty simple disposition for row; skipping reconstruction")
return row
try:
full = reconstruct_full_disposition(
simple,
row.get(COL_ENTITIES_BY_VALUE),
row.get(COL_LATENT_ENTITIES),
)
except Exception as exc:
logger.warning("_reconstruct_full_disposition_column: reconstruction failed (%s); leaving disposition empty", exc)
return row
row[COL_SENSITIVITY_DISPOSITION] = full.model_dump()
return row| method_label = method or "an appropriate method" | ||
| base = f"{cat_label} — protected via {method_label}" | ||
|
|
||
| prefix = _SENSITIVITY_PREFIX.get(sensitivity, "") | ||
| reason = (prefix + base).strip() |
There was a problem hiding this comment.
Dead code in plural-stripping logic
The endswith("_identifiers") branch can never be reached for any of the intended inputs. The preceding endswith("s") check already handles "direct_identifiers", "quasi_identifiers", and "latent_identifiers" — their [:-1] forms are all in _VALID_CATEGORIES, so that check returns first. The _identifiers branch is only reached when normalized[:-1] is not in _VALID_CATEGORIES, in which case it returns an invalid category string that would cause a ValidationError downstream since EntityDispositionSchema.category has no before-validator. In practice unreachable for expected model output, but the dead code is misleading and could cause silent failures for hypothetical edge-case inputs.
Resolves DIRTY merge state for PR #130. Conflicts in: - src/anonymizer/engine/rewrite/sensitivity_disposition.py (4 hunks): the strict_entity_protection feature (#123) and the simple-wire-schema refactor on this branch both rewrote the disposition prompt and workflow column setup. Resolution: keep the two-step pipeline (loose SimpleDispositionResult on the wire, server-side reconstruction into SensitivityDispositionSchema) and pass strict_entity_protection through to the prompt. The prompt's <strict_entity_protection> block is the strict-mode contract; the reconstructor produces the downstream schema. The output_schema selection between SensitivityDispositionSchema and StrictSensitivityDispositionSchema is no longer needed because we always emit SimpleDispositionResult. - src/anonymizer/engine/detection/detection_workflow.py: keep the imports of LatentEntitySchema this branch added; ValidationDecisionsSchema was no longer used after main's chunked-validation refactor. - tests/engine/test_detection_workflow.py: keep both the new _pad_empty_latent_column tests from this branch and the chunked-validation wiring tests from main. - src/anonymizer/engine/schemas/rewrite.py: re-add the CombinedRiskLevel enum that this branch had deleted as unused — main's StrictEntityDispositionSchema needs it. The non-strict EntityDispositionSchema still does not carry the field; the strict variant's to_entity_disposition() drops combined_risk_level during conversion. - tests/engine/test_chunked_validation.py: two assertions in TestMergeChunkDecisions assumed the merge step's payload retained value/label. After the wire-schema drop, those fields are stripped by ValidationDecisionsSchema validation and re-filled later by enrich_validation_decisions from candidate_lookup. Tests updated to match the post-drop contract with a comment explaining why. - tests/engine/test_detection_custom_columns.py: COL_VALIDATION_CANDIDATES was renamed to COL_SEED_VALIDATION_CANDIDATES in the new e2e test. All 496 tests pass (was 427 pre-merge; +69 from main). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's format-check (ruff format --check via tools/codestyle/format.sh) flagged 8 files I touched as needing reformat. Locally these passed ruff check, but ruff format applies stricter line-length / wrapping rules. No semantic change; pure mechanical reformat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two ruff lint violations surfaced after the main merge; both auto-fixable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…utput nemotron-3-nano:4b on rewrite mode (5/5 records on Spark Q5 bench) consistently emits the disposition items as a top-level array without the sensitivity_disposition wrapper key. Other small models drift to the bare-list shape intermittently on dense entity sets. Failure path: DataDesigner runs jsonschema.validate() on the raw LLM response BEFORE pydantic gets to coerce. The previous wire schema required type=object, so a bare-list response was rejected as "is not of type 'object'" and the record dropped — wasting the upstream detection cost (validator + augmenter + latent + domain ≈ 40s). Fix is two-layered to match the rest of PR #130: * __get_pydantic_json_schema__ widens the emitted JSON Schema to a oneOf of {wrapper-object, bare-array}. The DD pre-check now accepts both shapes. * _accept_bare_list (mode="before") normalizes a bare-list input back to the wrapper dict so the reconstructor and downstream consumers continue to read the canonical SimpleDispositionResult.sensitivity_disposition shape. Tests cover: bare-list parses, wrapper-dict still parses (no regression), emitted JSON Schema has oneOf with object+array branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Schema Mirrors the _coerce_proposed_label validator already on ValidationDecisionSchema. The chunked-validation path (added in #126, merged into this PR via ca08fb2) routes LLM output through RawValidationDecisionSchema instead of the wire-loose ValidationDecisionSchema, so the existing coercer didn't apply on that path. Failure mode (gemma4-e4b on the post-merge bench): model emits ``proposed_label: null`` for ``decision: keep`` decisions; pydantic rejects the entire chunk's records with ``string_type`` errors; DataDesigner drops the row. On a 5-note × 2-mode bench across 5 candidate models, this caused gemma4-e4b's success rate to crater to 1/5 redact + 2/5 rewrite even though the model otherwise delivers best-in-class quality (util 0.96 / leak 0.03 on the rows that survived). Tests cover null and int-typed proposed_label drift. 501 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ema.decision
gemma4-e4b on the legal_court note re-bench emitted free-form prose in
the decision slot — observed strings: "No specific entity type for the
date placeholder." and "No specific field matched for this token." —
which pydantic rejects as not matching the ValidationChoice enum,
dropping the whole chunk's records.
Mirrors ValidationDecisionSchema._normalize_decision but with one
deliberate difference: this schema's chunked-validation merger
(merge_chunk_decisions) treats decision=None as "no answer, skip" so a
later chunk can supply a real verdict for the same id. Therefore the
normalizer preserves None-ness and only normalizes string drift —
None / non-string / blank string -> None, free-form prose -> "keep"
(conservative privacy-safe fallback: preserve detection over silently
dropping the entity).
Tests cover: free-form prose -> "keep"; blank/None -> None (preserves
merger's skip semantic); display-variant substring matching ("Keep.",
"DROP!", "reclass entity"); and most-specific-first when prose mixes
choices.
Caught and fixed during the gemma4-e4b 10-run revalidation that
followed 576de6a — that fix landed redact 1/5 -> 4/5 and rewrite 2/5
-> 4/5. The remaining legal_court failure is this drift mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| raw_method = (item.protection_method_suggestion or "").strip() | ||
| if raw_method: | ||
| method = raw_method |
There was a problem hiding this comment.
protection_method_suggestion not normalised before strict-schema construction
raw_method is taken verbatim from the LLM output and passed directly to EntityDispositionSchema(protection_method_suggestion=method). EntityDispositionSchema.protection_method_suggestion is typed as ProtectionMethod — a str, Enum with lowercase values ("replace", "generalize", etc.) and use_enum_values=True. Pydantic v2 performs case-sensitive coercion, so a small-model emission like "REPLACE", "Replace", or "suppress_Inference" raises a ValidationError that propagates unhandled from _reconstruct_full_disposition_column, dropping the record — the exact failure mode this PR is designed to prevent.
Add case-folding analogous to how sensitivity is handled:
raw_method = (item.protection_method_suggestion or "").strip().lower()A substring fallback (similar to _normalize_category) would further harden against drift like "replace_with_surrogate".
qwen3.5-4b on legal_court rewrite emits the units as a top-level array without the ``units`` wrapper key. DataDesigner's jsonschema gate rejects this as not type=object, dropping the row before pydantic gets to coerce. Same two-layer fix as dc0bc84 for SimpleDispositionResult: __get_pydantic_json_schema__ widens to oneOf(object, array), and a mode="before" validator normalizes the bare list back to the wrapper. Tests cover bare-list parses, wrapper-dict still parses, and oneOf emitted with both branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| @field_validator("reason", mode="before") | ||
| @classmethod | ||
| def _truncate_reason(cls, v: object) -> object: | ||
| """Small-model observed emitting 250+ char reasons (nemotron-3-nano on | ||
| vLLM). Truncate to fit max_length=200 rather than dropping the record.""" | ||
| if isinstance(v, str) and len(v) > 200: | ||
| return v[:197].rstrip() + "..." | ||
| if v is None: | ||
| return "no reason provided" | ||
| return v |
There was a problem hiding this comment.
_truncate_reason misses the empty-string case
reason: str = Field(min_length=1, max_length=200) rejects "" after the before-validator runs. The validator handles None and len > 200, but an empty string falls through the two guards and is returned unchanged. Pydantic then raises ValidationError on min_length=1, dropping the record — the same failure mode this PR is designed to prevent.
| @field_validator("reason", mode="before") | |
| @classmethod | |
| def _truncate_reason(cls, v: object) -> object: | |
| """Small-model observed emitting 250+ char reasons (nemotron-3-nano on | |
| vLLM). Truncate to fit max_length=200 rather than dropping the record.""" | |
| if isinstance(v, str) and len(v) > 200: | |
| return v[:197].rstrip() + "..." | |
| if v is None: | |
| return "no reason provided" | |
| return v | |
| @field_validator("reason", mode="before") | |
| @classmethod | |
| def _truncate_reason(cls, v: object) -> object: | |
| """Small-model observed emitting 250+ char reasons (nemotron-3-nano on | |
| vLLM). Truncate to fit max_length=200 rather than dropping the record.""" | |
| if isinstance(v, str) and len(v) > 200: | |
| return v[:197].rstrip() + "..." | |
| if v is None or (isinstance(v, str) and not v.strip()): | |
| return "no reason provided" | |
| return v |
…ructor Addresses greptile-apps[bot] inline findings on PR #130 HEAD, plus two defensive nits surfaced in the round-2 review. Greptile P1s (each is the same class of failure this PR is meant to fix: small-model output drops the record before our normalization runs): * disposition_derivation: protection_method_suggestion now flows through a new _normalize_method() (case-fold + substring fallback) before EntityDispositionSchema construction. Pydantic enum coercion on ProtectionMethod is case-sensitive, so a small-model "REPLACE" or "Replace" was raising ValidationError and dropping the record. * rewrite: PrivacyAnswerItemSchema._truncate_reason now also coerces empty/whitespace-only strings to "no reason provided". Without this, the empty-string emission falls through both pre-fix guards (None, len > 200) and trips Field(min_length=1) on the strict pass. * sensitivity_disposition: _reconstruct_full_disposition_column now short-circuits when the simple wire payload is empty AND wraps the reconstruct call in try/except ValidationError. The orphans-only case (every SimpleDispositionItem id is out-of-context-range AND the model omitted entity_label/value echoes) used to propagate ValidationError unhandled. Now it emits an empty sensitivity_disposition column with a warning. Greptile P2: * disposition_derivation: drop dead endswith("_identifiers") branch in _normalize_category. The preceding endswith("s") + _VALID_CATEGORIES check already handles direct/quasi/sensitive/latent_identifiers; the plural-stripping branch was unreachable. Defensive nit: * detection: add cross-reference comment in ValidationDecisionSchema._normalize_decision pointing at RawValidationDecisionSchema._normalize_decision (same name, divergent None semantics — first cross-ref already existed in the chunked-path docstring; this closes the loop). Tests added: 5 new (_normalize_method case + substring + empty + end-to-end via reconstructor; _truncate_reason empty + whitespace variant; reconstruct column empty + orphans-only). 514 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pydantic restructure Closes engineering nits A/B/C from the round-2 /review on PR #130: A) **__get_pydantic_json_schema__ brittleness** — the inline overrides on SimpleDispositionResult and MeaningUnitsSchema both hardcoded wrapped["properties"][field_name] access. A pydantic minor bump that moves inline property schemas behind $ref indirection would silently break the oneOf widening, sending bare-list LLM responses back to DD's jsonschema gate and dropping records — the failure mode this PR is meant to fix. The new loose_list_wrapper_json_schema helper detects the missing-property and $ref-only cases, degrades to the unwidened wrapper, and emits a warning so future regressions are observable. B) **Pattern duplication** — three near-identical __get_pydantic_json_schema__ + _accept_bare_list pairs (one each on SimpleDispositionResult and MeaningUnitsSchema; the third would arrive with the next loose-wire schema) now share a single implementation in engine/schemas/shared.py. Two helpers: * loose_list_wrapper_json_schema(handler, schema, *, list_field) -> dict * accept_bare_list(*, list_field) -> classmethod-decorated validator Call sites compress from ~10 LOC each to a single line. C) **Two _normalize_decision functions diverge in semantics** — ValidationDecisionSchema returns "keep" for None/blank; RawValidationDecisionSchema preserves None for the chunk-merger's "no answer, skip" path. Same name made the divergence invisible at call sites. Renamed the chunked variant to _normalize_decision_preserve_none. Cross-reference docstrings updated in both directions. Tests: +2 (loose_list_wrapper widens to oneOf; falls back gracefully on $ref-only or missing property). Total 516 passing. Existing tests for SimpleDispositionResult / MeaningUnitsSchema oneOf emission and RawValidationDecisionSchema._normalize_decision_preserve_none continue to pass via the helper indirection / mechanical rename respectively. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| # treat it as "no entities to protect" rather than dropping the row. | ||
| if not simple.sensitivity_disposition: | ||
| logger.warning("reconstruct: empty SimpleDispositionResult for row; emitting empty sensitivity_disposition") | ||
| row[COL_SENSITIVITY_DISPOSITION] = {"sensitivity_disposition": []} |
There was a problem hiding this comment.
Downstream parsing still routes through SensitivityDispositionSchema.model_validate(...) in parse_sensitivity_disposition(), so I think this guard will just cause a failure later when the QA and rewrite steps call this function.
I also think that in the case of missing sensitivity disposition, we should be forcing all direct identifiers to be replaced and all quasi identifiers to be generalized - that's the rough guidance in the sensitivity disposition prompt too (cc @asteier2026)
There was a problem hiding this comment.
That would be optimal, I agree with "in the case of missing sensitivity disposition, we should be forcing all direct identifiers to be replaced and all quasi identifiers to be generalized "
| # Mapping from Anonymizer entity-labels to EntityCategory, used when the | ||
| # disposition LLM outputs an entity_label in the `category` field (observed | ||
| # with small Gemma models). Derived from three category sets so the source | ||
| # of truth lives in one place per category — see | ||
| # test_entity_label_to_category_covers_default_labels for the CI guard | ||
| # that fires when a label is added to DEFAULT_ENTITY_LABELS without a | ||
| # category assignment here. Any label not in this table falls back to | ||
| # "quasi_identifier" which is the most conservative (protect-cautiously) | ||
| # choice. |
There was a problem hiding this comment.
Two things I need to think more about in case the LLM at the sensitivity disposition step fails to generate output (and @asteier2026 probably knows more here)
- Is this list of direct identifiers, quasi and sensitive attrs right?
- Do we want any user supplied label to just be generalized instead of replaced?
Summary
Anonymizer's rewrite pipeline was dropping records on every non-
gpt-oss-120bmodel due to schema strictness that small models couldn't satisfy. Root cause: DataDesigner runsjsonschema.validate()on raw LLM output before pydantic's before-validators run — any strictenum,required,type,minLength, or uniqueness check becomes an un-coercible gate for small-model drift.This PR applies a two-layer pattern to every LLM-facing schema that was dropping records:
str(not Enum), defaults instead ofrequired, nullable where useful, no strict lengths. The emitted JSON Schema stops rejecting drifted output. For some shapes (disposition / meaning units) the wire schema is widened tooneOf({wrapper}, {array})via__get_pydantic_json_schema__so DD's pre-validate accepts both shapes.try/except ValidationErrorso an orphans-only response emits an empty disposition rather than dropping the row.Schemas touched
EntityDispositionSchemaSimpleDispositionItem(oneOfwidening accepts bare-list shape).disposition_derivation.reconstruct_full_dispositionbuilds the strictEntityDispositionSchemaserver-side via_normalize_category+_normalize_method(both case-fold + substring fallback) +derive_needs_protection(tautologically consistent (needs, method) pair).categoryrestored toEntityCategoryenum;_coerce_small_model_outputdeleted;combined_risk_leveldeleted. Pessimisticneeds_protectiondefault for high-risk entities with omitted method. Reconstructor column guardsValidationErrorfrom orphans-only responses.SimpleDispositionResultoneOf({wrapper-object}, {bare-array})so models that omit thesensitivity_dispositionkey (observed: nemotron-3-nano:4b, 5/5 records) still pass DD's jsonschema pre-validate. Before-validator normalizes the bare-list shape back to wrapper.MeaningUnitsSchemaoneOf+_accept_bare_listwidening (qwen3.5:4b legal_court drift)._ensure_listreassigns sequential ids 1..N when LLM omits ids or emits collisions.RawValidationDecisionSchema_normalize_decision(preserves None for the chunk-merger's "no answer, skip" semantics; prose drift →"keep") and_coerce_proposed_label(None →"", int → str). MirrorsValidationDecisionSchema's coercers which existed only on the non-chunked wire path.ValidationDecisionSchemadecisionenum → str;proposed_labelnullable; drop deadvalue/label(already server-overridden inenrich_validation_decisions).LatentEntitySchemarationaletruncation + empty default;evidencelist clamp; sentinel injected on empty list to keep pyarrow happy.MeaningUnitSchemaaspectenum → str + normalize.DomainClassificationSchemaDomainenum → str + normalize; string-to-float confidence coercion (accepts"0.95","85%", falls back to 0.5 for"high").QualityAnswersSchema/PrivacyAnswersSchema/QACompareResultsSchema@model_validator(mode="before")dedups IDs, pads missing with pessimistic defaults, drops extras.PrivacyAnswerItemSchema.reasontruncates at 200 chars and now also coerces empty/whitespace-only strings (was rejected bymin_length=1)._ENTITY_LABEL_TO_CATEGORYis derived from three frozensets (_DIRECT_ID_LABELS,_QUASI_ID_LABELS,_SENSITIVE_ATTR_LABELS); a CI-asserted coverage test guards against future divergence fromDEFAULT_ENTITY_LABELS.Drift modes uncovered post-merge (round-2 fixes)
The initial PR was reviewed and revised (commits 19efe6e..16bff86). After merging origin/main (chunked-validation feature #126) and re-benching against 5 candidate models on DGX Spark, four additional drift modes surfaced. Each is now covered:
{sensitivity_disposition: [...]}SimpleDispositionResultdc0bc84proposed_label: nullfrom chunked-validation pathRawValidationDecisionSchema576de6adecisionfield ("No specific entity type...")RawValidationDecisionSchema5fa19b7{units: [...]}MeaningUnitsSchema329d488Round-2 hardening (commit
5e55aea) closes greptile's automated review:protection_method_suggestionbefore strict-schema construction ("REPLACE"no longer drops record)_truncate_reasoncoerces to placeholder (was failingmin_length=1)_reconstruct_full_disposition_columnguardsValidationErrorfrom orphans-only responsesendswith("_identifiers")branch in_normalize_categoryBench results — 5 candidate models on DGX Spark (Ollama Q4_K_M, post-fix)
gemma4-e2b(2B)nemotron-3-nano:4bgemma4-e4b⭐qwen3.5:4bqwen3.5:9bRecommended sweet spot for rewrite:
gemma4:e4b— best small-model utility, 5/5 reliable, mid-tier latency.gemma4:e2bis the speed pick for replace mode (~21s avg).NVFP4 / vLLM second-half was bench-tested (separate commit set, on cosmicproc/E4B + prithivMLmods/E2B community quants). Result: no Blackwell speedup observed at single-stream batch with community W4A4 quants — both gemma4 sizes were equal-or-slower than Ollama Q4_K_M with -14 to -22pp utility drop. Documented as future work; not in this PR's scope.
Type of Change
Testing
make testpasses locally — 514 tests pass (108 new since the original PR open: schema coverage, label coverage, privacy reason truncation including empty-string, domain confidence string coercion, MeaningUnit id renumbering,_normalize_category+_normalize_methoddrift modes, class-K reconstructor reconciliation,_flatten_contextenumeration alignment,_pad_empty_latent_column, end-to-endenrich_validation_decisionswith dropped wire fields, bare-list shapes for both disposition and meaning,RawValidationDecisionSchemaNone/prose/int coercers, reconstruct-column orphans-only guard)make checkpasses locally (format + lint + typecheck)Documentation
make docs-buildpasses locally — no doc changes in this PRReview response
@memadi-nv (round 1, 2026-04-24): 7 inline comments + 6 missing-test gaps. All addressed by commits 19efe6e..16bff86. Brief mapping in commit messages.
greptile-apps[bot] (rounds 1-2): 3 P1 + 1 P2 from the round-2 review on commit 329d488. All closed by commit
5e55aea.Round-2 reviewers (Amy / Andre): Andre's engineering nits (
_normalize_methodlowercase, dead branch, ValidationError guard, empty-string_truncate_reason, cross-ref comment) all addressed. Amy's prompt-restoration concerns (COVERAGE / QUALITY clauses dropped from disposition prompt) and the requested A/B utility comparison are noted as follow-up — recommended for a separate PR with isolated bench evidence rather than reverting prompts that bench shows produce 0.92 utility on the headline model.Pattern shared via helper (commit
4a54b85)The two schemas (
SimpleDispositionResult,MeaningUnitsSchema) that need bare-list tolerance now share a single helper pair inengine/schemas/shared.py:loose_list_wrapper_json_schema(handler, schema, *, list_field)— emitsoneOf({wrapper}, {array}), gracefully degrades + logs a warning if pydantic ever moves the inline property schema behind a$ref(defensive against future restructure).accept_bare_list(*, list_field)— builds themode="before"validator that wraps a top-level bare list back into the canonical wrapper dict.Each call site is now one line each. Future loose-wire schemas drop in trivially.
Also in
4a54b85:RawValidationDecisionSchema._normalize_decisionrenamed to_normalize_decision_preserve_noneso the divergence vs the canonicalValidationDecisionSchema._normalize_decision(which returns "keep" instead of preservingNone) is visible at call sites without requiring docstring comparison.Open follow-up
Amy's prompt-restoration request (re-add COVERAGE / QUALITY clauses to the disposition prompt + run isolated A/B utility comparison) is deferred to a separate PR. Aggregate bench evidence on this branch shows util preserved (0.92 mean on
gemma4:e4bheadline; 5/5 reliable across all 5 candidate models). The follow-up PR will pair restored clauses with side-by-side bench evidence to isolate the utility delta.Related
Analysis trail and full benchmark matrix are documented externally in
notes/lipika-heads-up-local-serving.md§5/§5b in theanonymizer-strategyworkspace.