Priority Level
Medium (Annoying but has workaround)
Describe the bug
The entity validator's output schema declares value: str, so DataDesigner's pre-validation demands type: "string" for every decision's value field. Some LLMs (observed with GPT-5.4-mini) occasionally strip the quotes from numeric-looking entity values when filling in the skeleton, returning "value": 42 instead of "value": "42". DD rejects the response and the record is dropped from the dataset with Record missing from workflow output.
This happens probabilistically — in a small smoke-test run against real endpoints, ~1 in 6 records containing an age entity (or any other numeric-looking value) got dropped this way. It affects both the sync and async engines identically.
Steps/Code to reproduce bug
Run detection on any record containing a numeric quasi-identifier like age, using GPT-5.4-mini (or a similarly liberal model) as entity_validator:
from anonymizer import Anonymizer
from anonymizer.config.anonymizer_config import AnonymizerConfig, AnonymizerInput, Detect
from anonymizer.config.replace_strategies import Redact
# (model config wiring GPT-5.4-mini as validator elided)
anonymizer.run(
config=AnonymizerConfig(detect=Detect(), replace=Redact()),
data=AnonymizerInput(
source="input.csv", # single row: "Patient Bob Smith, 42, was admitted..."
text_column="text",
),
)
Representative DD warning (from a real run):
Non-retryable failure on _validation_decisions[rg=0, row=1]:
| Cause: The model output from 'openai/openai/gpt-5.4-mini' could not be
| parsed into the requested format while running generation for column
| '_validation_decisions'. Validation detail: Response doesn't match
| requested <response_schema> 42 is not of type 'string' Failed
| validating 'type' in schema['properties']['decisions']['items']
| ['properties']['value']: {'default': '', 'description': 'Entity value
| (echoed from skeleton)', 'title': 'Value', 'type': 'string'}
| On instance['decisions'][2]['value']: 42.
Expected behavior
Records should survive the validator regardless of whether the LLM echoes back a numeric value as a JSON string ("42") or a JSON number (42). The value field is purely echoed context — enrich_validation_decisions in src/anonymizer/engine/detection/custom_columns.py overwrites it from the candidate lookup before any downstream consumer reads it — so its type shouldn't gate record survival.
Additional context
Suggested fix (not attempted in the PR that surfaced this):
Drop value and label from ValidationDecisionSchema entirely. They're never read downstream, and asking the LLM to echo them is pure cost and failure surface. Concretely:
src/anonymizer/engine/schemas/detection.py: remove value and label fields from ValidationDecisionSchema.
src/anonymizer/engine/detection/detection_workflow.py::_get_validation_prompt: update the few-shot Output: line so the example no longer includes value/label — the skeleton (Template: line) still carries them as context for the LLM, just not the output.
- Add a regression test covering a numeric-looking entity value.
Approach that does NOT work (flagging it so we don't repeat the mistake):
Loosening the pydantic field to str | int | float + a coercion validator passes DD's pre-validation, but DD stores the raw LLM dict (not the pydantic-validated object) in the dataframe. Once some records have "42" (string) and others have 42 (int) for the same column, PyArrow can't pick a single Arrow dtype at parquet checkpoint time and the whole batch fails with Could not convert 'Alice' with type str: tried to convert to int64.
Environment:
data-designer >= 0.5.7
- Observed on
gpt-5.4-mini served via an internal API gateway. Likely reproducible on any LLM that isn't strictly JSON-schema compliant on numeric-string fields.
Priority Level
Medium (Annoying but has workaround)
Describe the bug
The entity validator's output schema declares
value: str, so DataDesigner's pre-validation demandstype: "string"for every decision'svaluefield. Some LLMs (observed with GPT-5.4-mini) occasionally strip the quotes from numeric-looking entity values when filling in the skeleton, returning"value": 42instead of"value": "42". DD rejects the response and the record is dropped from the dataset withRecord missing from workflow output.This happens probabilistically — in a small smoke-test run against real endpoints, ~1 in 6 records containing an
ageentity (or any other numeric-looking value) got dropped this way. It affects both the sync and async engines identically.Steps/Code to reproduce bug
Run detection on any record containing a numeric quasi-identifier like age, using GPT-5.4-mini (or a similarly liberal model) as
entity_validator:Representative DD warning (from a real run):
Expected behavior
Records should survive the validator regardless of whether the LLM echoes back a numeric value as a JSON string (
"42") or a JSON number (42). Thevaluefield is purely echoed context —enrich_validation_decisionsinsrc/anonymizer/engine/detection/custom_columns.pyoverwrites it from the candidate lookup before any downstream consumer reads it — so its type shouldn't gate record survival.Additional context
Suggested fix (not attempted in the PR that surfaced this):
Drop
valueandlabelfromValidationDecisionSchemaentirely. They're never read downstream, and asking the LLM to echo them is pure cost and failure surface. Concretely:src/anonymizer/engine/schemas/detection.py: removevalueandlabelfields fromValidationDecisionSchema.src/anonymizer/engine/detection/detection_workflow.py::_get_validation_prompt: update the few-shotOutput:line so the example no longer includesvalue/label— the skeleton (Template:line) still carries them as context for the LLM, just not the output.Approach that does NOT work (flagging it so we don't repeat the mistake):
Loosening the pydantic field to
str | int | float+ a coercion validator passes DD's pre-validation, but DD stores the raw LLM dict (not the pydantic-validated object) in the dataframe. Once some records have"42"(string) and others have42(int) for the same column, PyArrow can't pick a single Arrow dtype at parquet checkpoint time and the whole batch fails withCould not convert 'Alice' with type str: tried to convert to int64.Environment:
data-designer >= 0.5.7gpt-5.4-miniserved via an internal API gateway. Likely reproducible on any LLM that isn't strictly JSON-schema compliant on numeric-string fields.