fix(replace): drop workflow-internal columns and use COL_* constants by lipikaramaswamy · Pull Request #154 · NVIDIA-NeMo/Anonymizer

lipikaramaswamy · 2026-05-11T18:35:08Z

Summary

Both issues (#148 and #152) touch the same three lines in engine/replace/llm_replace_workflow.py:53-55. Addresses both in one mechanical fix.

#148 — string-literal column names violate the STYLEGUIDE.md rule that all column names live in engine/constants.py:

Add COL_ENTITY_EXAMPLES, COL_ENTITIES_FOR_REPLACE, COL_ENTITIES_FOR_REPLACE_JSON to engine/constants.py.
Use them in all 5 places these columns appear in llm_replace_workflow.py.

#152 — result.trace_dataframe.to_parquet fails to round-trip:

These three columns are prompt-construction scratch: consumed only by the replacement-generator Jinja template, never read downstream. They carry pyarrow-backed pandas extension dtypes that pandas can't reconstruct on read_parquet, which poisoned the entire trace file's readability.
Drop them from the result before returning, in both generate_map_only return paths (entity rows present and entity rows absent).
Tracked via a module-level _INTERNAL_COLUMNS tuple so the drop and the regression tests stay in sync.

Test plan

Two regression tests in test_llm_replace_workflow.py cover both return paths
Mutation tested — removing the .drop() calls makes both regression tests fail; restoring them makes them pass
Manually verified parquet round-trip on a captured trace dataframe: stripping the 3 columns lets pd.read_parquet succeed; downstream trace columns (_validated_entities, _sensitivity_disposition, etc.) remain accessible as Python dicts
make format-check passes locally
CI passes

Closes #148.
Closes #152.

- Add COL_ENTITY_EXAMPLES, COL_ENTITIES_FOR_REPLACE, COL_ENTITIES_FOR_REPLACE_JSON to engine/constants.py and use them in llm_replace_workflow.py (closes #148). - Drop these three columns from the LlmReplaceWorkflow result before returning. They are prompt-construction scratch consumed only by the replacement-generator template, never read downstream, and carry pyarrow-backed pandas extension dtypes that broke result.trace_dataframe.to_parquet round-trip (closes #152). - Add regression tests asserting the columns are stripped in both return paths (entity rows present and absent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

greptile-apps · 2026-05-11T18:37:31Z

Greptile Summary

This PR resolves two linked issues by replacing bare string column-name literals with COL_* constants and dropping three prompt-construction scratch columns from the LlmReplaceWorkflow.generate_map_only result before it is returned — preventing a pyarrow-dtype-induced to_parquet failure on the downstream trace dataframe.

constants.py: Adds COL_ENTITY_EXAMPLES, COL_ENTITIES_FOR_REPLACE, and COL_ENTITIES_FOR_REPLACE_JSON, satisfying the STYLEGUIDE rule that all column names live in one place.
llm_replace_workflow.py: Wires up the new constants everywhere the columns are referenced, introduces the module-level _INTERNAL_COLUMNS tuple as a single source of truth, applies .drop() in both generate_map_only return paths, and fixes the previously bare Jinja {{ _entity_examples }} reference via the <<ENTITY_EXAMPLES_COLUMN>> substitution mechanism so renames propagate automatically.
test_llm_replace_workflow.py: Two regression tests cover the no-entities early-return path and the entity path; the no-entities test genuinely exercises the drop, while the entity-path test passes independently of the drop call (see inline comment).

Confidence Score: 5/5

Safe to merge — the changes are mechanical string-to-constant replacements plus a targeted .drop() applied in both return paths, with no behavioural changes to the replacement logic itself.

The constant migration is straightforward and the new _INTERNAL_COLUMNS drop is guarded with errors=ignore, so it cannot break existing callers. The Jinja template fix correctly threads COL_ENTITY_EXAMPLES through substitute_placeholders. The only gap is that the entity-rows regression test does not force the drop to fire, but this does not affect production correctness.

The entity-rows regression test in tests/engine/test_llm_replace_workflow.py could be strengthened to mix entity and passthrough rows so the drop is genuinely exercised.

Important Files Changed

Filename	Overview
src/anonymizer/engine/constants.py	Adds three new COL_* constants for the LlmReplaceWorkflow internal scratch columns; placement and naming are consistent with the rest of the file.
src/anonymizer/engine/replace/llm_replace_workflow.py	Replaces all string-literal column names with constants, introduces _INTERNAL_COLUMNS sentinel, drops internal columns in both return paths, and fixes the Jinja template _entity_examples reference via the <<ENTITY_EXAMPLES_COLUMN>> substitution placeholder.
tests/engine/test_llm_replace_workflow.py	Two new regression tests added; early-return (no-entities) test exercises the drop correctly, but the entity-rows test doesn't because the mock return value omits the internal columns, leaving the .drop() as a no-op in that path.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[generate_map_only called] --> B[working_df = dataframe.copy]
    B --> C[Add COL_ENTITY_EXAMPLES\nCOL_ENTITIES_FOR_REPLACE\nCOL_ENTITIES_FOR_REPLACE_JSON\nto working_df]
    C --> D{split_rows on\nCOL_ENTITIES_FOR_REPLACE}
    D -->|all rows empty| E[passthrough_rows only\nset COL_REPLACEMENT_MAP = empty]
    D -->|entity rows present| F[entity_rows to NDD adapter]
    E --> G[merge_and_reorder passthrough_rows]
    G --> H[.drop _INTERNAL_COLUMNS\nerrors=ignore]
    F --> I[run_result from adapter]
    I --> J[filter replacement map]
    J --> K[merge_and_reorder output_df + passthrough_rows]
    K --> L[.drop _INTERNAL_COLUMNS\nerrors=ignore]
    H --> M[LlmReplaceResult returned\nclean of internal columns]
    L --> M

_{Reviews (2): Last reviewed commit: "fix(replace): parameterize _entity_examp..." | Re-trigger Greptile}

…plate Address greptile review feedback on #154: the Jinja template still referenced `{{ _entity_examples }}` as a raw literal even after the COL_ENTITY_EXAMPLES constant was introduced. If the constant ever gets renamed, the DataFrame column would change but the template would silently render an empty/undefined Jinja variable. Route the column name through `substitute_placeholders` the same way `<<ENTITIES_COLUMN>>` already is, so the constant is the single source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

lipikaramaswamy requested a review from a team as a code owner May 11, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(replace): drop workflow-internal columns and use COL_* constants#154

fix(replace): drop workflow-internal columns and use COL_* constants#154
lipikaramaswamy wants to merge 2 commits into
mainfrom
lipikaramaswamy/bugfix/llm-replace-internal-columns

lipikaramaswamy commented May 11, 2026

Uh oh!

greptile-apps Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lipikaramaswamy commented May 11, 2026

Summary

Test plan

Uh oh!

greptile-apps Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 11, 2026 •

edited

Loading