Skip to content

fix(replace): drop workflow-internal columns and use COL_* constants#154

Open
lipikaramaswamy wants to merge 2 commits into
mainfrom
lipikaramaswamy/bugfix/llm-replace-internal-columns
Open

fix(replace): drop workflow-internal columns and use COL_* constants#154
lipikaramaswamy wants to merge 2 commits into
mainfrom
lipikaramaswamy/bugfix/llm-replace-internal-columns

Conversation

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

Summary

Both issues (#148 and #152) touch the same three lines in engine/replace/llm_replace_workflow.py:53-55. Addresses both in one mechanical fix.

#148 — string-literal column names violate the STYLEGUIDE.md rule that all column names live in engine/constants.py:

  • Add COL_ENTITY_EXAMPLES, COL_ENTITIES_FOR_REPLACE, COL_ENTITIES_FOR_REPLACE_JSON to engine/constants.py.
  • Use them in all 5 places these columns appear in llm_replace_workflow.py.

#152result.trace_dataframe.to_parquet fails to round-trip:

  • These three columns are prompt-construction scratch: consumed only by the replacement-generator Jinja template, never read downstream. They carry pyarrow-backed pandas extension dtypes that pandas can't reconstruct on read_parquet, which poisoned the entire trace file's readability.
  • Drop them from the result before returning, in both generate_map_only return paths (entity rows present and entity rows absent).
  • Tracked via a module-level _INTERNAL_COLUMNS tuple so the drop and the regression tests stay in sync.

Test plan

  • Two regression tests in test_llm_replace_workflow.py cover both return paths
  • Mutation tested — removing the .drop() calls makes both regression tests fail; restoring them makes them pass
  • Manually verified parquet round-trip on a captured trace dataframe: stripping the 3 columns lets pd.read_parquet succeed; downstream trace columns (_validated_entities, _sensitivity_disposition, etc.) remain accessible as Python dicts
  • make format-check passes locally
  • CI passes

Closes #148.
Closes #152.

- Add COL_ENTITY_EXAMPLES, COL_ENTITIES_FOR_REPLACE,
  COL_ENTITIES_FOR_REPLACE_JSON to engine/constants.py and use them
  in llm_replace_workflow.py (closes #148).
- Drop these three columns from the LlmReplaceWorkflow result before
  returning. They are prompt-construction scratch consumed only by
  the replacement-generator template, never read downstream, and
  carry pyarrow-backed pandas extension dtypes that broke
  result.trace_dataframe.to_parquet round-trip (closes #152).
- Add regression tests asserting the columns are stripped in both
  return paths (entity rows present and absent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
@lipikaramaswamy lipikaramaswamy requested a review from a team as a code owner May 11, 2026 18:35
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR resolves two linked issues by replacing bare string column-name literals with COL_* constants and dropping three prompt-construction scratch columns from the LlmReplaceWorkflow.generate_map_only result before it is returned — preventing a pyarrow-dtype-induced to_parquet failure on the downstream trace dataframe.

  • constants.py: Adds COL_ENTITY_EXAMPLES, COL_ENTITIES_FOR_REPLACE, and COL_ENTITIES_FOR_REPLACE_JSON, satisfying the STYLEGUIDE rule that all column names live in one place.
  • llm_replace_workflow.py: Wires up the new constants everywhere the columns are referenced, introduces the module-level _INTERNAL_COLUMNS tuple as a single source of truth, applies .drop() in both generate_map_only return paths, and fixes the previously bare Jinja {{ _entity_examples }} reference via the <<ENTITY_EXAMPLES_COLUMN>> substitution mechanism so renames propagate automatically.
  • test_llm_replace_workflow.py: Two regression tests cover the no-entities early-return path and the entity path; the no-entities test genuinely exercises the drop, while the entity-path test passes independently of the drop call (see inline comment).

Confidence Score: 5/5

Safe to merge — the changes are mechanical string-to-constant replacements plus a targeted .drop() applied in both return paths, with no behavioural changes to the replacement logic itself.

The constant migration is straightforward and the new _INTERNAL_COLUMNS drop is guarded with errors=ignore, so it cannot break existing callers. The Jinja template fix correctly threads COL_ENTITY_EXAMPLES through substitute_placeholders. The only gap is that the entity-rows regression test does not force the drop to fire, but this does not affect production correctness.

The entity-rows regression test in tests/engine/test_llm_replace_workflow.py could be strengthened to mix entity and passthrough rows so the drop is genuinely exercised.

Important Files Changed

Filename Overview
src/anonymizer/engine/constants.py Adds three new COL_* constants for the LlmReplaceWorkflow internal scratch columns; placement and naming are consistent with the rest of the file.
src/anonymizer/engine/replace/llm_replace_workflow.py Replaces all string-literal column names with constants, introduces _INTERNAL_COLUMNS sentinel, drops internal columns in both return paths, and fixes the Jinja template _entity_examples reference via the <<ENTITY_EXAMPLES_COLUMN>> substitution placeholder.
tests/engine/test_llm_replace_workflow.py Two new regression tests added; early-return (no-entities) test exercises the drop correctly, but the entity-rows test doesn't because the mock return value omits the internal columns, leaving the .drop() as a no-op in that path.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[generate_map_only called] --> B[working_df = dataframe.copy]
    B --> C[Add COL_ENTITY_EXAMPLES\nCOL_ENTITIES_FOR_REPLACE\nCOL_ENTITIES_FOR_REPLACE_JSON\nto working_df]
    C --> D{split_rows on\nCOL_ENTITIES_FOR_REPLACE}
    D -->|all rows empty| E[passthrough_rows only\nset COL_REPLACEMENT_MAP = empty]
    D -->|entity rows present| F[entity_rows to NDD adapter]
    E --> G[merge_and_reorder passthrough_rows]
    G --> H[.drop _INTERNAL_COLUMNS\nerrors=ignore]
    F --> I[run_result from adapter]
    I --> J[filter replacement map]
    J --> K[merge_and_reorder output_df + passthrough_rows]
    K --> L[.drop _INTERNAL_COLUMNS\nerrors=ignore]
    H --> M[LlmReplaceResult returned\nclean of internal columns]
    L --> M
Loading

Reviews (2): Last reviewed commit: "fix(replace): parameterize _entity_examp..." | Re-trigger Greptile

…plate

Address greptile review feedback on #154: the Jinja template still
referenced `{{ _entity_examples }}` as a raw literal even after the
COL_ENTITY_EXAMPLES constant was introduced. If the constant ever
gets renamed, the DataFrame column would change but the template
would silently render an empty/undefined Jinja variable.

Route the column name through `substitute_placeholders` the same way
`<<ENTITIES_COLUMN>>` already is, so the constant is the single source
of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: result.trace_dataframe is not persistable via to_parquet chore: use COL_* constants for intermediate columns in llm_replace_workflow.py

1 participant