Skip to content

refactor: replace df.attrs with typed PipelineContext dataclass #141

Open
memadi-nv wants to merge 3 commits intomainfrom
memadi/feature/refactor-df-attribute
Open

refactor: replace df.attrs with typed PipelineContext dataclass #141
memadi-nv wants to merge 3 commits intomainfrom
memadi/feature/refactor-df-attribute

Conversation

@memadi-nv
Copy link
Copy Markdown
Contributor

Summary

Replaces the use of pandas' experimental DataFrame.attrs slot for threading pipeline-level metadata (original_text_column) through workflow stages. df.attrs is silently dropped by merge / concat / groupby, which forced manual attrs={**a.attrs, **b.attrs} plumbing across every split/recombine boundary in the engine.

A new typed PipelineContext dataclass now carries the metadata at the orchestration layer. Engine workflows operate on plain DataFrames; the orchestrator owns the metadata.

Changes

New

  • engine/pipeline_context.py — frozen PipelineContext(dataframe, original_text_column) with a with_dataframe() helper for evolving the wrapped DataFrame while preserving metadata. Future pipeline-wide metadata gets added here as new fields rather than as ad-hoc df.attrs keys.

Engine

  • engine/io/reader.pyread_input() now returns PipelineContext. Folded the previously separate ResolvedInputColumns helper into PipelineContext (same shape, same intent — one less concept).
  • engine/row_partitioning.py — dropped the attrs= parameter from merge_and_reorder(); it only existed to forward original_text_column.
  • engine/replace/llm_replace_workflow.py — removed df.attrs reads/writes from LlmReplaceWorkflow.generate_map_only. Clears the .attrs TODO at the original line referenced in refactor: replace df.attrs with typed pipeline context dataclass #4.
  • engine/rewrite/rewrite_workflow.py — removed df.attrs plumbing in RewriteWorkflow.run (fast-path and full-pipeline merges).
  • engine/detection/detection_workflow.py — removed the if "original_text_column" in dataframe.attrs: ... hack in EntityDetectionWorkflow.run.

Interface

  • interface/anonymizer.pyAnonymizer.run / Anonymizer.preview thread a PipelineContext into _run_internal(context=...). Helpers _rename_output_columns and _build_user_dataframe now take original_text_column as an explicit keyword argument instead of reading it back out of df.attrs.
  • interface/results.pyAnonymizerResult and PreviewResult expose original_text_column as a typed field. _DisplayMixin.display_record reads it from there. display_record behavior is unchanged.

Tests

  • Removed all df.attrs["original_text_column"] = ... setup from test stubs (no longer required by the orchestrator).
  • Migrated test_io.py from result.attrs[...] / result.columns / result[col] / len(result) to the new result.original_text_column / result.dataframe.* API.
  • Converted three legacy "attrs propagated" tests in the workflow tests into regression guards that assert original_text_column is not present in result.dataframe.attrs.
  • New: tests/engine/test_pipeline_context.py — covers construction, immutability, with_dataframe(), and the headline invariant (metadata survives merge / concat / groupby on the wrapped DataFrame).
  • New: test_run_threads_original_text_column_via_context_not_df_attrs in test_anonymizer_interface.py — integration-level regression guard for the orchestrator threading metadata via PipelineContext rather than via workflow-output df.attrs.

Net effect

  • No more dependency on an experimental pandas feature.
  • No more manual attrs={**a.attrs, **b.attrs} plumbing across merge / concat / groupby boundaries.
  • Pipeline-wide metadata is now statically typed, IDE-discoverable, and mypy/pyright-checkable.
  • Single source of truth for "what is the user's text column": one typed field, threaded explicitly.

Test plan

  • ruff check src tests — clean
  • ruff format --check src tests — clean
  • pytest — 653 passed

@memadi-nv memadi-nv requested a review from a team as a code owner April 28, 2026 19:53
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR replaces the experimental df.attrs mechanism for threading original_text_column through the pipeline with a typed PipelineContext dataclass, cleanly separating orchestrator-level metadata from engine-level DataFrame operations. The refactor is well-scoped: all attrs plumbing is removed from merge_and_reorder, workflow classes, and result objects, and comprehensive regression tests confirm the new invariant.

Confidence Score: 5/5

Safe to merge; no logic bugs found, all 653 tests pass, and the refactor correctly centralises metadata ownership at the orchestration layer.

Only a P2 style note about field-insertion order in public result dataclasses; no P0 or P1 findings.

src/anonymizer/interface/results.py — field insertion order in AnonymizerResult/PreviewResult is a minor public API break worth noting in release notes.

Important Files Changed

Filename Overview
src/anonymizer/engine/pipeline_context.py New frozen dataclass wrapping a DataFrame + original_text_column; docstring warns == / hash() are broken but auto-generated methods are still generated by frozen=True (already noted in a prior review thread)
src/anonymizer/engine/io/reader.py read_input now returns PipelineContext; ResolvedInputColumns absorbed into PipelineContext; private helper _resolve_output_column_collisions correctly preserves selected column name through rename
src/anonymizer/interface/results.py original_text_column added as a required positional field between trace_dataframe and failed_records in both AnonymizerResult and PreviewResult — breaks external callers using positional construction
src/anonymizer/interface/anonymizer.py _run_internal accepts PipelineContext instead of raw DataFrame; context.original_text_column wired to _rename_output_columns and _build_user_dataframe as explicit kwargs; clean
src/anonymizer/engine/row_partitioning.py merge_and_reorder drops attrs= parameter; implementation simplified to a single return expression
tests/engine/test_pipeline_context.py New tests cover construction, immutability, with_dataframe(), and the headline invariant that metadata survives pandas merge/concat/groupby

Sequence Diagram

sequenceDiagram
    participant U as User
    participant A as Anonymizer
    participant R as reader.read_input
    participant PC as PipelineContext
    participant DW as DetectionWorkflow
    participant RW as Replace/RewriteWorkflow
    participant Res as AnonymizerResult

    U->>A: run(config, data)
    A->>R: read_input(data)
    R->>PC: PipelineContext(dataframe, original_text_column)
    R-->>A: context: PipelineContext
    A->>A: _run_internal(context=context)
    A->>DW: run(context.dataframe, ...)
    DW-->>A: EntityDetectionResult(dataframe)
    A->>RW: run(detection_result.dataframe, ...)
    RW-->>A: RewriteResult(dataframe)
    note over A: context.original_text_column used here
    A->>A: _rename_output_columns(final_df, original_text_column=text_col)
    A->>A: _build_user_dataframe(renamed_trace, original_text_column=text_col)
    A->>Res: AnonymizerResult(dataframe, trace_dataframe, original_text_column, failed_records)
    Res-->>U: result
Loading

Reviews (3): Last reviewed commit: "address greptile comment" | Re-trigger Greptile

Comment thread src/anonymizer/engine/pipeline_context.py
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv changed the title refactor: replace df.attrs with typed PipelineContext dataclass (#4) refactor: replace df.attrs with typed PipelineContext dataclass Apr 28, 2026
@memadi-nv memadi-nv linked an issue Apr 28, 2026 that may be closed by this pull request
Signed-off-by: memadi <memadi@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: replace df.attrs with typed pipeline context dataclass

1 participant