refactor: replace df.attrs with typed PipelineContext dataclass #141
refactor: replace df.attrs with typed PipelineContext dataclass #141
Conversation
Greptile SummaryThis PR replaces the experimental Confidence Score: 5/5Safe to merge; no logic bugs found, all 653 tests pass, and the refactor correctly centralises metadata ownership at the orchestration layer. Only a P2 style note about field-insertion order in public result dataclasses; no P0 or P1 findings. src/anonymizer/interface/results.py — field insertion order in AnonymizerResult/PreviewResult is a minor public API break worth noting in release notes. Important Files Changed
Sequence DiagramsequenceDiagram
participant U as User
participant A as Anonymizer
participant R as reader.read_input
participant PC as PipelineContext
participant DW as DetectionWorkflow
participant RW as Replace/RewriteWorkflow
participant Res as AnonymizerResult
U->>A: run(config, data)
A->>R: read_input(data)
R->>PC: PipelineContext(dataframe, original_text_column)
R-->>A: context: PipelineContext
A->>A: _run_internal(context=context)
A->>DW: run(context.dataframe, ...)
DW-->>A: EntityDetectionResult(dataframe)
A->>RW: run(detection_result.dataframe, ...)
RW-->>A: RewriteResult(dataframe)
note over A: context.original_text_column used here
A->>A: _rename_output_columns(final_df, original_text_column=text_col)
A->>A: _build_user_dataframe(renamed_trace, original_text_column=text_col)
A->>Res: AnonymizerResult(dataframe, trace_dataframe, original_text_column, failed_records)
Res-->>U: result
Reviews (3): Last reviewed commit: "address greptile comment" | Re-trigger Greptile |
Signed-off-by: memadi <memadi@nvidia.com>
Summary
Replaces the use of pandas' experimental
DataFrame.attrsslot for threading pipeline-level metadata (original_text_column) through workflow stages.df.attrsis silently dropped bymerge/concat/groupby, which forced manualattrs={**a.attrs, **b.attrs}plumbing across every split/recombine boundary in the engine.A new typed
PipelineContextdataclass now carries the metadata at the orchestration layer. Engine workflows operate on plain DataFrames; the orchestrator owns the metadata.Changes
New
engine/pipeline_context.py— frozenPipelineContext(dataframe, original_text_column)with awith_dataframe()helper for evolving the wrapped DataFrame while preserving metadata. Future pipeline-wide metadata gets added here as new fields rather than as ad-hocdf.attrskeys.Engine
engine/io/reader.py—read_input()now returnsPipelineContext. Folded the previously separateResolvedInputColumnshelper intoPipelineContext(same shape, same intent — one less concept).engine/row_partitioning.py— dropped theattrs=parameter frommerge_and_reorder(); it only existed to forwardoriginal_text_column.engine/replace/llm_replace_workflow.py— removeddf.attrsreads/writes fromLlmReplaceWorkflow.generate_map_only. Clears the.attrsTODO at the original line referenced in refactor: replace df.attrs with typed pipeline context dataclass #4.engine/rewrite/rewrite_workflow.py— removeddf.attrsplumbing inRewriteWorkflow.run(fast-path and full-pipeline merges).engine/detection/detection_workflow.py— removed theif "original_text_column" in dataframe.attrs: ...hack inEntityDetectionWorkflow.run.Interface
interface/anonymizer.py—Anonymizer.run/Anonymizer.previewthread aPipelineContextinto_run_internal(context=...). Helpers_rename_output_columnsand_build_user_dataframenow takeoriginal_text_columnas an explicit keyword argument instead of reading it back out ofdf.attrs.interface/results.py—AnonymizerResultandPreviewResultexposeoriginal_text_columnas a typed field._DisplayMixin.display_recordreads it from there.display_recordbehavior is unchanged.Tests
df.attrs["original_text_column"] = ...setup from test stubs (no longer required by the orchestrator).test_io.pyfromresult.attrs[...]/result.columns/result[col]/len(result)to the newresult.original_text_column/result.dataframe.*API.original_text_columnis not present inresult.dataframe.attrs.tests/engine/test_pipeline_context.py— covers construction, immutability,with_dataframe(), and the headline invariant (metadata survivesmerge/concat/groupbyon the wrapped DataFrame).test_run_threads_original_text_column_via_context_not_df_attrsintest_anonymizer_interface.py— integration-level regression guard for the orchestrator threading metadata viaPipelineContextrather than via workflow-outputdf.attrs.Net effect
attrs={**a.attrs, **b.attrs}plumbing acrossmerge/concat/groupbyboundaries.mypy/pyright-checkable.Test plan
ruff check src tests— cleanruff format --check src tests— cleanpytest— 653 passed