Skip to content

Optimize cleaning providers for faster type fixes#2

Open
LinklyLuck wants to merge 1 commit into
masterfrom
c17e2p-codex/modify-app.py-and-the_pipeline_v2.py
Open

Optimize cleaning providers for faster type fixes#2
LinklyLuck wants to merge 1 commit into
masterfrom
c17e2p-codex/modify-app.py-and-the_pipeline_v2.py

Conversation

@LinklyLuck

Copy link
Copy Markdown
Owner

Summary

  • refactor the type-based cleaning provider to use vectorized Polars expressions for numeric, date, and boolean normalization
  • skip unnecessary work for empty or already-typed columns while keeping detailed change reports intact
  • ensure the cleaning helpers expose clean returns with newline-terminated modules

Testing

  • python -m compileall cleaning_providers/type_cleaner.py cleaning_providers/init.py

https://chatgpt.com/codex/tasks/task_e_6902641b0434832995bfa7ad87b3d6f9

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +325 to +331
updated_df = df.with_columns([
pl.when(converted.is_not_null())
.then(converted)
.when(invalid_mask)
.then(pl.lit(None, dtype=pl.Boolean))
.otherwise(original)
.alias(col)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Boolean cleaner leaves column as strings

The refactored _clean_boolean_column builds a with_columns expression that mixes Boolean literals with the original string values but never casts the result. In Polars, combining Boolean and string branches produces a UTF-8 column, so the cleaned dataframe still contains strings like "true"/"false" rather than True/False, despite the report claiming type fixes. Downstream code that expects an actual Boolean dtype will continue to see strings, meaning the type-cleaning stage no longer enforces boolean typing. Consider casting the final expression to pl.Boolean (or casting the original branch) so the column becomes a real boolean series.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant