fix: preserve column names with spaces in wr.redshift.copy()#3298
Conversation
Passes flavor=None to internal s3.to_parquet call to prevent pyarrow spark flavor from sanitizing column names (spaces → underscores). Fixes aws#3293
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The |
@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The GitHubCodeBuild (non-distributed) pipeline passed successfully. |
Hi @kukushking |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
|
@kukushking Could you please take a look at this PR when you get a chance? Quick summary:
Would appreciate your review! |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
|
Hi @hirenkumar-n-dholariya yes - the failure you are referring to is pre-existing, no reason to worry about. With regards to your change, this is a breaking change of default behavior for any current redshift user will be impacted. We will consider this for the next major version. |
|
@kukushking Thank you for the feedback! That's a fair point about the breaking change concern. Would it make sense to make the behavior configurable via an optional parameter, so existing users are not impacted by default? For example: def copy(
df,
...
sanitize_column_names: bool = True, # preserves backward compatibility
):This way:
Happy to implement this if it sounds like a good direction. |
|
@hirenkumar-n-dholariya yes, that would makes sense! Please also consider adding a test case to test the new behavior. Thank you! |
Problem: wr.redshift.copy() internally calls s3.to_parquet() which defaults to pyarrow flavor='spark'. This causes column names with spaces to be silently renamed (e.g. "my col" → "my_col"), leading to a mismatch between the DataFrame schema and the Redshift table schema. Solution: Add an optional sanitize_column_names parameter (default=True) to wr.redshift.copy() that controls whether pyarrow sanitizes column names. - sanitize_column_names=True (default): preserves existing behavior, column names are sanitized for backward compatibility. - sanitize_column_names=False: passes flavor=None to the internal s3.to_parquet() call, preserving original column names including spaces. This is a non-breaking change — existing users are unaffected since the default value maintains the current behavior. Changes: - Added sanitize_column_names: bool = True parameter to copy() - Updated pyarrow_additional_kwargs in s3.to_parquet() call accordingly - Added docstring for the new parameter - Added test case for sanitize_column_names=False behavior Fixes aws#3293
test: add test for sanitize_column_names=False in wr.redshift.copy()
style: fix ruff formatting - remove trailing whitespace
style: fix ruff formatting - remove trailing whitespace in _write.py
style: fix ruff formatting in test_redshift.py
fix: add required blank line in docstring for ruff D410/D411
fix: remove trailing whitespace in sanitize_column_names docstring
|
@kukushking Both files are now formatted correctly. Could you please review, give your approval/feeedback to proceed with merge. |
Problem
wr.redshift.copy()silently renames columns with spaces (e.g. "my col" → "my_col")because the internal
s3.to_parquetcall defaults to pyarrowflavor='spark',which sanitizes column names.
Fix
Explicitly pass
pyarrow_additional_kwargs={"flavor": None}in the internals3.to_parquetcall to preserve original column names.Fixes #3293