Skip to content

fix: preserve column names with spaces in wr.redshift.copy()#3298

Open
hirenkumar-n-dholariya wants to merge 10 commits into
aws:mainfrom
hirenkumar-n-dholariya:hirenkumar-n-dholariya-fix/redshift-copy-column-space-rename
Open

fix: preserve column names with spaces in wr.redshift.copy()#3298
hirenkumar-n-dholariya wants to merge 10 commits into
aws:mainfrom
hirenkumar-n-dholariya:hirenkumar-n-dholariya-fix/redshift-copy-column-space-rename

Conversation

@hirenkumar-n-dholariya
Copy link
Copy Markdown

Problem

wr.redshift.copy() silently renames columns with spaces (e.g. "my col" → "my_col")
because the internal s3.to_parquet call defaults to pyarrow flavor='spark',
which sanitizes column names.

Fix

Explicitly pass pyarrow_additional_kwargs={"flavor": None} in the internal
s3.to_parquet call to preserve original column names.

Fixes #3293

Passes flavor=None to internal s3.to_parquet call to prevent pyarrow spark flavor from sanitizing column names (spaces → underscores). Fixes aws#3293
@kukushking
Copy link
Copy Markdown
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: fdccae4
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking
Copy link
Copy Markdown
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: fdccae4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@hirenkumar-dholariya
Copy link
Copy Markdown

hirenkumar-dholariya commented Apr 10, 2026

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: fdccae4
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback!
The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully.
This issue exists independently in the main branch and is not introduced by this PR.

@hirenkumar-n-dholariya
Copy link
Copy Markdown
Author

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully. This issue exists independently in the main branch and is not introduced by this PR.

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback!
The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1 — pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully.
This issue exists independently in the main branch and is not introduced by this PR.

@hirenkumar-n-dholariya
Copy link
Copy Markdown
Author

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.
The GitHubCodeBuild (non-distributed) pipeline passed successfully. This issue exists independently in the main branch and is not introduced by this PR.

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1 — pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully. This issue exists independently in the main branch and is not introduced by this PR.

Hi @kukushking
Hope you are doing well. Could you please take a look at the comments and help to share your feedback/approval on the PR. Thank you so much in advance for your time.

@kukushking
Copy link
Copy Markdown
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 857326e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@hirenkumar-n-dholariya
Copy link
Copy Markdown
Author

@kukushking Could you please take a look at this PR when you get a chance?

Quick summary:

  • The fix passes flavor=None to the internal s3.to_parquet call to preserve column names with spaces in wr.redshift.copy()
  • The GitHubCodeBuild pipeline passed
  • The GitHubDistributedCodeBuild failure is a pre-existing issue caused by modin==0.37.1 incompatibility with pandas==3.0.1 (pandas.read_gbq` was removed in pandas 3.x) -> unrelated to this fix

Would appreciate your review!

@kukushking
Copy link
Copy Markdown
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 857326e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking
Copy link
Copy Markdown
Collaborator

Hi @hirenkumar-n-dholariya yes - the failure you are referring to is pre-existing, no reason to worry about.

With regards to your change, this is a breaking change of default behavior for any current redshift user will be impacted. We will consider this for the next major version.

@hirenkumar-n-dholariya
Copy link
Copy Markdown
Author

@kukushking Thank you for the feedback! That's a fair point about the breaking change concern.

Would it make sense to make the behavior configurable via an optional parameter, so existing users are not impacted by default?

For example:

def copy(
    df,
    ...
    sanitize_column_names: bool = True,  # preserves backward compatibility
):

This way:

  • Existing users are unaffected (default = True keeps current behavior)
  • Users who need to preserve column names with spaces can opt in with sanitize_column_names=False

Happy to implement this if it sounds like a good direction.

@kukushking
Copy link
Copy Markdown
Collaborator

@hirenkumar-n-dholariya yes, that would makes sense! Please also consider adding a test case to test the new behavior. Thank you!

Problem:
wr.redshift.copy() internally calls s3.to_parquet() which defaults to
pyarrow flavor='spark'. This causes column names with spaces to be
silently renamed (e.g. "my col" → "my_col"), leading to a mismatch
between the DataFrame schema and the Redshift table schema.

Solution:
Add an optional sanitize_column_names parameter (default=True) to
wr.redshift.copy() that controls whether pyarrow sanitizes column names.

- sanitize_column_names=True (default): preserves existing behavior,
  column names are sanitized for backward compatibility.
- sanitize_column_names=False: passes flavor=None to the internal
  s3.to_parquet() call, preserving original column names including spaces.

This is a non-breaking change — existing users are unaffected since
the default value maintains the current behavior.

Changes:
- Added sanitize_column_names: bool = True parameter to copy()
- Updated pyarrow_additional_kwargs in s3.to_parquet() call accordingly
- Added docstring for the new parameter
- Added test case for sanitize_column_names=False behavior

Fixes aws#3293
test: add test for sanitize_column_names=False in wr.redshift.copy()
style: fix ruff formatting - remove trailing whitespace
style: fix ruff formatting - remove trailing whitespace in _write.py
style: fix ruff formatting in test_redshift.py
fix: add required blank line in docstring for ruff D410/D411
fix: remove trailing whitespace in sanitize_column_names docstring
@hirenkumar-n-dholariya
Copy link
Copy Markdown
Author

@kukushking Both files are now formatted correctly.
The sanitize_column_names parameter added with default=True for backward compatibility, test case added, and ruff
formatting fixed.

Could you please review, give your approval/feeedback to proceed with merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wr.redshift.copy() silently renames columns with spaces due to pyarrow defaulting to flavor='spark' in internal s3.to_parquet call

4 participants