fix(file-based): strip trailing empty CSV headers instead of rejecting them#1039
fix(file-based): strip trailing empty CSV headers instead of rejecting them#1039devin-ai-integration[bot] wants to merge 1 commit into
Conversation
…g them Trailing empty/whitespace-only column names (common in CSVs with trailing delimiters) are now silently stripped with a warning instead of raising a config_error. Non-trailing empty headers remain an error. This fixes a compatibility regression introduced in v7.19.1 where existing S3 CSV streams with externally generated trailing delimiters started failing. Co-Authored-By: bot_apk <apk@cognition.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1779970954-fix-csv-trailing-empty-headers#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1779970954-fix-csv-trailing-empty-headersPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
PyTest Results (Fast)4 084 tests +3 4 073 ✅ +3 8m 10s ⏱️ -8s Results for commit 85caa6a. ± Comparison against base commit f4c0779. This pull request removes 6 and adds 9 tests. Note that renamed tests count towards both. |
PyTest Results (Full)4 087 tests +3 4 075 ✅ +3 11m 53s ⏱️ +21s Results for commit 85caa6a. ± Comparison against base commit f4c0779. This pull request removes 6 and adds 9 tests. Note that renamed tests count towards both. |
|
closing in favor of #1044 |
Summary
Fixes a compatibility regression introduced in CDK v7.19.1 (#1010) where CSV files with trailing empty/whitespace-only column names (common in externally generated CSVs with trailing delimiters) started failing with a
config_error. This broke existing source-s3 streams that were syncing successfully before the upgrade.Root cause: The empty header validation in
_CsvReader._get_headerstreated all empty column names identically — both trailing empties from trailing delimiters and genuine interior empties from malformed schemas.Fix:
config_error, as these indicate genuine schema problems.This restores backward compatibility for the common case (trailing delimiters) while preserving error detection for genuinely malformed CSV headers.
Not a breaking change — this strictly relaxes validation. No public API, spec, schema, or state changes.
Resolves: airbytehq/oncall#12736
Review & Testing Checklist for Human
col1,col2,,\nv1,v2,v3,v4should yield{"col1": "v1", "col2": "v2"}and log a warning)col1,,col3) still raiseconfig_errorNotes
_get_headersmethod signature changed (addedloggerparam, returnsTuple[List[str], int]), but this is a private method on the private_CsvReaderclass — no public API impact.read_dataintegration with trailing empty columns.Link to Devin session: https://app.devin.ai/sessions/1eee4b15c3594dfc9f644a490b7f1753