fix(file-based): Allow safe trailing empty CSV headers#1044
Conversation
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1780625970-s3-csv-trailing-empty-headers#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1780625970-s3-csv-trailing-empty-headersPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
PyTest Results (Fast)4 103 tests +9 4 091 ✅ +9 7m 36s ⏱️ ±0s Results for commit 012d045. ± Comparison against base commit 7da322d. This pull request removes 3 and adds 12 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
PyTest Results (Full)4 106 tests +9 4 094 ✅ +9 12m 7s ⏱️ + 2m 43s Results for commit 012d045. ± Comparison against base commit 7da322d. This pull request removes 3 and adds 12 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
|
CI note: all required checks are passing. The non-required I left this PR scoped to the CSV compatibility fix rather than changing the connector-test matrix or generic connector test harness. |
…y_count Consolidate logic into _get_headers (returns List[str]) and call _get_trailing_empty_header_count separately in read_data where the trailing count is needed. This eliminates the unnecessary wrapper method while keeping the interface clean. Co-Authored-By: bot_apk <apk@cognition.ai>
…e_trailing_headers Extract _validate_trailing_headers method that encapsulates all trailing header validation logic (count, strip, interior/all-empty checks) in one place. Add _read_and_validate_headers as the single entry point that reads raw headers from file and validates them. _get_headers delegates to it. Co-Authored-By: bot_apk <apk@cognition.ai>
…iling_empty_header_values _strip_trailing_empty_header_values now calls _get_trailing_empty_header_count internally on the raw_headers, removing the need for read_data to track trailing_empty_header_count separately. _read_and_validate_headers returns (clean_headers, raw_headers) instead of (clean_headers, trailing_count). _validate_trailing_headers returns just the clean headers list. Co-Authored-By: bot_apk <apk@cognition.ai>
Review: CSV trailing-empty-headers changeReviewed the switch from 🔴 Bug: blank lines in CSV files now raise an error instead of being skipped
Confirmed end-to-end against the real parser: Before this PR the same file parsed fine. This also affects files ending with an extra blank line ( Suggested fix — skip empty rows explicitly: for values in reader:
lineno += 1
if not values: # mirror DictReader's blank-row skipping
continue
...🟡 Minor: trailing-empty-header count recomputed per data row
🟡 Minor: no test coverage for blank lines
The trailing-empty-header logic itself looks correct: the interior-vs-trailing distinction in |
- Add blank-row skip (if not values: continue) to match DictReader behavior - Precompute trailing_empty_header_count once before the loop instead of per-row - Add test coverage for blank lines mid-file and trailing blank lines Co-Authored-By: bot_apk <apk@cognition.ai>
|
Good catch on all three points — thanks for the thorough review Daryna Ishchenko (@darynaishchenko). 🔴 Blank-row regression — Fixed. Added 🟡 Per-row recomputation — Fixed. 🟡 Blank-line test coverage — Added two new parametrized test cases ( All 69 tests pass locally, lint clean. |
09cc18f
into
main
Summary
Fixes a regression introduced in CDK v7.19.1 (airbytehq/airbyte-python-cdk#1010) where
_CsvReader._get_headers()rejects all empty CSV column names — including safe trailing empties from trailing delimiters. This broke existingsource-s3connections (v4.15.4) whose externally generated CSV files have trailing blank header columns (e.g.,col1,col2,col3,,).Fix:
_get_headers→_get_headers_and_trailing_empty_countnow distinguishes trailing vs interior empty headers:_strip_trailing_empty_header_valuesconfig_errorconfig_errorconfig_error(safety check)Uses raw
csv.reader+_row_to_dictinstead ofcsv.DictReaderfor explicit control over trailing column handling.Declarative-First Evaluation
N/A — this is a CDK-level fix in
csv_parser.py, not a connector manifest change.Test Coverage
Breaking Change Evaluation
Not breaking. This strictly relaxes validation — trailing empty headers that previously caused
config_errorare now silently handled. No schema, spec, state, or public API changes.Resolves https://github.com/airbytehq/oncall/issues/12736:
Link to Devin session: https://app.devin.ai/sessions/fe8a17db8ccc4484a231583efb869590