fix(file-based): Allow safe trailing empty CSV headers by devin-ai-integration[bot] · Pull Request #1044 · airbytehq/airbyte-python-cdk

devin-ai-integration · 2026-06-05T02:26:16Z

Summary

Fixes a regression introduced in CDK v7.19.1 (airbytehq/airbyte-python-cdk#1010) where _CsvReader._get_headers() rejects all empty CSV column names — including safe trailing empties from trailing delimiters. This broke existing source-s3 connections (v4.15.4) whose externally generated CSV files have trailing blank header columns (e.g., col1,col2,col3,,).

Fix: _get_headers → _get_headers_and_trailing_empty_count now distinguishes trailing vs interior empty headers:

Trailing empty/whitespace-only headers are stripped, and their corresponding row values are discarded via _strip_trailing_empty_header_values
Interior empty headers still raise config_error
All-empty headers still raise config_error
Non-empty data values under stripped trailing headers raise config_error (safety check)

Uses raw csv.reader + _row_to_dict instead of csv.DictReader for explicit control over trailing column handling.

Declarative-First Evaluation

N/A — this is a CDK-level fix in csv_parser.py, not a connector manifest change.

Test Coverage

10 new parametrized tests covering: trailing empty header stripping, interior empty header rejection, all-empty headers, read_data integration with trailing empties, non-empty values under trailing headers
3 existing tests updated to reflect the interior-vs-trailing distinction
All 67 CSV parser tests pass locally

Breaking Change Evaluation

Not breaking. This strictly relaxes validation — trailing empty headers that previously caused config_error are now silently handled. No schema, spec, state, or public API changes.

Resolves https://github.com/airbytehq/oncall/issues/12736:

https://github.com/airbytehq/oncall/issues/12736

Link to Devin session: https://app.devin.ai/sessions/fe8a17db8ccc4484a231583efb869590

devin-ai-integration · 2026-06-05T02:26:17Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment, CI, and merge conflict monitoring

github-actions · 2026-06-05T02:26:27Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1780625970-s3-csv-trailing-empty-headers#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1780625970-s3-csv-trailing-empty-headers

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/prerelease - Triggers a prerelease publish with default arguments
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📚 Show Repo Guidance

Helpful Resources

CDK API Reference

📝 Edit this welcome message.

github-actions · 2026-06-05T02:35:58Z

PyTest Results (Fast)

4 103 tests +9 4 091 ✅ +9 7m 36s ⏱️ ±0s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 012d045. ± Comparison against base commit 7da322d.

This pull request removes 3 and adds 12 tests. Note that renamed tests count towards both.

unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[trailing_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_empty_column_names

unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[middle_whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[only_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_column_names[empty_headers]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_column_names[whitespace_only_headers]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_interior_empty_column_names[interior_empty_header]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_interior_empty_column_names[interior_whitespace_header]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_non_empty_values_under_trailing_empty_headers
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[blank_line_mid_file]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[trailing_blank_line]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[trailing_empty_headers_and_cells]
…

♻️ This comment has been updated with latest results.

github-actions · 2026-06-05T02:39:30Z

PyTest Results (Full)

4 106 tests +9 4 094 ✅ +9 12m 7s ⏱️ + 2m 43s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 012d045. ± Comparison against base commit 7da322d.

This pull request removes 3 and adds 12 tests. Note that renamed tests count towards both.

unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[trailing_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_empty_column_names

unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[middle_whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[only_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_column_names[empty_headers]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_column_names[whitespace_only_headers]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_interior_empty_column_names[interior_empty_header]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_interior_empty_column_names[interior_whitespace_header]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_non_empty_values_under_trailing_empty_headers
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[blank_line_mid_file]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[trailing_blank_line]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[trailing_empty_headers_and_cells]
…

♻️ This comment has been updated with latest results.

devin-ai-integration · 2026-06-05T03:12:46Z

CI note: all required checks are passing. The non-required Check: source-google-drive job fails before exercising this PR's CSV parser changes: the stack trace shows airbyte_cdk/test/standard_tests/connector_base.py calling SourceGoogleDrive() with no args, while the current source-google-drive connector defines __init__(catalog, config, state). The same downstream connector-test failure is visible on the related draft PR #1039.

I left this PR scoped to the CSV compatibility fix rather than changing the connector-test matrix or generic connector test harness.

Devin session

…y_count Consolidate logic into _get_headers (returns List[str]) and call _get_trailing_empty_header_count separately in read_data where the trailing count is needed. This eliminates the unnecessary wrapper method while keeping the interface clean. Co-Authored-By: bot_apk <apk@cognition.ai>

…e_trailing_headers Extract _validate_trailing_headers method that encapsulates all trailing header validation logic (count, strip, interior/all-empty checks) in one place. Add _read_and_validate_headers as the single entry point that reads raw headers from file and validates them. _get_headers delegates to it. Co-Authored-By: bot_apk <apk@cognition.ai>

…iling_empty_header_values _strip_trailing_empty_header_values now calls _get_trailing_empty_header_count internally on the raw_headers, removing the need for read_data to track trailing_empty_header_count separately. _read_and_validate_headers returns (clean_headers, raw_headers) instead of (clean_headers, trailing_count). _validate_trailing_headers returns just the clean headers list. Co-Authored-By: bot_apk <apk@cognition.ai>

Daryna Ishchenko (darynaishchenko) · 2026-06-08T15:24:37Z

Review: CSV trailing-empty-headers change

Reviewed the switch from csv.DictReader to csv.reader + manual _row_to_dict. One significant bug plus a couple of minor points.

🔴 Bug: blank lines in CSV files now raise an error instead of being skipped

csv.DictReader silently skips blank rows (it has an internal while row == []: ... loop). A plain csv.reader does not — it yields [] for each blank line. The new loop has no equivalent skip, so a blank line flows into _row_to_dict(headers, []), producing {header: None, ...}, which trips the None in row.values() check → RecordParseError (or a logged error when ignore_errors_on_fields_mismatch is set).

Confirmed end-to-end against the real parser:

data = 'a,b\n1,2\n\n3,4\n'   # blank line between records
→ RAISED: RecordParseError "CSV data row contains fewer columns than the header row defines." lineno=3

Before this PR the same file parsed fine. This also affects files ending with an extra blank line (...\n\n), which is common. The regression is unrelated to trailing empty headers — it's a side effect of dropping DictReader.

Suggested fix — skip empty rows explicitly:

for values in reader:
    lineno += 1
    if not values:          # mirror DictReader's blank-row skipping
        continue
    ...

🟡 Minor: trailing-empty-header count recomputed per data row

_strip_trailing_empty_header_values calls _get_trailing_empty_header_count(raw_headers) on every row, though raw_headers is constant for the whole file. The count (or strip_start/strip_end) could be computed once before the loop and passed in. Not a correctness issue, just wasted work on large files.

🟡 Minor: no test coverage for blank lines

test_csv_parser.py has no test exercising blank/empty rows, which is why this regression isn't caught. Worth adding cases for a blank line mid-file and a trailing blank line.

The trailing-empty-header logic itself looks correct: the interior-vs-trailing distinction in _validate_trailing_headers, the all-empty-header guard, the _row_to_dict replication of DictReader's restkey/restval semantics, and the "non-empty value under a stripped trailing column" guard all behave as intended. The one real problem is the lost blank-row skipping.

- Add blank-row skip (if not values: continue) to match DictReader behavior - Precompute trailing_empty_header_count once before the loop instead of per-row - Add test coverage for blank lines mid-file and trailing blank lines Co-Authored-By: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-06-08T15:27:35Z

Good catch on all three points — thanks for the thorough review Daryna Ishchenko (@darynaishchenko).

🔴 Blank-row regression — Fixed. Added if not values: continue in the read_data loop to mirror DictReader's internal blank-row skipping. Added test cases for both mid-file blank lines and trailing blank lines.

🟡 Per-row recomputation — Fixed. trailing_empty_header_count is now computed once before the loop and passed into _strip_trailing_empty_header_values instead of recomputing from raw_headers on every row.

🟡 Blank-line test coverage — Added two new parametrized test cases (blank_line_mid_file and trailing_blank_line).

All 69 tests pass locally, lint clean.

Devin session

fix(file-based): allow safe trailing empty CSV headers

60b189b

devin-ai-integration Bot added the hyd-fix label Jun 6, 2026

devin-ai-integration Bot and others added 3 commits June 8, 2026 13:53

Daryna Ishchenko (darynaishchenko) marked this pull request as ready for review June 8, 2026 15:51

Merge branch 'main' into devin/1780625970-s3-csv-trailing-empty-headers

012d045

Daryna Ishchenko (darynaishchenko) requested a review from Anatolii Yatsuk (tolik0) June 8, 2026 16:09

Patrick Nilan (pnilan) mentioned this pull request Jun 8, 2026

fix(file-based): strip trailing empty CSV headers instead of rejecting them #1039

Closed

3 tasks

Anatolii Yatsuk (tolik0) approved these changes Jun 9, 2026

View reviewed changes

Daryna Ishchenko (darynaishchenko) merged commit 09cc18f into main Jun 9, 2026
26 of 27 checks passed

Daryna Ishchenko (darynaishchenko) deleted the devin/1780625970-s3-csv-trailing-empty-headers branch June 9, 2026 14:55

devin-ai-integration Bot mentioned this pull request Jun 9, 2026

fix(source-s3): upgrade CDK to v7.21.2 to fix CSV trailing empty header regression airbytehq/airbyte#79599

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(file-based): Allow safe trailing empty CSV headers#1044

fix(file-based): Allow safe trailing empty CSV headers#1044
Daryna Ishchenko (darynaishchenko) merged 6 commits into
mainfrom
devin/1780625970-s3-csv-trailing-empty-headers

devin-ai-integration Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

github-actions Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented Jun 5, 2026

Uh oh!

Daryna Ishchenko (darynaishchenko) commented Jun 8, 2026

Uh oh!

devin-ai-integration Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

devin-ai-integration Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Declarative-First Evaluation

Test Coverage

Breaking Change Evaluation

Uh oh!

devin-ai-integration Bot commented Jun 5, 2026

🤖 Devin AI Engineer

Uh oh!

github-actions Bot commented Jun 5, 2026

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

devin-ai-integration Bot commented Jun 5, 2026

Uh oh!

Daryna Ishchenko (darynaishchenko) commented Jun 8, 2026

Review: CSV trailing-empty-headers change

🔴 Bug: blank lines in CSV files now raise an error instead of being skipped

🟡 Minor: trailing-empty-header count recomputed per data row

🟡 Minor: no test coverage for blank lines

Uh oh!

devin-ai-integration Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devin-ai-integration Bot commented Jun 5, 2026 •

edited

Loading

github-actions Bot commented Jun 5, 2026 •

edited

Loading

github-actions Bot commented Jun 5, 2026 •

edited

Loading