Skip to content

fix(file-based): Allow safe trailing empty CSV headers#1044

Merged
Daryna Ishchenko (darynaishchenko) merged 6 commits into
mainfrom
devin/1780625970-s3-csv-trailing-empty-headers
Jun 9, 2026
Merged

fix(file-based): Allow safe trailing empty CSV headers#1044
Daryna Ishchenko (darynaishchenko) merged 6 commits into
mainfrom
devin/1780625970-s3-csv-trailing-empty-headers

Conversation

@devin-ai-integration

@devin-ai-integration devin-ai-integration Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a regression introduced in CDK v7.19.1 (airbytehq/airbyte-python-cdk#1010) where _CsvReader._get_headers() rejects all empty CSV column names — including safe trailing empties from trailing delimiters. This broke existing source-s3 connections (v4.15.4) whose externally generated CSV files have trailing blank header columns (e.g., col1,col2,col3,,).

Fix: _get_headers_get_headers_and_trailing_empty_count now distinguishes trailing vs interior empty headers:

  • Trailing empty/whitespace-only headers are stripped, and their corresponding row values are discarded via _strip_trailing_empty_header_values
  • Interior empty headers still raise config_error
  • All-empty headers still raise config_error
  • Non-empty data values under stripped trailing headers raise config_error (safety check)

Uses raw csv.reader + _row_to_dict instead of csv.DictReader for explicit control over trailing column handling.

Declarative-First Evaluation

N/A — this is a CDK-level fix in csv_parser.py, not a connector manifest change.

Test Coverage

  • 10 new parametrized tests covering: trailing empty header stripping, interior empty header rejection, all-empty headers, read_data integration with trailing empties, non-empty values under trailing headers
  • 3 existing tests updated to reflect the interior-vs-trailing distinction
  • All 67 CSV parser tests pass locally

Breaking Change Evaluation

Not breaking. This strictly relaxes validation — trailing empty headers that previously caused config_error are now silently handled. No schema, spec, state, or public API changes.

Resolves https://github.com/airbytehq/oncall/issues/12736:

Link to Devin session: https://app.devin.ai/sessions/fe8a17db8ccc4484a231583efb869590

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1780625970-s3-csv-trailing-empty-headers#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1780625970-s3-csv-trailing-empty-headers

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

PyTest Results (Fast)

4 103 tests  +9   4 091 ✅ +9   7m 36s ⏱️ ±0s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 012d045. ± Comparison against base commit 7da322d.

This pull request removes 3 and adds 12 tests. Note that renamed tests count towards both.
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[trailing_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_empty_column_names
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[middle_whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[only_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_column_names[empty_headers]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_column_names[whitespace_only_headers]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_interior_empty_column_names[interior_empty_header]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_interior_empty_column_names[interior_whitespace_header]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_non_empty_values_under_trailing_empty_headers
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[blank_line_mid_file]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[trailing_blank_line]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[trailing_empty_headers_and_cells]
…

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

PyTest Results (Full)

4 106 tests  +9   4 094 ✅ +9   12m 7s ⏱️ + 2m 43s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 012d045. ± Comparison against base commit 7da322d.

This pull request removes 3 and adds 12 tests. Note that renamed tests count towards both.
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[trailing_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_empty_column_names
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[middle_whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[only_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_column_names[empty_headers]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_column_names[whitespace_only_headers]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_interior_empty_column_names[interior_empty_header]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_interior_empty_column_names[interior_whitespace_header]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_non_empty_values_under_trailing_empty_headers
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[blank_line_mid_file]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[trailing_blank_line]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_headers_and_values[trailing_empty_headers_and_cells]
…

♻️ This comment has been updated with latest results.

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

CI note: all required checks are passing. The non-required Check: source-google-drive job fails before exercising this PR's CSV parser changes: the stack trace shows airbyte_cdk/test/standard_tests/connector_base.py calling SourceGoogleDrive() with no args, while the current source-google-drive connector defines __init__(catalog, config, state). The same downstream connector-test failure is visible on the related draft PR #1039.

I left this PR scoped to the CSV compatibility fix rather than changing the connector-test matrix or generic connector test harness.


Devin session

devin-ai-integration Bot and others added 3 commits June 8, 2026 13:53
…y_count

Consolidate logic into _get_headers (returns List[str]) and call
_get_trailing_empty_header_count separately in read_data where the
trailing count is needed. This eliminates the unnecessary wrapper
method while keeping the interface clean.

Co-Authored-By: bot_apk <apk@cognition.ai>
…e_trailing_headers

Extract _validate_trailing_headers method that encapsulates all trailing
header validation logic (count, strip, interior/all-empty checks) in one
place. Add _read_and_validate_headers as the single entry point that reads
raw headers from file and validates them. _get_headers delegates to it.

Co-Authored-By: bot_apk <apk@cognition.ai>
…iling_empty_header_values

_strip_trailing_empty_header_values now calls _get_trailing_empty_header_count
internally on the raw_headers, removing the need for read_data to track
trailing_empty_header_count separately.

_read_and_validate_headers returns (clean_headers, raw_headers) instead of
(clean_headers, trailing_count). _validate_trailing_headers returns just
the clean headers list.

Co-Authored-By: bot_apk <apk@cognition.ai>
@darynaishchenko

Copy link
Copy Markdown
Contributor

Review: CSV trailing-empty-headers change

Reviewed the switch from csv.DictReader to csv.reader + manual _row_to_dict. One significant bug plus a couple of minor points.

🔴 Bug: blank lines in CSV files now raise an error instead of being skipped

csv.DictReader silently skips blank rows (it has an internal while row == []: ... loop). A plain csv.reader does not — it yields [] for each blank line. The new loop has no equivalent skip, so a blank line flows into _row_to_dict(headers, []), producing {header: None, ...}, which trips the None in row.values() check → RecordParseError (or a logged error when ignore_errors_on_fields_mismatch is set).

Confirmed end-to-end against the real parser:

data = 'a,b\n1,2\n\n3,4\n'   # blank line between records
→ RAISED: RecordParseError "CSV data row contains fewer columns than the header row defines." lineno=3

Before this PR the same file parsed fine. This also affects files ending with an extra blank line (...\n\n), which is common. The regression is unrelated to trailing empty headers — it's a side effect of dropping DictReader.

Suggested fix — skip empty rows explicitly:

for values in reader:
    lineno += 1
    if not values:          # mirror DictReader's blank-row skipping
        continue
    ...

🟡 Minor: trailing-empty-header count recomputed per data row

_strip_trailing_empty_header_values calls _get_trailing_empty_header_count(raw_headers) on every row, though raw_headers is constant for the whole file. The count (or strip_start/strip_end) could be computed once before the loop and passed in. Not a correctness issue, just wasted work on large files.

🟡 Minor: no test coverage for blank lines

test_csv_parser.py has no test exercising blank/empty rows, which is why this regression isn't caught. Worth adding cases for a blank line mid-file and a trailing blank line.


The trailing-empty-header logic itself looks correct: the interior-vs-trailing distinction in _validate_trailing_headers, the all-empty-header guard, the _row_to_dict replication of DictReader's restkey/restval semantics, and the "non-empty value under a stripped trailing column" guard all behave as intended. The one real problem is the lost blank-row skipping.

- Add blank-row skip (if not values: continue) to match DictReader behavior
- Precompute trailing_empty_header_count once before the loop instead of per-row
- Add test coverage for blank lines mid-file and trailing blank lines

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

Good catch on all three points — thanks for the thorough review Daryna Ishchenko (@darynaishchenko).

🔴 Blank-row regression — Fixed. Added if not values: continue in the read_data loop to mirror DictReader's internal blank-row skipping. Added test cases for both mid-file blank lines and trailing blank lines.

🟡 Per-row recomputation — Fixed. trailing_empty_header_count is now computed once before the loop and passed into _strip_trailing_empty_header_values instead of recomputing from raw_headers on every row.

🟡 Blank-line test coverage — Added two new parametrized test cases (blank_line_mid_file and trailing_blank_line).

All 69 tests pass locally, lint clean.


Devin session

@darynaishchenko Daryna Ishchenko (darynaishchenko) marked this pull request as ready for review June 8, 2026 15:51
@darynaishchenko Daryna Ishchenko (darynaishchenko) merged commit 09cc18f into main Jun 9, 2026
26 of 27 checks passed
@darynaishchenko Daryna Ishchenko (darynaishchenko) deleted the devin/1780625970-s3-csv-trailing-empty-headers branch June 9, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants