Skip to content

fix(file-based): clarify CSV row/header column mismatch error messages#996

Draft
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1776797319-improve-csv-mismatch-error-messages
Draft

fix(file-based): clarify CSV row/header column mismatch error messages#996
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1776797319-improve-csv-mismatch-error-messages

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Summary

Replaces the two long, implementation-leaky CSV parse-error strings in FileBasedSourceError with concise, deterministic messages that comply with Airbyte's writing-good-error-messages guidelines.

Before:

A header field has resolved to `None`. This indicates that the CSV has more rows than the number of header fields. If you input your schema or headers, please verify that the number of columns corresponds to the number of columns in your CSV's rows.

A row's value has resolved to `None`. This indicates that the CSV has more columns in the header field than the number of columns in the row(s). If you input your schema or headers, please verify that the number of columns corresponds to the number of columns in your CSV's rows.

After:

CSV row has more columns than the header.
CSV row has fewer columns than the header.

The filename=<file uri> lineno=<line> context is already appended to the raised exception by BaseFileBasedSourceError.__init__ via kwargs, so per-row location is preserved in the log without embedding user-supplied data directly in the deterministic message.

What this addresses

  • Implementation leakage. The phrase resolved to `None` described a Python dict symptom (the DictReader sentinel for extra/missing values), not a CSV concept. Users didn't know what None meant.
  • Length. Both messages previously exceeded the 120-char target by roughly 3×.
  • Determinism. The new strings are static and usable as log-aggregation keys.
  • Accuracy of direction. Each variant now directly names whether the row has more or fewer columns than the header, matching the two distinct csv.DictReader branches (None in row keys vs. None in row.values()).

Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/16225
Related to https://github.com/airbytehq/oncall/issues/12046

Scope note — out of scope for this PR

The parent triage issue also flagged a deeper source-s3 v3→v4 regression where the legacy newlines_in_values: True CSV option is silently dropped by LegacyConfigTransformer, which causes the CDK's csv.DictReader to mis-split rows with embedded newlines in quoted fields. That connector-side regression is the frequent cause of these MISMATCHED_* errors in the wild, but fixing it requires:

  • Adding a newlines_in_values-equivalent option to CsvFormat (or explicit handling in LegacyConfigTransformer._transform_file_format), and
  • Plumbing newline="" through source-s3/v4/stream_reader.py::open_file to csv.DictReader.

Those changes are connector-level, higher-risk, and should ship as a separate PR. This PR intentionally scopes to the CDK-level message quality improvement so the fix can land quickly for all file-based CSV sources.

Review & Testing Checklist for Human

  • Confirm the two replacement strings read well in real sync logs (they are surfaced via default_file_based_stream.py's errors_collector with filename= / lineno= appended by BaseFileBasedSourceError).
  • Confirm no downstream tooling (log-ingestion rules, alert regexes, dashboards) was keying on the previous resolved to \None`` strings. A quick repo / dashboard grep is recommended before merge.
  • Decide whether a follow-up issue/PR should be opened for the source-s3 newlines_in_values regression described in the scope note above.

Notes

No tests referenced the old strings (verified via ripgrep for header field has resolved, row's value has resolved, MISMATCHED_COLUMNS, MISMATCHED_ROWS). Ruff lint and format both pass on the modified file.

Link to Devin session: https://app.devin.ai/sessions/cc54e092a81c48c5b492a605abafc1ea

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1776797319-improve-csv-mismatch-error-messages#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1776797319-improve-csv-mismatch-error-messages

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

PyTest Results (Fast)

4 022 tests  ±0   4 011 ✅ ±0   7m 39s ⏱️ -7s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit a742618. ± Comparison against base commit fd553bd.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

PyTest Results (Full)

4 025 tests  ±0   4 013 ✅ ±0   11m 26s ⏱️ +39s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit a742618. ± Comparison against base commit fd553bd.

♻️ This comment has been updated with latest results.

…rce-google-drive failures

Co-Authored-By: bot_apk <apk@cognition.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants