fix(file-based): clarify CSV row/header column mismatch error messages#996
Draft
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
Draft
fix(file-based): clarify CSV row/header column mismatch error messages#996devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
Conversation
Co-Authored-By: bot_apk <apk@cognition.ai>
Contributor
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1776797319-improve-csv-mismatch-error-messages#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1776797319-improve-csv-mismatch-error-messagesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
…rce-google-drive failures Co-Authored-By: bot_apk <apk@cognition.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the two long, implementation-leaky CSV parse-error strings in
FileBasedSourceErrorwith concise, deterministic messages that comply with Airbyte's writing-good-error-messages guidelines.Before:
After:
The
filename=<file uri> lineno=<line>context is already appended to the raised exception byBaseFileBasedSourceError.__init__via kwargs, so per-row location is preserved in the log without embedding user-supplied data directly in the deterministic message.What this addresses
resolved to `None`described a Pythondictsymptom (theDictReadersentinel for extra/missing values), not a CSV concept. Users didn't know whatNonemeant.csv.DictReaderbranches (None in rowkeys vs.None in row.values()).Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/16225
Related to https://github.com/airbytehq/oncall/issues/12046
Scope note — out of scope for this PR
The parent triage issue also flagged a deeper
source-s3v3→v4 regression where the legacynewlines_in_values: TrueCSV option is silently dropped byLegacyConfigTransformer, which causes the CDK'scsv.DictReaderto mis-split rows with embedded newlines in quoted fields. That connector-side regression is the frequent cause of these MISMATCHED_* errors in the wild, but fixing it requires:newlines_in_values-equivalent option toCsvFormat(or explicit handling inLegacyConfigTransformer._transform_file_format), andnewline=""throughsource-s3/v4/stream_reader.py::open_filetocsv.DictReader.Those changes are connector-level, higher-risk, and should ship as a separate PR. This PR intentionally scopes to the CDK-level message quality improvement so the fix can land quickly for all file-based CSV sources.
Review & Testing Checklist for Human
default_file_based_stream.py'serrors_collectorwithfilename=/lineno=appended byBaseFileBasedSourceError).resolved to \None`` strings. A quick repo / dashboard grep is recommended before merge.source-s3newlines_in_valuesregression described in the scope note above.Notes
No tests referenced the old strings (verified via ripgrep for
header field has resolved,row's value has resolved,MISMATCHED_COLUMNS,MISMATCHED_ROWS). Ruff lint and format both pass on the modified file.Link to Devin session: https://app.devin.ai/sessions/cc54e092a81c48c5b492a605abafc1ea