#7 fix #707 undefined keys in rows by nickolasclarke · Pull Request #709 · airbytehq/PyAirbyte

Nick Clarke (nickolasclarke) · 2025-07-03T21:15:19Z

This should resolve #707 by handling record field keys that are undefined. This is technically valid json, but undesirable. Not sure if this is the desired approach, however, so I'll keep this in draft and hold off on tests until I hear more.

Summary by CodeRabbit

Bug Fixes
- Improved handling of input data by detecting and ignoring empty string keys in records, with a warning logged when this occurs.

Nick Clarke (nickolasclarke) · 2025-07-03T21:15:54Z

        result = name
+        if not result:
+            # Use a short hash of the original name to avoid collisions.
+            uuid_suffix = uuid.uuid1().hex[:4]


4 characters should be more than enough, but this can always be expanded.

Nick Clarke (nickolasclarke) · 2025-07-03T21:16:19Z

        - "-1" -> "_1"
        """
        result = name
+        if not result:


Aaron ("AJ") Steers (@aaronsteers) this should probably log out a WARNING, but I dont immediately see the pattern for doing so in this repo.

I think warn_once may be a good fit for this.

PyAirbyte/airbyte/logs.py

Lines 48 to 58 in 6794987

def warn_once(

message: str,

logger: logging.Logger | None = None,

*,

with_stack: int | bool,

) -> None:

"""Emit a warning message only once.

This function is a wrapper around the `warnings.warn` function that logs the warning message

to the global logger. The warning message is only emitted once per unique message.

"""

coderabbitai · 2025-07-03T21:18:57Z

📝 Walkthrough

"""

Walkthrough

A check was added to the StreamRecord.__init__ method to detect and remove any empty string keys from input dictionaries. If such a key is found, a warning is logged once, and the empty key entry is ignored during record initialization. No changes were made to public interfaces.

Changes

File(s)	Change Summary
airbyte/records.py	Added logic in `StreamRecord.__init__` to detect, warn, and remove empty string keys from input dictionaries before processing. No public API changes.

Assessment against linked issues

Objective	Addressed	Explanation
Prevent PyAirbyteNameNormalizationError by handling empty string keys in StreamRecord initialization (#707)	✅

Assessment against linked issues: Out-of-scope changes

No out-of-scope changes were found.
"""

Would you like me to help draft a brief test suggestion to verify this behavior, or is this summary good to go as is? Wdyt?

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05c378d and 356ff07.

📒 Files selected for processing (1)

airbyte/records.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

airbyte/records.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Dependency Analysis with Deptry
GitHub Check: MyPy Check
GitHub Check: Pytest (No Creds)
GitHub Check: Pytest (Fast)
GitHub Check: preview_docs

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai auto-generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6794987 and 8e792ca.

📒 Files selected for processing (1)

airbyte/_util/name_normalizers.py (2 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#343
File: tests/unit_tests/test_text_normalization.py:214-214
Timestamp: 2024-09-07T00:30:42.566Z
Learning: In the `test_lower_case_normalizer` function, the `False` flag in the test cases refers to whether the normalization should raise an exception (`should_raise`), not whether the normalization should occur.

airbyte/_util/name_normalizers.py (1)

Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#343
File: tests/unit_tests/test_text_normalization.py:214-214
Timestamp: 2024-09-07T00:30:42.566Z
Learning: In the `test_lower_case_normalizer` function, the `False` flag in the test cases refers to whether the normalization should raise an exception (`should_raise`), not whether the normalization should occur.

🔇 Additional comments (1)

airbyte/_util/name_normalizers.py (1)

9-9: LGTM on the import addition!

The UUID import is necessary for the new functionality and is properly placed.

coderabbitai · 2025-07-03T21:19:00Z

+        if not result:
+            # Use a short hash of the original name to avoid collisions.
+            uuid_suffix = uuid.uuid1().hex[:4]
+            return f"undefined_key_{uuid_suffix}"


🛠️ Refactor suggestion

Consider making this deterministic to preserve caching benefits - wdyt?

The current approach using uuid.uuid1() is non-deterministic, meaning the same empty input will generate different normalized names each time. This breaks the effectiveness of the @functools.cache decorator and could cause inconsistencies.

Would you consider a deterministic approach instead? Here are a couple of alternatives:

Option 1: Use a fixed key for all empty inputs

- if not result: - # Use a short hash of the original name to avoid collisions. - uuid_suffix = uuid.uuid1().hex[:4] - return f"undefined_key_{uuid_suffix}" + if not result: + return "undefined_key"

Option 2: Use a hash-based approach for better determinism

- if not result: - # Use a short hash of the original name to avoid collisions. - uuid_suffix = uuid.uuid1().hex[:4] - return f"undefined_key_{uuid_suffix}" + if not result: + # Use a hash of the original name for deterministic results + import hashlib + hash_suffix = hashlib.md5(str(result).encode()).hexdigest()[:4] + return f"undefined_key_{hash_suffix}"

The deterministic approach would ensure consistent results and make the caching more effective. What do you think about this approach?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if not result:

# Use a short hash of the original name to avoid collisions.

uuid_suffix = uuid.uuid1().hex[:4]

return f"undefined_key_{uuid_suffix}"

if not result:

return "undefined_key"

🤖 Prompt for AI Agents

In airbyte/_util/name_normalizers.py around lines 72 to 75, the use of uuid.uuid1() to generate a suffix for empty input keys is non-deterministic, which breaks caching consistency. Replace this with a deterministic approach such as using a fixed suffix for all empty inputs or generating a hash-based suffix derived from the original input to ensure the same output for the same input, thereby preserving the benefits of the @functools.cache decorator.

I believe that cache busting here would be fine because we do not want to get key collisions.

I think the challenge with uuid here is that re-running the sync will end up with a new column name on subsequent executions. Some other implementations which need to solve a similar problem apply an ordinal id or letter like _undefined_key_1, _undefined_key_2, etc.

Another option is to simply treat unnamed properties as ignored. Did you find the key node that was having a null/empty field name? Or is the root cause something different?

I'm not sure what the best path is here. The AI had a good instinct, but bad suggestions. For example: if "result" is an empty string, then hashing it (hashlib.md5(str(result).encode())) won't be helpful for uniqueness if other field names could also be blank/empty.

Yes, the problematic data in question, was JSONL results that contained an empty key with a value "":my_still_valuable_value", which is technically valid JSON. This was a bug on our side, but is still present in the historical records.

An ordinal ID is probably the best solution here if you want to maintain the KV pair, but it would require passing additional state into this function, which feels wrong to me as well. I think warning in the logs skipping the row is probably best.

Yohann Jardin (yohannj) · 2025-07-07T08:33:58Z

+        if not result:
+            # Use a short hash of the original name to avoid collisions.
+            uuid_suffix = uuid.uuid1().hex[:4]
+            return f"undefined_key_{uuid_suffix}"


Considering this method is used to normalize names, this behavior feels very unnatural to me.

Rather than this implementation, what about skipping columns with an empty key completely?

PyAirbyte/airbyte/records.py

Line 209 in 6794987

from_dict: The dictionary to initialize the StreamRecord with.

yes I think I agree now. See my reply here: https://github.com/airbytehq/PyAirbyte/pull/709/files#r2192886996

It seems that if we do want to skip, this has to be done before the StreamRecordHandler is instantiated on L215 correct Aaron ("AJ") Steers (@aaronsteers)?

Actually, I think we may be overthinking this. This edge case will only occur when there is a single falsy key, you cannot have a dict with multiple falsy keys. So perhaps my original implementation without the hash and throwing a warning would be sufficient. Python would throw when casting the json to a dict.

Otherwise I could add something like

for k, v in from_dict.items(): if not k: warn_once( f"Empty key found in StreamRecord initialization with value: {v}. Ignoring.", ) from_dict = {k: v for k, v in from_dict.items() if k} break

at L215

I'm happy with either implementation.

Yes, something like that.
It could also be a variable like invalid_keys = [''] that we use to remove some keys.
It could also be:

if '' in from_dict: warn_once(f"Empty key found in StreamRecord. Ignoring.") del from_dict['']

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e792ca and 756963f.

📒 Files selected for processing (2)

airbyte/_util/name_normalizers.py (1 hunks)
airbyte/records.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

airbyte/_util/name_normalizers.py

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-09-17T21:18:12.530Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.

Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-10-08T15:34:31.026Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.

airbyte/records.py (2)

Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-09-17T21:18:12.530Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.

Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-10-08T15:34:31.026Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.

🧬 Code Graph Analysis (1)

airbyte/records.py (1)

airbyte/logs.py (1)

warn_once (48-76)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Dependency Analysis with Deptry
GitHub Check: Pytest (No Creds)
GitHub Check: MyPy Check
GitHub Check: preview_docs
GitHub Check: Pytest (Fast)

🔇 Additional comments (1)

airbyte/records.py (1)

84-84: Import addition looks good!

The import of warn_once is appropriate for the new functionality being added.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Nick Clarke (nickolasclarke) · 2025-07-10T06:39:26Z

Aaron ("AJ") Steers (@aaronsteers) there does not seem to be unit tests against the StreamRecord class directly. I could put in a unit test in the normalization testers, but that feels out place. Let me know where or if I should put tests in for this, and I'll do so.

handle undefined keys

8e792ca

Nick Clarke (nickolasclarke) commented Jul 3, 2025

View reviewed changes

coderabbitai Bot requested changes Jul 3, 2025

View reviewed changes

Yohann Jardin (yohannj) reviewed Jul 7, 2025

View reviewed changes

Nick Clarke (nickolasclarke) and others added 3 commits July 8, 2025 09:46

Merge branch 'airbytehq:main' into nclarke/handle_empty_keys

9a69456

simply delete the entry if the key is an emtpy string, instead.

756963f

add back in whitespace

92c804f

coderabbitai Bot requested changes Jul 10, 2025

View reviewed changes

Comment thread airbyte/records.py Outdated

Nick Clarke (nickolasclarke) and others added 3 commits July 9, 2025 23:18

remove uuid

30a5d09

Apply suggestions from code review

05c378d

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

simplify to pass ruff

356ff07

Nick Clarke (nickolasclarke) marked this pull request as ready for review July 10, 2025 06:39

Nick Clarke (nickolasclarke) changed the title ~~#7 Handle undefined keys in rows~~ #7 fix #707 undefined keys in rows Jul 10, 2025

devin-ai-integration Bot added the community PRs from community contributors label Dec 10, 2025

	def warn_once(
	message: str,
	logger: logging.Logger \| None = None,
	*,
	with_stack: int \| bool,
	) -> None:
	"""Emit a warning message only once.

	This function is a wrapper around the `warnings.warn` function that logs the warning message
	to the global logger. The warning message is only emitted once per unique message.
	"""

Uh oh!

Conversation

Nick Clarke (nickolasclarke) commented Jul 3, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Nick Clarke (nickolasclarke) Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Nick Clarke (nickolasclarke) Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Aaron ("AJ") Steers (aaronsteers) Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Nick Clarke (nickolasclarke) Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Aaron ("AJ") Steers (aaronsteers) Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nick Clarke (nickolasclarke) Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Yohann Jardin (yohannj) Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Nick Clarke (nickolasclarke) Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Nick Clarke (nickolasclarke) Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yohann Jardin (yohannj) Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Nick Clarke (nickolasclarke) commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Nick Clarke (nickolasclarke) commented Jul 3, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 3, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

Aaron ("AJ") Steers (aaronsteers) Jul 4, 2025 •

edited

Loading

Nick Clarke (nickolasclarke) Jul 8, 2025 •

edited

Loading

Yohann Jardin (yohannj) Jul 8, 2025 •

edited

Loading