Skip to content

#7 fix #707 undefined keys in rows#709

Open
Nick Clarke (nickolasclarke) wants to merge 7 commits into
airbytehq:mainfrom
nickolasclarke:nclarke/handle_empty_keys
Open

#7 fix #707 undefined keys in rows#709
Nick Clarke (nickolasclarke) wants to merge 7 commits into
airbytehq:mainfrom
nickolasclarke:nclarke/handle_empty_keys

Conversation

@nickolasclarke

@nickolasclarke Nick Clarke (nickolasclarke) commented Jul 3, 2025

Copy link
Copy Markdown

This should resolve #707 by handling record field keys that are undefined. This is technically valid json, but undesirable. Not sure if this is the desired approach, however, so I'll keep this in draft and hold off on tests until I hear more.

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of input data by detecting and ignoring empty string keys in records, with a warning logged when this occurs.

Comment thread airbyte/_util/name_normalizers.py Outdated
result = name
if not result:
# Use a short hash of the original name to avoid collisions.
uuid_suffix = uuid.uuid1().hex[:4]

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 characters should be more than enough, but this can always be expanded.

Comment thread airbyte/_util/name_normalizers.py Outdated
- "-1" -> "_1"
"""
result = name
if not result:

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaron ("AJ") Steers (@aaronsteers) this should probably log out a WARNING, but I dont immediately see the pattern for doing so in this repo.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think warn_once may be a good fit for this.

def warn_once(
message: str,
logger: logging.Logger | None = None,
*,
with_stack: int | bool,
) -> None:
"""Emit a warning message only once.
This function is a wrapper around the `warnings.warn` function that logs the warning message
to the global logger. The warning message is only emitted once per unique message.
"""

@coderabbitai

coderabbitai Bot commented Jul 3, 2025

Copy link
Copy Markdown
Contributor
📝 Walkthrough

"""

Walkthrough

A check was added to the StreamRecord.__init__ method to detect and remove any empty string keys from input dictionaries. If such a key is found, a warning is logged once, and the empty key entry is ignored during record initialization. No changes were made to public interfaces.

Changes

File(s) Change Summary
airbyte/records.py Added logic in StreamRecord.__init__ to detect, warn, and remove empty string keys from input dictionaries before processing. No public API changes.

Assessment against linked issues

Objective Addressed Explanation
Prevent PyAirbyteNameNormalizationError by handling empty string keys in StreamRecord initialization (#707)

Assessment against linked issues: Out-of-scope changes

No out-of-scope changes were found.
"""

Would you like me to help draft a brief test suggestion to verify this behavior, or is this summary good to go as is? Wdyt?


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05c378d and 356ff07.

📒 Files selected for processing (1)
  • airbyte/records.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte/records.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Dependency Analysis with Deptry
  • GitHub Check: MyPy Check
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: preview_docs
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai auto-generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6794987 and 8e792ca.

📒 Files selected for processing (1)
  • airbyte/_util/name_normalizers.py (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#343
File: tests/unit_tests/test_text_normalization.py:214-214
Timestamp: 2024-09-07T00:30:42.566Z
Learning: In the `test_lower_case_normalizer` function, the `False` flag in the test cases refers to whether the normalization should raise an exception (`should_raise`), not whether the normalization should occur.
airbyte/_util/name_normalizers.py (1)
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#343
File: tests/unit_tests/test_text_normalization.py:214-214
Timestamp: 2024-09-07T00:30:42.566Z
Learning: In the `test_lower_case_normalizer` function, the `False` flag in the test cases refers to whether the normalization should raise an exception (`should_raise`), not whether the normalization should occur.
🔇 Additional comments (1)
airbyte/_util/name_normalizers.py (1)

9-9: LGTM on the import addition!

The UUID import is necessary for the new functionality and is properly placed.

Comment thread airbyte/_util/name_normalizers.py Outdated
Comment on lines +72 to +75
if not result:
# Use a short hash of the original name to avoid collisions.
uuid_suffix = uuid.uuid1().hex[:4]
return f"undefined_key_{uuid_suffix}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider making this deterministic to preserve caching benefits - wdyt?

The current approach using uuid.uuid1() is non-deterministic, meaning the same empty input will generate different normalized names each time. This breaks the effectiveness of the @functools.cache decorator and could cause inconsistencies.

Would you consider a deterministic approach instead? Here are a couple of alternatives:

Option 1: Use a fixed key for all empty inputs

-        if not result:
-            # Use a short hash of the original name to avoid collisions.
-            uuid_suffix = uuid.uuid1().hex[:4]
-            return f"undefined_key_{uuid_suffix}"
+        if not result:
+            return "undefined_key"

Option 2: Use a hash-based approach for better determinism

-        if not result:
-            # Use a short hash of the original name to avoid collisions.
-            uuid_suffix = uuid.uuid1().hex[:4]
-            return f"undefined_key_{uuid_suffix}"
+        if not result:
+            # Use a hash of the original name for deterministic results
+            import hashlib
+            hash_suffix = hashlib.md5(str(result).encode()).hexdigest()[:4]
+            return f"undefined_key_{hash_suffix}"

The deterministic approach would ensure consistent results and make the caching more effective. What do you think about this approach?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not result:
# Use a short hash of the original name to avoid collisions.
uuid_suffix = uuid.uuid1().hex[:4]
return f"undefined_key_{uuid_suffix}"
if not result:
return "undefined_key"
🤖 Prompt for AI Agents
In airbyte/_util/name_normalizers.py around lines 72 to 75, the use of
uuid.uuid1() to generate a suffix for empty input keys is non-deterministic,
which breaks caching consistency. Replace this with a deterministic approach
such as using a fixed suffix for all empty inputs or generating a hash-based
suffix derived from the original input to ensure the same output for the same
input, thereby preserving the benefits of the @functools.cache decorator.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that cache busting here would be fine because we do not want to get key collisions.

@aaronsteers Aaron ("AJ") Steers (aaronsteers) Jul 4, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the challenge with uuid here is that re-running the sync will end up with a new column name on subsequent executions. Some other implementations which need to solve a similar problem apply an ordinal id or letter like _undefined_key_1, _undefined_key_2, etc.

Another option is to simply treat unnamed properties as ignored. Did you find the key node that was having a null/empty field name? Or is the root cause something different?

I'm not sure what the best path is here. The AI had a good instinct, but bad suggestions. For example: if "result" is an empty string, then hashing it (hashlib.md5(str(result).encode())) won't be helpful for uniqueness if other field names could also be blank/empty.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the problematic data in question, was JSONL results that contained an empty key with a value "":my_still_valuable_value", which is technically valid JSON. This was a bug on our side, but is still present in the historical records.

An ordinal ID is probably the best solution here if you want to maintain the KV pair, but it would require passing additional state into this function, which feels wrong to me as well. I think warning in the logs skipping the row is probably best.

Comment thread airbyte/_util/name_normalizers.py Outdated
if not result:
# Use a short hash of the original name to avoid collisions.
uuid_suffix = uuid.uuid1().hex[:4]
return f"undefined_key_{uuid_suffix}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this method is used to normalize names, this behavior feels very unnatural to me.

Rather than this implementation, what about skipping columns with an empty key completely?

from_dict: The dictionary to initialize the StreamRecord with.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think I agree now. See my reply here: https://github.com/airbytehq/PyAirbyte/pull/709/files#r2192886996

It seems that if we do want to skip, this has to be done before the StreamRecordHandler is instantiated on L215 correct Aaron ("AJ") Steers (@aaronsteers)?

@nickolasclarke Nick Clarke (nickolasclarke) Jul 8, 2025

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we may be overthinking this. This edge case will only occur when there is a single falsy key, you cannot have a dict with multiple falsy keys. So perhaps my original implementation without the hash and throwing a warning would be sufficient. Python would throw when casting the json to a dict.

Otherwise I could add something like

        for k, v in from_dict.items():
            if not k:
                warn_once(
                    f"Empty key found in StreamRecord initialization with value: {v}. Ignoring.",
                )
                from_dict = {k: v for k, v in from_dict.items() if k}
                break

at L215

I'm happy with either implementation.

@yohannj Yohann Jardin (yohannj) Jul 8, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like that.
It could also be a variable like invalid_keys = [''] that we use to remove some keys.
It could also be:

if '' in from_dict:
  warn_once(f"Empty key found in StreamRecord. Ignoring.")
  del from_dict['']

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e792ca and 756963f.

📒 Files selected for processing (2)
  • airbyte/_util/name_normalizers.py (1 hunks)
  • airbyte/records.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • airbyte/_util/name_normalizers.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-09-17T21:18:12.530Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-10-08T15:34:31.026Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.
airbyte/records.py (2)
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-09-17T21:18:12.530Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-10-08T15:34:31.026Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.
🧬 Code Graph Analysis (1)
airbyte/records.py (1)
airbyte/logs.py (1)
  • warn_once (48-76)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Dependency Analysis with Deptry
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: MyPy Check
  • GitHub Check: preview_docs
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (1)
airbyte/records.py (1)

84-84: Import addition looks good!

The import of warn_once is appropriate for the new functionality being added.

Comment thread airbyte/records.py Outdated
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@nickolasclarke

Copy link
Copy Markdown
Author

Aaron ("AJ") Steers (@aaronsteers) there does not seem to be unit tests against the StreamRecord class directly. I could put in a unit test in the normalization testers, but that feels out place. Let me know where or if I should put tests in for this, and I'll do so.

@nickolasclarke Nick Clarke (nickolasclarke) marked this pull request as ready for review July 10, 2025 06:39
@nickolasclarke Nick Clarke (nickolasclarke) changed the title #7 Handle undefined keys in rows #7 fix #707 undefined keys in rows Jul 10, 2025
@devin-ai-integration devin-ai-integration Bot added the community PRs from community contributors label Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community PRs from community contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

source-mixpanel raises PyAirbyteNameNormalizationError

3 participants