#7 fix #707 undefined keys in rows#709
Conversation
| result = name | ||
| if not result: | ||
| # Use a short hash of the original name to avoid collisions. | ||
| uuid_suffix = uuid.uuid1().hex[:4] |
There was a problem hiding this comment.
4 characters should be more than enough, but this can always be expanded.
| - "-1" -> "_1" | ||
| """ | ||
| result = name | ||
| if not result: |
There was a problem hiding this comment.
Aaron ("AJ") Steers (@aaronsteers) this should probably log out a WARNING, but I dont immediately see the pattern for doing so in this repo.
There was a problem hiding this comment.
I think warn_once may be a good fit for this.
Lines 48 to 58 in 6794987
📝 Walkthrough""" WalkthroughA check was added to the Changes
Assessment against linked issues
Assessment against linked issues: Out-of-scope changesNo out-of-scope changes were found. Would you like me to help draft a brief test suggestion to verify this behavior, or is this summary good to go as is? Wdyt? 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte/_util/name_normalizers.py(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#343
File: tests/unit_tests/test_text_normalization.py:214-214
Timestamp: 2024-09-07T00:30:42.566Z
Learning: In the `test_lower_case_normalizer` function, the `False` flag in the test cases refers to whether the normalization should raise an exception (`should_raise`), not whether the normalization should occur.
airbyte/_util/name_normalizers.py (1)
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#343
File: tests/unit_tests/test_text_normalization.py:214-214
Timestamp: 2024-09-07T00:30:42.566Z
Learning: In the `test_lower_case_normalizer` function, the `False` flag in the test cases refers to whether the normalization should raise an exception (`should_raise`), not whether the normalization should occur.
🔇 Additional comments (1)
airbyte/_util/name_normalizers.py (1)
9-9: LGTM on the import addition!The UUID import is necessary for the new functionality and is properly placed.
| if not result: | ||
| # Use a short hash of the original name to avoid collisions. | ||
| uuid_suffix = uuid.uuid1().hex[:4] | ||
| return f"undefined_key_{uuid_suffix}" |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Consider making this deterministic to preserve caching benefits - wdyt?
The current approach using uuid.uuid1() is non-deterministic, meaning the same empty input will generate different normalized names each time. This breaks the effectiveness of the @functools.cache decorator and could cause inconsistencies.
Would you consider a deterministic approach instead? Here are a couple of alternatives:
Option 1: Use a fixed key for all empty inputs
- if not result:
- # Use a short hash of the original name to avoid collisions.
- uuid_suffix = uuid.uuid1().hex[:4]
- return f"undefined_key_{uuid_suffix}"
+ if not result:
+ return "undefined_key"Option 2: Use a hash-based approach for better determinism
- if not result:
- # Use a short hash of the original name to avoid collisions.
- uuid_suffix = uuid.uuid1().hex[:4]
- return f"undefined_key_{uuid_suffix}"
+ if not result:
+ # Use a hash of the original name for deterministic results
+ import hashlib
+ hash_suffix = hashlib.md5(str(result).encode()).hexdigest()[:4]
+ return f"undefined_key_{hash_suffix}"The deterministic approach would ensure consistent results and make the caching more effective. What do you think about this approach?
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not result: | |
| # Use a short hash of the original name to avoid collisions. | |
| uuid_suffix = uuid.uuid1().hex[:4] | |
| return f"undefined_key_{uuid_suffix}" | |
| if not result: | |
| return "undefined_key" |
🤖 Prompt for AI Agents
In airbyte/_util/name_normalizers.py around lines 72 to 75, the use of
uuid.uuid1() to generate a suffix for empty input keys is non-deterministic,
which breaks caching consistency. Replace this with a deterministic approach
such as using a fixed suffix for all empty inputs or generating a hash-based
suffix derived from the original input to ensure the same output for the same
input, thereby preserving the benefits of the @functools.cache decorator.
There was a problem hiding this comment.
I believe that cache busting here would be fine because we do not want to get key collisions.
There was a problem hiding this comment.
I think the challenge with uuid here is that re-running the sync will end up with a new column name on subsequent executions. Some other implementations which need to solve a similar problem apply an ordinal id or letter like _undefined_key_1, _undefined_key_2, etc.
Another option is to simply treat unnamed properties as ignored. Did you find the key node that was having a null/empty field name? Or is the root cause something different?
I'm not sure what the best path is here. The AI had a good instinct, but bad suggestions. For example: if "result" is an empty string, then hashing it (hashlib.md5(str(result).encode())) won't be helpful for uniqueness if other field names could also be blank/empty.
There was a problem hiding this comment.
Yes, the problematic data in question, was JSONL results that contained an empty key with a value "":my_still_valuable_value", which is technically valid JSON. This was a bug on our side, but is still present in the historical records.
An ordinal ID is probably the best solution here if you want to maintain the KV pair, but it would require passing additional state into this function, which feels wrong to me as well. I think warning in the logs skipping the row is probably best.
| if not result: | ||
| # Use a short hash of the original name to avoid collisions. | ||
| uuid_suffix = uuid.uuid1().hex[:4] | ||
| return f"undefined_key_{uuid_suffix}" |
There was a problem hiding this comment.
Considering this method is used to normalize names, this behavior feels very unnatural to me.
Rather than this implementation, what about skipping columns with an empty key completely?
Line 209 in 6794987
There was a problem hiding this comment.
yes I think I agree now. See my reply here: https://github.com/airbytehq/PyAirbyte/pull/709/files#r2192886996
It seems that if we do want to skip, this has to be done before the StreamRecordHandler is instantiated on L215 correct Aaron ("AJ") Steers (@aaronsteers)?
There was a problem hiding this comment.
Actually, I think we may be overthinking this. This edge case will only occur when there is a single falsy key, you cannot have a dict with multiple falsy keys. So perhaps my original implementation without the hash and throwing a warning would be sufficient. Python would throw when casting the json to a dict.
Otherwise I could add something like
for k, v in from_dict.items():
if not k:
warn_once(
f"Empty key found in StreamRecord initialization with value: {v}. Ignoring.",
)
from_dict = {k: v for k, v in from_dict.items() if k}
breakat L215
I'm happy with either implementation.
There was a problem hiding this comment.
Yes, something like that.
It could also be a variable like invalid_keys = [''] that we use to remove some keys.
It could also be:
if '' in from_dict:
warn_once(f"Empty key found in StreamRecord. Ignoring.")
del from_dict['']There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte/_util/name_normalizers.py(1 hunks)airbyte/records.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- airbyte/_util/name_normalizers.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-09-17T21:18:12.530Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-10-08T15:34:31.026Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.
airbyte/records.py (2)
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-09-17T21:18:12.530Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#369
File: airbyte/_connector_base.py:0-0
Timestamp: 2024-10-08T15:34:31.026Z
Learning: In this codebase, `message.record.stream` is a required property enforced by schema, so it will not be `None`.
🧬 Code Graph Analysis (1)
airbyte/records.py (1)
airbyte/logs.py (1)
warn_once(48-76)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Dependency Analysis with Deptry
- GitHub Check: Pytest (No Creds)
- GitHub Check: MyPy Check
- GitHub Check: preview_docs
- GitHub Check: Pytest (Fast)
🔇 Additional comments (1)
airbyte/records.py (1)
84-84: Import addition looks good!The import of
warn_onceis appropriate for the new functionality being added.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
Aaron ("AJ") Steers (@aaronsteers) there does not seem to be unit tests against the |
This should resolve #707 by handling record field keys that are undefined. This is technically valid json, but undesirable. Not sure if this is the desired approach, however, so I'll keep this in draft and hold off on tests until I hear more.
Summary by CodeRabbit