feat(sinks): add new databricks_zerobus for Databricks ingestion#24840
feat(sinks): add new databricks_zerobus for Databricks ingestion#24840flaviofcruz wants to merge 16 commits intovectordotdev:masterfrom
Conversation
2368e4a to
42bf043
Compare
|
Thanks @flaviofcruz for this new integration! Apologies for the slow review on this one. @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 436d0da4bd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
436d0da to
9dcb5d1
Compare
|
All contributors have signed the CLA ✍️ ✅ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9dcb5d1e71
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
FYI I am waiting for @hsuanyi to sign the CLA (see comment) before reviewing this further. Also, there is an resolved review comment. |
@flaviofcruz in case you missed the above, we will require all profiles who contributed to this PR to sign the CLA. Happy to review once that is done. |
|
I have read the CLA Document and I hereby sign the CLA |
|
recheck |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 050409defd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@pront really appreciate your work for the review. However, I was looking at the zerobus SDK license and the license could be problematic: https://github.com/databricks/zerobus-sdk/blob/main/LICENSE Do you know if this could be a blocker? Let me know if that is. |
050409d to
36a74ef
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 36a74ef530
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9d7b620bfb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ck up the environment variables
…f no URL is set up
f643aaf to
31bcdcb
Compare
Thanks, I have rebased and force pushed. I will follow up with the VRL fixes which should address codex comment above. Will also follow up with arrow batch support once I get the greenlight from the zerobus team. |
c1e2cc3 to
31bcdcb
Compare
|
Hi @flaviofcruz, there are a few open threads and it's unclear if they are resolved or not. Ping me when ready for a final review. |
Threads should be closed now I think. |
pront
left a comment
There was a problem hiding this comment.
Left a few review comments from the local pass.
| } | ||
|
|
||
| /// Encode a batch of events into a `BatchOutput`. | ||
| pub fn encode_batch(&self, events: &[Event]) -> Result<BatchOutput, Error> { |
There was a problem hiding this comment.
Just noting down here that the 10MB limit enforced in config.rs might not hold in all cases. For example, a numeric-heavy schema right at the 10MB boundary could realistically encode over 10MB and fail at the SDK call.
There was a problem hiding this comment.
Should we just a comment on the sink documentation? Or you have some other proposal?
There was a problem hiding this comment.
A short mention in the docs is a good idea. But
dd346b6 to
d564d98
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d564d98f70
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 141f680972
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| When disabled (the default), events are marked as delivered as soon as the | ||
| ingestion call completes without error, without waiting for an explicit | ||
| server acknowledgement. |
There was a problem hiding this comment.
Correct ack-disabled behavior description
This paragraph says that with acknowledgements disabled the sink marks events delivered without waiting for server acknowledgement, but the implementation always waits on wait_for_offset after every ingest and even errors on missing offsets (MissingAckOffset) regardless of sink ack config (src/sinks/databricks_zerobus/service.rs:390-393). Users who disable acknowledgements to avoid server-ack latency/failures will get different runtime behavior than documented, so this guidance should be updated to match the actual sink semantics.
Useful? React with 👍 / 👎.
Summary
Databricks provides a Zerobus ingest connector [1], a push based API that writes data directly into Unity Catalog Delta tables. This PR introduces a new vector sink that integrates with Databricks, allowing Vector to push data into Databricks. We use the Databricks provided SDK to implement the sink [2].
Zerobus supports row level ingestion and that's we do here. Zerobus also has arrow batch in experimental mode but we didn't add support for it. We will swap the row level ingestion once it becomes stable and that will be the future default.
With row based ingestion, we extended the BatchSerializerConfig to support a batch serializer that creates vector's of protocol buffer bytes. This makes it the second option for doing batch serialization, along arrow batch.
Users do not have to specify the schema at all, we will fetch the schema for them from Unity Catalog and then use on the API. If users want to do schema changes, they should update their table as needed. We don't have a lot of support for dynamic schema changes at the moment.
Vector configuration
How did you test this PR?
Unit tests, running small toy examples and using it in production for actual traffic.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
[1] https://docs.databricks.com/aws/en/ingestion/zerobus-overview
[2] https://github.com/databricks/zerobus-sdk
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details here.