Skip to content

fix(databricks_zerobus sink): defer Unity Catalog schema fetch out of build()#25408

Open
flaviofcruz wants to merge 8 commits into
vectordotdev:masterfrom
flaviofcruz:zerobus-defer-descriptor
Open

fix(databricks_zerobus sink): defer Unity Catalog schema fetch out of build()#25408
flaviofcruz wants to merge 8 commits into
vectordotdev:masterfrom
flaviofcruz:zerobus-defer-descriptor

Conversation

@flaviofcruz
Copy link
Copy Markdown
Contributor

@flaviofcruz flaviofcruz commented May 11, 2026

Summary

databricks_zerobus's SinkConfig::build() synchronously calls Unity Catalog to fetch the table's protobuf descriptor. If the table doesn't exist or credentials are wrong, the call fails inside build() and Vector exits before the sink even starts — even when healthcheck.enabled = false.

This aligns the sink with the convention used by AWS S3, Kafka, etc.: build() does only local setup, the healthcheck does the remote probe, and runtime failures surface per-batch via the existing retry/event-status path.

Changes:

  • UC descriptor + encoder + stream-mode are resolved lazily via tokio::sync::OnceCell on first use (and re-attempted on each failure).
  • Event encoding moves from ZerobusSink into ZerobusService::encode_batch, since the encoder now depends on the lazily-resolved descriptor.
  • ZerobusService::new only builds the SDK client and the HTTP client (both local).
  • ensure_stream (the healthcheck) is the natural gate: it triggers schema resolution + stream creation, and runs the same way on first ingest if the healthcheck is disabled.
  • Replaced the eager-ProxyConfig-stash with an HttpClient built once in new, so fetch_table_schema takes &HttpClient and doesn't reconstruct it per retry.
  • Added logic to decide whether schema fetch errors should be treated as retryable or not. If UC is transiently failing, we will try again during ingestion. Potentially this opens the door to dynamically re-fetch the schema as time goes on.

Vector configuration

  sinks:
    zb:
      type: databricks_zerobus
      inputs: [demo]
      ingestion_endpoint: "https://ingest.dev.databricks.com"
      unity_catalog_endpoint: "https://workspace.cloud.databricks.com"
      table_name: "main.default.does_not_exist"
      auth:
        strategy: oauth
        client_id: "${DATABRICKS_CLIENT_ID}"
        client_secret: "${DATABRICKS_CLIENT_SECRET}"
      healthcheck:
        enabled: false

Before: Vector exits at startup with Unity Catalog API returned error 404: ....
After: Vector starts; batches are rejected at ingest time with the same error logged per batch.

How did you test this PR?

  • Manual smoke tests still to do:
  • Non-existent table with default settings → sink starts, events rejected.
  • Bad OAuth credentials → sink starts, events rejected.
  • healthcheck.enabled = false with unreachable UC → sink starts.
  • require_healthy = true → Vector exits (opt-in fail-fast preserved).

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

@github-actions github-actions Bot added the domain: sinks Anything related to the Vector's sinks label May 11, 2026
@flaviofcruz flaviofcruz force-pushed the zerobus-defer-descriptor branch from 30bffb2 to 83ebd94 Compare May 11, 2026 16:45
@flaviofcruz flaviofcruz marked this pull request as ready for review May 11, 2026 19:46
@flaviofcruz flaviofcruz requested a review from a team as a code owner May 11, 2026 19:46
@github-actions github-actions Bot added the domain: ci Anything related to Vector's CI environment label May 12, 2026
Comment thread .github/workflows/semantic.yml
Copy link
Copy Markdown
Contributor

@petere-datadog petere-datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, works as expected, code looks good to me

Copy link
Copy Markdown
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex review

also a few checks are failing

Comment thread changelog.d/25408_databricks_zerobus_lazy_schema.fix.md Outdated
@flaviofcruz flaviofcruz force-pushed the zerobus-defer-descriptor branch from 3e6c015 to bdd9e09 Compare May 20, 2026 20:26
@flaviofcruz flaviofcruz requested a review from pront May 20, 2026 21:08
@pront pront added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label May 22, 2026
@pront pront enabled auto-merge May 22, 2026 16:38
@pront pront added this pull request to the merge queue May 22, 2026
@pront pront removed this pull request from the merge queue due to a manual request May 22, 2026
@pront
Copy link
Copy Markdown
Member

pront commented May 22, 2026

please add databricks_zerobus sink to semantic.yml

@flaviofcruz
Copy link
Copy Markdown
Contributor Author

please add databricks_zerobus sink to semantic.yml

It is added here: https://github.com/flaviofcruz/vector/blob/bdd9e0910af6bd3fd2c2459222f402d823b99858/.github/workflows/semantic.yml But I can merge a separate PR if that makes it easier.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2cea4a861

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +486 to +488
let schema = service.ensure_schema().await?;
let records = Self::encode_records(schema, &request.events)?;
let stream = service.get_or_create_stream(schema).await?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve rejection status for non-retriable sink failures

call now propagates schema/encoding/stream-init failures as Err via ?, but the generic driver treats every Err as EventStatus::Errored (transient) regardless of retryability. That means permanent failures introduced here (for example Unity Catalog 404/401 mapped to SchemaError { retryable: false }, or non-retriable EncodingError) no longer become rejected batches and can be NACKed/retried indefinitely by ack-enabled sources instead of being dropped as permanent failures.

Useful? React with 👍 / 👎.

/// retryable statuses for HTTP-based sinks lives at
/// [`crate::sinks::util::http::RetryStrategy::Default`].
fn status_is_retryable(status: StatusCode) -> bool {
status.is_server_error()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude HTTP 501 from retryable schema failures

status_is_retryable currently marks all 5xx responses as retryable, but Vector’s default HTTP retry semantics explicitly exclude 501 Not Implemented. This means a permanent 501 returned by Unity Catalog or an intermediate gateway will now be retried as transient, delaying failure and repeatedly replaying the same request instead of rejecting it immediately.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: ci Anything related to Vector's CI environment domain: sinks Anything related to the Vector's sinks no-changelog Changes in this PR do not need user-facing explanations in the release changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants