Skip to content

feat(vector sink): add multiple endpoint strategies#25662

Open
fpytloun wants to merge 4 commits into
vectordotdev:masterfrom
fpytloun:fpytloun/vector-sink-multiple-backends
Open

feat(vector sink): add multiple endpoint strategies#25662
fpytloun wants to merge 4 commits into
vectordotdev:masterfrom
fpytloun:fpytloun/vector-sink-multiple-backends

Conversation

@fpytloun

Copy link
Copy Markdown
Contributor

Summary

Adds multi-endpoint support to the vector sink:

  • addresses = [...] configures multiple downstream Vector endpoints
  • endpoint_strategy = "load_balance" is the default and uses the existing distributed service / endpoint-health path
  • endpoint_strategy = "failover" uses ordered, non-preemptive failover for stateful downstream Vector aggregators
  • keeps existing address = "..." behavior unchanged
  • updates generated component docs and adds a changelog fragment

The failover strategy starts with the first configured endpoint, moves to the next endpoint on request failure or per-endpoint timeout, and keeps using the successful endpoint until it fails. Hosts can use different address ordering to spread primary ownership without requiring random failover semantics.

Validation

  • make generate-component-docs
  • cargo fmt --all -- --check
  • cargo test --no-default-features --features sinks-vector --lib sinks::vector::tests::
  • cargo test --no-default-features --features sinks-vector --lib sinks::vector::test::generate_config
  • cargo clippy --no-default-features --features sinks-vector --lib -- -D warnings -A clippy::manual_option_zip
  • ./scripts/check_changelog_fragments.sh
  • Docker Compose E2E:
    • load_balance delivered events to both downstream Vector servers
    • failover delivered only to the first endpoint while healthy, then moved to the second endpoint after the first was stopped with bounded request.timeout_secs

@fpytloun fpytloun requested review from a team as code owners June 22, 2026 14:47
@github-actions github-actions Bot added domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. labels Jun 22, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7499cf772

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sinks/vector/config.rs
Comment thread src/sinks/vector/config.rs
Comment thread src/sinks/vector/config.rs Outdated
Comment thread src/sinks/vector/config.rs
@fpytloun fpytloun force-pushed the fpytloun/vector-sink-multiple-backends branch from d7499cf to 25ceca6 Compare June 22, 2026 15:12
@fpytloun

Copy link
Copy Markdown
Contributor Author

Addressed the review comments in the latest push (25ceca6c6f):

  • Preserved the old single-address service path so unchanged single-endpoint configs do not go through endpoint-health distributed_service.
  • Made ordered failover advance only on retriable Vector errors; non-retriable gRPC statuses such as DataLoss now bubble immediately and are not resent to later endpoints.
  • Guarded active endpoint updates so stale concurrent successes cannot preempt a newer failover target.
  • Added timeout slack around the internal per-endpoint failover loop so the final endpoint still gets its full per-endpoint timeout budget.
  • Added a regression test proving non-retriable primary rejection is not resent to secondary.

Validation rerun:

  • cargo fmt --all
  • cargo test --no-default-features --features sinks-vector --lib sinks::vector::tests::
  • cargo clippy --no-default-features --features sinks-vector --lib -- -D warnings -A clippy::manual_option_zip
  • delegated code review: APPROVE

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25ceca6c6f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sinks/vector/config.rs Outdated
Comment thread src/sinks/vector/config.rs Outdated
Comment thread src/sinks/vector/config.rs
@fpytloun fpytloun force-pushed the fpytloun/vector-sink-multiple-backends branch from 25ceca6 to 75fcbe3 Compare June 22, 2026 15:51
@fpytloun

Copy link
Copy Markdown
Contributor Author

Updated the branch to address the latest review comments:

  • Endpoint health now uses the same retriable-error classification as the Vector sink retry logic, so non-retriable Vector responses such as DataLoss do not mark a reachable endpoint unhealthy.
  • HealthConfig::default() now matches the documented/serde defaults instead of zero backoff values.
  • Failover active-state advancement now uses a generation-aware CAS loop and reloads stale observed state before deciding whether to advance.
  • Added focused regressions for health classification, documented health defaults, stale failover generation handling, and stale mismatched-state reload behavior.

Validation:

  • cargo fmt --all -- --check
  • cargo test --no-default-features --features sinks-vector --lib
  • cargo clippy --no-default-features --features sinks-vector --lib -- -D warnings -A clippy::manual_option_zip
  • Delegated code review: APPROVE

@fpytloun

fpytloun commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Tested load balancing mode in real environment and it works really well.

Before (aggregators are client sticky and spread randomly. Each client traffic is different.):
obrazek

After (aggregators are used equally by all clients):
obrazek

@fpytloun

fpytloun commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

I found one caveat in failover mode combined with keepalive.max_connection_age_secs feature (#25660). It is ok-ish behavior but problematic for load-balancing load in some environments.

  • agent A has addresses = ["aggr-a", "aggr-b"] and reaches aggr-a
  • agent B has addresses = ["aggr-b", "aggr-a"] and reaches aggr-a after aggr-b crash
  • aggregator a closes connection gracefully due to keepalive.max_connection_age_secs
  • agent A fallbacks to aggr-b, agent B fallbacks to aggr-a so they just swapped their position instead of re-connecting to primary

I am going to change it to following operational semantics:

  • normal steady state: keep using current active endpoint
  • when current active request returns a retriable error/timeout: 1. try configured primary first, 2. then try the rest of the configured order
  • if this was just source max-age / connection recycle, primary reconnect succeeds
  • if primary is really down, it falls through to secondary
  • no background probing/failback while current active is healthy.

@fpytloun fpytloun force-pushed the fpytloun/vector-sink-multiple-backends branch from 75fcbe3 to 0247494 Compare June 25, 2026 09:40
@fpytloun

Copy link
Copy Markdown
Contributor Author

Updated failover semantics to handle source-side max_connection_age reconnects correctly.

Before this update, failover mode could treat a connection-age close as an endpoint failure and advance in ring order. That meant stations could drift away from their configured primary after the receiver recycled the long-lived gRPC connection.

The failover strategy now:

  • tries the current active endpoint first for steady-state behavior;
  • after a retriable error or per-endpoint timeout, re-evaluates endpoints from the configured address order;
  • retries the configured primary before falling through to secondary endpoints;
  • keeps the existing retriable/non-retriable error handling and generation-aware state update protections.

Added coverage:

  • failover_strategy_retries_primary_before_secondary
  • failover_attempts_current_then_configured_order

Validation:

  • make generate-component-docs
  • cargo fmt --all -- --check
  • cargo test --no-default-features --features sinks-vector --lib
  • cargo clippy --no-default-features --features sinks-vector --lib -- -D warnings -A clippy::manual_option_zip
  • delegated code review: APPROVE

@fpytloun fpytloun force-pushed the fpytloun/vector-sink-multiple-backends branch from 0247494 to 3d6f738 Compare June 25, 2026 10:03
@fpytloun

Copy link
Copy Markdown
Contributor Author

Updated the endpoint strategy split:

  • failover is now the non-preemptive/ring mode: current active endpoint first, then continue through endpoints from the next address.
  • failover_primary is the primary-preferring mode: current active endpoint first, then retry from configured address order so receiver-side connection recycling can converge back to addresses[0].
  • failover_random is intentionally not included in this PR; it needs separate health/random-selection semantics.

Added coverage for parsing failover_primary, ring attempt ordering, primary-preferring attempt ordering, and 3-endpoint service behavior for both failover modes.

Validation rerun:

  • make generate-component-docs
  • cargo fmt --all -- --check
  • cargo test --no-default-features --features sinks-vector --lib
  • cargo clippy --no-default-features --features sinks-vector --lib -- -D warnings -A clippy::manual_option_zip
  • delegated code review: APPROVE

@fpytloun

Copy link
Copy Markdown
Contributor Author

Updated the changelog fragment to explicitly mention the final endpoint strategies added by this PR: load_balance, failover, and failover_primary.

Validation:

  • ./scripts/check_changelog_fragments.sh
  • git diff --check

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d185b63e9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sinks/vector/config.rs Outdated
Comment thread src/sinks/vector/config.rs
Comment thread src/sinks/vector/config.rs Outdated
@fpytloun

Copy link
Copy Markdown
Contributor Author

Addressed the automated failover review suggestions.

Changes:

  • stale failover failures no longer advance shared state after another request has already moved it
  • attempts are recomputed from the current active endpoint when another request advances to a different endpoint
  • recomputed attempts skip endpoints already tried by the current request, preserving bounded retries for untried endpoints
  • failover_primary now gives the outer timeout one endpoint-timeout of slack beyond the worst-case endpoint-attempt budget

Validation:

  • cargo test --no-default-features --features sinks-vector --lib sinks::vector::config::tests::failover_
  • cargo test --no-default-features --features sinks-vector --lib sinks::vector::tests::failover_
  • cargo clippy --no-default-features --features sinks-vector --lib -- -D warnings -A clippy::manual_option_zip
  • ./scripts/check_changelog_fragments.sh
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant