Skip to content

enhancement(gcp_cloud_storage sink): Add support for OIDC via Workload Identity Federation#25454

Open
ganelo wants to merge 13 commits into
vectordotdev:masterfrom
ganelo:og/gcp-token-refresh-fix
Open

enhancement(gcp_cloud_storage sink): Add support for OIDC via Workload Identity Federation#25454
ganelo wants to merge 13 commits into
vectordotdev:masterfrom
ganelo:og/gcp-token-refresh-fix

Conversation

@ganelo
Copy link
Copy Markdown
Contributor

@ganelo ganelo commented May 18, 2026

Summary

Change crates for GCP SDK to a seemingly better-supported one, that also supports Workload Identity Federation, in order to unblock support for OIDC via WIF.

Note: this feature was implemented in part through the use of AI. It has also been validated via deploying in a dev environment (see below). I view it as a starting point for iterating to get it above the bar - I'm committed to working with you to get it to that point.

Vector configuration

Standard gcp_cloud_storage sink with environment variable GOOGLE_APPLICATION_CREDENTIALS pointing at a file containing GCP external account json.

How did you test this PR?

Deployed via a helm chart on a containerized GCP dev environment; confirmed writes succeed to GCP sink over the course of several days (ensuring no issues with eg token refresh, etc).

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Brian Park and others added 4 commits May 12, 2026 16:59
- Remove .claude/settings.local.json from git index
- Use tokio::fs::read instead of std::fs::read in credential loading
- Add cred_type and project_id to InnerCreds for debug logging
- Use ring crypto provider instead of default-rustls-provider to avoid
  pulling in aws-lc-rs/aws-lc-sys (cmake-built C library)
- Remove dead MissingAuth error variant
- Return explicit error for missing type field in credentials JSON
- Gate SCOPE_COMPUTE with #[cfg(test)] since it is only used in tests
- Simplify GcpAuthConfig::build by delegating GOOGLE_APPLICATION_CREDENTIALS
  handling to the SDK ADC builder

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tart

The google-cloud-auth SDK's `AccessTokenCredentials::access_token()` is the
only exposed API and gives back a token string with no expiry information.
When the SDK's internal refresh fails (e.g. external_account credential_source
file unreadable, STS impersonation denied), it keeps returning the cached
stale token without surfacing an error. Vector's 30-min regenerator then
"succeeds" indefinitely while every GCS request gets 401 — a state observed
in chalet-production to last until manual pod replacement.

Three changes converge on self-healing:

- Each `regenerate_token` rebuilds `AccessTokenCredentials` from disk via a
  new `CredSource` enum, so the SDK has no cached token to return. Side
  benefit: configmap / IAM-controller rotations are picked up without a
  restart.
- `fetch_token` is bounded by a 30 s `tokio::time::timeout` and a new
  `GcpError::FetchTokenTimeout` variant. A stuck metadata server / STS
  endpoint surfaces as a logged error and falls into the 2 s retry loop
  instead of hanging the regenerator task forever.
- `GcsRetryLogic` optionally carries a `GcpAuthenticator` and, on a 401,
  fires a throttled `force_refresh` before returning Retry. Throttle
  (`MIN_FORCE_REFRESH_INTERVAL = 30 s`) prevents retry storms from
  hammering the STS endpoint. Both `gcp_cloud_storage` and `gcp_chronicle`
  sinks now construct the retry logic with `with_auth(...)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generated by `make build-licenses`. The drift is pre-existing — Cargo.lock
on og/gcp-oidc has moved past what the license file was last regenerated
against. Same regeneration would land on og/gcp-oidc if run there; tying
it to this branch so the fix PR is license-clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ganelo ganelo requested a review from a team as a code owner May 18, 2026 20:24
@github-actions github-actions Bot added domain: sources Anything related to the Vector's sources domain: sinks Anything related to the Vector's sinks labels May 18, 2026
@ganelo ganelo changed the title enhancement(gcp sink): Add support for OIDC via Workload Identity Federation enhancement(gcp_cloud_storage sink): Add support for OIDC via Workload Identity Federation May 18, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 270045d0c3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/gcp.rs
Comment on lines +269 to 270
let mut deadline = TOKEN_REFRESH_INTERVAL;
loop {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Refresh OAuth tokens using their real expiry time

The new regenerator now sleeps a fixed 30 minutes before every refresh, but this commit also adds credential types (external_account / impersonated_service_account in build_credentials_from_json) whose access-token lifetime can be configured below 1800s. In that case the cached token can expire well before the next scheduled refresh, causing repeated auth failures until the 30-minute loop runs again (or indefinitely noisy retries in components without 401-triggered force_refresh, such as Pub/Sub and Stackdriver paths). Please derive the next refresh deadline from each token’s actual expiry instead of a constant interval.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it doesn't look like the new Google SDK crate exposes expiry time on the token anymore, so this isn't possible.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a488dfeda

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread Cargo.lock
oganel and others added 3 commits May 19, 2026 11:16
The recent master merge produced two reqwest 0.13.3 blocks because each
parent's entry referenced different transitive deps. Dropped the stale
duplicate whose referenced versions (rustls 0.23.37, hyper 1.7.0, http
1.3.1, hyper-rustls 0.27.5, tokio-rustls 0.26.2, tower-http 0.6.8) no
longer exist elsewhere in the lockfile, so Cargo can parse the lockfile
again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9d4aa1036

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/gcp.rs Outdated
The prior self-heal commit only wired GCS and Chronicle through
GcsRetryLogic::with_auth. The other GCP components — gcp_pubsub sink,
gcp_stackdriver_logs sink, gcp_stackdriver_metrics sink, and the
gcp_pubsub source — used the default HTTP retry path, which treats 401
as DontRetry and so left them dependent on the periodic regenerator
tick for recovery. With external_account / impersonated_service_account
tokens routinely under 1800s, that gap could expire a token well before
the next refresh.

Three changes converge on full coverage:

- New GcpAuthRetryLogic in sinks/gcs_common/config.rs, generic over the
  response type via a status-extractor closure, with convenience
  constructors for both http::Response<Bytes> (BatchedHttpSink) and
  HttpResponse (HttpService). On 401 it fires auth.force_refresh() and
  retries; everything else falls through to the configured RetryStrategy.
- gcp_pubsub sink, stackdriver logs sink, and stackdriver metrics sink
  now use that retry logic with their existing strategy + auth handle.
- gcp_pubsub source's translate_error fires force_refresh on
  tonic::Code::Unauthenticated before falling through to RetryDelay, so
  the next streaming-pull reconnect uses a fresh token.

Also shorten TOKEN_REFRESH_INTERVAL from 30 min to 5 min as a defense in
depth for revocation paths that don't surface as 401 (clock skew, IdP
revocation, etc.). The 401 self-heal handles the common case ahead of
this tick; force_refresh's existing 30s throttle keeps STS load bounded
even when both fire simultaneously.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 01dbb648ec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sinks/gcs_common/config.rs
GcpAuthRetryLogic::should_retry_response unconditionally returned Retry
on 401, which silently overrode RetryStrategy::None for the new
stackdriver wiring. A user who explicitly disabled retries could still
see repeated attempts when tokens were rejected. Gate the override on
retry_strategy != None and skip the credential refresh in that case
(the failed request will not be re-issued, so the refresh has no payoff
and would still produce STS load).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ganelo
Copy link
Copy Markdown
Contributor Author

ganelo commented May 19, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e3f40e1b4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sinks/gcs_common/config.rs Outdated
Custom is an explicit user-configured allowlist; overriding it on 401
was a behavioral regression vs. the previous http_response_retry_logic
path (in configs like `Custom { status_codes: [500] }`, a revoked
credential would now silently burn the request's retry budget). Make
401 self-heal only when the strategy is Default or All; Custom defers
to the user's allowlist (refresh + retry only when 401 is in it), None
stays disabled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ganelo
Copy link
Copy Markdown
Contributor Author

ganelo commented May 19, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Addresses a sweep of post-merge review feedback on the OIDC migration
and 401 self-heal work.

Auth structure & lifecycle (src/gcp.rs):
- Restore master's `GOOGLE_APPLICATION_CREDENTIALS` precedence over
  `api_key` when `credentials_path` is unset, with a comment on why.
- Collapse the three RwLocks (token/cred_type/project_id) into a single
  `RwLock<CredState>` so readers can no longer observe a torn snapshot.
- Cache the `Authorization` `HeaderValue` once per refresh instead of
  formatting and parsing on every request; removes the per-request
  `parse().unwrap()` panic risk and surfaces malformed tokens as
  `InvalidAuthHeader` at refresh time.
- `force_refresh` now updates `last_refresh_success` only on success
  (failed attempts no longer silently throttle out subsequent 401s) and
  coalesces concurrent 401 bursts via `refresh_in_flight: AtomicBool`
  with an RAII `InFlightGuard`.
- Coordinate the periodic regenerator with the force_refresh throttle:
  successful periodic refreshes also update `last_refresh_success`.
- Move the `rustls::crypto::ring` provider install behind a `OnceLock`
  helper so it's a one-shot process init, not per-config-build.
- Exponential backoff on regenerator failure (2s doubling, capped at
  TOKEN_REFRESH_INTERVAL); resets on success. Replaces the constant 2s
  loop that would hot-spin during STS outages.
- Rename `spawn_regenerate_token` to `start_background_refresh` (sinks)
  and split notifications into `subscribe_token_rotation` (only the
  gcp_pubsub source needs the watch receiver).
- New `Scope` newtype with `pub const PUBSUB`, `DEVSTORAGE_READ_WRITE`,
  etc. — a typo in a sink wiring is now a compile error.
- Update `api_key` doc to reflect non-service-account file types.
- Comment on the synchronous JSON parse in `build_from_file`.

Retry logic (src/sinks/gcs_common/config.rs):
- `should_retry_response` (both `GcpAuthRetryLogic` and `GcsRetryLogic`)
  now synchronously waits for `force_refresh` via `block_in_place` +
  `block_on` before returning `Retry`, so the retry uses a fresh token
  instead of burning the retry budget on the stale one. The in-flight
  guard above coalesces concurrent 401s to one STS call per burst.
- `PhantomData<fn(Req) -> Res>` → `PhantomData<(Req, Res)>` for
  conventional invariance.
- Document why `gcp_pubsub` sink hardcodes `RetryStrategy::Default`
  (no public retry_strategy knob on PubsubConfig).

Tests:
- `retry_logic_matrix`: table-driven test over (status × strategy) for
  `GcpAuthRetryLogic` — 17 cases covering 200/401/500/404 against
  None/Default/All/Custom-with-401/Custom-with-500. Codex caught two
  semantic bugs that this matrix would have caught.
- `translate_error_unauthenticated_returns_retry_delay` (+ sibling
  for the non-Unauthenticated branch) in `gcp_pubsub` source.
- Replace `fails_missing_creds` (which fails outside GCE due to
  metadata.google.internal DNS) with four file-based tests covering
  the concrete failure paths in `build_from_file` /
  `build_credentials_from_json`: missing file, malformed JSON, missing
  type field, unsupported type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ganelo
Copy link
Copy Markdown
Contributor Author

ganelo commented May 20, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sinks Anything related to the Vector's sinks domain: sources Anything related to the Vector's sources

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Include support for GCP authentication via workload identity federation

2 participants