enhancement(gcp_cloud_storage sink): Add support for OIDC via Workload Identity Federation#25454
enhancement(gcp_cloud_storage sink): Add support for OIDC via Workload Identity Federation#25454ganelo wants to merge 13 commits into
Conversation
- Remove .claude/settings.local.json from git index - Use tokio::fs::read instead of std::fs::read in credential loading - Add cred_type and project_id to InnerCreds for debug logging - Use ring crypto provider instead of default-rustls-provider to avoid pulling in aws-lc-rs/aws-lc-sys (cmake-built C library) - Remove dead MissingAuth error variant - Return explicit error for missing type field in credentials JSON - Gate SCOPE_COMPUTE with #[cfg(test)] since it is only used in tests - Simplify GcpAuthConfig::build by delegating GOOGLE_APPLICATION_CREDENTIALS handling to the SDK ADC builder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tart The google-cloud-auth SDK's `AccessTokenCredentials::access_token()` is the only exposed API and gives back a token string with no expiry information. When the SDK's internal refresh fails (e.g. external_account credential_source file unreadable, STS impersonation denied), it keeps returning the cached stale token without surfacing an error. Vector's 30-min regenerator then "succeeds" indefinitely while every GCS request gets 401 — a state observed in chalet-production to last until manual pod replacement. Three changes converge on self-healing: - Each `regenerate_token` rebuilds `AccessTokenCredentials` from disk via a new `CredSource` enum, so the SDK has no cached token to return. Side benefit: configmap / IAM-controller rotations are picked up without a restart. - `fetch_token` is bounded by a 30 s `tokio::time::timeout` and a new `GcpError::FetchTokenTimeout` variant. A stuck metadata server / STS endpoint surfaces as a logged error and falls into the 2 s retry loop instead of hanging the regenerator task forever. - `GcsRetryLogic` optionally carries a `GcpAuthenticator` and, on a 401, fires a throttled `force_refresh` before returning Retry. Throttle (`MIN_FORCE_REFRESH_INTERVAL = 30 s`) prevents retry storms from hammering the STS endpoint. Both `gcp_cloud_storage` and `gcp_chronicle` sinks now construct the retry logic with `with_auth(...)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generated by `make build-licenses`. The drift is pre-existing — Cargo.lock on og/gcp-oidc has moved past what the license file was last regenerated against. Same regeneration would land on og/gcp-oidc if run there; tying it to this branch so the fix PR is license-clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 270045d0c3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let mut deadline = TOKEN_REFRESH_INTERVAL; | ||
| loop { |
There was a problem hiding this comment.
Refresh OAuth tokens using their real expiry time
The new regenerator now sleeps a fixed 30 minutes before every refresh, but this commit also adds credential types (external_account / impersonated_service_account in build_credentials_from_json) whose access-token lifetime can be configured below 1800s. In that case the cached token can expire well before the next scheduled refresh, causing repeated auth failures until the 30-minute loop runs again (or indefinitely noisy retries in components without 401-triggered force_refresh, such as Pub/Sub and Stackdriver paths). Please derive the next refresh deadline from each token’s actual expiry instead of a constant interval.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Unfortunately, it doesn't look like the new Google SDK crate exposes expiry time on the token anymore, so this isn't possible.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a488dfeda
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The recent master merge produced two reqwest 0.13.3 blocks because each parent's entry referenced different transitive deps. Dropped the stale duplicate whose referenced versions (rustls 0.23.37, hyper 1.7.0, http 1.3.1, hyper-rustls 0.27.5, tokio-rustls 0.26.2, tower-http 0.6.8) no longer exist elsewhere in the lockfile, so Cargo can parse the lockfile again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9d4aa1036
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The prior self-heal commit only wired GCS and Chronicle through GcsRetryLogic::with_auth. The other GCP components — gcp_pubsub sink, gcp_stackdriver_logs sink, gcp_stackdriver_metrics sink, and the gcp_pubsub source — used the default HTTP retry path, which treats 401 as DontRetry and so left them dependent on the periodic regenerator tick for recovery. With external_account / impersonated_service_account tokens routinely under 1800s, that gap could expire a token well before the next refresh. Three changes converge on full coverage: - New GcpAuthRetryLogic in sinks/gcs_common/config.rs, generic over the response type via a status-extractor closure, with convenience constructors for both http::Response<Bytes> (BatchedHttpSink) and HttpResponse (HttpService). On 401 it fires auth.force_refresh() and retries; everything else falls through to the configured RetryStrategy. - gcp_pubsub sink, stackdriver logs sink, and stackdriver metrics sink now use that retry logic with their existing strategy + auth handle. - gcp_pubsub source's translate_error fires force_refresh on tonic::Code::Unauthenticated before falling through to RetryDelay, so the next streaming-pull reconnect uses a fresh token. Also shorten TOKEN_REFRESH_INTERVAL from 30 min to 5 min as a defense in depth for revocation paths that don't surface as 401 (clock skew, IdP revocation, etc.). The 401 self-heal handles the common case ahead of this tick; force_refresh's existing 30s throttle keeps STS load bounded even when both fire simultaneously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 01dbb648ec
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
GcpAuthRetryLogic::should_retry_response unconditionally returned Retry on 401, which silently overrode RetryStrategy::None for the new stackdriver wiring. A user who explicitly disabled retries could still see repeated attempts when tokens were rejected. Gate the override on retry_strategy != None and skip the credential refresh in that case (the failed request will not be re-issued, so the refresh has no payoff and would still produce STS load). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e3f40e1b4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Custom is an explicit user-configured allowlist; overriding it on 401
was a behavioral regression vs. the previous http_response_retry_logic
path (in configs like `Custom { status_codes: [500] }`, a revoked
credential would now silently burn the request's retry budget). Make
401 self-heal only when the strategy is Default or All; Custom defers
to the user's allowlist (refresh + retry only when 401 is in it), None
stays disabled.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Addresses a sweep of post-merge review feedback on the OIDC migration and 401 self-heal work. Auth structure & lifecycle (src/gcp.rs): - Restore master's `GOOGLE_APPLICATION_CREDENTIALS` precedence over `api_key` when `credentials_path` is unset, with a comment on why. - Collapse the three RwLocks (token/cred_type/project_id) into a single `RwLock<CredState>` so readers can no longer observe a torn snapshot. - Cache the `Authorization` `HeaderValue` once per refresh instead of formatting and parsing on every request; removes the per-request `parse().unwrap()` panic risk and surfaces malformed tokens as `InvalidAuthHeader` at refresh time. - `force_refresh` now updates `last_refresh_success` only on success (failed attempts no longer silently throttle out subsequent 401s) and coalesces concurrent 401 bursts via `refresh_in_flight: AtomicBool` with an RAII `InFlightGuard`. - Coordinate the periodic regenerator with the force_refresh throttle: successful periodic refreshes also update `last_refresh_success`. - Move the `rustls::crypto::ring` provider install behind a `OnceLock` helper so it's a one-shot process init, not per-config-build. - Exponential backoff on regenerator failure (2s doubling, capped at TOKEN_REFRESH_INTERVAL); resets on success. Replaces the constant 2s loop that would hot-spin during STS outages. - Rename `spawn_regenerate_token` to `start_background_refresh` (sinks) and split notifications into `subscribe_token_rotation` (only the gcp_pubsub source needs the watch receiver). - New `Scope` newtype with `pub const PUBSUB`, `DEVSTORAGE_READ_WRITE`, etc. — a typo in a sink wiring is now a compile error. - Update `api_key` doc to reflect non-service-account file types. - Comment on the synchronous JSON parse in `build_from_file`. Retry logic (src/sinks/gcs_common/config.rs): - `should_retry_response` (both `GcpAuthRetryLogic` and `GcsRetryLogic`) now synchronously waits for `force_refresh` via `block_in_place` + `block_on` before returning `Retry`, so the retry uses a fresh token instead of burning the retry budget on the stale one. The in-flight guard above coalesces concurrent 401s to one STS call per burst. - `PhantomData<fn(Req) -> Res>` → `PhantomData<(Req, Res)>` for conventional invariance. - Document why `gcp_pubsub` sink hardcodes `RetryStrategy::Default` (no public retry_strategy knob on PubsubConfig). Tests: - `retry_logic_matrix`: table-driven test over (status × strategy) for `GcpAuthRetryLogic` — 17 cases covering 200/401/500/404 against None/Default/All/Custom-with-401/Custom-with-500. Codex caught two semantic bugs that this matrix would have caught. - `translate_error_unauthenticated_returns_retry_delay` (+ sibling for the non-Unauthenticated branch) in `gcp_pubsub` source. - Replace `fails_missing_creds` (which fails outside GCE due to metadata.google.internal DNS) with four file-based tests covering the concrete failure paths in `build_from_file` / `build_credentials_from_json`: missing file, malformed JSON, missing type field, unsupported type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Summary
Change crates for GCP SDK to a seemingly better-supported one, that also supports Workload Identity Federation, in order to unblock support for OIDC via WIF.
Note: this feature was implemented in part through the use of AI. It has also been validated via deploying in a dev environment (see below). I view it as a starting point for iterating to get it above the bar - I'm committed to working with you to get it to that point.
Vector configuration
Standard
gcp_cloud_storagesink with environment variable GOOGLE_APPLICATION_CREDENTIALS pointing at a file containing GCP external account json.How did you test this PR?
Deployed via a helm chart on a containerized GCP dev environment; confirmed writes succeed to GCP sink over the course of several days (ensuring no issues with eg token refresh, etc).
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References