Skip to content

fix(core): honor AWS_ENDPOINT_URL_STS in IAM credentials-provider loader#5968

Open
edwardpark97 wants to merge 1 commit into
valkey-io:mainfrom
edwardpark97:upstream-pr/iam-aws-endpoint-url-sts-fix
Open

fix(core): honor AWS_ENDPOINT_URL_STS in IAM credentials-provider loader#5968
edwardpark97 wants to merge 1 commit into
valkey-io:mainfrom
edwardpark97:upstream-pr/iam-aws-endpoint-url-sts-fix

Conversation

@edwardpark97
Copy link
Copy Markdown

Summary

Fixes ElastiCache/MemoryDB IAM authentication in AWS partitions that do not publish a separate FIPS STS hostname (most notably us-gov-west-1, where the standard sts.us-gov-west-1.amazonaws.com endpoint is itself FIPS-validated and is the only STS endpoint in the region). Previously, setting AWS_USE_FIPS_ENDPOINT=true made the default credential provider's internal STS client synthesize a non-existent sts-fips.<region>.amazonaws.com hostname, causing credential acquisition to hang and IAM auth to time out with Connection error: Timeout. Setting AWS_ENDPOINT_URL_STS did not help because the Rust SDK's credentials-provider STS client doesn't honor that env var (unlike boto3, which threads it through).

Issue link

This Pull Request is linked to issue: core: ElastiCache IAM auth hangs in us-gov-west-1 when AWS_USE_FIPS_ENDPOINT=true
Closes #5967

Features / Behaviour Changes

glide-core/src/iam/mod.rs::get_signing_identity() now honors an explicit AWS_ENDPOINT_URL_STS environment variable when building the credentials-provider config loader, and disables FIPS on that scoped loader so the SDK's endpoint resolver does not reject the override. This mirrors boto3's long-standing behavior.

  • No public API changes. The fix is internal to credential acquisition.
  • No behavior change when AWS_ENDPOINT_URL_STS is unset or empty — the loader path is identical to before.
  • All language bindings benefit (Python, Java, Node, Go) since they all go through glide-core for IAM auth.

Implementation

The change is a single ~30-line addition inside get_signing_identity():

let mut loader = aws_config::defaults(BehaviorVersion::latest())
    .region(aws_config::Region::new(region.to_string()));

if let Ok(sts_endpoint) = std::env::var("AWS_ENDPOINT_URL_STS")
    && !sts_endpoint.is_empty()
{
    loader = loader.use_fips(false).endpoint_url(sts_endpoint);
}

let config = loader.load().await;

Two reviewer hot-spots:

  1. Why .use_fips(false) is needed alongside .endpoint_url() — the SDK's endpoint resolver fails fast when a FIPS partition is requested but the resolved (or user-provided) endpoint isn't on the SDK's hard-coded FIPS endpoint list. Without .use_fips(false), the override is rejected before any DNS lookup is attempted, producing "an error occurred while loading credentials" within ~100ms (not a DNS timeout). Disabling FIPS on this scoped loader is safe because (a) the override is local to credential acquisition only and does not affect SigV4 presigning of the actual ElastiCache/MemoryDB connect request (which runs through aws-sigv4 independently), and (b) the user remains responsible for pointing AWS_ENDPOINT_URL_STS at a FIPS-validated endpoint where compliance requires it.

  2. The !sts_endpoint.is_empty() guardstd::env::var() returns Ok("") when the env var is set but blank (common in Kubernetes manifests that templatize values). Treating blank as unset avoids passing an empty string to endpoint_url(), which would fail with a confusing error later.

A short comment in the source flags both points and links back to issue #5967 for the full writeup.

Limitations

  • Only AWS_ENDPOINT_URL_STS is honored, not the more general AWS_ENDPOINT_URL. The narrower env var is sufficient for the GovCloud use case and matches what we observed boto3 honoring in the same code path.
  • The fix is in the IAM credentials-provider loader only; if a user constructs a custom credentials provider externally, this code path is bypassed entirely (and the bug was never present in that case).

Testing

  • Added regression test test_get_signing_identity_honors_aws_endpoint_url_sts in glide-core/src/iam/mod.rs covering both a populated and an empty AWS_ENDPOINT_URL_STS value. The test exercises the new loader path with static credentials supplied via AWS_ACCESS_KEY_ID so no actual STS call is made, but it proves the code is plumbed through without breaking the happy path. Runs in the existing #[serial] test group to avoid env-var races with other tests.
  • Validated runtime behavior against AWS GovCloud (us-gov-west-1) ElastiCache Serverless with a real IRSA setup. Before the fix: credential acquisition hangs and IAM auth times out with Connection error: Timeout. After the fix: PING/PONG completes in ~2s, and KV/publisher/subscriber clients all connect cleanly.
  • Local checks (all pass):
    • cargo fmt --check (from glide-core/) — clean
    • cargo clippy --all-targets -- -D warnings (from glide-core/) — exit 0, no warnings
    • cargo test --lib iam:: (from glide-core/) — 9 passed, 0 failed (8 pre-existing + the new regression test)

Checklist

  • This Pull Request is related to one issue.
  • Commit message has a detailed description of what changed and why.
  • Tests are added or updated.
  • CHANGELOG.md and documentation files are updated.
  • Linters have been run — cargo fmt --check, cargo clippy --all-targets -- -D warnings, and cargo test --lib iam:: all pass from glide-core/. (No Prettier-relevant files changed.)
  • Destination branch is correct — main.
  • Create merge commit if merging release branch into main, squash otherwise. — Squash.

In AWS partitions where a separate FIPS STS hostname is not published
(e.g. us-gov-west-1, where the standard sts.us-gov-west-1.amazonaws.com
is itself FIPS-validated and is the only STS endpoint in the region),
the default credential provider's internal STS client otherwise
constructs a non-existent sts-fips.<region>.amazonaws.com whenever
AWS_USE_FIPS_ENDPOINT=true is set. DNS resolution fails, credential
acquisition hangs, and ElastiCache/MemoryDB IAM authentication times
out with 'Connection error: Timeout'.

Honor an explicit AWS_ENDPOINT_URL_STS override on the SDK config
loader when building the credentials provider. The Python SDK (boto3)
already threads AWS_ENDPOINT_URL_STS into both direct STS calls and
the credentials-provider STS client; this mirrors that behavior for
the Rust SDK loader.

Also explicitly disable FIPS on this loader: even when the override
is provided, the AWS SDK's endpoint resolver fails fast when a FIPS
partition is requested but the resolved (or user-provided) endpoint
is not on the SDK's hard-coded FIPS endpoint list. Disabling FIPS
here is safe because the override is scoped to credential acquisition;
SigV4 presigning of the actual ElastiCache/MemoryDB connect request
happens separately via aws-sigv4 and is unaffected. The user remains
responsible for pointing AWS_ENDPOINT_URL_STS at a FIPS-validated
endpoint where required.

Adds a regression test exercising the new code path with both a
populated and an empty AWS_ENDPOINT_URL_STS.

Signed-off-by: Edward Park <edwardpark97@gmail.com>
@edwardpark97 edwardpark97 requested a review from a team as a code owner May 17, 2026 06:26
Comment thread glide-core/src/iam/mod.rs
// the SDK endpoint resolver rejects user URLs not on its FIPS list. Scoped
// to this loader; SigV4 presigning is unaffected. See valkey-io/valkey-glide#5967.
if let Ok(sts_endpoint) = std::env::var("AWS_ENDPOINT_URL_STS")
&& !sts_endpoint.is_empty()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to also check for whitespace-only values here?

Comment thread glide-core/src/iam/mod.rs
if let Ok(sts_endpoint) = std::env::var("AWS_ENDPOINT_URL_STS")
&& !sts_endpoint.is_empty()
{
loader = loader.use_fips(false).endpoint_url(sts_endpoint);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint URL is passed directly without validating it uses HTTPS. An http:// URL would send credential requests over plaintext, potentially exposing STS session tokens.

Comment thread CHANGELOG.md
## Pending 2.4

#### Fixes
* CORE: Honor `AWS_ENDPOINT_URL_STS` in the IAM credentials-provider loader so ElastiCache/MemoryDB IAM auth works in AWS partitions that do not publish a separate FIPS STS hostname (e.g. `us-gov-west-1`). Previously, setting `AWS_USE_FIPS_ENDPOINT=true` made the SDK construct a non-existent `sts-fips.<region>.amazonaws.com`, causing credential acquisition to hang. Matches `boto3` behavior. ([#5967](https://github.com/valkey-io/valkey-glide/issues/5967))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have already release version 2.4.0 of GLIDE, there is a merge conflict here. Can you please update this to be under Pending 2.5?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

core: ElastiCache IAM auth hangs in us-gov-west-1 when AWS_USE_FIPS_ENDPOINT=true

2 participants