Skip to content

fix: mitigate STS assume role throttling in Kafka buffer#6634

Merged
dinujoh merged 3 commits into
opensearch-project:mainfrom
dinujoh:fix/sts-throttling-kafka-buffer
Mar 13, 2026
Merged

fix: mitigate STS assume role throttling in Kafka buffer#6634
dinujoh merged 3 commits into
opensearch-project:mainfrom
dinujoh:fix/sts-throttling-kafka-buffer

Conversation

@dinujoh

@dinujoh dinujoh commented Mar 12, 2026

Copy link
Copy Markdown
Member

Description

Prevent excessive STS AssumeRole calls when customers delete their IAM role or misconfigure the trust policy. Previously, one customer could generate 12,000 STS calls in 4 minutes due to unbounded retries of non-retryable AccessDeniedException errors.

Changes:

  • KafkaSecurityConfigurer: Fail fast on STS 403 (AccessDenied) in getBootStrapServersForMsk() instead of retrying 360 times
  • KafkaSecurityConfigurer: Replace fixed 10s retry sleep with exponential backoff (10s to 10min max) for retryable STS and Kafka errors
  • KafkaCustomConsumer: Replace fixed 10s retry with exponential backoff using Kafka's ExponentialBackoff (10s to 10min max) for AuthenticationException errors
  • KafkaCustomConsumer: Use Duration constants for backoff readability
  • KafkaCustomConsumer: Reset backoff counter on successful poll to handle transient errors gracefully
  • Add exponential backoff to outer run() exception handler

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Prevent excessive STS AssumeRole calls when customers delete their IAM
role or misconfigure the trust policy. Previously, one pipeline could
generate 12,000 STS calls in 4 minutes due to unbounded retries of
non-retryable AccessDeniedException errors.

Changes:
- KafkaSecurityConfigurer: Fail fast on STS 403 (AccessDenied) in
  getBootStrapServersForMsk() instead of retrying 360 times
- KafkaCustomConsumer: Replace fixed 10s retry with exponential backoff
  (10s to 5min max) for AuthenticationException errors
- KafkaCustomConsumer: Stop consumer after 50 consecutive auth failures
  to prevent indefinite STS hammering
- KafkaCustomConsumer: Reset backoff and failure counter on successful
  poll to handle transient errors gracefully
- Add exponential backoff to outer run() exception handler

Signed-off-by: Dinu John <86094133+dinujoh@users.noreply.github.com>
sb2k16
sb2k16 previously approved these changes Mar 12, 2026
dinujoh added 2 commits March 12, 2026 17:49
Signed-off-by: Dinu John <86094133+dinujoh@users.noreply.github.com>
…ation constants, remove silent shutdown

Signed-off-by: Dinu John <86094133+dinujoh@users.noreply.github.com>
@dinujoh dinujoh force-pushed the fix/sts-throttling-kafka-buffer branch from d029b97 to d44e9e9 Compare March 12, 2026 22:52

@dlvenable dlvenable left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dinujoh for this fix!

@dinujoh dinujoh merged commit 7cb72ca into opensearch-project:main Mar 13, 2026
74 of 82 checks passed
@dinujoh dinujoh deleted the fix/sts-throttling-kafka-buffer branch March 13, 2026 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants