Skip to content

feat: retry transient 401/403 auth errors with exponential backoff#20

Merged
Rolf Håvard Blindheim (rhblind) merged 1 commit into
mainfrom
feat/auth-retry
Jun 18, 2026
Merged

feat: retry transient 401/403 auth errors with exponential backoff#20
Rolf Håvard Blindheim (rhblind) merged 1 commit into
mainfrom
feat/auth-retry

Conversation

@rhblind

Copy link
Copy Markdown
Member

Summary

  • Add configurable max_auth_retries option (default 3) so the producer retries transient 401/403 auth errors during Splunk rolling updates instead of immediately stopping the pipeline
  • Retries use exponential backoff (base interval * 2^n) capped at 30s or the base interval, whichever is larger
  • Emit [:off_broadway_splunk, :auth, :error] telemetry event on each auth error for observability
  • Fully backwards compatible: existing producers get the new retry behavior by default; set max_auth_retries: 1 to restore the previous immediate-stop behavior

Closes #19

Test plan

  • 4 validation tests for max_auth_retries option (negative, zero, default, positive)
  • 4 unit tests for backoff_interval/2 (doubling, cap when base < 30s, cap when base > 30s, zero base)
  • Transient 401 recovery test (2 failures then success)
  • Exceeding max retries stops the producer
  • Legacy behavior test (max_auth_retries: 1 stops immediately)
  • Telemetry event emission test
  • Existing 401/403 tests still pass unchanged
  • Full suite: 45 tests, 0 failures
  • Dialyzer, Credo, format checks pass

Add configurable max_auth_retries (default 3) so the producer survives
transient auth failures during Splunk rolling updates instead of stopping
immediately. Retries use exponential backoff capped at 30s. Emits
[:off_broadway_splunk, :auth, :error] telemetry on each auth error.

Closes #19

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@rhblind Rolf Håvard Blindheim (rhblind) merged commit 49b4dbc into main Jun 18, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Producer crashes on transient 401/403 during Splunk rolling updates

1 participant