Skip to content

[Feat][Router]: Add automatic retry with exponential backoff and jitter#939

Open
ikaadil wants to merge 71 commits into
vllm-project:mainfrom
ikaadil:request-retry
Open

[Feat][Router]: Add automatic retry with exponential backoff and jitter#939
ikaadil wants to merge 71 commits into
vllm-project:mainfrom
ikaadil:request-retry

Conversation

@ikaadil

@ikaadil ikaadil commented May 1, 2026

Copy link
Copy Markdown
Contributor

Overview

Implements automatic retry for transient HTTP failures with configurable exponential backoff and jitter. This enhances the router's resilience by automatically retrying failed requests instead of immediately returning errors to clients.

Details: https://medium.com/@avnein4988/mitigating-the-thundering-herd-problem-exponential-backoff-with-jitter-b507cdf90d62

Changes

Core Implementation

  • RetryConfig dataclass: Configurable retry parameters with exponential backoff calculation
  • Retry logic: Integrated into request processing flow in route_general_request()
  • Retryable status detection: Function to identify transient failures

CLI Arguments

Added retry configuration options:

  • --retry-max-retries (default: 5)
  • --retry-initial-backoff-ms (default: 50)
  • --retry-max-backoff-ms (default: 30000)
  • --retry-backoff-multiplier (default: 1.5)
  • --retry-jitter-factor (default: 0.2)
  • --disable-retries: Disable all retries

Key Features

Retryable Status Codes

Automatically retries on transient failures:

  • 408 - Request Timeout
  • 429 - Too Many Requests
  • 500 - Internal Server Error
  • 502 - Bad Gateway
  • 503 - Service Unavailable
  • 504 - Gateway Timeout

Exponential Backoff with Jitter

Prevents thundering herd through randomized delays:
Formula: delay = min(initial_backoff_ms × (multiplier ^ attempt), max_backoff_ms)
With jitter: D' = D × (1 + U[-j, +j])
Example with defaults:

  • Attempt 0: ~50ms
  • Attempt 1: ~75ms
  • Attempt 2: ~112ms
  • Attempt 3: ~168ms
  • Attempt 4: ~253ms

Backward Compatibility

  • Removed existing max_instance_failover_reroute_attempts behavior
  • Falls back to single attempt when set to 0
  • No breaking changes to existing functionality

Usage

vllm-router --port 8000 \
    --service-discovery static \
    --static-backends "http://localhost:9001,http://localhost:9002" \
    --static-models "facebook/opt-125m,facebook/opt-125m" \
    --routing-logic roundrobin \
    --retry-max-retries 5 \
    --retry-initial-backoff-ms 100 \
    --retry-max-backoff-ms 60000 \
    --retry-backoff-multiplier 2.0 \
    --retry-jitter-factor 0.1

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a retry mechanism with exponential backoff and jitter for transient failures, aligning the router's behavior with the sglang model gateway. The changes include a new RetryConfig dataclass, CLI arguments for configuration, updated documentation, and logic in the request service to handle retryable HTTP status codes (408, 429, 500, 502, 503, 504). Feedback identifies a logic error where retries are effectively disabled by default due to the max_attempts calculation, the inclusion of an unused last_response variable, and a concern that blacklisting URLs for transient errors prevents retrying the same backend in single-node environments.

Comment thread src/vllm_router/services/request_service/request.py Outdated
Comment thread src/vllm_router/services/request_service/request.py Outdated
Comment thread src/vllm_router/services/request_service/request.py
ikaadil and others added 28 commits May 2, 2026 10:07
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Increase timeout values in e2e test workflow

Signed-off-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…te attempts (vllm-project#839)

* Add max instance failover reroute attempts configuration

- Introduced a new command-line argument  to specify the number of reroute attempts for failed requests.
- Updated the routing logic to utilize this new configuration, allowing for better handling of request failures.
- Enhanced the request routing service to incorporate the maximum reroute attempts in its logic.

This change improves the robustness of the routing mechanism by allowing for configurable failover behavior.

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Add command-line argument for LMCache health check interval

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Refactor routing logic to directly set max instance failover reroute attempts

- Removed the set_max_instance_failover_reroute_attempts method and directly assigned the value to the router's attribute.
- Simplified the request routing logic by consolidating endpoint filtering and error handling, improving readability and maintainability.

This change enhances the clarity of the routing logic and streamlines the handling of reroute attempts for failed requests.

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Add unit tests for instance failover routing logic

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Trigger pipeline

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Refactor request routing to improve request tracking

- Moved the tracking of valid incoming requests to a more appropriate location in the routing logic.
- Simplified the retrieval of endpoint information by ensuring it is called only once, enhancing code clarity.

This change improves the maintainability of the request routing service.

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Add space

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Add the comments

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Add empty line

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Add comment

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Addressed the comment

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Fix the log

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Add comment

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

* Resolve conflict

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

---------

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta khairul Alam Adil <25082512+ikaadil@users.noreply.github.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…m-project#847)

Signed-off-by: enneitex <etienne.divet@gmail.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…llm-project#760) (vllm-project#844)

Allow the router to be served under a subpath (e.g. /vllm) by passing
root_path through to uvicorn. Also adds Helm chart support via
routerSpec.rootPath.

Signed-off-by: Keyu Chen <54015474+keyuchen21@users.noreply.github.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…t to INFO. (vllm-project#846)

* Expose LMCache log level as configurable Helm value and default to INFO.

Signed-off-by: nargit <NargiT@users.noreply.github.com>

* Fix names

Signed-off-by: NargiT <NargiT@users.noreply.github.com>

* fix tests and code

Signed-off-by: NargiT <NargiT@users.noreply.github.com>

* fix test for ray-cluster

Signed-off-by: NargiT <NargiT@users.noreply.github.com>

* fix typo

Signed-off-by: NargiT <NargiT@users.noreply.github.com>

* fix yet another typo

Signed-off-by: NargiT <NargiT@users.noreply.github.com>

* update doc

Signed-off-by: NargiT <NargiT@users.noreply.github.com>

---------

Signed-off-by: nargit <NargiT@users.noreply.github.com>
Signed-off-by: NargiT <NargiT@users.noreply.github.com>
Co-authored-by: nargit <NargiT@users.noreply.github.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…ect#849)

* feat(router): add --log-format json option for structured logging

Add a JsonFormatter that outputs log records as JSON with timestamp,
level, logger, message, filename, and lineno fields. The new
--log-format flag (choices: text, json) controls the output format
for both the router loggers and uvicorn.

Signed-off-by: Keyu Chen <54015474+keyuchen21@users.noreply.github.com>

* test: add tests for JsonFormatter and --log-format parser arg

Add TestJsonFormatter class covering JSON output validation, exception
inclusion/exclusion, format switching via set_log_format, and
init_logger format respect. Add parser tests verifying --log-format
defaults to text and accepts json. Update README logging options
documentation.

Signed-off-by: Keyu Chen <54015474+keyuchen21@users.noreply.github.com>

* refactor: address review feedback for JsonFormatter

Instantiate formatter once outside the loop in set_log_format to avoid
redundant allocations. Add stack_info support and default=str fallback
to JsonFormatter for robustness. Add tests for stack_info inclusion and
non-serializable object handling.

Signed-off-by: Keyu Chen <54015474+keyuchen21@users.noreply.github.com>

* style: fix black formatting in test files

Signed-off-by: Keyu Chen <54015474+keyuchen21@users.noreply.github.com>

---------

Signed-off-by: Keyu Chen <54015474+keyuchen21@users.noreply.github.com>
Signed-off-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
* [Router][Image Edit]: routing multi-part form requests

Signed-off-by: Nuno Ramos <nmiguel123@gmail.com>

* [Router][Refactor]: abstraction for proxying multipart form requests

Signed-off-by: Nuno Ramos <nmiguel123@gmail.com>

---------

Signed-off-by: Nuno Ramos <nmiguel123@gmail.com>
Co-authored-by: Nuno Ramos <nmiguel123@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
* feat(helm) add pdb and expose various options in the values. Add tests

Signed-off-by: enneitex <etienne.divet@gmail.com>

* feat(helm) update README and json schema with new fields

Signed-off-by: enneitex <etienne.divet@gmail.com>

---------

Signed-off-by: enneitex <etienne.divet@gmail.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…#834)

Includes:
- GPU Bare-Metal node orchestration
- Secure Traefik ingress + TLS Endpoints (cert-manager)
- Prometheus + Grafana monitoring
- Built-in vLLM production stack + Vllm inference dashboards
- Terraform + Helm integration

Signed-off-by: Kosseila (CloudThrill) <klouddude@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…roject#777)

* [Feat][Router] Add disaggregated prefill orchestrated routing

Implements support for disaggregated prefill as outlined in the 2025 Q1 roadmap.
This enables prefill/decode disaggregation with router-orchestrated KV cache transfer.

Closes vllm-project#26

Signed-off-by: Yahav <yahavb@amazon.com>

* [CI/Build] Lower Python version requirement to 3.10 for Neuron SDK compatibility

Signed-off-by: Yahav <yahavb@amazon.com>

* [Feat][Router] Address PR review feedback for disaggregated prefill orchestrated routing

- Remove dead code (handle_orchestrated_request method in routing_logic.py)
- Fix prefill request to use max_tokens=1 per proposal spec
- Use shared aiohttp client instead of creating new session per request
- Fix streaming to yield chunks immediately (true streaming)
- Remove redundant isinstance check for DisaggregatedPrefillOrchestratedRouter
- Use router's _find_endpoints method to avoid code duplication

Signed-off-by: Yahav <yahavb@amazon.com>

* fix: use kv_transfer_params instead of disagg_prefill_resp

- Add kv_transfer_params to prefill request to enable disaggregated mode
- Extract kv_transfer_params from prefill response and forward to decode
- Set remote_host to prefill endpoint for KV cache retrieval

Signed-off-by: Yahav <yahavb@amazon.com>

* docs: add example for disaggregated_prefill_orchestrated mode

- Add README with usage instructions and configuration notes
- Add sanitized Kubernetes manifests (router, prefill, decode)
- Include example curl command and expected router logs

Signed-off-by: Yahav <yahavb@amazon.com>

* style: fix black formatting

Signed-off-by: Yahav <yahavb@amazon.com>

* style: fix markdownlint errors in README.md

Signed-off-by: Yahav <yahavb@amazon.com>

* style: fix markdownlint errors in proposal doc

Signed-off-by: Yahav <yahavb@amazon.com>

* docs: clean up DisaggregatedPrefillOrchestratedRouter docstring

Signed-off-by: Yahav <yahavb@amazon.com>

* feat: return 503 with distinct codes for prefill/decode unavailability

- PREFILL_SERVICE_UNAVAILABLE: No prefill endpoints discovered
- DECODE_SERVICE_UNAVAILABLE: No decode endpoints discovered

This allows automated tests to distinguish transient startup issues from real bugs.

Signed-off-by: Yahav <yahavb@amazon.com>

* revert: restore requires-python = 3.12

Signed-off-by: Yahav <yahavb@amazon.com>

* fix: replace angle bracket placeholders with uppercase format

Angle brackets like <your-pvc-name> are interpreted as shell redirections
by shellcheck, causing CI failures. Use uppercase format instead:
YOUR-PVC-NAME, YOUR-MODEL-PATH, etc.

Signed-off-by: Yahav <yahavb@amazon.com>

* fix: remove trailing whitespace from YAML files

Signed-off-by: Yahav <yahavb@amazon.com>

---------

Signed-off-by: Yahav <yahavb@amazon.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
* bugfix: deprecate disable log request

Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com>

* Update helm/README.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>

---------

Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com>
Signed-off-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…#875)

* feat(helm): add configurable NodePort to router service

Add optional `routerSpec.nodePort` field that, when set alongside
`routerSpec.serviceType: NodePort`, pins the NodePort to a fixed value
instead of letting Kubernetes assign a random one on every helm upgrade.

Closes vllm-project#763

Signed-off-by: Keyu Chen <54015474+keyuchen21@users.noreply.github.com>

* fix(helm): move nodePort schema to routerSpec and use truthiness check

- Move nodePort JSON schema property from servingEngineSpec to routerSpec
  where it belongs
- Replace hasKey check with truthiness check in service-router.yaml to
  correctly handle nodePort: null

Signed-off-by: Keyu Chen <54015474+keyuchen21@users.noreply.github.com>

* docs(helm): document nodePort field in Router Configuration table

Add routerSpec.nodePort entry to the Helm README to document the
configurable NodePort introduced for the router service.

Signed-off-by: Keyu_Chen <54015474+km5ar@users.noreply.github.com>

---------

Signed-off-by: Keyu Chen <54015474+keyuchen21@users.noreply.github.com>
Signed-off-by: Keyu_Chen <54015474+km5ar@users.noreply.github.com>
Co-authored-by: Keyu_Chen <54015474+km5ar@users.noreply.github.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…oning models emitting reasoning_content instead of content (vllm-project#873)

Signed-off-by: Kosseila (CloudThrill) <klouddude@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: enneitex <etienne.divet@gmail.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…ect#880)

* fix: Detect the media_type instead of hardcode to text/event-stream

Signed-off-by: Shern Shiou Tan <shernshiou@gmail.com>

* test: Add test for audio/wav and text/event-stream

Signed-off-by: Shern Shiou Tan <shernshiou@gmail.com>

* fix: Move media-type before header

Signed-off-by: Shern Shiou Tan <shernshiou@gmail.com>

---------

Signed-off-by: Shern Shiou Tan <shernshiou@gmail.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…lm-project#889)

Signed-off-by: Nejc Habjan <nejc.habjan@siemens.com>
Co-authored-by: Rui Zhang <51696593+ruizhang0101@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
ikaadil and others added 4 commits May 2, 2026 10:12
Signed-off-by: Ifta khairul Alam Adil <25082512+ikaadil@users.noreply.github.com>
- Fix Helm chart version regression (0.1.10 -> 0.1.19)
- Remove dead max_instance_failover_reroute_attempts parameter
- Clarify max_retries semantics in docs (total attempts, not retries)
- Add input validation for retry CLI arguments
- Fix thread safety in RoutingInterface singleton init
- Add super().__init__() to DisaggregatedPrefillOrchestratedRouter
- Add comment explaining HTTPException retry behavior
- Rename test to test_non_retryable_http_exception_not_retried
- Add test for retryable HTTPException (503)

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
… 0.1.10.

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Move media_type extraction after headers are received to properly
capture the Content-Type from backend responses. This fixes the
audio content type forwarding test.

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
ikaadil added 2 commits May 2, 2026 15:53
- Add tests for is_retryable_status function

- Fix retry logic to exclude backends with retryable errors

- Update validation message for retry_max_retries

- Fix type annotation in routing_logic.py

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
@ikaadil

ikaadil commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

@ruizhang0101 could you please review the MR? Thanks!

@ruizhang0101

Copy link
Copy Markdown
Collaborator

@aeon-x Could you take a look at this?

ikaadil added 6 commits May 6, 2026 16:20
- Removed req_id, sorted_endpoints, last_endpoints_id, and last_endpoints_hash from the RoundRobinRouter class initialization.
- Streamlined the constructor to focus on essential attributes.

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
- Removed unused retry_urls set to simplify the retry mechanism.
- Updated logic to filter remaining endpoints based solely on error_urls.

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
@aeon-x

aeon-x commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Hey @ikaadil, i think retrying should be an optional, as most users would need fast fail over.

Can you make sure that this retry mechanism is turned off unless it is explictly turned on by a flag?

ikaadil added 4 commits May 6, 2026 19:28
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…e as int

Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
@ikaadil

ikaadil commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Can you make sure that this retry mechanism is turned off unless it is explictly turned on by a flag?

Done

ikaadil and others added 4 commits May 22, 2026 11:11
Signed-off-by: Ifta khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Ifta khairul Alam Adil <ikaadil007@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.