You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(vllm-router): add fallback model support for zero-downtime GPU node reboots
When all backends for a model are unavailable (either health-checked away
or all attempts errored out), requests automatically fall through to a
configured fallback model. The model name in the request body is rewritten
so downstream gateways (e.g. Envoy AI Gateway routing to Bedrock) receive
the correct model identifier.
Config: per-model fallback_model in YAML, or --static-fallback-models CLI flag.
Signed-off-by: Max Wittig <max.wittig@siemens.com>
Copy file name to clipboardExpand all lines: src/vllm_router/README.md
+60Lines changed: 60 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,6 +29,7 @@ The router can be configured using command-line arguments. Below are the availab
29
29
-`--static-models`: The models running in the static serving engines, separated by commas (e.g., `model1,model2`).
30
30
-`--static-aliases`: The aliases of the models running in the static serving engines, separated by commas and associated using colons (e.g., `model_alias1:model,mode_alias2:model`).
31
31
-`--static-backend-health-checks`: Enable this flag to make vllm-router check periodically if the models work by sending dummy requests to their endpoints.
32
+
-`--static-fallback-models`: Fallback model mappings, separated by commas (e.g., `model1:fallback1,model2:fallback2`). When all backends for a model are unavailable, requests are retried on the fallback model.
32
33
-`--k8s-port`: The port of vLLM processes when using K8s service discovery. Default is `8000`.
33
34
-`--k8s-namespace`: The namespace of vLLM pods when using K8s service discovery. Default is `default`.
34
35
-`--k8s-label-selector`: The label selector to filter vLLM pods when using K8s service discovery.
@@ -108,6 +109,64 @@ different endpoints for each model type.
108
109
> Enabling this flag will put some load on your backend every minute as real requests are send to the nodes
109
110
> to test their functionality.
110
111
112
+
## Fallback models
113
+
114
+
When all backends for a model become unavailable (e.g. during node reboots), the
115
+
router can automatically retry the request on a different **fallback model**. The
116
+
model name in the request body is rewritten to the fallback model name before
117
+
forwarding, so the fallback backend receives the correct model identifier.
118
+
119
+
Fallback triggers in two situations:
120
+
121
+
1.**No healthy endpoints** -- all backends have been marked unhealthy by the
122
+
periodic health check. The router switches to the fallback model immediately
123
+
without attempting the primary backends.
124
+
2.**All instance-level failover attempts failed** -- the primary backends were
125
+
still considered healthy but every attempt returned a connection error (e.g.
126
+
the node went down between health checks). After exhausting
127
+
`--max-instance-failover-reroute-attempts`, the router retries once on the
128
+
fallback model.
129
+
130
+
### Configuration
131
+
132
+
**In a YAML config file**, add `fallback_model` to any model entry. The value
133
+
must be the name of another model defined in `static_models`:
134
+
135
+
```yaml
136
+
static_models:
137
+
glm-5:
138
+
static_backends:
139
+
- https://gpu-node-1/glm-5
140
+
- https://gpu-node-2/glm-5
141
+
static_model_type: chat
142
+
fallback_model: glm-5-cloud # fall back to the cloud-hosted variant
143
+
glm-5-cloud:
144
+
static_backends:
145
+
- http://cloud-gateway:1975
146
+
static_model_type: chat
147
+
healthcheck_disabled: true
148
+
```
149
+
150
+
**Via CLI**, use `--static-fallback-models` with comma-separated
Combining `fallback_model` with `--max-instance-failover-reroute-attempts` and a
166
+
short `--static-backend-health-check-interval` gives the best resilience: failed
167
+
requests are retried on other instances first, then on the fallback model, while
168
+
the health check quickly removes dead backends from future routing decisions.
169
+
111
170
## Dynamic Router Config
112
171
113
172
The router can be configured dynamically using a config file when passing the `--dynamic-config-yaml` or
@@ -128,6 +187,7 @@ Currently, the dynamic config supports the following fields:
128
187
- (When using `static` service discovery) `static_models`: The models running in the static serving engines, separated by commas (e.g., `model1,model2`).
129
188
- (When using `static` service discovery) `static_aliases`: The aliases of the models running in the static serving engines, separated by commas and associated using colons (e.g., `model_alias1:model,mode_alias2:model`).
130
189
- (When using `static` service discovery and if you enable the `--static-backend-health-checks` flag) `static_model_types`: The model types running in the static serving engines, separated by commas (e.g., `chat,chat`).
190
+
- (When using `static` service discovery) `fallback_model`: A per-model string in the YAML config (under each model entry) specifying another model to fall back to when all backends are unavailable.
131
191
- (When using `k8s` service discovery) `k8s_port`: The port of vLLM processes when using K8s service discovery. Default is `8000`.
132
192
- (When using `k8s` service discovery) `k8s_namespace`: The namespace of vLLM pods when using K8s service discovery. Default is `default`.
133
193
- (When using `k8s` service discovery) `k8s_label_selector`: The label selector to filter vLLM pods when using K8s service discovery.
0 commit comments