Skip to content

Commit fd97ff7

Browse files
committed
feat(vllm-router): add fallback model support for zero-downtime GPU node reboots
When all backends for a model are unavailable (either health-checked away or all attempts errored out), requests automatically fall through to a configured fallback model. The model name in the request body is rewritten so downstream gateways (e.g. Envoy AI Gateway routing to Bedrock) receive the correct model identifier. Config: per-model fallback_model in YAML, or --static-fallback-models CLI flag. Signed-off-by: Max Wittig <max.wittig@siemens.com>
1 parent 5a3f13d commit fd97ff7

8 files changed

Lines changed: 246 additions & 33 deletions

File tree

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,9 @@ write_to = "src/vllm_router/_version.py"
6565
[tool.isort]
6666
profile = "black"
6767

68+
[tool.ruff]
69+
target-version = "py312"
70+
6871
[tool.pytest.ini_options]
6972
asyncio_mode = "auto"
7073

src/vllm_router/README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ The router can be configured using command-line arguments. Below are the availab
2929
- `--static-models`: The models running in the static serving engines, separated by commas (e.g., `model1,model2`).
3030
- `--static-aliases`: The aliases of the models running in the static serving engines, separated by commas and associated using colons (e.g., `model_alias1:model,mode_alias2:model`).
3131
- `--static-backend-health-checks`: Enable this flag to make vllm-router check periodically if the models work by sending dummy requests to their endpoints.
32+
- `--static-fallback-models`: Fallback model mappings, separated by commas (e.g., `model1:fallback1,model2:fallback2`). When all backends for a model are unavailable, requests are retried on the fallback model.
3233
- `--k8s-port`: The port of vLLM processes when using K8s service discovery. Default is `8000`.
3334
- `--k8s-namespace`: The namespace of vLLM pods when using K8s service discovery. Default is `default`.
3435
- `--k8s-label-selector`: The label selector to filter vLLM pods when using K8s service discovery.
@@ -108,6 +109,64 @@ different endpoints for each model type.
108109
> Enabling this flag will put some load on your backend every minute as real requests are send to the nodes
109110
> to test their functionality.
110111
112+
## Fallback models
113+
114+
When all backends for a model become unavailable (e.g. during node reboots), the
115+
router can automatically retry the request on a different **fallback model**. The
116+
model name in the request body is rewritten to the fallback model name before
117+
forwarding, so the fallback backend receives the correct model identifier.
118+
119+
Fallback triggers in two situations:
120+
121+
1. **No healthy endpoints** -- all backends have been marked unhealthy by the
122+
periodic health check. The router switches to the fallback model immediately
123+
without attempting the primary backends.
124+
2. **All instance-level failover attempts failed** -- the primary backends were
125+
still considered healthy but every attempt returned a connection error (e.g.
126+
the node went down between health checks). After exhausting
127+
`--max-instance-failover-reroute-attempts`, the router retries once on the
128+
fallback model.
129+
130+
### Configuration
131+
132+
**In a YAML config file**, add `fallback_model` to any model entry. The value
133+
must be the name of another model defined in `static_models`:
134+
135+
```yaml
136+
static_models:
137+
glm-5:
138+
static_backends:
139+
- https://gpu-node-1/glm-5
140+
- https://gpu-node-2/glm-5
141+
static_model_type: chat
142+
fallback_model: glm-5-cloud # fall back to the cloud-hosted variant
143+
glm-5-cloud:
144+
static_backends:
145+
- http://cloud-gateway:1975
146+
static_model_type: chat
147+
healthcheck_disabled: true
148+
```
149+
150+
**Via CLI**, use `--static-fallback-models` with comma-separated
151+
`model:fallback` pairs:
152+
153+
```bash
154+
vllm-router --port 8000 \
155+
--service-discovery static \
156+
--static-backends "https://gpu-node-1/glm-5,https://gpu-node-2/glm-5,http://cloud-gateway:1975" \
157+
--static-models "glm-5,glm-5,glm-5-cloud" \
158+
--static-model-types "chat,chat,chat" \
159+
--static-fallback-models "glm-5:glm-5-cloud" \
160+
--static-backend-health-checks \
161+
--max-instance-failover-reroute-attempts 2 \
162+
--routing-logic roundrobin
163+
```
164+
165+
Combining `fallback_model` with `--max-instance-failover-reroute-attempts` and a
166+
short `--static-backend-health-check-interval` gives the best resilience: failed
167+
requests are retried on other instances first, then on the fallback model, while
168+
the health check quickly removes dead backends from future routing decisions.
169+
111170
## Dynamic Router Config
112171

113172
The router can be configured dynamically using a config file when passing the `--dynamic-config-yaml` or
@@ -128,6 +187,7 @@ Currently, the dynamic config supports the following fields:
128187
- (When using `static` service discovery) `static_models`: The models running in the static serving engines, separated by commas (e.g., `model1,model2`).
129188
- (When using `static` service discovery) `static_aliases`: The aliases of the models running in the static serving engines, separated by commas and associated using colons (e.g., `model_alias1:model,mode_alias2:model`).
130189
- (When using `static` service discovery and if you enable the `--static-backend-health-checks` flag) `static_model_types`: The model types running in the static serving engines, separated by commas (e.g., `chat,chat`).
190+
- (When using `static` service discovery) `fallback_model`: A per-model string in the YAML config (under each model entry) specifying another model to fall back to when all backends are unavailable.
131191
- (When using `k8s` service discovery) `k8s_port`: The port of vLLM processes when using K8s service discovery. Default is `8000`.
132192
- (When using `k8s` service discovery) `k8s_namespace`: The namespace of vLLM pods when using K8s service discovery. Default is `default`.
133193
- (When using `k8s` service discovery) `k8s_label_selector`: The label selector to filter vLLM pods when using K8s service discovery.

src/vllm_router/app.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,11 @@ def initialize_all(app: FastAPI, args):
202202
static_backend_health_check_timeout_seconds=args.static_backend_health_check_timeout_seconds,
203203
prefill_model_labels=args.prefill_model_labels,
204204
decode_model_labels=args.decode_model_labels,
205+
fallback_models=(
206+
parse_static_aliases(args.static_fallback_models)
207+
if args.static_fallback_models
208+
else None
209+
),
205210
)
206211
elif args.service_discovery == "k8s":
207212
initialize_service_discovery(

src/vllm_router/dynamic_config.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ class DynamicRouterConfig:
5757
static_aliases: Optional[str] = None
5858
static_model_labels: Optional[str] = None
5959
static_model_types: Optional[str] = None
60+
static_fallback_models: Optional[str] = None
6061
static_backend_health_checks: Optional[bool] = False
6162
static_backend_health_check_interval: Optional[int] = 60
6263
static_backend_health_check_timeout_seconds: Optional[int] = 10
@@ -97,6 +98,7 @@ def from_args(args) -> "DynamicRouterConfig":
9798
static_backend_health_checks=args.static_backend_health_checks,
9899
static_backend_health_check_interval=args.static_backend_health_check_interval,
99100
static_backend_health_check_timeout_seconds=args.static_backend_health_check_timeout_seconds,
101+
static_fallback_models=getattr(args, "static_fallback_models", None),
100102
k8s_port=args.k8s_port,
101103
k8s_namespace=args.k8s_namespace,
102104
k8s_label_selector=args.k8s_label_selector,
@@ -185,6 +187,11 @@ def reconfigure_service_discovery(self, config: DynamicRouterConfig):
185187
decode_model_labels=parse_comma_separated_args(
186188
config.decode_model_labels
187189
),
190+
fallback_models=(
191+
parse_static_aliases(config.static_fallback_models)
192+
if config.static_fallback_models
193+
else None
194+
),
188195
)
189196
elif config.service_discovery == "k8s":
190197
reconfigure_service_discovery(

src/vllm_router/parsers/parser.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,13 @@ def parse_args():
171171
default=None,
172172
help="The model labels of static backends, separated by commas. E.g., model1,model2",
173173
)
174+
parser.add_argument(
175+
"--static-fallback-models",
176+
type=str,
177+
default=None,
178+
help="Fallback model mappings, separated by commas. E.g., model1:fallback1,model2:fallback2. "
179+
"When all backends for a model are unavailable, requests are retried on the fallback model.",
180+
)
174181
parser.add_argument(
175182
"--static-backend-health-checks",
176183
action="store_true",

src/vllm_router/parsers/yaml_utils.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,18 @@ def generate_static_model_types(models: dict[str, Any]) -> str:
3737
return ",".join(static_model_types)
3838

3939

40+
def generate_static_fallback_models(models: dict[str, Any]) -> str | None:
41+
"""Generate comma-separated fallback model mappings.
42+
43+
Format: model1:fallback1,model2:fallback2
44+
"""
45+
fallback_models = []
46+
for name, details in models.items():
47+
if "fallback_model" in details:
48+
fallback_models.append(f"{name}:{details['fallback_model']}")
49+
return ",".join(fallback_models) if fallback_models else None
50+
51+
4052
def read_and_process_yaml_config_file(config_path: str) -> dict[str, Any]:
4153
with open(config_path, encoding="utf-8") as f:
4254
try:
@@ -49,6 +61,9 @@ def read_and_process_yaml_config_file(config_path: str) -> dict[str, Any]:
4961
yaml_config["static_backends"] = generate_static_backends(models)
5062
yaml_config["static_models"] = generate_static_models(models)
5163
yaml_config["static_model_types"] = generate_static_model_types(models)
64+
fallback_models = generate_static_fallback_models(models)
65+
if fallback_models:
66+
yaml_config["static_fallback_models"] = fallback_models
5267
if aliases:
5368
yaml_config["static_aliases"] = generate_static_aliases(aliases)
5469
return yaml_config

src/vllm_router/service_discovery.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,7 @@ def __init__(
217217
static_backend_health_check_timeout_seconds: int = 10,
218218
prefill_model_labels: List[str] | None = None,
219219
decode_model_labels: List[str] | None = None,
220+
fallback_models: Dict[str, str] | None = None,
220221
):
221222
self.app = app
222223
assert len(urls) == len(models), "URLs and models should have the same length"
@@ -225,6 +226,7 @@ def __init__(
225226
self.aliases = aliases
226227
self.model_labels = model_labels
227228
self.model_types = model_types
229+
self.fallback_models = fallback_models or {}
228230
self.engines_id = [str(uuid.uuid4()) for i in range(0, len(urls))]
229231
self.added_timestamp = int(time.time())
230232
self.unhealthy_endpoint_hashes = []

0 commit comments

Comments
 (0)