You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Among these, `role`, `host_ip`, and `port` are required; all other parameters are optional.
256
260
261
+
## Scheduling Strategies
262
+
263
+
The Router supports the following scheduling strategies, configurable via `policy` (mixed mode), `prefill-policy`, and `decode-policy` (PD disaggregated mode) fields in the configuration file.
264
+
265
+
**Default strategies**: When not configured, prefill nodes default to `process_tokens`, mixed and decode nodes default to `request_num`.
| `random` | General | Randomly selects one available instance, stateless, suitable for lightweight scenarios. |
270
+
| `round_robin` | General | Uses atomic counter to cycle through instance list, distributing requests evenly in order. |
271
+
| `power_of_two` | General | Randomly picks two instances, compares their concurrent request counts, selects the one with lower load. |
272
+
| `process_tokens` | **prefill (default)** | Iterates all instances, selects the one with the fewest tokens currently being processed (in-memory counting), suitable for prefill long-request load balancing. |
273
+
| `request_num` | **mixed / decode (default)** | Iterates all instances, selects the one with the fewest concurrent requests (in-memory counting), suitable for decode and mixed scenarios. |
274
+
| `fd_metrics_score` | mixed / decode | Uses in-memory counting to get running/waiting request counts, scores by `running + waiting × waitingWeight`, selects the instance with the lowest score. |
275
+
| `fd_remote_metrics_score` | mixed / decode | Fetches running/waiting request counts from each instance's remote `/metrics` endpoint in real-time, scores by `running + waiting × waitingWeight`, selects the instance with the lowest score. Requires `metrics_port` in instance registration. **Note: A synchronous remote HTTP request is issued on every scheduling decision. With a large number of instances or poor network conditions, this can significantly increase scheduling latency. Evaluate your deployment conditions carefully before enabling this strategy.** |
276
+
| `cache_aware` | prefill | Maintains KV Cache prefix hit information per instance via Radix Tree, selects instances by combining hit ratio and load scores (in-memory counting); automatically falls back to `process_tokens` when load is severely imbalanced. |
277
+
| `remote_cache_aware` | prefill | Same cache-aware strategy as `cache_aware`, but uses remote `/metrics` endpoint for instance load data. Requires `metrics_port` in instance registration. **Note: A synchronous remote HTTP request is issued on every scheduling decision. With a large number of instances or poor network conditions, this can significantly increase scheduling latency. Evaluate your deployment conditions carefully before enabling this strategy.** |
278
+
257
279
## Troubleshooting
258
280
259
281
If you encounter issues while using the Router, please refer to the [Router Troubleshooting Guide](router_faq.md), which covers common log analysis, response output interpretation, and troubleshooting methods.
Copy file name to clipboardExpand all lines: docs/online_serving/router_faq.md
+48-3Lines changed: 48 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,14 +23,22 @@ For basic Router usage, please refer to [Load-Balancing Scheduling Router](route
23
23
|`Failed to register instance from index {index}: {error}`| Instance at index {index} in config file failed to register | That instance was not registered | Health status, registration parameters |
24
24
|`failed to send request to {url} with error: {error}`| Health check request failed to send | The instance may be marked as unhealthy | Network connectivity, proxy settings |
25
25
|`scanner error: {error}`| Error occurred while reading backend streaming response | The current request may fail | Backend instance status |
26
+
|`[prefill] scanner error: {error}, message={message}`| Error occurred while reading Prefill backend streaming response | The current Prefill request may fail | Backend instance status |
27
+
|`[prefill] copy error: {error}, message={message}`| Error occurred while copying Prefill response data | The current Prefill request may fail | Backend instance status |
28
+
|`Panic recovered: {error}`| A panic occurred during request processing and was recovered | The current request fails, but the service continues running | Backend instance status, request content |
29
+
|`empty baseURL provided`| Health check received an empty base URL | Health check cannot be performed | Registration parameters |
30
+
|`failed to create request: {error}`| Failed to create health check request | The instance may be marked as unhealthy | Network environment |
31
+
|`failed to read response body: {error}`| Failed to read health check response body | The instance may be marked as unhealthy | Backend instance status |
26
32
27
33
### Warn-Level Logs
28
34
29
35
| Log Message | Meaning | Impact | What to Check |
30
36
| :--- | :--- | :--- | :--- |
31
37
|`Server {url} is not healthy`| The instance at this URL failed health check | Router cannot register the instance, or will remove it from the registered list | Health status |
32
38
|`Instance {url} role is unknown`| Instance role cannot be recognized | The instance will not be added to the scheduling list | Registration parameters |
33
-
|`cache-aware prefill: tokenizer failed, fallback to process_tokens: {error}`| Tokenizer service call failed, automatically falling back to process_tokens strategy | Prefill scheduling temporarily does not use cache_aware strategy; normal request processing is not affected | Tokenizer service status |
39
+
|`cache-aware prefill: tokenizer failed, fallback to char tokens: {error}`| Tokenizer service call failed, automatically falling back to character-based tokenization | cache_aware strategy remains active, using character-based tokenization for cache matching instead of the Tokenizer; normal request processing is not affected | Tokenizer service status |
40
+
|`cache-aware prefill: tokenize failed, fallback to process_tokens: {error}`| Tokenization completely failed (e.g., empty input), falling back to process_tokens strategy | Prefill scheduling temporarily does not use cache_aware strategy; normal request processing is not affected | Request content, Tokenizer service status |
41
+
|`cache-aware prefill: final strategy: process_tokens, reason: tokenize failed: {error}. ts_ms={ts}`| Tokenization failed (new format), falling back to process_tokens strategy | Prefill scheduling temporarily does not use cache_aware strategy; normal request processing is not affected | Request content, Tokenizer service status |
34
42
35
43
### Info-Level Logs
36
44
@@ -42,6 +50,42 @@ For basic Router usage, please refer to [Load-Balancing Scheduling Router](route
42
50
|`No instances found in config file {path}`| No instances found in the registration config file | Check whether register.yaml is empty |
|`Request failed, retrying...`| Request failed, retrying | Router will retry up to 3 times |
53
+
|`select worker (prefill): {url}, tokens: {tokens}`| Prefill scheduler selected a worker, showing current token processing count | Normal operation log |
54
+
|`select worker ({type}): {url}, count: {count}`| Decode/Mixed scheduler selected a worker, showing current request concurrency | Normal operation log |
55
+
|`release worker: {url}, count: {count}`| Request ended, worker counter released | Normal operation log |
56
+
|`release prefill tokens: {url}, tokens: {tokens}`| Prefill request ended, token load released | Normal operation log |
57
+
|`cleanup unhealthy worker counter: {url}`| Cleaned up counter for unhealthy worker | Normal operation log |
58
+
|`removed counters for {count} unhealthy workers: {urls}`| Batch cleanup of counters for unhealthy workers | Normal operation log |
59
+
|`[stats] total_running={n}, workers: [{loads}], cache_hit_rate={rate}% (hits={hits}/total={total})`| Periodic stats: total requests, worker loads, cache hit rate | Normal operation log, useful for monitoring and tuning |
|`[prefill] first chunk received, release counter url={url}`| Prefill streaming response received first chunk, counter released | Normal operation log |
64
+
|`[prefill] non-stream prefill response done, release counter url={url}`| Prefill non-streaming response completed, counter released | Normal operation log |
65
+
|`[prefill] backendResp is nil or backendResp.Body is nil, url={url}`| Prefill backend response is nil | May indicate backend connection issue |
66
+
|`[prefill] release in defer (fallback) url={url}, isStream={bool}`| Fallback resource release when Prefill request exits abnormally | Error handling log |
67
+
|`[prefill] release in CommonCompletions defer (error path) url={url}`| Prefill resource release on error path | Error handling log |
68
+
|`cache-aware prefill: final strategy: process_tokens, reason: strategy not initialized`| cache_aware strategy not initialized, falling back to process_tokens | Check cache_aware configuration |
69
+
|`cache-aware prefill: final strategy: process_tokens, reason: load imbalanced, loads={loads}. ts_ms={ts}`| Load imbalanced across instances, falling back to process_tokens strategy | Normal operation log, automatic load balancing switch |
70
+
|`cache-aware prefill: final strategy: cache_aware_scoring, selected={url}, loads={loads}, hitRatios={ratios}. ts_ms={ts}`| cache_aware scoring strategy selected a worker | Normal operation log, showing loads and hit ratios |
71
+
|`[{method}] {path} {proto} {status} {latency} {clientIP}`| HTTP request access log | Normal operation log, records basic info for each request |
72
+
|`before SelectWorker prefill. ts_ms={ts}`| Starting Prefill worker selection in PD disaggregated mode | Normal operation log, for performance tracing |
73
+
|`before SelectWorker decode, after prefill. ts_ms={ts}`| Starting Decode worker selection after Prefill selection | Normal operation log, for performance tracing |
74
+
|`after SelectWorker decode, before return. ts_ms={ts}`| Decode worker selection completed | Normal operation log, for performance tracing |
75
+
76
+
### Debug-Level Logs
77
+
78
+
> Debug-level logs are only output when the log level is set to `debug`, typically used for development debugging.
79
+
80
+
| Log Message | Meaning | Description |
81
+
| :--- | :--- | :--- |
82
+
|`Healthy instances: prefill={urls}, decode={urls}, mixed={urls}`| Lists healthy instances for each role | Useful for verifying instance discovery |
83
+
|`cache-aware prefill: hashes={n} workers={n} load={loads} hit={hits}`| Hash count, worker count, and load info for cache_aware strategy | Useful for debugging cache hits |
84
+
|`cache-aware prefill: tokenizer tokens={tokens}`| Tokenizer tokenization result | Useful for debugging tokenization results |
85
+
|`cache-aware score: worker={url} hit={hit} loadRatio={ratio} score={score}`| Scoring details for cache_aware strategy | Useful for debugging scheduling decisions |
86
+
|`radix match: hashes={n} matched_len={n} node_children={n}`| Radix tree match details | Useful for debugging cache matching |
87
+
|`radix record: worker={url} hashes={n} node_depth={n}`| Radix tree record details | Useful for debugging cache recording |
88
+
|`radix eviction: removed={n} nodeCount={n}`| Radix tree eviction details | Useful for debugging cache eviction |
45
89
46
90
## Common Response Output Analysis
47
91
@@ -189,9 +233,10 @@ If `Failed to start server` appears in startup logs, check:
189
233
190
234
### Tokenizer Service (cache_aware Strategy)
191
235
192
-
When using the `cache_aware` scheduling strategy, the Router calls a Tokenizer service to tokenize requests for cache hit ratio computation. When the Tokenizer service is unavailable, the Router will log a Warn-level message: `tokenizer failed, fallback to process_tokens`.
236
+
When using the `cache_aware` scheduling strategy, the Router calls a Tokenizer service to tokenize requests for cache hit ratio computation. When the Tokenizer service is unavailable, the Router has a two-level degradation mechanism:
193
237
194
-
**This does not affect normal request processing** — the Router has a built-in degradation mechanism that automatically falls back to the `process_tokens` strategy for continued scheduling. The only impact is the temporary loss of cache-aware optimization.
238
+
1.**Fallback to character-based tokenization** (common case): The log will show `tokenizer failed, fallback to char tokens`. The cache_aware strategy remains active, using character-based tokenization for cache matching instead of the Tokenizer. Cache hit accuracy may decrease, but normal request processing is not affected.
239
+
2.**Fallback to process_tokens strategy** (extreme case): When tokenization completely fails (e.g., empty request content), the log will show `tokenize failed, fallback to process_tokens`. The cache_aware strategy temporarily becomes inactive, and scheduling falls back to token processing volume. Normal request processing is not affected.
0 commit comments