Skip to content

Latest commit

 

History

History
182 lines (136 loc) · 13.1 KB

File metadata and controls

182 lines (136 loc) · 13.1 KB
title Router Guide
subtitle Deployment modes, quick start, and page map for Dynamo routing docs

Overview

The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups. This guide helps you get started with using the Dynamo router and points to the pages that cover routing concepts, configuration, disaggregated serving, and operations in more detail.

Quick Start

The router can be deployed using Python / CLI, Kubernetes, or as a standalone component.

Python / CLI Deployment

To launch the Dynamo frontend with the KV Router:

python -m dynamo.frontend --router-mode kv --http-port 8000

This command:

  • Launches the Dynamo frontend service with KV routing enabled
  • Exposes the service on port 8000 (configurable)
  • Automatically handles all backend workers registered to the Dynamo endpoint

Backend workers register themselves using the register_model API. For accurate prefix-cache state, workers must also publish KV cache events with the backend-specific event flags; otherwise the router can run in approximate mode with --no-router-kv-events.

CLI Arguments

Argument Default Description
--router-mode kv round-robin Enable KV cache-aware routing
--router-temperature <float> 0.0 Controls routing randomness (0.0 = deterministic, higher = more random)
--kv-cache-block-size <size> Backend-specific KV cache block size (should match backend config)
--router-kv-events / --no-router-kv-events --router-kv-events Enable/disable real-time KV event tracking
--load-aware / --no-load-aware --no-load-aware Route by active load without cache-reuse signals; implies --router-mode kv on the frontend
--router-kv-overlap-score-credit <float> 1.0 Credit multiplier for device-local prefix overlap, from 0.0 to 1.0
--router-prefill-load-scale <float> 1.0 Scale adjusted prompt-side prefill load before adding decode blocks
--router-track-prefill-tokens / --no-router-track-prefill-tokens --router-track-prefill-tokens Include prompt-side load in active worker load accounting
--router-prefill-load-model <none|aic> none Prompt-side load model; see Routing Concepts and Configuration and Tuning
--router-queue-threshold <float> 16.0 Queue threshold fraction; priority hints only reorder requests while this queue is non-empty
--router-queue-policy <str> fcfs Scheduling policy for the queue: fcfs (tail TTFT), wspt (avg TTFT), or lcfs (comparison-only reverse ordering)
--serve-indexer false Serve the Dynamo-native remote indexer from this frontend/router on the worker component
--use-remote-indexer false Query the worker component's served remote indexer instead of maintaining a local overlap indexer

For all available options: python -m dynamo.frontend --help

For detailed configuration options and tuning parameters, see Configuration and Tuning. For candidate eligibility rules, see Router Filtering. For how the router models prefill and decode load in the cost function, see Routing Concepts.

Kubernetes Deployment

To enable the KV Router in Kubernetes, add the DYN_ROUTER_MODE environment variable to your frontend service:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-deployment
spec:
  services:
    Frontend:
      componentType: frontend
      replicas: 1
      envs:
        - name: DYN_ROUTER_MODE
          value: kv  # Enable KV Smart Router

Key Points:

  • Set DYN_ROUTER_MODE=kv on the Frontend service only
  • Configure worker-side KV event publishing when you want event-driven prefix-cache state
  • Use --no-router-kv-events for approximate cache-state prediction when workers are not publishing events

Environment Variables

All CLI arguments can be configured via environment variables using the DYN_ prefix:

CLI Argument Environment Variable Default
--router-mode kv DYN_ROUTER_MODE=kv round-robin
--load-aware DYN_ROUTER_LOAD_AWARE=true false
--router-temperature DYN_ROUTER_TEMPERATURE 0.0
--kv-cache-block-size DYN_KV_CACHE_BLOCK_SIZE Backend-specific
--no-router-kv-events DYN_ROUTER_USE_KV_EVENTS=false true
--router-kv-overlap-score-credit DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT 1.0
--router-prefill-load-scale DYN_ROUTER_PREFILL_LOAD_SCALE 1.0
--router-queue-policy DYN_ROUTER_QUEUE_POLICY fcfs
DYN_ENCODER_CUDA_TO_CPU_RATIO 8 Throughput ratio of a non-CPU worker relative to one CPU worker for device-aware-weighted routing

For complete K8s examples and advanced configuration, see K8s Examples and Configuration and Tuning. For A/B testing and advanced K8s setup, see the KV Router A/B Benchmarking Guide.

Standalone Router

You can also run the KV router as a standalone service (without the Dynamo frontend) for disaggregated serving (e.g., routing to prefill workers), multi-tier architectures, or any scenario requiring intelligent KV cache-aware routing decisions. See the Standalone Router component for more details.

Frontend-Embedded vs. Standalone Router

Deployment Process Metrics Port Use Case
Frontend-embedded python -m dynamo.frontend --router-mode kv Frontend HTTP port (default 8000) Standard deployment; router runs inside the frontend process
Standalone python -m dynamo.router DYN_SYSTEM_PORT (if set) Multi-tier architectures, advanced disaggregated prefill routing, custom pipelines

The standalone router does not include the HTTP frontend (no /v1/chat/completions endpoint). It exposes only the RouterRequestMetrics via the system status server. See the Standalone Router README.

Deployment Modes

The Dynamo router can be deployed in several configurations. The table below shows every combination and when to use it:

Mode Command Routing Logic KV Events Topology Use Case
Frontend + Round-Robin python -m dynamo.frontend --router-mode round-robin Cycles through workers None Aggregated Simplest baseline; no KV awareness
Frontend + Random python -m dynamo.frontend --router-mode random Random worker selection None Aggregated Stateless load balancing
Frontend + KV (Aggregated) python -m dynamo.frontend --router-mode kv KV cache overlap + load NATS Core / JetStream / ZMQ / Approx Aggregated Production single-pool serving with cache reuse
Frontend + KV (Disaggregated) python -m dynamo.frontend --router-mode kv with prefill + decode workers KV cache overlap + load NATS Core / JetStream / ZMQ / Approx Disaggregated (prefill + decode pools) Separate prefill/decode for large-scale serving
Frontend + Least-Loaded python -m dynamo.frontend --router-mode least-loaded Fewest active connections None Aggregated or disaggregated fallback Simple load-aware balancing without KV awareness
Frontend + Device-Aware Weighted python -m dynamo.frontend --router-mode device-aware-weighted Device-aware budget + least-loaded within selected device group None Aggregated or disaggregated fallback Heterogeneous fleet balancing (CPU/non-CPU); degenerates to least-loaded when only one device class is present
Frontend + Direct python -m dynamo.frontend --router-mode direct Worker ID from request hints None Aggregated External orchestrator (e.g., EPP/GAIE) selects workers
Standalone Router python -m dynamo.router KV cache overlap + load NATS Core / JetStream / ZMQ Any Routing without the HTTP frontend (multi-tier, custom pipelines)

Routing Modes (--router-mode)

Mode Value How Workers Are Selected
Round-Robin round-robin (default) Cycles through available workers in order
Random random Selects a random worker for each request
KV kv Evaluates KV cache overlap and decode load per worker; picks lowest cost
Least-Loaded least-loaded Routes to the worker with fewest active connections; in disaggregated prefill paths it skips bootstrap optimization and falls back to synchronous prefill
Device-Aware Weighted device-aware-weighted Partitions workers into CPU and non-CPU groups, applies capability-normalized ratio budgeting using DYN_ENCODER_CUDA_TO_CPU_RATIO to decide which group receives the request, then selects the least-loaded worker within that group
Direct direct Reads the target worker_id from the request's routing hints; no selection logic

Device-Aware Weighted Routing

device-aware-weighted is designed for heterogeneous fleets where workers of different compute capability, for example CPU embedding encoders alongside GPU embedding encoders, share the same endpoint.

Workers are split into CPU and non-CPU groups. The router compares a capability-normalized load across the two groups:

normalized_load = total_inflight(group) / (instance_count(group) x throughput_weight)

The throughput weight is 1 for CPU workers and DYN_ENCODER_CUDA_TO_CPU_RATIO for non-CPU workers. The next request is routed to the group with the lower normalized load, then to the least-loaded worker inside that group.

Use DYN_ENCODER_CUDA_TO_CPU_RATIO to approximate the throughput ratio of a non-CPU worker relative to one CPU worker. The default is 8.

When only one device class is present, the policy degenerates to standard least-loaded routing.

KV Event Transport Modes (within --router-mode kv)

When using KV routing, the router needs to know what each worker has cached. There are four ways to get this information:

Event Mode How to Enable Description
NATS Core (local indexer) Router default (no router flag) Workers maintain a local indexer; configure backend-side KV event publishing so the router can recover state and receive events via NATS Core
JetStream (durable) --router-durable-kv-events Events persisted in NATS JetStream; supports snapshots and durable consumers. Deprecated.
ZMQ --event-plane zmq Workers publish via ZMQ PUB sockets; the standalone dynamo.indexer service aggregates events
Approximate (no events) --no-router-kv-events No events consumed; router predicts cache state from its own routing decisions with TTL-based expiration

Aggregated vs. Disaggregated Topology

Topology Workers How It Works
Aggregated Single pool (prefill + decode in one process) All workers handle the full request lifecycle
Disaggregated Separate prefill and decode pools Frontend routes to a prefill worker first, then to a decode worker; requires workers registered with ModelType.Prefill

Disaggregated mode is activated automatically when prefill workers register alongside decode workers. See Disaggregated Serving for details.

More Router Docs