| title | Discovery Plane |
|---|
Dynamo's service discovery layer lets components find each other at runtime. Workers register their endpoints when they start, and frontends discover them automatically. The discovery backend adapts to the deployment environment.
| Deployment | Discovery Backend | Configuration |
|---|---|---|
| Kubernetes (with Dynamo operator) | Native K8s (CRDs, EndpointSlices) | Operator sets DYN_DISCOVERY_BACKEND=kubernetes |
| Bare metal / Local (default) | etcd | ETCD_ENDPOINTS (defaults to http://localhost:2379) |
Note: The runtime always defaults to etcd. Kubernetes discovery must be explicitly enabled -- the Dynamo operator handles this automatically.
When running on Kubernetes with the Dynamo operator, service discovery uses native Kubernetes resources instead of etcd.
- Workers register their endpoints by creating DynamoWorkerMetadata custom resources.
- EndpointSlices signal pod readiness to the system.
- Components watch for CRD changes to discover available workers.
- No external etcd cluster required.
- Native integration with Kubernetes pod lifecycle.
- Automatic cleanup when pods terminate.
- Works with standard Kubernetes RBAC.
| Variable | Description |
|---|---|
DYN_DISCOVERY_BACKEND |
Set to kubernetes |
POD_NAME |
Current pod name |
POD_NAMESPACE |
Current namespace |
POD_UID |
Pod unique identifier |
When DYN_DISCOVERY_BACKEND is not set (or set to etcd), etcd is used for service discovery.
| Variable | Description | Default |
|---|---|---|
ETCD_ENDPOINTS |
Comma-separated etcd URLs | http://localhost:2379 |
ETCD_AUTH_USERNAME |
Basic auth username | None |
ETCD_AUTH_PASSWORD |
Basic auth password | None |
ETCD_AUTH_CA |
CA certificate path (TLS) | None |
ETCD_AUTH_CLIENT_CERT |
Client certificate path | None |
ETCD_AUTH_CLIENT_KEY |
Client key path | None |
Example:
export ETCD_ENDPOINTS=http://etcd-0:2379,http://etcd-1:2379,http://etcd-2:2379Workers register their endpoints in etcd with a key hierarchy:
/services/{namespace}/{component}/{endpoint}/{instance_id}
For example:
/services/vllm-agg/backend/generate/694d98147d54be25
Frontends and routers discover available workers by watching the relevant prefix and receiving real-time updates when workers join or leave.
Each runtime maintains a lease with etcd (default TTL: 10 seconds). If a worker crashes or loses connectivity:
- Keep-alive heartbeats stop.
- The lease expires after the TTL.
- All registered endpoints are automatically deleted.
- Clients receive removal events and reroute traffic to healthy workers.
This ensures stale endpoints are cleaned up without manual intervention.
Dynamo provides a KV store abstraction for storing metadata (endpoint instances, model deployment cards, event channels). Multiple backends are supported:
| Backend | Use Case |
|---|---|
| etcd | Production deployments |
| Memory | Testing and development |
| NATS | NATS-only deployments |
| File | Local persistence |
The Dynamo operator automatically sets DYN_DISCOVERY_BACKEND=kubernetes for pods. No additional setup required.
For bare-metal production deployments, deploy a 3-node etcd cluster for high availability.
Balance between failure detection speed and overhead:
- Short TTL (5s) -- Faster failure detection, more keep-alive traffic.
- Long TTL (30s) -- Less overhead, slower detection.
The default (10s) is a reasonable starting point for most deployments.
- Event Plane -- Pub/sub for KV cache events and worker metrics
- Distributed Runtime -- Runtime architecture
- Request Plane -- Request transport configuration
- Fault Tolerance -- Failure handling