You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(distributed): route per request across loaded replicas + cache probeHealth (#9968)
* refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel
Lifts the replica-selection policy (in_flight ASC, last_used ASC,
available_vram DESC) out of the SQL ORDER BY into a pure Go function in
the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity
and remains the production path used by SmartRouter; PickBestReplica is
the canonical implementation that the future per-frontend rotating
replica cache (TODO referenced from pkg/model) will call against an
in-memory snapshot without paying a DB round-trip per inference.
A new registry_test mirror spec seeds a multi-tier scenario and asserts
both layers pick the same replica, so any future tweak to either side
fails the test until the other side is updated.
No behavior change.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(distributed): route per inference request and cache probeHealth
Two related fixes that together restore load balancing across loaded
replicas of the same model.
1. ModelLoader.Load and LoadModel bypass the local *Model cache when
modelRouter is set. The cached *Model wraps an InFlightTrackingClient
bound to a single (nodeID, replicaIndex) — reusing it pinned every
subsequent request to whichever node won the very first pick, so
FindAndLockNodeWithModel's round-robin never got a chance to run
even after the reconciler scaled the model out to a second node. In
distributed mode SmartRouter.Route now runs per request, and
PickBestReplica picks the least-loaded replica each time.
SmartRouter has its own coalescing (advisory DB lock for first-time
loads + singleflight on backend.install RPC) so concurrent first
requests for a not-yet-loaded model still produce a single worker
side install.
2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results
in a new probeCache (probe_cache.go) with a 30s TTL. With per-request
routing every inference call hits probeHealth, and llama.cpp-style
backends serialize HealthCheck behind active Predict — so a burst of
incoming requests stalled on the probe to a node already mid-stream,
tripping the 2s timeout and falling through to the install path.
singleflight collapses N concurrent first-time probes for the same
(node, addr) into one round-trip, failed probes invalidate the entry
so the staleness-recovery path still triggers, and the TTL matches
pkg/model/model.go's healthCheckTTL so the single-process and
distributed paths share a staleness budget. The background
HealthMonitor still reaps actually-dead backends within ~45s.
The bypass introduces one short FindAndLockNodeWithModel transaction per
inference. A TODO in pkg/model/loader.go documents the future per modelID
rotating-replica cache that would reuse PickBestReplica against an
in-memory snapshot and skip the DB round-trip for hot paths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
0 commit comments