Skip to content

Commit a891eed

Browse files
localai-botmudler
andauthored
fix(distributed): persist per-model load info so reconciler survives frontend restart (#9981)
* feat(distributed): add per-model ModelLoadInfo persistence Adds a dedicated ModelLoadInfo table keyed by model name, decoupled from the per-replica NodeModel rows. The reconciler can now recover model load metadata after every NodeModel row has been removed (worker death, eviction, MarkOffline reaping, frontend restart with stale heartbeats), which is the read side of Bug-1 from the distributed mode bug hunt. Registry exposes: - UpsertModelLoadInfo: ON CONFLICT (model_name) update; last-write-wins, matching the existing per-replica blob semantics under concurrent multi-frontend dispatch. - GetModelLoadInfo: read from the new table first; fall back to the legacy NodeModel-blob scan for rows written before any frontend in the cluster ran an UpsertModelLoadInfo (rolling-upgrade transition). SetNodeModelLoadInfo (per-replica blob) is preserved for backward compatibility and per-replica diagnostics; the dispatch-path hook in the next commit calls both. The new table joins the existing nodes AutoMigrate set under the same schema-migration advisory lock. Refs: Bug-1, docs/superpowers/specs/2026-05-24-distributed-mode-bug-hunt-findings.md Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] * fix(distributed): persist per-model load info on dispatch scheduleAndLoad now writes the (backendType, ModelOptions blob) pair to the new ModelLoadInfo table in addition to the existing per-replica NodeModel.model_opts_blob field. The per-replica blob still works for the hot path; the per-model row outlives every NodeModel row going away, which is what unblocks the reconciler on the read side. Both writes are best-effort with warn-level logging on failure: a write miss here just means the reconciler may need a fresh inference request to repopulate, which is the pre-fix behavior. Concurrency: two frontends loading the same model at the same time both fire UpsertModelLoadInfo; ON CONFLICT (model_name) makes the row converge to whichever commits last. Matches the existing per-replica blob semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] * test(distributed): cover load info persistence and Bug-1 recovery Adds Ginkgo specs that prove the persistence layer behaves correctly and that the reconciler actually recovers from the frontend-restart scenario that was failing in production: registry_test.go: - per-model row survives RemoveAllNodeModelReplicas (the bug repro) - ON CONFLICT (model_name) updates backend type + blob, last-write-wins - legacy NodeModel-blob fallback still works (rolling-upgrade transition) - GetModelLoadInfo returns ErrRecordNotFound when both sources are empty - UpsertModelLoadInfo rejects empty model names reconciler_test.go: - Bug-1 end-to-end: with min_replicas=2, no NodeModel rows, but a ModelLoadInfo row present, one reconcile tick fires two scheduler calls. Pre-fix this returned "no load info" and the scheduler never got called until a fresh inference request arrived. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] * docs(distributed): note restart-safe reconciler behavior Adds a bullet to the Replica Reconciler section explaining that per-model load metadata is persisted across frontend restarts via the new model_load_infos PostgreSQL table, so a rolling upgrade no longer needs a fresh inference request per model before the reconciler can replace dead replicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
1 parent 06e777b commit a891eed

8 files changed

Lines changed: 194 additions & 4 deletions

File tree

core/services/nodes/interfaces.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ type ModelRouter interface {
1717
TouchNodeModel(ctx context.Context, nodeID, modelName string, replicaIndex int)
1818
SetNodeModel(ctx context.Context, nodeID, modelName string, replicaIndex int, state, address string, initialInFlight int) error
1919
SetNodeModelLoadInfo(ctx context.Context, nodeID, modelName string, replicaIndex int, backendType string, optsBlob []byte) error
20+
UpsertModelLoadInfo(ctx context.Context, modelName, backendType string, optsBlob []byte) error
2021
GetModelLoadInfo(ctx context.Context, modelName string) (backendType string, optsBlob []byte, err error)
2122
NextFreeReplicaIndex(ctx context.Context, nodeID, modelName string, maxSlots int) (int, error)
2223
CountReplicasOnNode(ctx context.Context, nodeID, modelName string) (int, error)

core/services/nodes/model_router_test.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,9 @@ func (f *fakeModelRouterForSmartRouter) SetNodeModel(_ context.Context, _, _ str
5656
func (f *fakeModelRouterForSmartRouter) SetNodeModelLoadInfo(_ context.Context, _, _ string, _ int, _ string, _ []byte) error {
5757
return nil
5858
}
59+
func (f *fakeModelRouterForSmartRouter) UpsertModelLoadInfo(_ context.Context, _, _ string, _ []byte) error {
60+
return nil
61+
}
5962
func (f *fakeModelRouterForSmartRouter) GetModelLoadInfo(_ context.Context, _ string) (string, []byte, error) {
6063
return "", nil, fmt.Errorf("not found")
6164
}

core/services/nodes/reconciler_test.go

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -423,6 +423,47 @@ var _ = Describe("ReplicaReconciler", func() {
423423
Expect(cap).To(Equal(2))
424424
})
425425
})
426+
427+
Context("frontend-restart scenario (Bug-1)", func() {
428+
It("recovers replicas after every NodeModel row has been removed", func() {
429+
ctx := context.Background()
430+
431+
// One healthy node. UpsertModelLoadInfo records per-model metadata
432+
// independently of any NodeModel row, mirroring what
433+
// scheduleAndLoad does on a successful dispatch.
434+
node := registerNode("restart-node", "10.0.20.1:50051")
435+
setSchedulingConfig("restart-model", 2, 4, "")
436+
Expect(registry.UpsertModelLoadInfo(ctx, "restart-model", "llama-cpp", []byte("opts-from-pre-restart"))).To(Succeed())
437+
438+
// Simulate the bug: between frontend instances the NodeModel rows
439+
// are wiped (MarkOffline path, stale-heartbeat reaping). The
440+
// per-model load info row stays because it's not tied to any
441+
// (node, replica) slot.
442+
Expect(registry.RemoveAllNodeModelReplicas(ctx, node.ID, "restart-model")).To(Succeed())
443+
444+
// Pre-fix: GetModelLoadInfo returned ErrRecordNotFound here and
445+
// the reconciler logged "no load info" every 30s until a manual
446+
// inference request arrived.
447+
bt, blob, err := registry.GetModelLoadInfo(ctx, "restart-model")
448+
Expect(err).ToNot(HaveOccurred())
449+
Expect(bt).To(Equal("llama-cpp"))
450+
Expect(blob).To(Equal([]byte("opts-from-pre-restart")))
451+
452+
// And the reconciler tick should now call into the scheduler
453+
// for the missing replicas instead of bailing out.
454+
scheduler := &fakeScheduler{scheduleNode: node}
455+
reconciler := NewReplicaReconciler(ReplicaReconcilerOptions{
456+
Registry: registry,
457+
Scheduler: scheduler,
458+
DB: db,
459+
})
460+
reconciler.reconcile(ctx)
461+
462+
Expect(scheduler.scheduleCalls).ToNot(BeEmpty(),
463+
"reconciler must call the scheduler after a frontend restart that wiped NodeModel rows")
464+
Expect(scheduler.scheduleCalls[0].modelName).To(Equal("restart-model"))
465+
})
466+
})
426467
})
427468

428469
// fakeProber lets tests control whether a model's gRPC address "responds".

core/services/nodes/registry.go

Lines changed: 72 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,31 @@ type NodeModel struct {
9494
UpdatedAt time.Time `json:"updated_at"`
9595
}
9696

97+
// ModelLoadInfo is per-model load metadata kept independently of NodeModel rows
98+
// so the Replica Reconciler can re-load a model after every replica row has
99+
// been removed (worker death, eviction, MarkOffline reaping, frontend restart
100+
// with stale heartbeats).
101+
//
102+
// Why a separate table when the same blob is also stamped on each NodeModel
103+
// row? NodeModel rows are tied to a live (node, replica) slot and get deleted
104+
// when a backend stops being healthy. Tying the only copy of load info to
105+
// that lifecycle is exactly what caused Bug-1: a frontend restart followed by
106+
// transient worker-row removal left no copy of ModelOptions, so the reconciler
107+
// could not bring `min_replicas` back without a fresh inference request.
108+
//
109+
// Keyed by ModelName (the tracking key used by the router); last-write-wins
110+
// on the opts blob because two concurrent frontends dispatching the same
111+
// model with slightly different opts converge on whichever finished last.
112+
// That is identical to the per-NodeModel-row semantics today; if a stronger
113+
// guarantee is needed in the future, the row carries UpdatedAt for ordering.
114+
type ModelLoadInfo struct {
115+
ModelName string `gorm:"primaryKey;size:255" json:"model_name"`
116+
BackendType string `gorm:"size:128" json:"backend_type"`
117+
ModelOptsBlob []byte `gorm:"type:bytea" json:"-"`
118+
CreatedAt time.Time `json:"created_at"`
119+
UpdatedAt time.Time `json:"updated_at"`
120+
}
121+
97122
// NodeLabel is a key-value label on a node (like K8s labels).
98123
type NodeLabel struct {
99124
ID string `gorm:"primaryKey;size:36" json:"id"`
@@ -178,7 +203,7 @@ type NodeRegistry struct {
178203
// when multiple instances (frontend + workers) start at the same time.
179204
func NewNodeRegistry(db *gorm.DB) (*NodeRegistry, error) {
180205
if err := advisorylock.WithLockCtx(context.Background(), db, advisorylock.KeySchemaMigrate, func() error {
181-
return db.AutoMigrate(&BackendNode{}, &NodeModel{}, &NodeLabel{}, &ModelSchedulingConfig{}, &PendingBackendOp{})
206+
return db.AutoMigrate(&BackendNode{}, &NodeModel{}, &NodeLabel{}, &ModelSchedulingConfig{}, &PendingBackendOp{}, &ModelLoadInfo{})
182207
}); err != nil {
183208
return nil, fmt.Errorf("migrating node tables: %w", err)
184209
}
@@ -622,10 +647,54 @@ func (r *NodeRegistry) SetNodeModelLoadInfo(ctx context.Context, nodeID, modelNa
622647
Updates(map[string]any{"backend_type": backendType, "model_opts_blob": optsBlob}).Error
623648
}
624649

650+
// UpsertModelLoadInfo records or replaces the per-model load info in the
651+
// dedicated ModelLoadInfo table. Unlike SetNodeModelLoadInfo (which writes the
652+
// blob onto a specific replica row and dies with it), this survives every
653+
// NodeModel row being removed and so lets the reconciler recover replicas
654+
// after worker death + frontend restart (Bug-1).
655+
//
656+
// ON CONFLICT updates backend_type, model_opts_blob, and updated_at. Two
657+
// frontends dispatching the same model concurrently with slightly different
658+
// opts converge on whichever transaction committed last; that matches the
659+
// existing per-replica blob semantics today.
660+
func (r *NodeRegistry) UpsertModelLoadInfo(ctx context.Context, modelName, backendType string, optsBlob []byte) error {
661+
if modelName == "" {
662+
return fmt.Errorf("model name is required")
663+
}
664+
now := time.Now()
665+
rec := ModelLoadInfo{
666+
ModelName: modelName,
667+
BackendType: backendType,
668+
ModelOptsBlob: optsBlob,
669+
CreatedAt: now,
670+
UpdatedAt: now,
671+
}
672+
return r.db.WithContext(ctx).Clauses(clause.OnConflict{
673+
Columns: []clause.Column{{Name: "model_name"}},
674+
DoUpdates: clause.Assignments(map[string]any{
675+
"backend_type": backendType,
676+
"model_opts_blob": optsBlob,
677+
"updated_at": now,
678+
}),
679+
}).Create(&rec).Error
680+
}
681+
625682
// GetModelLoadInfo retrieves the stored backend type and serialized model
626-
// options from any existing loaded replica. Returns gorm.ErrRecordNotFound
627-
// if no replica has stored options.
683+
// options. Reads from the dedicated ModelLoadInfo table first (survives every
684+
// NodeModel row being deleted); falls back to scanning loaded NodeModel rows
685+
// for the load info stamped before any frontend in this cluster ran an
686+
// UpsertModelLoadInfo (rolling-upgrade transition). Returns
687+
// gorm.ErrRecordNotFound when neither source has an entry.
628688
func (r *NodeRegistry) GetModelLoadInfo(ctx context.Context, modelName string) (backendType string, optsBlob []byte, err error) {
689+
var info ModelLoadInfo
690+
err = r.db.WithContext(ctx).Where("model_name = ?", modelName).First(&info).Error
691+
if err == nil {
692+
return info.BackendType, info.ModelOptsBlob, nil
693+
}
694+
if !errors.Is(err, gorm.ErrRecordNotFound) {
695+
return "", nil, err
696+
}
697+
629698
var nm NodeModel
630699
err = r.db.WithContext(ctx).
631700
Where("model_name = ? AND state = ? AND model_opts_blob IS NOT NULL", modelName, "loaded").

core/services/nodes/registry_test.go

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1100,4 +1100,69 @@ var _ = Describe("NodeRegistry", func() {
11001100
"reserved capacity must remove a node from VRAM-aware candidates")
11011101
})
11021102
})
1103+
1104+
Describe("ModelLoadInfo persistence (Bug-1)", func() {
1105+
It("survives every NodeModel row being removed", func() {
1106+
ctx := context.Background()
1107+
1108+
// One node with one loaded replica + per-replica blob (the legacy path).
1109+
node := makeNode("li-1", "10.0.1.1:50051", 8_000_000_000)
1110+
Expect(registry.Register(ctx, node, true)).To(Succeed())
1111+
Expect(registry.SetNodeModel(ctx, node.ID, "load-info-model", 0, "loaded", node.Address, 0)).To(Succeed())
1112+
Expect(registry.SetNodeModelLoadInfo(ctx, node.ID, "load-info-model", 0, "llama-cpp", []byte("opts-v1"))).To(Succeed())
1113+
1114+
// Persist per-model via the new path (the dispatch hook does this).
1115+
Expect(registry.UpsertModelLoadInfo(ctx, "load-info-model", "llama-cpp", []byte("opts-v1"))).To(Succeed())
1116+
1117+
// Simulate worker death + MarkOffline reaping: every NodeModel row gone.
1118+
Expect(registry.RemoveAllNodeModelReplicas(ctx, node.ID, "load-info-model")).To(Succeed())
1119+
1120+
bt, blob, err := registry.GetModelLoadInfo(ctx, "load-info-model")
1121+
Expect(err).ToNot(HaveOccurred(),
1122+
"per-model load info must survive every NodeModel row going away")
1123+
Expect(bt).To(Equal("llama-cpp"))
1124+
Expect(blob).To(Equal([]byte("opts-v1")))
1125+
})
1126+
1127+
It("ON CONFLICT updates backend type and opts (last-write-wins)", func() {
1128+
ctx := context.Background()
1129+
1130+
Expect(registry.UpsertModelLoadInfo(ctx, "lww", "llama-cpp", []byte("v1"))).To(Succeed())
1131+
Expect(registry.UpsertModelLoadInfo(ctx, "lww", "vllm", []byte("v2"))).To(Succeed())
1132+
1133+
bt, blob, err := registry.GetModelLoadInfo(ctx, "lww")
1134+
Expect(err).ToNot(HaveOccurred())
1135+
Expect(bt).To(Equal("vllm"))
1136+
Expect(blob).To(Equal([]byte("v2")))
1137+
})
1138+
1139+
It("falls back to legacy NodeModel blob when no per-model row exists", func() {
1140+
// Pre-fix rolling-upgrade path: a frontend that ran before the new
1141+
// table existed only wrote the per-replica blob. The new
1142+
// GetModelLoadInfo must still find it so an upgrade doesn't
1143+
// regress the reconciler for already-loaded models.
1144+
ctx := context.Background()
1145+
1146+
node := makeNode("li-legacy", "10.0.1.2:50051", 8_000_000_000)
1147+
Expect(registry.Register(ctx, node, true)).To(Succeed())
1148+
Expect(registry.SetNodeModel(ctx, node.ID, "legacy-model", 0, "loaded", node.Address, 0)).To(Succeed())
1149+
Expect(registry.SetNodeModelLoadInfo(ctx, node.ID, "legacy-model", 0, "llama-cpp", []byte("legacy-opts"))).To(Succeed())
1150+
1151+
bt, blob, err := registry.GetModelLoadInfo(ctx, "legacy-model")
1152+
Expect(err).ToNot(HaveOccurred())
1153+
Expect(bt).To(Equal("llama-cpp"))
1154+
Expect(blob).To(Equal([]byte("legacy-opts")))
1155+
})
1156+
1157+
It("returns ErrRecordNotFound when neither source has the model", func() {
1158+
ctx := context.Background()
1159+
_, _, err := registry.GetModelLoadInfo(ctx, "never-loaded")
1160+
Expect(err).To(MatchError(gorm.ErrRecordNotFound))
1161+
})
1162+
1163+
It("rejects empty model names", func() {
1164+
err := registry.UpsertModelLoadInfo(context.Background(), "", "llama-cpp", []byte("x"))
1165+
Expect(err).To(HaveOccurred())
1166+
})
1167+
})
11031168
})

core/services/nodes/router.go

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,12 +156,18 @@ func (r *SmartRouter) scheduleAndLoad(ctx context.Context, backendType, tracking
156156
xlog.Warn("Failed to record model on node", "node", node.Name, "model", trackingKey, "replica", replicaIndex, "error", err)
157157
}
158158

159-
// Store load metadata for future replica scale-ups by the reconciler
159+
// Store load metadata for future replica scale-ups by the reconciler.
160+
// Writes both per-replica (NodeModel.model_opts_blob) for backward compat
161+
// and per-model (ModelLoadInfo table) so the reconciler can recover after
162+
// every replica row has been removed (Bug-1).
160163
if modelOpts != nil {
161164
if optsBlob, marshalErr := proto.Marshal(modelOpts); marshalErr == nil {
162165
if storeErr := r.registry.SetNodeModelLoadInfo(ctx, node.ID, trackingKey, replicaIndex, backendType, optsBlob); storeErr != nil {
163166
xlog.Warn("Failed to store model load info", "node", node.Name, "model", trackingKey, "replica", replicaIndex, "error", storeErr)
164167
}
168+
if storeErr := r.registry.UpsertModelLoadInfo(ctx, trackingKey, backendType, optsBlob); storeErr != nil {
169+
xlog.Warn("Failed to upsert per-model load info", "model", trackingKey, "error", storeErr)
170+
}
165171
}
166172
}
167173

core/services/nodes/router_test.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,10 @@ func (f *fakeModelRouter) SetNodeModelLoadInfo(_ context.Context, _, _ string, _
159159
return nil
160160
}
161161

162+
func (f *fakeModelRouter) UpsertModelLoadInfo(_ context.Context, _, _ string, _ []byte) error {
163+
return nil
164+
}
165+
162166
func (f *fakeModelRouter) GetModelLoadInfo(_ context.Context, _ string) (string, []byte, error) {
163167
return "", nil, fmt.Errorf("not found")
164168
}

docs/content/features/distributed-mode.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -490,6 +490,7 @@ The **Replica Reconciler** runs as a background process on the frontend:
490490
- **Scale down**: Removes idle replicas after 5 minutes of inactivity
491491
- **Maintain minimum**: Ensures `min_replicas` are always loaded (recovers from node failures)
492492
- **Eviction protection**: Models with auto-scaling enabled are never evicted below `min_replicas`
493+
- **Restart-safe**: Per-model load metadata (backend type + `ModelOptions`) is persisted in the `model_load_infos` PostgreSQL table on the first successful dispatch, so a frontend restart or rolling upgrade does not require a fresh inference request to repopulate state before the reconciler can scale up replacement replicas.
493494

494495
All fields are optional and composable:
495496
- Node selector only: pin model to matching nodes, single replica

0 commit comments

Comments
 (0)