Skip to content

Commit 8d6548c

Browse files
localai-botmudler
andauthored
fix(distributed): sync gallery OpCache + caches across frontend replicas (#9983)
When the LocalAI frontend deployment is scaled past one replica, the UI's /api/operations poll round-robins between pods. Each pod kept the OpCache (galleryID->jobID), OpStatus map, and the post-install in-memory caches (ModelConfigLoader, UpgradeChecker) purely in-process. Reads never consulted PostgreSQL or NATS even though writes already published to PG. Symptoms: - A user installing a model on replica A saw the operation card flicker in and out as the load balancer alternated. - The Models page re-fetched the whole gallery on every flicker because useEffect([operations.length]) re-fires when the count changes. - A chat completion that landed on replica B after the install completed on replica A failed to find the new model — B's ModelConfigLoader was still the old one because nothing told it to reload from disk. - The UpgradeChecker 6-hour cache stayed stale on peer replicas after a backend upgrade, so /api/backends/upgrades kept surfacing an upgrade that had already shipped. Mirror the jobs Dispatcher pattern for gallery ops: - OpCache learns SetMessagingClient/SetGalleryStore + a Start(ctx) that hydrates from PostgreSQL and subscribes to gallery.opcache.{start,end}. Set/SetBackend now upsert cache_key + is_backend_op on the gallery_ operations row and broadcast OpCacheEvent so peers merge it in. The hydrate path uses a new GalleryStore.ListActive() (status in {pending, downloading, processing} and updated within 30 min). - GalleryService.SubscribeBroadcasts wires a SubjectGalleryProgress- Wildcard subscriber that calls a new lock-light mergeStatus into the local statuses map, plus a SubjectGalleryCancelWildcard subscriber that runs the locally-registered cancel func. Hydrate() restores active rows from PostgreSQL on startup so a freshly-started replica is not observably empty mid-install. CancelOperation tolerates the cancel func living on a different replica and publishes anyway. - modelHandler and backendHandler publish on the new SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends after a successful install/delete/upgrade. SubscribeBroadcasts wires peers to refresh: OnModelsChanged (re-runs LoadModelConfigsFromPath) and OnBackendOpCompleted (re-triggers UpgradeChecker). The originating replica reloads inline so it never enters the broadcast handler. - OpStatus.Error (an error interface) flat-marshalled to "{}" over JSON, so a failed install replicated to a peer arrived with a nil error and the UI's failure banner never appeared. Add MarshalJSON/UnmarshalJSON via an opStatusWire shim that round-trips Error as a string. - UpdateStatus and CancelOperation now drop the mutex before publishing to NATS or persisting to PostgreSQL. The wildcard subscriber's mergeStatus loops back into the same service on the publishing replica and would deadlock otherwise; this also prevents future PG round-trips from stalling concurrent readers on every progress tick. Tests cover the OpStatus error round-trip, OpCache propagation through a shared in-memory bus, OpCache PostgreSQL hydration (active-only), GalleryService progress + cancel broadcast, Nodes preservation across a peer's bare progress tick, GalleryService hydration from PG, and the two cache-invalidation broadcasts (models + backends). 44 specs total in galleryop; routes/operations specs and jobs/agents suites still pass. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
1 parent b02e3ff commit 8d6548c

9 files changed

Lines changed: 1185 additions & 36 deletions

File tree

core/application/startup.go

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ import (
1515
"github.com/mudler/LocalAI/core/http/auth"
1616
"github.com/mudler/LocalAI/core/services/galleryop"
1717
"github.com/mudler/LocalAI/core/services/jobs"
18+
"github.com/mudler/LocalAI/core/services/messaging"
1819
"github.com/mudler/LocalAI/core/services/monitoring"
1920
"github.com/mudler/LocalAI/core/services/nodes"
2021
"github.com/mudler/LocalAI/core/services/routing/admission"
@@ -312,6 +313,30 @@ func New(opts ...config.AppOption) (*Application, error) {
312313
}
313314
application.galleryService.SetGalleryStore(distSvc.DistStores.Gallery)
314315
}
316+
// Hydrate from the store first so the wildcard subscriber finds an
317+
// already-populated statuses map for any operations still in flight
318+
// on a peer replica.
319+
if err := application.galleryService.Hydrate(); err != nil {
320+
xlog.Warn("Gallery service hydrate failed", "error", err)
321+
}
322+
// Bind cache-invalidation handler before SubscribeBroadcasts so the
323+
// first inbound event is already routed. Peer replicas install a
324+
// model and broadcast on SubjectCacheInvalidateModels; this
325+
// callback re-runs LoadModelConfigsFromPath so a subsequent chat
326+
// completion that load-balances onto this replica finds the new
327+
// config. The originating replica reloads inline in modelHandler
328+
// and never enters this path.
329+
gs := application.galleryService
330+
sys := options.SystemState
331+
cfgLoaderOpts := options.ToConfigLoaderOptions()
332+
gs.OnModelsChanged = func(_ messaging.CacheInvalidateEvent) {
333+
if err := application.ModelConfigLoader().LoadModelConfigsFromPath(sys.Model.ModelsPath, cfgLoaderOpts...); err != nil {
334+
xlog.Warn("Failed to reload model configs after peer invalidation", "error", err)
335+
}
336+
}
337+
if err := application.galleryService.SubscribeBroadcasts(); err != nil {
338+
xlog.Warn("Gallery service subscribe failed", "error", err)
339+
}
315340
// Wire distributed model/backend managers so delete propagates to workers
316341
application.galleryService.SetModelManager(
317342
nodes.NewDistributedModelManager(options, application.modelLoader, distSvc.Unloader),

core/http/app.go

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -367,6 +367,20 @@ func API(application *application.Application) (*echo.Echo, error) {
367367
var opcache *galleryop.OpCache
368368
if !application.ApplicationConfig().DisableWebUI {
369369
opcache = galleryop.NewOpCache(application.GalleryService())
370+
// In distributed mode, wire the NATS client + gallery store so this
371+
// replica's OpCache stays in sync with peers — without this the
372+
// /api/operations endpoint returns whatever this single replica
373+
// happened to admit, and a load-balanced UI poll alternates between
374+
// "operation visible" and "operation gone" between replicas.
375+
if d := application.Distributed(); d != nil {
376+
opcache.SetMessagingClient(d.Nats)
377+
if d.DistStores != nil && d.DistStores.Gallery != nil {
378+
opcache.SetGalleryStore(d.DistStores.Gallery)
379+
}
380+
if err := opcache.Start(application.ApplicationConfig().Context); err != nil {
381+
xlog.Warn("OpCache distributed subscribe failed; running standalone", "error", err)
382+
}
383+
}
370384
}
371385

372386
mcpMw := auth.RequireFeature(application.AuthDB(), auth.FeatureMCP)

core/services/distributed/gallery.go

Lines changed: 75 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,25 @@ import (
66

77
"github.com/google/uuid"
88
"gorm.io/gorm"
9+
"gorm.io/gorm/clause"
910
)
1011

1112
// GalleryOperationRecord tracks model/backend download operations in PostgreSQL.
13+
//
14+
// CacheKey and IsBackendOp mirror the in-memory OpCache held by each frontend
15+
// replica. They are written when a request first lands so a freshly-started
16+
// (or freshly-routed-to) replica can rebuild its OpCache from this table
17+
// instead of returning an empty `/api/operations` payload while the real
18+
// operation is still in flight on a peer.
1219
type GalleryOperationRecord struct {
1320
ID string `gorm:"primaryKey;size:36" json:"id"`
1421
UserID string `gorm:"index;size:36" json:"user_id,omitempty"`
1522
GalleryElementName string `gorm:"size:255" json:"gallery_element_name"`
16-
OpType string `gorm:"size:32" json:"op_type"` // "model_install", "model_delete", "backend_install"
17-
Status string `gorm:"size:32;default:pending" json:"status"` // pending, downloading, processing, completed, failed, cancelled
18-
Progress float64 `json:"progress"` // 0.0 to 1.0
23+
CacheKey string `gorm:"index;size:512" json:"cache_key,omitempty"` // OpCache key (galleryID or node:<id>:<backend>)
24+
IsBackendOp bool `json:"is_backend_op"` // true if installed via SetBackend
25+
OpType string `gorm:"size:32" json:"op_type"` // "model_install", "model_delete", "backend_install"
26+
Status string `gorm:"size:32;default:pending" json:"status"` // pending, downloading, processing, completed, failed, cancelled
27+
Progress float64 `json:"progress"` // 0.0 to 1.0
1928
Message string `gorm:"type:text" json:"message,omitempty"`
2029
Error string `gorm:"type:text" json:"error,omitempty"`
2130
FileName string `gorm:"size:512" json:"file_name,omitempty"`
@@ -27,6 +36,12 @@ type GalleryOperationRecord struct {
2736
UpdatedAt time.Time `json:"updated_at"`
2837
}
2938

39+
// activeStatuses lists the gallery_operations.status values that represent an
40+
// operation a replica should still surface via /api/operations. Hydration and
41+
// the dedup lookup share this set so the two paths never disagree about what
42+
// "still active" means.
43+
var activeStatuses = []string{"pending", "downloading", "processing"}
44+
3045
func (GalleryOperationRecord) TableName() string { return "gallery_operations" }
3146

3247
// GalleryStore manages gallery operation state in PostgreSQL.
@@ -42,14 +57,26 @@ func NewGalleryStore(db *gorm.DB) (*GalleryStore, error) {
4257
return &GalleryStore{db: db}, nil
4358
}
4459

45-
// Create stores a new gallery operation.
60+
// Create stores a new gallery operation. Tolerates a row already existing
61+
// for this ID — OpCache.Set may have written a placeholder row via
62+
// UpsertCacheKey before the galleryop service goroutine called Create, and
63+
// in that case we want to fill in the descriptive columns (gallery element
64+
// name, op type, status) rather than fail with a primary-key conflict.
65+
// CacheKey and IsBackendOp are intentionally not in DoUpdates so the
66+
// placeholder's values win.
4667
func (s *GalleryStore) Create(op *GalleryOperationRecord) error {
4768
if op.ID == "" {
4869
op.ID = uuid.New().String()
4970
}
5071
op.CreatedAt = time.Now()
5172
op.UpdatedAt = op.CreatedAt
52-
return s.db.Create(op).Error
73+
return s.db.Clauses(clause.OnConflict{
74+
Columns: []clause.Column{{Name: "id"}},
75+
DoUpdates: clause.AssignmentColumns([]string{
76+
"gallery_element_name", "op_type", "status",
77+
"frontend_id", "user_id", "cancellable", "updated_at",
78+
}),
79+
}).Create(op).Error
5380
}
5481

5582
// UpdateProgress updates progress for an operation.
@@ -93,14 +120,55 @@ func (s *GalleryStore) List(status string) ([]GalleryOperationRecord, error) {
93120
return ops, q.Find(&ops).Error
94121
}
95122

123+
// ListActive returns operations still considered in-flight — used by replicas
124+
// to rehydrate their in-memory OpCache + statuses on startup. Stale records
125+
// (older than 30 minutes without an update) are excluded so a crashed peer's
126+
// orphaned rows never resurrect on a healthy replica; the existing CleanStale
127+
// reaper eventually marks them failed.
128+
func (s *GalleryStore) ListActive() ([]GalleryOperationRecord, error) {
129+
var ops []GalleryOperationRecord
130+
staleCutoff := time.Now().Add(-30 * time.Minute)
131+
err := s.db.Where("status IN ? AND updated_at > ?", activeStatuses, staleCutoff).
132+
Order("created_at DESC").Find(&ops).Error
133+
return ops, err
134+
}
135+
136+
// UpsertCacheKey records the in-memory OpCache key + IsBackendOp flag on the
137+
// gallery_operations row, creating the row if it does not exist yet.
138+
//
139+
// Why upsert: OpCache.Set is called by the HTTP admission handler before the
140+
// galleryop service goroutine processes the operation and calls Create. If
141+
// OpCache wrote with a plain Updates() those columns would silently be lost
142+
// in the window between the two, so peer replicas hydrating in that window
143+
// would still rebuild an empty OpCache. Upsert closes that window.
144+
func (s *GalleryStore) UpsertCacheKey(id, cacheKey string, isBackend bool) error {
145+
now := time.Now()
146+
rec := GalleryOperationRecord{
147+
ID: id,
148+
CacheKey: cacheKey,
149+
IsBackendOp: isBackend,
150+
Status: "pending",
151+
CreatedAt: now,
152+
UpdatedAt: now,
153+
}
154+
return s.db.Clauses(clause.OnConflict{
155+
Columns: []clause.Column{{Name: "id"}},
156+
DoUpdates: clause.Assignments(map[string]any{
157+
"cache_key": cacheKey,
158+
"is_backend_op": isBackend,
159+
"updated_at": now,
160+
}),
161+
}).Create(&rec).Error
162+
}
163+
96164
// FindDuplicate checks if another instance is already downloading the same element.
97165
// Only considers records updated within the last 30 minutes as active — older
98166
// in-progress records are assumed to be stale (crashed instance).
99167
func (s *GalleryStore) FindDuplicate(elementName string) (*GalleryOperationRecord, error) {
100168
var op GalleryOperationRecord
101169
staleCutoff := time.Now().Add(-30 * time.Minute)
102170
err := s.db.Where("gallery_element_name = ? AND status IN ? AND updated_at > ?", elementName,
103-
[]string{"pending", "downloading", "processing"}, staleCutoff).First(&op).Error
171+
activeStatuses, staleCutoff).First(&op).Error
104172
if err != nil {
105173
return nil, err
106174
}
@@ -118,8 +186,7 @@ func (s *GalleryStore) Cancel(id string) error {
118186
func (s *GalleryStore) CleanStale(age time.Duration) error {
119187
cutoff := time.Now().Add(-age)
120188
return s.db.Model(&GalleryOperationRecord{}).
121-
Where("updated_at < ? AND status IN ?", cutoff,
122-
[]string{"pending", "downloading", "processing"}).
189+
Where("updated_at < ? AND status IN ?", cutoff, activeStatuses).
123190
Updates(map[string]any{
124191
"status": "failed",
125192
"error": "stale operation cleaned up on startup",

core/services/galleryop/backends.go

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ import (
99

1010
"github.com/mudler/LocalAI/core/config"
1111
"github.com/mudler/LocalAI/core/gallery"
12+
"github.com/mudler/LocalAI/core/services/messaging"
1213
"github.com/mudler/LocalAI/pkg/downloader"
1314
"github.com/mudler/LocalAI/pkg/model"
1415
"github.com/mudler/LocalAI/pkg/system"
@@ -114,6 +115,21 @@ func (g *GalleryService) backendHandler(op *ManagementOp[gallery.GalleryBackend,
114115
return err
115116
}
116117

118+
// Tell peer replicas that the backend set has changed. UpgradeChecker
119+
// caches upgrade-available bits for 6 hours, so without this peers would
120+
// keep advertising an upgrade for a backend that already moved.
121+
opName := "install"
122+
switch {
123+
case op.Delete:
124+
opName = "delete"
125+
case op.Upgrade:
126+
opName = "upgrade"
127+
}
128+
g.publishCacheInvalidate(messaging.SubjectCacheInvalidateBackends, messaging.CacheInvalidateEvent{
129+
Element: op.GalleryElementName,
130+
Op: opName,
131+
})
132+
117133
g.UpdateStatus(op.ID,
118134
&OpStatus{
119135
Deletion: op.Delete,

0 commit comments

Comments
 (0)