Skip to content

Commit 75a63f8

Browse files
authored
feat(distributed): sync state with frontends, better backend management reporting (#9426)
* fix(distributed): detect backend upgrades across worker nodes Before this change `DistributedBackendManager.CheckUpgrades` delegated to the local manager, which read backends from the frontend filesystem. In distributed deployments the frontend has no backends installed locally — they live on workers — so the upgrade-detection loop never ran and the UI silently never surfaced upgrades even when the gallery advertised newer versions or digests. Worker-side: NATS backend.list reply now carries Version, URI and Digest for each installed backend (read from metadata.json). Frontend-side: DistributedBackendManager.ListBackends aggregates per-node refs (name, status, version, digest) instead of deduping, and CheckUpgrades feeds that aggregation into gallery.CheckUpgradesAgainst — a new entrypoint factored out of CheckBackendUpgrades so both paths share the same core logic. Cluster drift policy: when per-node version/digest tuples disagree, the backend is flagged upgradeable regardless of whether any single node matches the gallery, and UpgradeInfo.NodeDrift enumerates the outliers so operators can see *why* it is out of sync. The next upgrade-all realigns the cluster. Tests cover: drift detection, unanimous-match (no upgrade), and the empty-installed-version path that the old distributed code silently missed. * feat(ui): surface backend upgrades in the System page The System page (Manage.jsx) only showed updates as a tiny inline arrow, so operators routinely missed them. Port the Backend Gallery's upgrade UX so System speaks the same visual language: - Yellow banner at the top of the Backends tab when upgrades are pending, with an "Upgrade all" button (serial fan-out, matches the gallery) and a "Updates only" filter toggle. - Warning pill (↑ N) next to the tab label so the count is glanceable even when the banner is scrolled out of view. - Per-row labeled "Upgrade to vX.Y" button (replaces the icon-only button that silently flipped semantics between Reinstall and Upgrade), plus an "Update available" badge in the new Version column. - New columns: Version (with upgrade + drift chips), Nodes (per-node attribution badges for distributed mode, degrading to a compact "on N nodes · M offline" chip above three nodes), Installed (relative time). - System backends render a "Protected" chip instead of a bare "—" so rows still align and the reason is obvious. - Delete uses the softer btn-danger-ghost so rows don't scream red; the ConfirmDialog still owns the "are you sure". The upgrade checker also needed the same per-worker fix as the previous commit: NewUpgradeChecker now takes a BackendManager getter so its periodic runs call the distributed CheckUpgrades (which asks workers) instead of the empty frontend filesystem. Without this the /api/backends/ upgrades endpoint stayed empty in distributed mode even with the protocol change in place. New CSS primitives — .upgrade-banner, .tab-pill, .badge-row, .cell-stack, .cell-mono, .cell-muted, .row-actions, .btn-danger-ghost — all live in App.css so other pages can adopt them without duplicating styles. * feat(ui): polish the Nodes page so it reads like a product The Nodes page was the biggest visual liability in distributed mode. Rework the main dashboard surfaces in place without changing behavior: StatCards: uniform height (96px min), left accent bar colored by the metric's semantic (success/warning/error/primary), icon lives in a 36x36 soft-tinted chip top-right, value is left-aligned and large. Grid auto-fills so the row doesn't collapse on narrow viewports. This replaces the previous thin-bordered boxes with inconsistent heights. Table rows: expandable rows now show a chevron cue on the left (rotates on expand) so users know rows open. Status cell became a dedicated chip with an LED-style halo dot instead of a bare bullet. Action buttons gained labels — "Approve", "Resume", "Drain" — so the icons aren't doing all the semantic work; the destructive remove action uses the softer btn-danger-ghost variant so rows don't scream red, with the ConfirmDialog still owning the real "are you sure". Applied cell-mono/cell-muted utility classes so label chips and addresses share one spacing/font grammar instead of re-declaring inline styles everywhere. Expanded drawer: empty states for Loaded Models and Installed Backends now render as a proper drawer-empty card (dashed border, icon, one-line hint) instead of a plain muted string that read like broken formatting. Tabs: three inline-styled buttons became the shared .tab class so they inherit focus ring, hover state, and the rest of the design system — matches the System page. "Add more workers" toggle turned into a .nodes-add-worker dashed-border button labelled "Register a new worker" (action voice) instead of a chevron + muted link that operators kept mistaking for broken text. New shared CSS primitives carry over to other pages: .stat-grid + .stat-card, .row-chevron, .node-status, .drawer-empty, .nodes-add-worker. * feat(distributed): durable backend fan-out + state reconciliation Two connected problems handled together: 1) Backend delete/install/upgrade used to silently skip non-healthy nodes, so a delete during an outage left a zombie on the offline node once it returned. The fan-out now records intent in a new pending_backend_ops table before attempting the NATS round-trip. Currently-healthy nodes get an immediate attempt; everyone else is queued. Unique index on (node_id, backend, op) means reissuing the same operation refreshes next_retry_at instead of stacking duplicates. 2) Loaded-model state could drift from reality: a worker OOM'd, got killed, or restarted a backend process would leave a node_models row claiming the model was still loaded, feeding ghost entries into the /api/nodes/models listing and the router's scheduling decisions. The existing ReplicaReconciler gains two new passes that run under a fresh KeyStateReconciler advisory lock (non-blocking, so one wedged frontend doesn't freeze the cluster): - drainPendingBackendOps: retries queued ops whose next_retry_at has passed on currently-healthy nodes. Success deletes the row; failure bumps attempts and pushes next_retry_at out with exponential backoff (30s → 15m cap). ErrNoResponders also marks the node unhealthy. - probeLoadedModels: gRPC-HealthChecks addresses the DB thinks are loaded but hasn't seen touched in the last probeStaleAfter (2m). Unreachable addresses are removed from the registry. A pluggable ModelProber lets tests substitute a fake without standing up gRPC. DistributedBackendManager exposes DeleteBackendDetailed so the HTTP handler can surface per-node outcomes ("2 succeeded, 1 queued") to the UI in a follow-up commit; the existing DeleteBackend still returns error-only for callers that don't care about node breakdown. Multi-frontend safety: the state pass uses advisorylock.TryWithLockCtx on a new key so N frontends coordinate — the same pattern the health monitor and replica reconciler already rely on. Single-node mode runs both passes inline (adapter is nil, state drain is a no-op). Tests cover the upsert semantics, backoff math, the probe removing an unreachable model but keeping a reachable one, and filtering by probeStaleAfter. * feat(ui): show cluster distribution of models in the System page When a frontend restarted in distributed mode, models that workers had already loaded weren't visible until the operator clicked into each node manually — the /api/models/capabilities endpoint only knew about configs on the frontend's filesystem, not the registry-backed truth. /api/models/capabilities now joins in ListAllLoadedModels() when the registry is active, returning loaded_on[] with node id/name/state/status for each model. Models that live in the registry but lack a local config (the actual ghosts, not recovered from the frontend's file cache) still surface with source="registry-only" so operators can see and persist them; without that emission they'd be invisible to this frontend. Manage → Models replaces the old Running/Idle pill with a distribution cell that lists the first three nodes the model is loaded on as chips colored by state (green loaded, blue loading, amber anything else). On wider clusters the remaining count collapses into a +N chip with a title-attribute breakdown. Disabled / single-node behavior unchanged. Adopted models get an extra "Adopted" ghost-icon chip with hover copy explaining what it means and how to make it permanent. Distributed mode also enables a 10s auto-refresh and a "Last synced Xs ago" indicator next to the Update button so ghost rows drop off within one reconcile tick after their owning process dies. Non-distributed mode is untouched — no polling, no cell-stack, same old Running/Idle. * feat(ui): NodeDistributionChip — shared per-node attribution component Large clusters were going to break the Manage → Backends Nodes column: the old inline logic rendered every node as a badge and would shred the layout at >10 workers, plus the Manage → Models distribution cell had copy-pasted its own slightly-different version. NodeDistributionChip handles any cluster size with two render modes: - small (≤3 nodes): inline chips of node names, colored by health. - large: a single "on N nodes · M offline · K drift" summary chip; clicking opens a Popover with a per-node table (name, status, version, digest for backends; name, status, state for models). Drift counting mirrors the backend's summarizeNodeDrift so the UI number matches UpgradeInfo.NodeDrift. Digests are truncated to the docker-style 12-char form with the full value preserved in the title. Popover is a new general-purpose primitive: fixed positioning anchored to the trigger, flips above when there's no room below, closes on outside-click or Escape, returns focus to the trigger. Uses .card as its surface so theming is inherited. Also useful for a future labels-editor popup and the user menu. Manage.jsx drops its duplicated inline Nodes-column + loaded_on cell and uses the shared chip with context="backends" / "models" respectively. Delete code removes ~40 lines of ad-hoc logic. * feat(ui): shared FilterBar across the System page tabs The Backends gallery had a nice search + chip + toggle strip; the System page had nothing, so the two surfaces felt like different apps. Lift the pattern into a reusable FilterBar and wire both System tabs through it. New component core/http/react-ui/src/components/FilterBar.jsx renders a search input, a role="tablist" chip row (aria-selected for a11y), and optional toggles / right slot. Chips support an optional `count` which the System page uses to show "User 3", "Updates 1" etc. System Models tab: search by id or backend; chips for All/Running/Idle/Disabled/Pinned plus a conditional Distributed chip in distributed mode. "Last synced" + Update button live in the right slot. System Backends tab: search by name/alias/meta-backend-for; chips for All/User/System/Meta plus conditional Updates / Offline-nodes chips when relevant. The old ad-hoc "Updates only" toggle from the upgrade banner folded into the Updates chip — one source of truth for that filter. Offline chip only appears in distributed mode when at least one backend has an unhealthy node, so the chip row stays quiet on healthy clusters. Filter state persists in URL query params (mq/mf/bq/bf) so deep links and tab switches keep the operator's filter context instead of resetting every time. Also adds an "Adopted" distribution path: when a model in /api/models/capabilities carries source="registry-only" (discovered on a worker but not configured locally), the Models tab shows a ghost chip labelled "Adopted" with hover copy explaining how to persist it — this is what closes the loop on the ghost-model story end-to-end.
1 parent 9cd8d79 commit 75a63f8

File tree

21 files changed

+2182
-309
lines changed

21 files changed

+2182
-309
lines changed

core/application/distributed.go

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -242,14 +242,20 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB) (*Distribut
242242
DB: authDB,
243243
})
244244

245-
// Create ReplicaReconciler for auto-scaling model replicas
245+
// Create ReplicaReconciler for auto-scaling model replicas. Adapter +
246+
// RegistrationToken feed the state-reconciliation passes: pending op
247+
// drain uses the adapter, and model health probes use the token to auth
248+
// against workers' gRPC HealthCheck.
246249
reconciler := nodes.NewReplicaReconciler(nodes.ReplicaReconcilerOptions{
247-
Registry: registry,
248-
Scheduler: router,
249-
Unloader: remoteUnloader,
250-
DB: authDB,
251-
Interval: 30 * time.Second,
252-
ScaleDownDelay: 5 * time.Minute,
250+
Registry: registry,
251+
Scheduler: router,
252+
Unloader: remoteUnloader,
253+
Adapter: remoteUnloader,
254+
RegistrationToken: cfg.Distributed.RegistrationToken,
255+
DB: authDB,
256+
Interval: 30 * time.Second,
257+
ScaleDownDelay: 5 * time.Minute,
258+
ProbeStaleAfter: 2 * time.Minute,
253259
})
254260

255261
// Create ModelRouterAdapter to wire into ModelLoader

core/application/startup.go

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,12 @@ func New(opts ...config.AppOption) (*Application, error) {
235235
// In distributed mode, uses PostgreSQL advisory lock so only one frontend
236236
// instance runs periodic checks (avoids duplicate upgrades across replicas).
237237
if len(options.BackendGalleries) > 0 {
238-
uc := NewUpgradeChecker(options, application.ModelLoader(), application.distributedDB())
238+
// Pass a lazy getter for the backend manager so the checker always
239+
// uses the active one — DistributedBackendManager is swapped in above
240+
// and asks workers for their installed backends, which is what
241+
// upgrade detection needs in distributed mode.
242+
bmFn := func() galleryop.BackendManager { return application.GalleryService().BackendManager() }
243+
uc := NewUpgradeChecker(options, application.ModelLoader(), application.distributedDB(), bmFn)
239244
application.upgradeChecker = uc
240245
go uc.Run(options.Context)
241246
}

core/application/upgrade_checker.go

Lines changed: 39 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ import (
88
"github.com/mudler/LocalAI/core/config"
99
"github.com/mudler/LocalAI/core/gallery"
1010
"github.com/mudler/LocalAI/core/services/advisorylock"
11+
"github.com/mudler/LocalAI/core/services/galleryop"
1112
"github.com/mudler/LocalAI/pkg/model"
1213
"github.com/mudler/LocalAI/pkg/system"
1314
"github.com/mudler/xlog"
@@ -26,6 +27,12 @@ type UpgradeChecker struct {
2627
galleries []config.Gallery
2728
systemState *system.SystemState
2829
db *gorm.DB // non-nil in distributed mode
30+
// backendManagerFn lazily returns the current backend manager (may be
31+
// swapped from Local to Distributed after startup). Pulled through each
32+
// check so the UpgradeChecker uses whichever is active. In distributed
33+
// mode this ensures CheckUpgrades asks workers instead of the (empty)
34+
// frontend filesystem — fixing the bug where upgrades never surfaced.
35+
backendManagerFn func() galleryop.BackendManager
2936

3037
checkInterval time.Duration
3138
stop chan struct{}
@@ -40,18 +47,22 @@ type UpgradeChecker struct {
4047
// NewUpgradeChecker creates a new UpgradeChecker service.
4148
// Pass db=nil for standalone mode, or a *gorm.DB for distributed mode
4249
// (uses advisory locks so only one instance runs periodic checks).
43-
func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB) *UpgradeChecker {
50+
// backendManagerFn is optional; when set, CheckUpgrades is routed through
51+
// the active backend manager — required in distributed mode so the check
52+
// aggregates from workers rather than the empty frontend filesystem.
53+
func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB, backendManagerFn func() galleryop.BackendManager) *UpgradeChecker {
4454
return &UpgradeChecker{
45-
appConfig: appConfig,
46-
modelLoader: ml,
47-
galleries: appConfig.BackendGalleries,
48-
systemState: appConfig.SystemState,
49-
db: db,
50-
checkInterval: 6 * time.Hour,
51-
stop: make(chan struct{}),
52-
done: make(chan struct{}),
53-
triggerCh: make(chan struct{}, 1),
54-
lastUpgrades: make(map[string]gallery.UpgradeInfo),
55+
appConfig: appConfig,
56+
modelLoader: ml,
57+
galleries: appConfig.BackendGalleries,
58+
systemState: appConfig.SystemState,
59+
db: db,
60+
backendManagerFn: backendManagerFn,
61+
checkInterval: 6 * time.Hour,
62+
stop: make(chan struct{}),
63+
done: make(chan struct{}),
64+
triggerCh: make(chan struct{}, 1),
65+
lastUpgrades: make(map[string]gallery.UpgradeInfo),
5566
}
5667
}
5768

@@ -64,13 +75,16 @@ func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoade
6475
func (uc *UpgradeChecker) Run(ctx context.Context) {
6576
defer close(uc.done)
6677

67-
// Initial delay: don't slow down startup
78+
// Initial delay: don't slow down startup. Short enough that operators
79+
// don't stare at an empty upgrade banner for long; long enough that
80+
// workers have registered and reported their installed backends.
81+
initialDelay := 10 * time.Second
6882
select {
6983
case <-ctx.Done():
7084
return
7185
case <-uc.stop:
7286
return
73-
case <-time.After(30 * time.Second):
87+
case <-time.After(initialDelay):
7488
}
7589

7690
// First check always runs locally (to warm the cache on this instance)
@@ -144,7 +158,18 @@ func (uc *UpgradeChecker) GetAvailableUpgrades() map[string]gallery.UpgradeInfo
144158
}
145159

146160
func (uc *UpgradeChecker) runCheck(ctx context.Context) {
147-
upgrades, err := gallery.CheckBackendUpgrades(ctx, uc.galleries, uc.systemState)
161+
var (
162+
upgrades map[string]gallery.UpgradeInfo
163+
err error
164+
)
165+
if uc.backendManagerFn != nil {
166+
if bm := uc.backendManagerFn(); bm != nil {
167+
upgrades, err = bm.CheckUpgrades(ctx)
168+
}
169+
}
170+
if upgrades == nil && err == nil {
171+
upgrades, err = gallery.CheckBackendUpgrades(ctx, uc.galleries, uc.systemState)
172+
}
148173

149174
uc.mu.Lock()
150175
uc.lastCheckTime = time.Now()

core/cli/worker.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -738,6 +738,9 @@ func (s *backendSupervisor) subscribeLifecycleEvents() {
738738
if b.Metadata != nil {
739739
info.InstalledAt = b.Metadata.InstalledAt
740740
info.GalleryURL = b.Metadata.GalleryURL
741+
info.Version = b.Metadata.Version
742+
info.URI = b.Metadata.URI
743+
info.Digest = b.Metadata.Digest
741744
}
742745
infos = append(infos, info)
743746
}

core/gallery/backends.go

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -394,6 +394,23 @@ type SystemBackend struct {
394394
Metadata *BackendMetadata
395395
UpgradeAvailable bool `json:"upgrade_available,omitempty"`
396396
AvailableVersion string `json:"available_version,omitempty"`
397+
// Nodes holds per-node attribution in distributed mode. Empty in single-node.
398+
// Each entry describes a node that has this backend installed, with the
399+
// version/digest it reports. Lets the UI surface drift and per-node status.
400+
Nodes []NodeBackendRef `json:"nodes,omitempty"`
401+
}
402+
403+
// NodeBackendRef describes one node's view of an installed backend. Used both
404+
// for per-node attribution in the UI and for drift detection during upgrade
405+
// checks (a cluster with mismatched versions/digests is flagged upgradeable).
406+
type NodeBackendRef struct {
407+
NodeID string `json:"node_id"`
408+
NodeName string `json:"node_name"`
409+
NodeStatus string `json:"node_status"` // healthy | unhealthy | offline | draining | pending
410+
Version string `json:"version,omitempty"`
411+
Digest string `json:"digest,omitempty"`
412+
URI string `json:"uri,omitempty"`
413+
InstalledAt string `json:"installed_at,omitempty"`
397414
}
398415

399416
type SystemBackends map[string]SystemBackend

core/gallery/upgrade.go

Lines changed: 108 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -23,20 +23,43 @@ type UpgradeInfo struct {
2323
AvailableVersion string `json:"available_version"`
2424
InstalledDigest string `json:"installed_digest,omitempty"`
2525
AvailableDigest string `json:"available_digest,omitempty"`
26+
// NodeDrift lists nodes whose installed version or digest differs from
27+
// the cluster majority. Non-empty means the cluster has diverged and an
28+
// upgrade will realign it. Empty in single-node mode.
29+
NodeDrift []NodeDriftInfo `json:"node_drift,omitempty"`
2630
}
2731

28-
// CheckBackendUpgrades compares installed backends against gallery entries
29-
// and returns a map of backend names to UpgradeInfo for those that have
30-
// newer versions or different OCI digests available.
32+
// NodeDriftInfo describes one node that disagrees with the cluster majority
33+
// on which version/digest of a backend is installed.
34+
type NodeDriftInfo struct {
35+
NodeID string `json:"node_id"`
36+
NodeName string `json:"node_name"`
37+
Version string `json:"version,omitempty"`
38+
Digest string `json:"digest,omitempty"`
39+
}
40+
41+
// CheckBackendUpgrades is the single-node entrypoint. Distributed callers use
42+
// CheckUpgradesAgainst directly with their aggregated SystemBackends.
3143
func CheckBackendUpgrades(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState) (map[string]UpgradeInfo, error) {
32-
galleryBackends, err := AvailableBackends(galleries, systemState)
44+
installed, err := ListSystemBackends(systemState)
3345
if err != nil {
34-
return nil, fmt.Errorf("failed to list available backends: %w", err)
46+
return nil, fmt.Errorf("failed to list installed backends: %w", err)
3547
}
48+
return CheckUpgradesAgainst(ctx, galleries, systemState, installed)
49+
}
3650

37-
installedBackends, err := ListSystemBackends(systemState)
51+
// CheckUpgradesAgainst compares a caller-supplied SystemBackends set against
52+
// the gallery. Fixes the distributed-mode bug where the old code passed the
53+
// frontend's (empty) local filesystem through ListSystemBackends and so never
54+
// surfaced any upgrades.
55+
//
56+
// Cluster drift policy: if a backend's per-node versions/digests disagree, the
57+
// row is flagged upgradeable regardless of whether any node matches the gallery
58+
// — next Upgrade All realigns the cluster. NodeDrift lists the outliers.
59+
func CheckUpgradesAgainst(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, installedBackends SystemBackends) (map[string]UpgradeInfo, error) {
60+
galleryBackends, err := AvailableBackends(galleries, systemState)
3861
if err != nil {
39-
return nil, fmt.Errorf("failed to list installed backends: %w", err)
62+
return nil, fmt.Errorf("failed to list available backends: %w", err)
4063
}
4164

4265
result := make(map[string]UpgradeInfo)
@@ -57,56 +80,117 @@ func CheckBackendUpgrades(ctx context.Context, galleries []config.Gallery, syste
5780
}
5881

5982
installedVersion := installed.Metadata.Version
83+
installedDigest := installed.Metadata.Digest
6084
galleryVersion := galleryEntry.Version
6185

62-
// If both sides have versions, compare them
86+
// Detect cluster drift: does every node report the same version+digest?
87+
// In single-node mode this stays empty (Nodes is nil).
88+
majority, drift := summarizeNodeDrift(installed.Nodes)
89+
if majority.version != "" {
90+
installedVersion = majority.version
91+
}
92+
if majority.digest != "" {
93+
installedDigest = majority.digest
94+
}
95+
96+
makeInfo := func(info UpgradeInfo) UpgradeInfo {
97+
info.NodeDrift = drift
98+
return info
99+
}
100+
101+
// If versions are available on both sides, they're the source of truth.
63102
if galleryVersion != "" && installedVersion != "" {
64-
if galleryVersion != installedVersion {
65-
result[installed.Metadata.Name] = UpgradeInfo{
103+
if galleryVersion != installedVersion || len(drift) > 0 {
104+
result[installed.Metadata.Name] = makeInfo(UpgradeInfo{
66105
BackendName: installed.Metadata.Name,
67106
InstalledVersion: installedVersion,
68107
AvailableVersion: galleryVersion,
69-
}
108+
})
70109
}
71-
// Versions match — no upgrade needed
72110
continue
73111
}
74112

75-
// Gallery has a version but installed doesn't — this happens for backends
76-
// installed before version tracking was added. Flag as upgradeable so
77-
// users can re-install to pick up version metadata.
113+
// Gallery has a version but installed doesn't — backends installed before
114+
// version tracking was added. Flag as upgradeable to pick up metadata.
78115
if galleryVersion != "" && installedVersion == "" {
79-
result[installed.Metadata.Name] = UpgradeInfo{
116+
result[installed.Metadata.Name] = makeInfo(UpgradeInfo{
80117
BackendName: installed.Metadata.Name,
81118
InstalledVersion: "",
82119
AvailableVersion: galleryVersion,
83-
}
120+
})
84121
continue
85122
}
86123

87-
// Fall back to OCI digest comparison when versions are unavailable
124+
// Fall back to OCI digest comparison when versions are unavailable.
88125
if downloader.URI(galleryEntry.URI).LooksLikeOCI() {
89126
remoteDigest, err := oci.GetImageDigest(galleryEntry.URI, "", nil, nil)
90127
if err != nil {
91128
xlog.Warn("Failed to get remote OCI digest for upgrade check", "backend", installed.Metadata.Name, "error", err)
92129
continue
93130
}
94131
// If we have a stored digest, compare; otherwise any remote digest
95-
// means we can't confirm we're up to date — flag as upgradeable
96-
if installed.Metadata.Digest == "" || remoteDigest != installed.Metadata.Digest {
97-
result[installed.Metadata.Name] = UpgradeInfo{
132+
// means we can't confirm we're up to date — flag as upgradeable.
133+
if installedDigest == "" || remoteDigest != installedDigest || len(drift) > 0 {
134+
result[installed.Metadata.Name] = makeInfo(UpgradeInfo{
98135
BackendName: installed.Metadata.Name,
99-
InstalledDigest: installed.Metadata.Digest,
136+
InstalledDigest: installedDigest,
100137
AvailableDigest: remoteDigest,
101-
}
138+
})
102139
}
140+
} else if len(drift) > 0 {
141+
// No version/digest path but nodes disagree — still worth flagging.
142+
result[installed.Metadata.Name] = makeInfo(UpgradeInfo{
143+
BackendName: installed.Metadata.Name,
144+
InstalledVersion: installedVersion,
145+
InstalledDigest: installedDigest,
146+
})
103147
}
104-
// No version info and non-OCI URI — cannot determine, skip
105148
}
106149

107150
return result, nil
108151
}
109152

153+
// summarizeNodeDrift collapses per-node version/digest tuples to a majority
154+
// pair and returns the outliers. In single-node mode (empty nodes slice) this
155+
// returns zero values and a nil drift list.
156+
func summarizeNodeDrift(nodes []NodeBackendRef) (majority struct{ version, digest string }, drift []NodeDriftInfo) {
157+
if len(nodes) == 0 {
158+
return majority, nil
159+
}
160+
161+
type key struct{ version, digest string }
162+
counts := map[key]int{}
163+
var topKey key
164+
var topCount int
165+
for _, n := range nodes {
166+
k := key{n.Version, n.Digest}
167+
counts[k]++
168+
if counts[k] > topCount {
169+
topCount = counts[k]
170+
topKey = k
171+
}
172+
}
173+
174+
majority.version = topKey.version
175+
majority.digest = topKey.digest
176+
177+
if len(counts) == 1 {
178+
return majority, nil // unanimous — no drift
179+
}
180+
for _, n := range nodes {
181+
if n.Version == majority.version && n.Digest == majority.digest {
182+
continue
183+
}
184+
drift = append(drift, NodeDriftInfo{
185+
NodeID: n.NodeID,
186+
NodeName: n.NodeName,
187+
Version: n.Version,
188+
Digest: n.Digest,
189+
})
190+
}
191+
return majority, drift
192+
}
193+
110194
// UpgradeBackend upgrades a single backend to the latest gallery version using
111195
// an atomic swap with backup-based rollback on failure.
112196
func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64)) error {

0 commit comments

Comments
 (0)