Fix sync coordinator starvation when a source always fails#769
Merged
Fix sync coordinator starvation when a source always fails#769
Conversation
`coordinator.performRegistrySync` only set `syncStatus.LastSyncTime` on the success branch. That field is persisted to `registry_sync.ended_at`, which the scheduler reads via `ListSourceSyncsByLastUpdate` (ordered `ended_at ASC NULLS FIRST LIMIT 1`). A source whose syncs always failed therefore kept `ended_at` NULL on every tick, perpetually sorted first under NULLS FIRST, and starved every other syncable source. This was observed in v1.4.0 staging where a broken Git source blocked a healthy one from ever syncing, leaving `dev.toolhive/skills` empty. The Kubernetes source was unaffected because it is informer-driven and does not ride the coordinator. Move the `LastSyncTime = &now` assignment ahead of the success/failure split so `ended_at` always advances, then have `fetchSyncStatus` only project it onto the public `SourceSyncStatus.LastSyncTime` field when the most recent attempt completed. That preserves the documented "last successful sync" semantics on the API surface while letting the DB column serve its scheduling role. Add a unit regression test that asserts the failure branch sets `LastSyncTime`, and a higher-level test that runs two syncable sources (one always failing) against an in-memory state service mirroring the real ordering and asserts the healthy source is selected by the second tick. Fixes the staging starvation reported on v1.4.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #769 +/- ##
==========================================
+ Coverage 60.50% 61.01% +0.51%
==========================================
Files 108 108
Lines 10580 10580
==========================================
+ Hits 6401 6455 +54
+ Misses 3610 3547 -63
- Partials 569 578 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
`task docs` (which runs `swag init`) emits no trailing newline, but the previous edit reintroduced one. The `Verify Generated Files` CI job re-runs `swag init` and diffs against the committed file, so the stray newline failed the check.
rdimitrov
added a commit
that referenced
this pull request
May 6, 2026
A staging deployment was wedged for ~2.5 months because a `git`-type sync hit a clone error and the pod was killed before `performRegistrySync`'s defer could write a terminal `sync_status`. The `registry_sync` row was left at `IN_PROGRESS` indefinitely. Subsequent restarts never repaired it: `BulkInitializeSourceSyncs` is `ON CONFLICT DO NOTHING` and there is no other call path that updates an existing `registry_sync` row outside of an active sync. Every coordinator tick called `GetNextSyncJob`, picked the wedged row, and `ShouldSync` short-circuited on `SyncPhaseSyncing` before reaching its trailing log line — so the wedge was completely silent. Add an `isOrphanedInProgressSync` check inside `ShouldSync`. When the phase is `Syncing` but `started_at` is nil or older than a one-hour grace window, treat the row as orphaned, log a warning, and fall through to the normal sync flow. The coordinator's deferred `UpdateSyncStatus` will then write a fresh terminal status on the next sync attempt. The grace window is comfortably longer than any realistic sync run, so a real in-flight sync is never misclassified as orphaned. The single-replica deployment has no sync-side leader election; the grace window is purely defensive in case multi-replica is ever introduced. The fix lives in `ShouldSync` rather than as a startup reaper because it heals the wedge wherever it manifests rather than only on pod restart, and avoids a SQL-layer surface change. It also incidentally plugs the `ShouldSync` early-return log gap that made the wedge silent in production: every IN_PROGRESS skip now emits a structured log line. #769 closes the graceful-failure starvation hole; this commit closes the crash-mid-sync wedge hole. Improves on #769
ended_at on failed sync to prevent scheduler starvation
ChrisJBurns
approved these changes
May 6, 2026
rdimitrov
added a commit
that referenced
this pull request
May 6, 2026
## Bug A `git` sync hit a clone error and the pod was killed before `performRegistrySync`'s defer could write a terminal `sync_status`, leaving the `registry_sync` row at `IN_PROGRESS`. `BulkInitializeSourceSyncs` is `ON CONFLICT DO NOTHING`, so restarts never repaired it. Every coordinator tick then short-circuited in `ShouldSync` on `SyncPhaseSyncing` before reaching its trailing log line — wedged for ~2.5 months on staging, completely silent. ## Fix In `ShouldSync`, detect `IN_PROGRESS` rows whose `started_at` is `nil` or older than a 1-hour grace window. Log a `WARN` and fall through to the normal sync flow so the coordinator's deferred `UpdateSyncStatus` writes a fresh terminal status on the next sync attempt that actually runs. The grace window is comfortably longer than any realistic sync run. Lives in `ShouldSync` rather than a startup reaper so it heals lazily on every tick (not only on restart) and avoids a SQL surface change. As a side effect, every `IN_PROGRESS` skip now emits a structured log line, closing the silent-wedge gap that hid the original incident. ## Smoke test (docker compose) - **Before**: wedge `registry_sync` to `IN_PROGRESS` with `started_at = now() - 2h`, restart → coordinator silently skips (`Registry does not need sync, reason: sync-already-in-progress`, `DEBUG` only — invisible at `INFO`). - **After**: same setup → `WARN: Detected stale IN_PROGRESS sync, treating as failed`. `ShouldSync` falls through; once the regular flow runs the sync (data change, filter change, manual trigger), the deferred status update flips the row to a terminal status. ## Tests Added to `TestDefaultSyncManager_ShouldSync`: - `stale IN_PROGRESS past grace window is treated as orphaned` — fails without the fix. - `IN_PROGRESS with no started_at is treated as orphaned` — covers nil timestamps. Tightened the existing `sync not needed when already syncing` case to `LastAttempt = now - 5min` to exercise the inside-grace-window path. ## Relationship to #769 #769 closes the graceful-failure starvation hole; this PR closes the SIGKILL-mid-sync wedge hole.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A permanently-failing syncable source starved every other syncable source.
coordinator.performRegistrySynconly setLastSyncTime(persisted asregistry_sync.ended_at) on success, so a perpetually failing source keptended_at = NULLand always won the scheduler'sORDER BY ended_at ASC NULLS FIRST LIMIT 1.Observed on
v1.4.0staging: a Git source with a broken URL blocked a healthy one from ever syncing, leavingdev.toolhive/skillsempty.Changes
LastSyncTime = &nowon every attempt (success and failure) soended_atalways advances.lastSyncTimesemantically from "last successful sync" to "last completed sync attempt" —phasealready tells consumers whether the attempt succeeded. Doc + swagger updated.internal/sync/coordinator/coordinator_test.go.Test plan
go test ./internal/sync/...task lint-fixmainvs this branch with two syncable Git sourcesFixes #770