Fix sync coordinator starvation when a source always fails by rdimitrov · Pull Request #769 · stacklok/toolhive-registry-server

rdimitrov · 2026-05-06T11:59:32Z

Summary

A permanently-failing syncable source starved every other syncable source. coordinator.performRegistrySync only set LastSyncTime (persisted as registry_sync.ended_at) on success, so a perpetually failing source kept ended_at = NULL and always won the scheduler's ORDER BY ended_at ASC NULLS FIRST LIMIT 1.

Observed on v1.4.0 staging: a Git source with a broken URL blocked a healthy one from ever syncing, leaving dev.toolhive/skills empty.

Changes

Set LastSyncTime = &now on every attempt (success and failure) so ended_at always advances.
Rename lastSyncTime semantically from "last successful sync" to "last completed sync attempt" — phase already tells consumers whether the attempt succeeded. Doc + swagger updated.
Regression tests in internal/sync/coordinator/coordinator_test.go.

Test plan

go test ./internal/sync/...
task lint-fix
Local docker-compose A/B repro on main vs this branch with two syncable Git sources

Fixes #770

`coordinator.performRegistrySync` only set `syncStatus.LastSyncTime` on the success branch. That field is persisted to `registry_sync.ended_at`, which the scheduler reads via `ListSourceSyncsByLastUpdate` (ordered `ended_at ASC NULLS FIRST LIMIT 1`). A source whose syncs always failed therefore kept `ended_at` NULL on every tick, perpetually sorted first under NULLS FIRST, and starved every other syncable source. This was observed in v1.4.0 staging where a broken Git source blocked a healthy one from ever syncing, leaving `dev.toolhive/skills` empty. The Kubernetes source was unaffected because it is informer-driven and does not ride the coordinator. Move the `LastSyncTime = &now` assignment ahead of the success/failure split so `ended_at` always advances, then have `fetchSyncStatus` only project it onto the public `SourceSyncStatus.LastSyncTime` field when the most recent attempt completed. That preserves the documented "last successful sync" semantics on the API surface while letting the DB column serve its scheduling role. Add a unit regression test that asserts the failure branch sets `LastSyncTime`, and a higher-level test that runs two syncable sources (one always failing) against an in-memory state service mirroring the real ordering and asserts the healthy source is selected by the second tick. Fixes the staging starvation reported on v1.4.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-05-06T12:04:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.01%. Comparing base (193417d) to head (3c8d7bc).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #769      +/-   ##
==========================================
+ Coverage   60.50%   61.01%   +0.51%     
==========================================
  Files         108      108              
  Lines       10580    10580              
==========================================
+ Hits         6401     6455      +54     
+ Misses       3610     3547      -63     
- Partials      569      578       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

`task docs` (which runs `swag init`) emits no trailing newline, but the previous edit reintroduced one. The `Verify Generated Files` CI job re-runs `swag init` and diffs against the committed file, so the stray newline failed the check.

A staging deployment was wedged for ~2.5 months because a `git`-type sync hit a clone error and the pod was killed before `performRegistrySync`'s defer could write a terminal `sync_status`. The `registry_sync` row was left at `IN_PROGRESS` indefinitely. Subsequent restarts never repaired it: `BulkInitializeSourceSyncs` is `ON CONFLICT DO NOTHING` and there is no other call path that updates an existing `registry_sync` row outside of an active sync. Every coordinator tick called `GetNextSyncJob`, picked the wedged row, and `ShouldSync` short-circuited on `SyncPhaseSyncing` before reaching its trailing log line — so the wedge was completely silent. Add an `isOrphanedInProgressSync` check inside `ShouldSync`. When the phase is `Syncing` but `started_at` is nil or older than a one-hour grace window, treat the row as orphaned, log a warning, and fall through to the normal sync flow. The coordinator's deferred `UpdateSyncStatus` will then write a fresh terminal status on the next sync attempt. The grace window is comfortably longer than any realistic sync run, so a real in-flight sync is never misclassified as orphaned. The single-replica deployment has no sync-side leader election; the grace window is purely defensive in case multi-replica is ever introduced. The fix lives in `ShouldSync` rather than as a startup reaper because it heals the wedge wherever it manifests rather than only on pod restart, and avoids a SQL-layer surface change. It also incidentally plugs the `ShouldSync` early-return log gap that made the wedge silent in production: every IN_PROGRESS skip now emits a structured log line. #769 closes the graceful-failure starvation hole; this commit closes the crash-mid-sync wedge hole. Improves on #769

## Bug A `git` sync hit a clone error and the pod was killed before `performRegistrySync`'s defer could write a terminal `sync_status`, leaving the `registry_sync` row at `IN_PROGRESS`. `BulkInitializeSourceSyncs` is `ON CONFLICT DO NOTHING`, so restarts never repaired it. Every coordinator tick then short-circuited in `ShouldSync` on `SyncPhaseSyncing` before reaching its trailing log line — wedged for ~2.5 months on staging, completely silent. ## Fix In `ShouldSync`, detect `IN_PROGRESS` rows whose `started_at` is `nil` or older than a 1-hour grace window. Log a `WARN` and fall through to the normal sync flow so the coordinator's deferred `UpdateSyncStatus` writes a fresh terminal status on the next sync attempt that actually runs. The grace window is comfortably longer than any realistic sync run. Lives in `ShouldSync` rather than a startup reaper so it heals lazily on every tick (not only on restart) and avoids a SQL surface change. As a side effect, every `IN_PROGRESS` skip now emits a structured log line, closing the silent-wedge gap that hid the original incident. ## Smoke test (docker compose) - **Before**: wedge `registry_sync` to `IN_PROGRESS` with `started_at = now() - 2h`, restart → coordinator silently skips (`Registry does not need sync, reason: sync-already-in-progress`, `DEBUG` only — invisible at `INFO`). - **After**: same setup → `WARN: Detected stale IN_PROGRESS sync, treating as failed`. `ShouldSync` falls through; once the regular flow runs the sync (data change, filter change, manual trigger), the deferred status update flips the row to a terminal status. ## Tests Added to `TestDefaultSyncManager_ShouldSync`: - `stale IN_PROGRESS past grace window is treated as orphaned` — fails without the fix. - `IN_PROGRESS with no started_at is treated as orphaned` — covers nil timestamps. Tightened the existing `sync not needed when already syncing` case to `LastAttempt = now - 5min` to exercise the inside-grace-window path. ## Relationship to #769 #769 closes the graceful-failure starvation hole; this PR closes the SIGKILL-mid-sync wedge hole.

Align lastSyncTime with completed sync attempts

cf2b2ed

rdimitrov mentioned this pull request May 6, 2026

Sync coordinator starves healthy sources when one always fails #770

Closed

Regenerate swagger.json to drop trailing newline

3c8d7bc

`task docs` (which runs `swag init`) emits no trailing newline, but the previous edit reintroduced one. The `Verify Generated Files` CI job re-runs `swag init` and diffs against the committed file, so the stray newline failed the check.

rdimitrov mentioned this pull request May 6, 2026

Self-heal stale IN_PROGRESS syncs in ShouldSync #772

Merged

rdimitrov changed the title ~~Advance ended_at on failed sync to prevent scheduler starvation~~ Fix sync coordinator starvation when a source always fails May 6, 2026

ChrisJBurns approved these changes May 6, 2026

View reviewed changes

rdimitrov merged commit 415190e into main May 6, 2026
15 checks passed

rdimitrov deleted the fix/sync-coordinator-starvation branch May 6, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sync coordinator starvation when a source always fails#769

Fix sync coordinator starvation when a source always fails#769
rdimitrov merged 3 commits intomainfrom
fix/sync-coordinator-starvation

rdimitrov commented May 6, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rdimitrov commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

codecov-commenter commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdimitrov commented May 6, 2026 •

edited

Loading

codecov-commenter commented May 6, 2026 •

edited

Loading