Skip to content

Fix sync coordinator starvation when a source always fails#769

Merged
rdimitrov merged 3 commits intomainfrom
fix/sync-coordinator-starvation
May 6, 2026
Merged

Fix sync coordinator starvation when a source always fails#769
rdimitrov merged 3 commits intomainfrom
fix/sync-coordinator-starvation

Conversation

@rdimitrov
Copy link
Copy Markdown
Member

@rdimitrov rdimitrov commented May 6, 2026

Summary

A permanently-failing syncable source starved every other syncable source. coordinator.performRegistrySync only set LastSyncTime (persisted as registry_sync.ended_at) on success, so a perpetually failing source kept ended_at = NULL and always won the scheduler's ORDER BY ended_at ASC NULLS FIRST LIMIT 1.

Observed on v1.4.0 staging: a Git source with a broken URL blocked a healthy one from ever syncing, leaving dev.toolhive/skills empty.

Changes

  • Set LastSyncTime = &now on every attempt (success and failure) so ended_at always advances.
  • Rename lastSyncTime semantically from "last successful sync" to "last completed sync attempt" — phase already tells consumers whether the attempt succeeded. Doc + swagger updated.
  • Regression tests in internal/sync/coordinator/coordinator_test.go.

Test plan

  • go test ./internal/sync/...
  • task lint-fix
  • Local docker-compose A/B repro on main vs this branch with two syncable Git sources

Fixes #770

`coordinator.performRegistrySync` only set `syncStatus.LastSyncTime`
on the success branch. That field is persisted to
`registry_sync.ended_at`, which the scheduler reads via
`ListSourceSyncsByLastUpdate` (ordered `ended_at ASC NULLS FIRST
LIMIT 1`). A source whose syncs always failed therefore kept
`ended_at` NULL on every tick, perpetually sorted first under
NULLS FIRST, and starved every other syncable source. This was
observed in v1.4.0 staging where a broken Git source blocked a
healthy one from ever syncing, leaving `dev.toolhive/skills`
empty. The Kubernetes source was unaffected because it is
informer-driven and does not ride the coordinator.

Move the `LastSyncTime = &now` assignment ahead of the
success/failure split so `ended_at` always advances, then have
`fetchSyncStatus` only project it onto the public
`SourceSyncStatus.LastSyncTime` field when the most recent
attempt completed. That preserves the documented "last successful
sync" semantics on the API surface while letting the DB column
serve its scheduling role. Add a unit regression test that
asserts the failure branch sets `LastSyncTime`, and a higher-level
test that runs two syncable sources (one always failing) against
an in-memory state service mirroring the real ordering and
asserts the healthy source is selected by the second tick.

Fixes the staging starvation reported on v1.4.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.01%. Comparing base (193417d) to head (3c8d7bc).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #769      +/-   ##
==========================================
+ Coverage   60.50%   61.01%   +0.51%     
==========================================
  Files         108      108              
  Lines       10580    10580              
==========================================
+ Hits         6401     6455      +54     
+ Misses       3610     3547      -63     
- Partials      569      578       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

`task docs` (which runs `swag init`) emits no trailing newline, but
the previous edit reintroduced one. The `Verify Generated Files`
CI job re-runs `swag init` and diffs against the committed file,
so the stray newline failed the check.
rdimitrov added a commit that referenced this pull request May 6, 2026
A staging deployment was wedged for ~2.5 months because a `git`-type
sync hit a clone error and the pod was killed before
`performRegistrySync`'s defer could write a terminal `sync_status`. The
`registry_sync` row was left at `IN_PROGRESS` indefinitely. Subsequent
restarts never repaired it: `BulkInitializeSourceSyncs` is
`ON CONFLICT DO NOTHING` and there is no other call path that updates an
existing `registry_sync` row outside of an active sync. Every
coordinator tick called `GetNextSyncJob`, picked the wedged row, and
`ShouldSync` short-circuited on `SyncPhaseSyncing` before reaching its
trailing log line — so the wedge was completely silent.

Add an `isOrphanedInProgressSync` check inside `ShouldSync`. When the
phase is `Syncing` but `started_at` is nil or older than a one-hour
grace window, treat the row as orphaned, log a warning, and fall
through to the normal sync flow. The coordinator's deferred
`UpdateSyncStatus` will then write a fresh terminal status on the next
sync attempt.

The grace window is comfortably longer than any realistic sync run,
so a real in-flight sync is never misclassified as orphaned. The
single-replica deployment has no sync-side leader election; the grace
window is purely defensive in case multi-replica is ever introduced.

The fix lives in `ShouldSync` rather than as a startup reaper because
it heals the wedge wherever it manifests rather than only on pod
restart, and avoids a SQL-layer surface change. It also incidentally
plugs the `ShouldSync` early-return log gap that made the wedge silent
in production: every IN_PROGRESS skip now emits a structured log line.

#769 closes the graceful-failure starvation hole; this commit closes
the crash-mid-sync wedge hole.

Improves on #769
@rdimitrov rdimitrov changed the title Advance ended_at on failed sync to prevent scheduler starvation Fix sync coordinator starvation when a source always fails May 6, 2026
@rdimitrov rdimitrov merged commit 415190e into main May 6, 2026
15 checks passed
@rdimitrov rdimitrov deleted the fix/sync-coordinator-starvation branch May 6, 2026 21:27
rdimitrov added a commit that referenced this pull request May 6, 2026
## Bug

A `git` sync hit a clone error and the pod was killed before
`performRegistrySync`'s defer could write a terminal `sync_status`,
leaving the `registry_sync` row at `IN_PROGRESS`.
`BulkInitializeSourceSyncs` is `ON CONFLICT DO NOTHING`, so restarts
never repaired it. Every coordinator tick then short-circuited in
`ShouldSync` on `SyncPhaseSyncing` before reaching its trailing log line
— wedged for ~2.5 months on staging, completely silent.

## Fix

In `ShouldSync`, detect `IN_PROGRESS` rows whose `started_at` is `nil`
or older than a 1-hour grace window. Log a `WARN` and fall through to
the normal sync flow so the coordinator's deferred `UpdateSyncStatus`
writes a fresh terminal status on the next sync attempt that actually
runs. The grace window is comfortably longer than any realistic sync
run.

Lives in `ShouldSync` rather than a startup reaper so it heals lazily on
every tick (not only on restart) and avoids a SQL surface change. As a
side effect, every `IN_PROGRESS` skip now emits a structured log line,
closing the silent-wedge gap that hid the original incident.

## Smoke test (docker compose)

- **Before**: wedge `registry_sync` to `IN_PROGRESS` with `started_at =
now() - 2h`, restart → coordinator silently skips (`Registry does not
need sync, reason: sync-already-in-progress`, `DEBUG` only — invisible
at `INFO`).
- **After**: same setup → `WARN: Detected stale IN_PROGRESS sync,
treating as failed`. `ShouldSync` falls through; once the regular flow
runs the sync (data change, filter change, manual trigger), the deferred
status update flips the row to a terminal status.

## Tests

Added to `TestDefaultSyncManager_ShouldSync`:
- `stale IN_PROGRESS past grace window is treated as orphaned` — fails
without the fix.
- `IN_PROGRESS with no started_at is treated as orphaned` — covers nil
timestamps.

Tightened the existing `sync not needed when already syncing` case to
`LastAttempt = now - 5min` to exercise the inside-grace-window path.

## Relationship to #769

#769 closes the graceful-failure starvation hole; this PR closes the
SIGKILL-mid-sync wedge hole.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sync coordinator starves healthy sources when one always fails

3 participants