feature(agent): per-source circuit breaker, health endpoints, stagger#252
Open
nghiadaulau wants to merge 1 commit into
Open
feature(agent): per-source circuit breaker, health endpoints, stagger#252nghiadaulau wants to merge 1 commit into
nghiadaulau wants to merge 1 commit into
Conversation
A failing signal source was retried on every tick — hammering a downed
backend forever — and the long-promised pull staggering was never actually
implemented (all sources fired at the same instant).
- pkg/agent/health.go: a per-source circuit breaker. Repeated failures back
off exponentially (from poll_interval up to source_backoff_max, default
15m); a single successful pull closes the breaker, so a source recovers
on its own with no restart. tickSource consults it before pulling and
records success/failure.
- Stagger: tick() now spreads source pulls across up to half the poll
interval instead of launching them simultaneously.
- Admin endpoints (X-Gateway-Secret), keyed by the source name shown in the
health view:
- GET /api/agent/sources/health — per-source state (ok / backing_off /
paused), failures, last error, next-eligible time. Sources are seeded
at startup so they appear before their first pull.
- POST /api/agent/sources/:name/pause | /resume — manual hold; reject an
unknown name (404) instead of creating a phantom source.
- DELETE /api/agent/sources/:name/cursor — drop a stuck cursor so the
source re-backfills from lookback (CursorStore.Delete added).
- Config triple-touch (source_backoff_max) + Helm (sourceBackoffMax) + docs.
Addresses upstream issue VersusControl#117. Deferred (follow-ups): sharing the breaker
with the analyze get_related_logs reader, and gap-window tracking.
Tests: breaker backoff/cap/reset, pause/resume + unknown-name rejection,
snapshot states, tickSource skips a backing-off source then resumes, and
CursorStore.Delete.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses #117.
Problem
hammering a downed backend forever with no backoff.
same instant" comment described behavior that was never implemented —
all sources fired simultaneously.
Fix
pkg/agent/health.go). Repeated failuresback off exponentially — starting at
poll_interval, doubling, capped atagent.source_backoff_max(default15m). A single successful pull (evenan empty one) closes the breaker, so a source recovers on its own with
no restart.
tickSourcechecks the breaker before pulling and recordsthe outcome.
tick()now spreads source pulls across up to halfthe poll interval instead of launching them all at once.
X-Gateway-Secret), keyed bythe source name as shown in the health view:
GET /api/agent/sources/health— per-sourceok/backing_off/paused, consecutive failures, last error, next-eligible time. Sourcesare seeded at startup so they appear before their first pull.
POST /api/agent/sources/:name/pause|/resume— manual hold. Anunknown name returns 404 rather than creating a phantom source.
DELETE /api/agent/sources/:name/cursor— drop a stuck resume cursor sothe source re-backfills from
lookback(addedCursorStore.Delete).Scope
This is the breaker + visibility + control core. Two items from the broader
plan are deliberately deferred to keep the PR focused: sharing the breaker
with the analyze
get_related_logsreader (so an open breaker short-circuitsthat tool), and gap-window tracking (which later feeds burn-rate coverage).
Testing
rejection (no phantom sources), snapshot states;
CursorStore.Delete.breaker is open, then pulled again after
resume.poll_interval: 2s—/sources/healthshows the source from boot; pausing the wrong namereturns 404 with no phantom entry; pausing the correct name flips it to
paused; the unauthenticated request returns 401.gofmt,go vet,go build ./...,go test -race ./pkg/agent ./pkg/config ./pkg/controllersgreen;helm templaterenderssource_backoff_max.Checklist
feature:prefix); docs + CHANGELOG