Skip to content

feature(agent): per-source circuit breaker, health endpoints, stagger#252

Open
nghiadaulau wants to merge 1 commit into
VersusControl:mainfrom
nghiadaulau:feat/agent-source-health
Open

feature(agent): per-source circuit breaker, health endpoints, stagger#252
nghiadaulau wants to merge 1 commit into
VersusControl:mainfrom
nghiadaulau:feat/agent-source-health

Conversation

@nghiadaulau

Copy link
Copy Markdown
Member

Addresses #117.

Problem

  • A signal source that kept failing was pulled again on every tick
    hammering a downed backend forever with no backoff.
  • The worker's "stagger initial pull so sources don't hit backends at the
    same instant" comment described behavior that was never implemented
    all sources fired simultaneously.
  • There was no way to see why a source went quiet, or to pause one.

Fix

  • Per-source circuit breaker (pkg/agent/health.go). Repeated failures
    back off exponentially — starting at poll_interval, doubling, capped at
    agent.source_backoff_max (default 15m). A single successful pull (even
    an empty one) closes the breaker, so a source recovers on its own with
    no restart
    . tickSource checks the breaker before pulling and records
    the outcome.
  • Real staggering: tick() now spreads source pulls across up to half
    the poll interval instead of launching them all at once.
  • Source-health admin endpoints (gated by X-Gateway-Secret), keyed by
    the source name as shown in the health view:
    • GET /api/agent/sources/health — per-source ok / backing_off /
      paused, consecutive failures, last error, next-eligible time. Sources
      are seeded at startup so they appear before their first pull.
    • POST /api/agent/sources/:name/pause | /resume — manual hold. An
      unknown name returns 404 rather than creating a phantom source.
    • DELETE /api/agent/sources/:name/cursor — drop a stuck resume cursor so
      the source re-backfills from lookback (added CursorStore.Delete).

Scope

This is the breaker + visibility + control core. Two items from the broader
plan are deliberately deferred to keep the PR focused: sharing the breaker
with the analyze get_related_logs reader (so an open breaker short-circuits
that tool), and gap-window tracking (which later feeds burn-rate coverage).

Testing

  • Unit: breaker backoff / cap / reset, pause/resume + unknown-name
    rejection (no phantom sources), snapshot states; CursorStore.Delete.
  • Worker: a failing source is pulled once, then skipped while the
    breaker is open, then pulled again after resume.
  • Live (built binary): with a file source and poll_interval: 2s
    /sources/health shows the source from boot; pausing the wrong name
    returns 404 with no phantom entry; pausing the correct name flips it to
    paused; the unauthenticated request returns 401.
  • gofmt, go vet, go build ./...,
    go test -race ./pkg/agent ./pkg/config ./pkg/controllers green;
    helm template renders source_backoff_max.

Checklist

  • Unit + worker + live verification
  • Config triple-touch + Helm mirrored
  • Endpoints authenticated; unknown source → 404
  • House conventions (feature: prefix); docs + CHANGELOG

A failing signal source was retried on every tick — hammering a downed
backend forever — and the long-promised pull staggering was never actually
implemented (all sources fired at the same instant).

- pkg/agent/health.go: a per-source circuit breaker. Repeated failures back
  off exponentially (from poll_interval up to source_backoff_max, default
  15m); a single successful pull closes the breaker, so a source recovers
  on its own with no restart. tickSource consults it before pulling and
  records success/failure.
- Stagger: tick() now spreads source pulls across up to half the poll
  interval instead of launching them simultaneously.
- Admin endpoints (X-Gateway-Secret), keyed by the source name shown in the
  health view:
  - GET /api/agent/sources/health  — per-source state (ok / backing_off /
    paused), failures, last error, next-eligible time. Sources are seeded
    at startup so they appear before their first pull.
  - POST /api/agent/sources/:name/pause | /resume  — manual hold; reject an
    unknown name (404) instead of creating a phantom source.
  - DELETE /api/agent/sources/:name/cursor  — drop a stuck cursor so the
    source re-backfills from lookback (CursorStore.Delete added).
- Config triple-touch (source_backoff_max) + Helm (sourceBackoffMax) + docs.

Addresses upstream issue VersusControl#117. Deferred (follow-ups): sharing the breaker
with the analyze get_related_logs reader, and gap-window tracking.

Tests: breaker backoff/cap/reset, pause/resume + unknown-name rejection,
snapshot states, tickSource skips a backing-off source then resumes, and
CursorStore.Delete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant