You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(fcm): gate native-fallback probe on rolling FCM health
The native-fallback probe previously returned true whenever FCM was
configured AND devices were registered, which suppressed web-push for
the namespace. The HAPI Bot correctly pointed out the gap: if the FCM
pipeline silently breaks (expired service-account key, sustained 5xx,
OAuth token-fetch failure, network blackhole) the operator gets nothing
on either channel until they manually intervene.
Approach (deliberate, not the bot's exact suggested fix):
- FcmService now keeps a small rolling window (last 8 outcomes) of send
attempts and exposes `isHealthy()`. The threshold is 5+/8 failures =
unhealthy; the buffer starts empty so a freshly-booted hub is
optimistic ("innocent until proven guilty") and does not double-fire
on event #1.
- Token-fetch failure (`getFcmAccessToken` throws) now records exactly
one health-failure (not one per device), short-circuits the send
loop, and returns a result so `sendToNamespace` no longer leaks the
exception.
- `invalid` token responses are explicitly excluded from the health
buffer because they are per-device facts (rotated/uninstalled token),
not pipeline failures - FCM was reachable, it just rejected one
stale token.
- `buildNativeFallbackProbe` now optionally accepts the FcmService and
short-circuits to "let web-push fire" when health is bad, before it
even queries the device registry. The single-arg call shape is still
supported for back-compat.
Why not the bot's exact suggestion ("invert: call FCM first, fall back
on result.sent === 0"):
- Couples PushNotificationChannel to FcmService and FcmSendPayload,
reversing the clean parallel-channel architecture established earlier
in this PR.
- Treats every transient single-event failure as fallback-worthy, which
re-opens the duplicate-notification race that the suppression logic
was added to close (FCM HTTP timeout that delivers later + the web
push we sent in the meantime = two pings).
- A rolling health window only flips on sustained breakage, which is
the actual operational scenario the bot is worried about.
The wrist-first design intent ("FCM fires unconditionally, web-push is
suppressed for the same namespace") documented in
docs/api/native-companion-contract.md is preserved on the happy path.
The probe only re-enables web-push when there is concrete evidence the
native pipeline is not delivering.
Tests:
- New FcmService.isHealthy suite covers empty-buffer, threshold flip,
recovery as failures age out of the window, invalid-token exclusion,
and network-error path.
- nativeFallbackProbe gains coverage for the unhealthy-but-registered,
healthy-and-registered, and absent-fcmService (back-compat) cases.
- All 292 hub tests still pass; typecheck clean.
Co-authored-by: Cursor <cursoragent@cursor.com>
0 commit comments