Skip to content

Commit 422b5b0

Browse files
heavygeecursoragent
andcommitted
fix(fcm): gate native-fallback probe on rolling FCM health
The native-fallback probe previously returned true whenever FCM was configured AND devices were registered, which suppressed web-push for the namespace. The HAPI Bot correctly pointed out the gap: if the FCM pipeline silently breaks (expired service-account key, sustained 5xx, OAuth token-fetch failure, network blackhole) the operator gets nothing on either channel until they manually intervene. Approach (deliberate, not the bot's exact suggested fix): - FcmService now keeps a small rolling window (last 8 outcomes) of send attempts and exposes `isHealthy()`. The threshold is 5+/8 failures = unhealthy; the buffer starts empty so a freshly-booted hub is optimistic ("innocent until proven guilty") and does not double-fire on event #1. - Token-fetch failure (`getFcmAccessToken` throws) now records exactly one health-failure (not one per device), short-circuits the send loop, and returns a result so `sendToNamespace` no longer leaks the exception. - `invalid` token responses are explicitly excluded from the health buffer because they are per-device facts (rotated/uninstalled token), not pipeline failures - FCM was reachable, it just rejected one stale token. - `buildNativeFallbackProbe` now optionally accepts the FcmService and short-circuits to "let web-push fire" when health is bad, before it even queries the device registry. The single-arg call shape is still supported for back-compat. Why not the bot's exact suggestion ("invert: call FCM first, fall back on result.sent === 0"): - Couples PushNotificationChannel to FcmService and FcmSendPayload, reversing the clean parallel-channel architecture established earlier in this PR. - Treats every transient single-event failure as fallback-worthy, which re-opens the duplicate-notification race that the suppression logic was added to close (FCM HTTP timeout that delivers later + the web push we sent in the meantime = two pings). - A rolling health window only flips on sustained breakage, which is the actual operational scenario the bot is worried about. The wrist-first design intent ("FCM fires unconditionally, web-push is suppressed for the same namespace") documented in docs/api/native-companion-contract.md is preserved on the happy path. The probe only re-enables web-push when there is concrete evidence the native pipeline is not delivering. Tests: - New FcmService.isHealthy suite covers empty-buffer, threshold flip, recovery as failures age out of the window, invalid-token exclusion, and network-error path. - nativeFallbackProbe gains coverage for the unhealthy-but-registered, healthy-and-registered, and absent-fcmService (back-compat) cases. - All 292 hub tests still pass; typecheck clean. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 4568880 commit 422b5b0

5 files changed

Lines changed: 226 additions & 17 deletions

File tree

hub/src/fcm/fcmService.test.ts

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,3 +199,102 @@ describe('FcmService.sendToNamespace', () => {
199199
expect(globalThis.fetch).not.toHaveBeenCalled()
200200
})
201201
})
202+
203+
describe('FcmService.isHealthy (rolling outcome window)', () => {
204+
let originalFetch: typeof globalThis.fetch
205+
beforeEach(() => {
206+
originalFetch = globalThis.fetch
207+
})
208+
afterEach(() => {
209+
globalThis.fetch = originalFetch
210+
})
211+
212+
it('starts healthy with an empty outcome buffer (innocent until proven guilty)', () => {
213+
const store = makeStore([])
214+
const svc = new FcmService('proj-id', { client_email: 'x', private_key: 'y' }, store as never)
215+
expect(svc.isHealthy()).toBe(true)
216+
})
217+
218+
it('flips to unhealthy after 5 consecutive transient failures', async () => {
219+
const store = makeStore([
220+
{ namespace: 'default', token: 't1', platform: 'phone', deviceId: 'p1' }
221+
])
222+
globalThis.fetch = mock(async () =>
223+
new Response('Service Unavailable', { status: 503 })
224+
) as unknown as typeof fetch
225+
226+
const svc = new FcmService('proj-id', { client_email: 'x', private_key: 'y' }, store as never)
227+
228+
// 4 failures: still healthy (threshold is 5)
229+
for (let i = 0; i < 4; i += 1) {
230+
await svc.sendToNamespace('default', makePayload())
231+
}
232+
expect(svc.isHealthy()).toBe(true)
233+
234+
// 5th failure crosses the threshold
235+
await svc.sendToNamespace('default', makePayload())
236+
expect(svc.isHealthy()).toBe(false)
237+
})
238+
239+
it('recovers to healthy as recent successes age out the failure tail', async () => {
240+
const store = makeStore([
241+
{ namespace: 'default', token: 't1', platform: 'phone', deviceId: 'p1' }
242+
])
243+
let callCount = 0
244+
globalThis.fetch = mock(async () => {
245+
callCount += 1
246+
// First 5 calls fail (503), rest succeed
247+
if (callCount <= 5) {
248+
return new Response('Service Unavailable', { status: 503 })
249+
}
250+
return new Response('{"name":"ok"}', { status: 200 })
251+
}) as unknown as typeof fetch
252+
253+
const svc = new FcmService('proj-id', { client_email: 'x', private_key: 'y' }, store as never)
254+
255+
for (let i = 0; i < 5; i += 1) {
256+
await svc.sendToNamespace('default', makePayload())
257+
}
258+
expect(svc.isHealthy()).toBe(false)
259+
260+
// 4 successes after 5 failures: window is [F,F,F,F,F,S,S,S,S] -> trim
261+
// to last 8: [F,F,F,F,S,S,S,S] -> 4 failures, threshold 5 -> healthy.
262+
for (let i = 0; i < 4; i += 1) {
263+
await svc.sendToNamespace('default', makePayload())
264+
}
265+
expect(svc.isHealthy()).toBe(true)
266+
})
267+
268+
it('does NOT count invalid-token responses against health (per-device fact, not pipeline failure)', async () => {
269+
const store = makeStore([
270+
{ namespace: 'default', token: 'rotated', platform: 'phone', deviceId: 'p1' }
271+
])
272+
globalThis.fetch = mock(async () =>
273+
new Response('{"error":{"status":"UNREGISTERED"}}', { status: 404 })
274+
) as unknown as typeof fetch
275+
276+
const svc = new FcmService('proj-id', { client_email: 'x', private_key: 'y' }, store as never)
277+
278+
// 10 invalid-token responses - the pipeline is fine, just the
279+
// device is dead. Health must not flip.
280+
for (let i = 0; i < 10; i += 1) {
281+
await svc.sendToNamespace('default', makePayload())
282+
}
283+
expect(svc.isHealthy()).toBe(true)
284+
})
285+
286+
it('counts fetch-throw (network error) as a health failure', async () => {
287+
const store = makeStore([
288+
{ namespace: 'default', token: 't1', platform: 'phone', deviceId: 'p1' }
289+
])
290+
globalThis.fetch = mock(async () => {
291+
throw new Error('ECONNREFUSED')
292+
}) as unknown as typeof fetch
293+
294+
const svc = new FcmService('proj-id', { client_email: 'x', private_key: 'y' }, store as never)
295+
for (let i = 0; i < 5; i += 1) {
296+
await svc.sendToNamespace('default', makePayload())
297+
}
298+
expect(svc.isHealthy()).toBe(false)
299+
})
300+
})

hub/src/fcm/fcmService.ts

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,25 +55,77 @@ type FcmSendResult = {
5555
type FcmTokenSendResult = 'sent' | 'invalid' | 'failed'
5656

5757
export class FcmService {
58+
/**
59+
* Rolling window of the last N send outcomes. Drives `isHealthy()`,
60+
* which the native-fallback probe consults to decide whether suppressing
61+
* web-push for this namespace is still safe. We deliberately do NOT
62+
* count `invalid` here - an invalid token is a per-device fact, not an
63+
* FCM-pipeline-broken signal (FCM was reachable, it just rejected one
64+
* stale token). Only `sent` and `failed` populate the buffer.
65+
*/
66+
private recentOutcomes: Array<'sent' | 'failed'> = []
67+
private static readonly HEALTH_WINDOW = 8
68+
private static readonly HEALTH_FAILURE_THRESHOLD = 5
69+
5870
constructor(
5971
private readonly projectId: string,
6072
private readonly serviceAccount: ServiceAccount,
6173
private readonly store: Store
6274
) {}
6375

76+
/**
77+
* Health gate for the native-fallback probe. Returns false when the
78+
* recent-outcome window is dominated by failures (broken Firebase
79+
* credentials, sustained 5xx, network blackhole). When unhealthy, the
80+
* probe lets web-push fire as a last-resort surface for this namespace.
81+
*
82+
* Empty buffer is treated as healthy (innocent until proven guilty) so
83+
* a freshly-started hub does not silently double-fire on event #1.
84+
*/
85+
isHealthy(): boolean {
86+
if (this.recentOutcomes.length === 0) return true
87+
const failures = this.recentOutcomes.filter((o) => o === 'failed').length
88+
return failures < FcmService.HEALTH_FAILURE_THRESHOLD
89+
}
90+
91+
private recordOutcome(outcome: 'sent' | 'failed'): void {
92+
this.recentOutcomes.push(outcome)
93+
if (this.recentOutcomes.length > FcmService.HEALTH_WINDOW) {
94+
this.recentOutcomes.shift()
95+
}
96+
}
97+
6498
async sendToNamespace(namespace: string, payload: FcmSendPayload): Promise<FcmSendResult> {
6599
const devices = this.store.fcm.getDevicesByNamespace(namespace)
66100
if (devices.length === 0) {
67101
return { sent: 0, failed: 0, invalidTokens: [] }
68102
}
69103

70-
const accessToken = await getFcmAccessToken(this.serviceAccount)
104+
let accessToken: string
105+
try {
106+
accessToken = await getFcmAccessToken(this.serviceAccount)
107+
} catch (e) {
108+
// Token-fetch failure (expired service account key, OAuth
109+
// outage, network) - count one health-failure (not one per
110+
// device, that would over-weight the buffer) and return.
111+
console.error('[FcmService] Token fetch failed:', e instanceof Error ? e.message : e)
112+
this.recordOutcome('failed')
113+
return { sent: 0, failed: devices.length, invalidTokens: [] }
114+
}
115+
71116
const invalidTokens: string[] = []
72117
let sent = 0
73118
let failed = 0
74119

75120
await Promise.all(devices.map(async (device) => {
76121
const result = await this.sendToToken(accessToken, device.token, payload, device.platform)
122+
// `invalid` is a per-device fact, not a pipeline signal -
123+
// exclude it from the health buffer (see field doc above).
124+
if (result === 'sent') {
125+
this.recordOutcome('sent')
126+
} else if (result === 'failed') {
127+
this.recordOutcome('failed')
128+
}
77129
if (result === 'sent') {
78130
sent += 1
79131
return

hub/src/fcm/nativeFallbackProbe.test.ts

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,4 +55,36 @@ describe('buildNativeFallbackProbe', () => {
5555
expect(probe('empty')).toBe(false)
5656
expect(probe('untouched')).toBe(false)
5757
})
58+
59+
it('returns false when fcmService reports unhealthy, even with registered devices', () => {
60+
const store = makeStore({ default: 3 })
61+
const fcmConfig = { projectId: 'p', serviceAccount: { client_email: 'x', private_key: 'y' } }
62+
const fcmService = { isHealthy: mock(() => false) }
63+
const probe = buildNativeFallbackProbe(store as unknown as Store, fcmConfig, fcmService)
64+
65+
expect(probe('default')).toBe(false)
66+
expect(fcmService.isHealthy).toHaveBeenCalled()
67+
// We short-circuit before hitting the device store - the namespace
68+
// has registered devices but a broken FCM pipeline means web-push
69+
// is the only surface still able to reach the operator.
70+
expect(store.fcm.getDevicesByNamespace).not.toHaveBeenCalled()
71+
})
72+
73+
it('returns true when fcmService reports healthy and devices exist', () => {
74+
const store = makeStore({ default: 1 })
75+
const fcmConfig = { projectId: 'p', serviceAccount: { client_email: 'x', private_key: 'y' } }
76+
const fcmService = { isHealthy: mock(() => true) }
77+
const probe = buildNativeFallbackProbe(store as unknown as Store, fcmConfig, fcmService)
78+
79+
expect(probe('default')).toBe(true)
80+
expect(fcmService.isHealthy).toHaveBeenCalled()
81+
})
82+
83+
it('treats absent fcmService as healthy (back-compat with single-arg call site)', () => {
84+
const store = makeStore({ default: 2 })
85+
const fcmConfig = { projectId: 'p', serviceAccount: { client_email: 'x', private_key: 'y' } }
86+
const probe = buildNativeFallbackProbe(store as unknown as Store, fcmConfig)
87+
88+
expect(probe('default')).toBe(true)
89+
})
5890
})

hub/src/fcm/nativeFallbackProbe.ts

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,37 @@
11
import type { Store } from '../store'
2+
import type { FcmService } from './fcmService'
23

34
/**
45
* Build a per-namespace probe used by the web-push channel to decide whether
56
* to defer to the native FCM channel. The probe MUST only return true when
6-
* BOTH conditions hold:
7+
* ALL of these hold:
78
*
89
* 1. FCM is actually configured on this hub start (fcmConfig is truthy),
910
* so a registered FCM channel exists to deliver the notification.
10-
* 2. At least one device row is registered for the namespace.
11+
* 2. The FCM service is currently healthy (recent sends are not all
12+
* failing). When credentials expire or the FCM pipeline blackholes,
13+
* we let web-push run again so the operator does not get silently
14+
* cut off from all notifications.
15+
* 3. At least one device row is registered for the namespace.
1116
*
12-
* Failing either condition means web-push must fire normally - otherwise
13-
* notifications go to /dev/null when, for example, an operator removes
14-
* `FCM_SERVICE_ACCOUNT_PATH` from the env without clearing stored device
15-
* registrations from the database.
17+
* Failing any condition means web-push must fire normally. The default
18+
* "happy path" remains: native is the canonical wrist-first surface and
19+
* web-push is suppressed to avoid duplicate OS notifications - this probe
20+
* only re-enables web-push when we have evidence the native pipeline is
21+
* not actually delivering.
1622
*/
1723
export function buildNativeFallbackProbe(
1824
store: Store,
19-
fcmConfig: unknown
25+
fcmConfig: unknown,
26+
fcmService?: Pick<FcmService, 'isHealthy'>
2027
): (namespace: string) => boolean {
2128
if (!fcmConfig) {
2229
return () => false
2330
}
24-
return (namespace: string) => store.fcm.getDevicesByNamespace(namespace).length > 0
31+
return (namespace: string) => {
32+
if (fcmService && !fcmService.isHealthy()) {
33+
return false
34+
}
35+
return store.fcm.getDevicesByNamespace(namespace).length > 0
36+
}
2537
}

hub/src/startHub.ts

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -201,13 +201,28 @@ export async function startHub(options: StartHubOptions = {}): Promise<HubInstan
201201

202202
const fcmConfig = resolveFcmConfig()
203203

204+
// Build the optional FCM service early so the native-fallback probe
205+
// can consult its health gate. When FCM is configured, `fcmService` is
206+
// shared between the FcmNotificationChannel and the probe so a broken
207+
// pipeline (expired credentials, sustained 5xx) lets web-push run as
208+
// a last-resort surface for the namespace instead of silently muting
209+
// both channels.
210+
const fcmService = fcmConfig
211+
? new FcmService(fcmConfig.projectId, fcmConfig.serviceAccount, store)
212+
: null
213+
204214
// Only suppress web-push when a native FCM channel is actually live
205-
// AND a device is registered for the namespace. The fcmConfig gate is
206-
// critical: without it, stale device rows from a prior FCM-enabled
207-
// boot would silently drop web-push on a hub started without the
208-
// service-account env var (notifications would vanish entirely).
209-
// See `buildNativeFallbackProbe` for the contract.
210-
const nativeFallbackProbe = buildNativeFallbackProbe(store, fcmConfig)
215+
// AND a device is registered for the namespace AND the FCM pipeline is
216+
// currently healthy. The fcmConfig gate is critical: without it, stale
217+
// device rows from a prior FCM-enabled boot would silently drop
218+
// web-push on a hub started without the service-account env var
219+
// (notifications would vanish entirely). See `buildNativeFallbackProbe`
220+
// for the full contract.
221+
const nativeFallbackProbe = buildNativeFallbackProbe(
222+
store,
223+
fcmConfig,
224+
fcmService ?? undefined
225+
)
211226

212227
const notificationChannels: NotificationChannel[] = [
213228
new PushNotificationChannel(
@@ -219,8 +234,7 @@ export async function startHub(options: StartHubOptions = {}): Promise<HubInstan
219234
)
220235
]
221236

222-
if (fcmConfig) {
223-
const fcmService = new FcmService(fcmConfig.projectId, fcmConfig.serviceAccount, store)
237+
if (fcmConfig && fcmService) {
224238
notificationChannels.push(new FcmNotificationChannel(fcmService, sseManager, visibilityTracker, store))
225239
console.log('[Fcm] Native companion push enabled (project:', fcmConfig.projectId + ')')
226240
}

0 commit comments

Comments
 (0)