Skip to content

Commit f737f63

Browse files
author
kjgbot
committed
Design integration event re-drive after lost steers
1 parent b9cba58 commit f737f63

1 file changed

Lines changed: 283 additions & 0 deletions

File tree

Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
# Integration Event Re-Drive After Lost Steers
2+
3+
Issue: #147. Base: #145 (`split/132-slack-integration-event-context-retry`).
4+
Status: DESIGN ONLY. Do not implement or merge until #145 has landed. The
5+
broker delivery-signal contract was pinned by relay-worker and project-lead on
6+
2026-06-07.
7+
8+
## Problem
9+
10+
The integration-event bridge currently treats broker send acceptance as delivery.
11+
For Slack logical dedupe, `claimSlackLogicalInjection()` records the
12+
logical/content-hash claim before a steer is known to have reached the target
13+
agent. For non-Slack paths, `wasRecentlyInjected()` does the same with the path
14+
dedupe key.
15+
16+
That is safe for duplicate suppression only when the broker send reliably reaches
17+
the agent. The confirmed failure mode from 2026-06-07 is different:
18+
19+
- Relayfile emits the same Slack record several times.
20+
- The first delivery logs `injecting` for `slack-comms`, so the bridge claims the
21+
logical key.
22+
- The steer is lost after broker acceptance and never appears in the agent
23+
conversation.
24+
- Later re-deliveries are suppressed for the one-hour Slack replay TTL because
25+
the content hash is unchanged.
26+
27+
#145 fixes the separate "content unavailable" path, but it cannot fix a lost
28+
first steer with unchanged content. The bridge needs delivery confirmation and a
29+
bounded re-drive window.
30+
31+
## Goals
32+
33+
- Make dedupe claims provisional until delivery is confirmed.
34+
- Commit a provisional claim only after the broker/harness-driver reports that
35+
the steer reached the intended recipient.
36+
- Release a provisional claim on explicit delivery failure or confirmation
37+
timeout, so a duplicate/replayed event can re-drive the same message.
38+
- Preserve current pacing: event dispatch must not block the project send queue
39+
while waiting for delivery confirmation.
40+
- Keep duplicate hardening by stable event identity first, Slack logical key and
41+
content hash second.
42+
- Make telemetry low-noise: count released provisional claims and confirmation
43+
timeouts without logging every duplicate.
44+
45+
## Non-Goals
46+
47+
- No durable mailbox design in this PR. That belongs to relay#1056.
48+
- No delivery retry loop inside Pear beyond releasing the dedupe claim. Relayfile
49+
replay, stream reconnect, or event-feed polling provides the re-drive input.
50+
- No merge before Khaliq's explicit go. This design stacks on #145 and should
51+
become a follow-up PR after #145 lands.
52+
- No implementation until relay-worker confirms `delivery_injected` coverage for
53+
every agent transport the integration bridge can target.
54+
55+
## Broker Delivery Contract
56+
57+
Pear already has a related confirmation helper,
58+
`BrokerManager.sendMessageAndWaitForDelivery(projectId, input, { timeoutMs })`.
59+
That helper:
60+
61+
- sends through the broker;
62+
- captures the returned `event_id` and targets;
63+
- subscribes to `session.client.onEvent(...)`;
64+
- resolves on `delivery_ack`, `delivery_verified`, or
65+
`message_delivery_confirmed` for that event and target;
66+
- rejects on `delivery_failed`, `message_delivery_failed`, or its own timeout.
67+
68+
Project-lead ruled that #147 must NOT commit on that helper's current success
69+
condition. `delivery_ack` and `delivery_verified` can be satisfied by an echo
70+
timeout fallback, which is too weak for this bug: it could commit the dedupe
71+
claim even if the steer never crossed the lost hop.
72+
73+
Commit signal for #147: `delivery_injected`, filtered by returned
74+
`event_id` and target. That event is emitted by the PTY worker after it writes
75+
the injection bytes and carriage return to the PTY. It is the correct boundary
76+
for mode-B loss: "accepted by broker" is too early, and "read by model" is out
77+
of scope.
78+
79+
Relay-worker owns the BrokerManager helper extension. The bridge should depend
80+
on a sibling helper, not hand-subscribe to broker events:
81+
82+
```ts
83+
sendMessageAndWaitForInjected(
84+
projectId: string,
85+
input: SendMessageInput,
86+
options?: { timeoutMs?: number }
87+
): Promise<{ eventId: string; targets: string[] }>
88+
```
89+
90+
Semantics:
91+
92+
- Internals share/refactor the existing send-plus-event-wait logic from
93+
`sendMessageAndWaitForDelivery`.
94+
- Success is only `delivery_injected` matching the returned `event_id` and
95+
target `name`.
96+
- `delivery_ack`, `delivery_verified`, and `message_delivery_confirmed` are not
97+
success for this helper. If they arrive without `delivery_injected`, the
98+
helper keeps waiting until `delivery_injected` or timeout.
99+
- Failure stays `delivery_failed` / `message_delivery_failed`, or helper
100+
timeout.
101+
- Target handling matches the existing helper: use returned `targets` when
102+
present; for direct agent sends with no targets, fall back to `[input.to]`.
103+
- Unsupported brokers may return no `eventId` or no targets; the bridge treats
104+
those as accepted-only and commits immediately, preserving current behavior.
105+
106+
The helper may internally reuse the current `sendMessageAndWaitForDelivery`
107+
mechanics: send once, capture `event_id` / targets, subscribe internally, filter
108+
events in BrokerManager. It must differ by resolving only on
109+
`delivery_injected`.
110+
111+
Calling this helper from the existing project pacer and awaiting it would
112+
reintroduce the old head-of-line block. The bridge must start the helper as a
113+
background confirmation task, attach `.then(commit).catch(release)`, and return
114+
from the pacer callback immediately after starting the send attempt. The pacer
115+
rate-limits starts, not confirmation completion.
116+
117+
Coverage gate before implementation: confirm `delivery_injected` is emitted for
118+
every transport the integration bridge can inject to, including local PTY,
119+
cloud, and remote agents. If any transport lacks `delivery_injected`, define
120+
that transport's success signal first; otherwise the bridge would release on
121+
every timeout and re-drive duplicates indefinitely.
122+
123+
File ownership for implementation:
124+
125+
- pear-worker owns the additive BrokerManager helper and the bridge state
126+
machine.
127+
- relay-worker owns BrokerManager helper tests.
128+
- Keep broker.ts changes strictly additive: new helper, new predicate, and
129+
shared/refactored wait logic only. A shared wait refactor is allowed, but it
130+
must preserve `sendMessageAndWaitForDelivery()` behavior exactly and keep that
131+
method's existing tests green. Do not touch the #146 workspace-key/cloud
132+
regions.
133+
134+
## Bridge State Model
135+
136+
Replace optimistic one-shot claims with a small state machine shared by generic
137+
path dedupe and Slack logical/content-hash dedupe.
138+
139+
```ts
140+
type DeliveryClaimStatus = 'provisional' | 'committed'
141+
142+
type DeliveryClaim = {
143+
status: DeliveryClaimStatus
144+
expiresAt: number
145+
confirmationDeadline: number
146+
recipients: Map<string, 'pending' | 'confirmed' | 'failed'>
147+
contentHash?: string
148+
eventIds: Set<string>
149+
}
150+
```
151+
152+
For Slack logical keys, the entry is still keyed by logical Slack identity. A
153+
content-hash bucket carries one claim:
154+
155+
```ts
156+
type SlackLogicalInjectionState = {
157+
expiresAt: number
158+
blindClaim?: DeliveryClaim
159+
contentHashes: Map<string, DeliveryClaim>
160+
}
161+
```
162+
163+
Behavior:
164+
165+
- A committed matching content hash suppresses duplicates until TTL expiry.
166+
- A provisional matching content hash suppresses only concurrent in-flight alias
167+
copies while confirmation is still pending.
168+
- When confirmation times out or fails, the provisional claim is removed. A later
169+
replay of the same logical key/content hash can inject again.
170+
- A blind claim can still learn the first later content hash as in #145, but that
171+
hash stays provisional until delivery is confirmed.
172+
- Genuine edits with new content hashes continue to inject independently.
173+
174+
For non-Slack integration events, the existing recent-injection map becomes the
175+
same provisional/committed state keyed by `eventDedupeKeyWithFingerprint()`.
176+
177+
## Delivery Flow
178+
179+
1. Build the dedupe key and resolve recipients as today.
180+
2. If an equivalent committed claim exists, drop as duplicate.
181+
3. If an equivalent provisional claim exists, drop as duplicate only while its
182+
confirmation window is still open.
183+
4. Create or update a provisional claim before sending, scoped to the event key,
184+
content hash, and intended recipients.
185+
5. Enqueue each broker send through the existing project pacer. The pacer starts
186+
the `sendMessageAndWaitForInjected()` task and does not await its
187+
confirmation.
188+
6. For each started send, attach background handlers:
189+
- On `delivery_injected`: mark that recipient confirmed.
190+
- When all recipients are confirmed: commit the claim and extend `expiresAt`
191+
to the normal replay TTL.
192+
- On failure or timeout: mark failed and release the provisional claim if no
193+
recipient confirmed.
194+
7. If every recipient fails synchronously before send acceptance, release the
195+
claim immediately as today.
196+
8. If some recipients confirm and some fail, commit for confirmed recipients and
197+
log/telemetry the partial failure. A later replay should target only missing
198+
recipients if recipient-scoped tracking is implemented in the same pass;
199+
otherwise release the whole claim to prefer duplicate delivery over message
200+
loss.
201+
202+
The first implementation should prefer whole-claim release on partial failure.
203+
It may duplicate a message for a recipient that already got it, but it avoids the
204+
known worse behavior of losing human instructions.
205+
206+
## Timeout And Telemetry
207+
208+
Proposed default injection-confirmation timeout: 5 seconds, conditional on
209+
relay-worker's timing finding.
210+
211+
If `delivery_injected` fires immediately when the PTY worker writes bytes on
212+
receipt, 5 seconds is intentionally generous: healthy agents should confirm
213+
quickly even if the model reads the bytes later. That keeps the provisional
214+
claim window short enough that replay can recover promptly and concurrent
215+
duplicate suppression does not linger.
216+
217+
If the PTY write is gated on agent idle, 5 seconds is too short for a legitimate
218+
long-running turn. In that case, use a larger or adaptive timeout so busy agents
219+
do not manufacture duplicate re-drives for messages that will eventually write.
220+
The timeout must be based on the actual `delivery_injected` timing semantics,
221+
not just the desired injection latency target.
222+
223+
Add aggregate counters/logs:
224+
225+
- `deliveryClaimsProvisional`
226+
- `deliveryClaimsCommitted`
227+
- `deliveryClaimsReleased`
228+
- `deliveryInjectedTimeouts`
229+
- `deliveryInjectedUnsupported`
230+
231+
Use the existing aggregated warning cadence for failures and include only
232+
projectId, provider, logical key prefix, recipient count, and reason. Do not log
233+
message content.
234+
235+
## Test Plan
236+
237+
Add integration-event bridge tests on top of the existing #145 suite:
238+
239+
- Slack unchanged-content replay re-drives after first accepted steer never
240+
emits `delivery_injected`.
241+
- Slack unchanged-content replay is suppressed after `delivery_injected`
242+
commits the content hash.
243+
- Slack alias copies during an open provisional window are suppressed, but a
244+
later replay after timeout is delivered.
245+
- Generic non-Slack event dedupe releases after delivery timeout and re-drives.
246+
- No-recipient and synchronous send-failure paths still release immediately.
247+
- Project broker pacing does not wait on delivery confirmation; the existing
248+
"does not wait on delivery confirmation path" test should become "does not
249+
wait while tracking injection confirmation".
250+
- Missing `sendMessageAndWaitForInjected` in tests/mocks is treated as a hard
251+
harness gap after #147 lands; do not silently fall back to
252+
`sendMessageAndWaitForDelivery`.
253+
- Unsupported injection confirmation commits accepted sends immediately and
254+
emits low-noise telemetry.
255+
256+
Add broker tests for relay-worker's helper extension:
257+
258+
- send acceptance returns before `delivery_injected`;
259+
- confirmation resolves on `delivery_injected` for the matching event id and
260+
target;
261+
- confirmation rejects on delivery failure;
262+
- confirmation rejects on timeout;
263+
- `delivery_ack` / `delivery_verified` without `delivery_injected` do not
264+
commit the helper;
265+
- multi-target sends wait for all reported targets.
266+
267+
## Rollout
268+
269+
1. Land #145 and restart Pear on it to restore reliable content reads.
270+
2. Relay-worker confirms `delivery_injected` coverage across bridge target
271+
transports and whether it fires on immediate PTY write or is gated on agent
272+
idle.
273+
3. pear-worker adds the additive BrokerManager helper and bridge state machine;
274+
relay-worker adds helper tests.
275+
4. Implement #147 as a follow-up stacked from #145/main using that helper as the
276+
confirmation source.
277+
5. Keep slack-comms' debug-log/OOB drop guard enabled until #147 is merged and
278+
the running app is restarted on it.
279+
280+
Residual risk: if the PTY accepts bytes but the agent process is wedged and
281+
never semantically processes them, `delivery_injected` still commits. That is an
282+
agent-liveness/read-receipt problem and belongs to the fuller relay#1056 read
283+
ack path, not #147.

0 commit comments

Comments
 (0)