Skip to content

Commit f6a9d4e

Browse files
author
kjgbot
committed
Design integration event re-drive after lost steers
1 parent fe72ac1 commit f6a9d4e

1 file changed

Lines changed: 295 additions & 0 deletions

File tree

Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
# Integration Event Re-Drive After Lost Steers
2+
3+
Issue: #147. Base: #145 (`split/132-slack-integration-event-context-retry`).
4+
Status: DESIGN ONLY. Do not implement or merge until #145 has landed. The
5+
broker delivery-signal contract was pinned by relay-worker and project-lead on
6+
2026-06-07.
7+
8+
## Problem
9+
10+
The integration-event bridge currently treats broker send acceptance as delivery.
11+
For Slack logical dedupe, `claimSlackLogicalInjection()` records the
12+
logical/content-hash claim before a steer is known to have reached the target
13+
agent. For non-Slack paths, `wasRecentlyInjected()` does the same with the path
14+
dedupe key.
15+
16+
That is safe for duplicate suppression only when the broker send reliably reaches
17+
the agent. The confirmed failure mode from 2026-06-07 is different:
18+
19+
- Relayfile emits the same Slack record several times.
20+
- The first delivery logs `injecting` for `slack-comms`, so the bridge claims the
21+
logical key.
22+
- The steer is lost after broker acceptance and never appears in the agent
23+
conversation.
24+
- Later re-deliveries are suppressed for the one-hour Slack replay TTL because
25+
the content hash is unchanged.
26+
27+
#145 fixes the separate "content unavailable" path, but it cannot fix a lost
28+
first steer with unchanged content. The bridge needs delivery confirmation and a
29+
bounded re-drive window.
30+
31+
## Goals
32+
33+
- Make dedupe claims provisional until delivery is confirmed.
34+
- Commit a provisional claim only after the broker/harness-driver reports that
35+
the steer reached the intended recipient.
36+
- Release a provisional claim on explicit delivery failure or confirmation
37+
timeout, so a duplicate/replayed event can re-drive the same message.
38+
- Preserve current pacing: event dispatch must not block the project send queue
39+
while waiting for delivery confirmation.
40+
- Keep duplicate hardening by stable event identity first, Slack logical key and
41+
content hash second.
42+
- Make telemetry low-noise: count released provisional claims and confirmation
43+
timeouts without logging every duplicate.
44+
45+
## Non-Goals
46+
47+
- No durable mailbox design in this PR. That belongs to relay#1056.
48+
- No delivery retry loop inside Pear beyond releasing the dedupe claim. Relayfile
49+
replay, stream reconnect, or event-feed polling provides the re-drive input.
50+
- No merge before Khaliq's explicit go. This design stacks on #145 and should
51+
become a follow-up PR after #145 lands.
52+
- No implementation until relay-worker confirms `delivery_injected` coverage for
53+
every agent transport the integration bridge can target.
54+
55+
## Broker Delivery Contract
56+
57+
Pear already has a related confirmation helper,
58+
`BrokerManager.sendMessageAndWaitForDelivery(projectId, input, { timeoutMs })`.
59+
That helper:
60+
61+
- sends through the broker;
62+
- captures the returned `event_id` and targets;
63+
- subscribes to `session.client.onEvent(...)`;
64+
- resolves on `delivery_ack`, `delivery_verified`, or
65+
`message_delivery_confirmed` for that event and target;
66+
- rejects on `delivery_failed`, `message_delivery_failed`, or its own timeout.
67+
68+
Project-lead ruled that #147 must NOT commit on that helper's current success
69+
condition. `delivery_ack` and `delivery_verified` can be satisfied by an echo
70+
timeout fallback, which is too weak for this bug: it could commit the dedupe
71+
claim even if the steer never crossed the lost hop.
72+
73+
Commit signal for #147: `delivery_injected`, filtered by returned
74+
`event_id` and target. That event is emitted by the PTY worker after it writes
75+
the injection bytes and carriage return to the PTY. It is the correct boundary
76+
for mode-B loss: "accepted by broker" is too early, and "read by model" is out
77+
of scope.
78+
79+
Relay-worker owns the BrokerManager helper extension. The bridge should depend
80+
on a sibling helper, not hand-subscribe to broker events:
81+
82+
```ts
83+
sendMessageAndWaitForInjected(
84+
projectId: string,
85+
input: SendMessageInput,
86+
options?: { timeoutMs?: number }
87+
): Promise<{ eventId: string; targets: string[] }>
88+
```
89+
90+
Semantics:
91+
92+
- Internals share/refactor the existing send-plus-event-wait logic from
93+
`sendMessageAndWaitForDelivery`.
94+
- Success is only `delivery_injected` matching the returned `event_id` and
95+
target `name`.
96+
- `delivery_ack`, `delivery_verified`, and `message_delivery_confirmed` are not
97+
success for this helper. If they arrive without `delivery_injected`, the
98+
helper keeps waiting until `delivery_injected` or timeout.
99+
- Failure stays `delivery_failed` / `message_delivery_failed`, or helper
100+
timeout.
101+
- Target handling matches the existing helper: use returned `targets` when
102+
present; for direct agent sends with no targets, fall back to `[input.to]`.
103+
- Unsupported brokers may return no `eventId` or no targets; the bridge treats
104+
those as accepted-only and commits immediately, preserving current behavior.
105+
106+
The helper may internally reuse the current `sendMessageAndWaitForDelivery`
107+
mechanics: send once, capture `event_id` / targets, subscribe internally, filter
108+
events in BrokerManager. It must differ by resolving only on
109+
`delivery_injected`.
110+
111+
Calling this helper from the existing project pacer and awaiting it would
112+
reintroduce the old head-of-line block. The bridge must start the helper as a
113+
background confirmation task, attach `.then(commit).catch(release)`, and return
114+
from the pacer callback immediately after starting the send attempt. The pacer
115+
rate-limits starts, not confirmation completion.
116+
117+
Coverage result: relay-worker confirmed `delivery_injected` is emitted for the
118+
agent transports the integration bridge can inject to. Local PTY workers emit it
119+
after writing the injection bytes and carriage return. Headless/app-server
120+
runtimes emit it on successful handler delivery before ack/failure.
121+
Cloud-attached sandboxes use the same harness-driver event stream through
122+
`attachCloudSandbox`, so direct sends to cloud-owned agents use the same
123+
delivery signal once the agent has been observed/listed.
124+
125+
Timing result: for current integration-event `mode: 'steer'`, `delivery_injected`
126+
is not gated on the model becoming idle. PTY steer sends ESC ESC, waits briefly,
127+
writes the injection and CR, then emits `delivery_injected`. It can still wait
128+
behind the worker's local pending-injection queue or a wedged PTY write, but it
129+
does not wait for a long model turn to complete.
130+
131+
Channel caveat: the bridge can target channels as well as direct agents. Broker
132+
channel sends fan out to concrete workers that use the same delivery machinery,
133+
but today's `/api/send` response does not include the resolved worker names.
134+
Existing wait helpers therefore see `targets=[]` for `#channel` sends and return
135+
immediately. For #147, that must NOT count as committed delivery. The
136+
implementation must either resolve concrete channel recipients before
137+
waiting/committing, or explicitly leave channel-target commit/re-drive as a
138+
residual and avoid committing on `targets=[]` for channel sends. The long-term
139+
fix is for the broker send response to include resolved targets.
140+
141+
File ownership for implementation:
142+
143+
- pear-worker owns the additive BrokerManager helper and the bridge state
144+
machine.
145+
- relay-worker owns BrokerManager helper tests.
146+
- Keep broker.ts changes strictly additive: new helper, new predicate, and
147+
shared/refactored wait logic only. A shared wait refactor is allowed, but it
148+
must preserve `sendMessageAndWaitForDelivery()` behavior exactly and keep that
149+
method's existing tests green. Do not touch the #146 workspace-key/cloud
150+
regions.
151+
152+
## Bridge State Model
153+
154+
Replace optimistic one-shot claims with a small state machine shared by generic
155+
path dedupe and Slack logical/content-hash dedupe.
156+
157+
```ts
158+
type DeliveryClaimStatus = 'provisional' | 'committed'
159+
160+
type DeliveryClaim = {
161+
status: DeliveryClaimStatus
162+
expiresAt: number
163+
confirmationDeadline: number
164+
recipients: Map<string, 'pending' | 'confirmed' | 'failed'>
165+
contentHash?: string
166+
eventIds: Set<string>
167+
}
168+
```
169+
170+
For Slack logical keys, the entry is still keyed by logical Slack identity. A
171+
content-hash bucket carries one claim:
172+
173+
```ts
174+
type SlackLogicalInjectionState = {
175+
expiresAt: number
176+
blindClaim?: DeliveryClaim
177+
contentHashes: Map<string, DeliveryClaim>
178+
}
179+
```
180+
181+
Behavior:
182+
183+
- A committed matching content hash suppresses duplicates until TTL expiry.
184+
- A provisional matching content hash suppresses only concurrent in-flight alias
185+
copies while confirmation is still pending.
186+
- When confirmation times out or fails, the provisional claim is removed. A later
187+
replay of the same logical key/content hash can inject again.
188+
- A blind claim can still learn the first later content hash as in #145, but that
189+
hash stays provisional until delivery is confirmed.
190+
- Genuine edits with new content hashes continue to inject independently.
191+
192+
For non-Slack integration events, the existing recent-injection map becomes the
193+
same provisional/committed state keyed by `eventDedupeKeyWithFingerprint()`.
194+
195+
## Delivery Flow
196+
197+
1. Build the dedupe key and resolve recipients as today.
198+
2. If an equivalent committed claim exists, drop as duplicate.
199+
3. If an equivalent provisional claim exists, drop as duplicate only while its
200+
confirmation window is still open.
201+
4. Create or update a provisional claim before sending, scoped to the event key,
202+
content hash, and intended recipients.
203+
5. Enqueue each broker send through the existing project pacer. The pacer starts
204+
the `sendMessageAndWaitForInjected()` task and does not await its
205+
confirmation.
206+
6. For each started send, attach background handlers:
207+
- On `delivery_injected`: mark that recipient confirmed.
208+
- When all recipients are confirmed: commit the claim and extend `expiresAt`
209+
to the normal replay TTL.
210+
- On failure or timeout: mark failed and release the provisional claim if no
211+
recipient confirmed.
212+
7. If every recipient fails synchronously before send acceptance, release the
213+
claim immediately as today.
214+
8. If some recipients confirm and some fail, commit for confirmed recipients and
215+
log/telemetry the partial failure. A later replay should target only missing
216+
recipients if recipient-scoped tracking is implemented in the same pass;
217+
otherwise release the whole claim to prefer duplicate delivery over message
218+
loss.
219+
220+
The first implementation should prefer whole-claim release on partial failure.
221+
It may duplicate a message for a recipient that already got it, but it avoids the
222+
known worse behavior of losing human instructions.
223+
224+
## Timeout And Telemetry
225+
226+
Default injection-confirmation timeout: 5 seconds for direct `mode: 'steer'`
227+
agent sends.
228+
229+
Rationale: relay-worker confirmed `delivery_injected` fires after the PTY worker
230+
writes bytes, not after the model reads or completes a turn. Five seconds is
231+
therefore intentionally generous for healthy agents while keeping the
232+
provisional claim window short enough that replay can recover promptly and
233+
concurrent duplicate suppression does not linger.
234+
235+
Add aggregate counters/logs:
236+
237+
- `deliveryClaimsProvisional`
238+
- `deliveryClaimsCommitted`
239+
- `deliveryClaimsReleased`
240+
- `deliveryInjectedTimeouts`
241+
- `deliveryInjectedUnsupported`
242+
243+
Use the existing aggregated warning cadence for failures and include only
244+
projectId, provider, logical key prefix, recipient count, and reason. Do not log
245+
message content.
246+
247+
## Test Plan
248+
249+
Add integration-event bridge tests on top of the existing #145 suite:
250+
251+
- Slack unchanged-content replay re-drives after first accepted steer never
252+
emits `delivery_injected`.
253+
- Slack unchanged-content replay is suppressed after `delivery_injected`
254+
commits the content hash.
255+
- Slack alias copies during an open provisional window are suppressed, but a
256+
later replay after timeout is delivered.
257+
- Generic non-Slack event dedupe releases after delivery timeout and re-drives.
258+
- No-recipient and synchronous send-failure paths still release immediately.
259+
- Project broker pacing does not wait on delivery confirmation; the existing
260+
"does not wait on delivery confirmation path" test should become "does not
261+
wait while tracking injection confirmation".
262+
- Missing `sendMessageAndWaitForInjected` in tests/mocks is treated as a hard
263+
harness gap after #147 lands; do not silently fall back to
264+
`sendMessageAndWaitForDelivery`.
265+
- Unsupported injection confirmation commits accepted sends immediately and
266+
emits low-noise telemetry.
267+
268+
Add broker tests for relay-worker's helper extension:
269+
270+
- send acceptance returns before `delivery_injected`;
271+
- confirmation resolves on `delivery_injected` for the matching event id and
272+
target;
273+
- confirmation rejects on delivery failure;
274+
- confirmation rejects on timeout;
275+
- `delivery_ack` / `delivery_verified` without `delivery_injected` do not
276+
commit the helper;
277+
- multi-target sends wait for all reported targets.
278+
279+
## Rollout
280+
281+
1. Land #145 and restart Pear on it to restore reliable content reads.
282+
2. Direct-agent implementation can use `delivery_injected` with a 5-second
283+
timeout. Channel targets must either resolve concrete recipients before
284+
commit or remain explicitly residual without committing on `targets=[]`.
285+
3. pear-worker adds the additive BrokerManager helper and bridge state machine;
286+
relay-worker adds helper tests.
287+
4. Implement #147 as a follow-up stacked from #145/main using that helper as the
288+
confirmation source.
289+
5. Keep slack-comms' debug-log/OOB drop guard enabled until #147 is merged and
290+
the running app is restarted on it.
291+
292+
Residual risk: if the PTY accepts bytes but the agent process is wedged and
293+
never semantically processes them, `delivery_injected` still commits. That is an
294+
agent-liveness/read-receipt problem and belongs to the fuller relay#1056 read
295+
ack path, not #147.

0 commit comments

Comments
 (0)