Skip to content

Commit b09e585

Browse files
author
kjgbot
committed
Design integration event re-drive after lost steers
1 parent b9cba58 commit b09e585

1 file changed

Lines changed: 273 additions & 0 deletions

File tree

Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
# Integration Event Re-Drive After Lost Steers
2+
3+
Issue: #147. Base: #145 (`split/132-slack-integration-event-context-retry`).
4+
Status: DESIGN ONLY. Do not implement or merge until #145 has landed. The
5+
broker delivery-signal contract was pinned by relay-worker and project-lead on
6+
2026-06-07.
7+
8+
## Problem
9+
10+
The integration-event bridge currently treats broker send acceptance as delivery.
11+
For Slack logical dedupe, `claimSlackLogicalInjection()` records the
12+
logical/content-hash claim before a steer is known to have reached the target
13+
agent. For non-Slack paths, `wasRecentlyInjected()` does the same with the path
14+
dedupe key.
15+
16+
That is safe for duplicate suppression only when the broker send reliably reaches
17+
the agent. The confirmed failure mode from 2026-06-07 is different:
18+
19+
- Relayfile emits the same Slack record several times.
20+
- The first delivery logs `injecting` for `slack-comms`, so the bridge claims the
21+
logical key.
22+
- The steer is lost after broker acceptance and never appears in the agent
23+
conversation.
24+
- Later re-deliveries are suppressed for the one-hour Slack replay TTL because
25+
the content hash is unchanged.
26+
27+
#145 fixes the separate "content unavailable" path, but it cannot fix a lost
28+
first steer with unchanged content. The bridge needs delivery confirmation and a
29+
bounded re-drive window.
30+
31+
## Goals
32+
33+
- Make dedupe claims provisional until delivery is confirmed.
34+
- Commit a provisional claim only after the broker/harness-driver reports that
35+
the steer reached the intended recipient.
36+
- Release a provisional claim on explicit delivery failure or confirmation
37+
timeout, so a duplicate/replayed event can re-drive the same message.
38+
- Preserve current pacing: event dispatch must not block the project send queue
39+
while waiting for delivery confirmation.
40+
- Keep duplicate hardening by stable event identity first, Slack logical key and
41+
content hash second.
42+
- Make telemetry low-noise: count released provisional claims and confirmation
43+
timeouts without logging every duplicate.
44+
45+
## Non-Goals
46+
47+
- No durable mailbox design in this PR. That belongs to relay#1056.
48+
- No delivery retry loop inside Pear beyond releasing the dedupe claim. Relayfile
49+
replay, stream reconnect, or event-feed polling provides the re-drive input.
50+
- No merge before Khaliq's explicit go. This design stacks on #145 and should
51+
become a follow-up PR after #145 lands.
52+
- No implementation until relay-worker confirms `delivery_injected` coverage for
53+
every agent transport the integration bridge can target.
54+
55+
## Broker Delivery Contract
56+
57+
Pear already has a related confirmation helper,
58+
`BrokerManager.sendMessageAndWaitForDelivery(projectId, input, { timeoutMs })`.
59+
That helper:
60+
61+
- sends through the broker;
62+
- captures the returned `event_id` and targets;
63+
- subscribes to `session.client.onEvent(...)`;
64+
- resolves on `delivery_ack`, `delivery_verified`, or
65+
`message_delivery_confirmed` for that event and target;
66+
- rejects on `delivery_failed`, `message_delivery_failed`, or its own timeout.
67+
68+
Project-lead ruled that #147 must NOT commit on that helper's current success
69+
condition. `delivery_ack` and `delivery_verified` can be satisfied by an echo
70+
timeout fallback, which is too weak for this bug: it could commit the dedupe
71+
claim even if the steer never crossed the lost hop.
72+
73+
Commit signal for #147: `delivery_injected`, filtered by returned
74+
`event_id` and target. That event is emitted by the PTY worker after it writes
75+
the injection bytes and carriage return to the PTY. It is the correct boundary
76+
for mode-B loss: "accepted by broker" is too early, and "read by model" is out
77+
of scope.
78+
79+
Relay-worker owns the BrokerManager helper extension. The bridge should depend
80+
on a sibling helper, not hand-subscribe to broker events:
81+
82+
```ts
83+
sendMessageAndWaitForInjected(
84+
projectId: string,
85+
input: SendMessageInput,
86+
options?: { timeoutMs?: number }
87+
): Promise<{ eventId: string; targets: string[] }>
88+
```
89+
90+
Semantics:
91+
92+
- Internals share/refactor the existing send-plus-event-wait logic from
93+
`sendMessageAndWaitForDelivery`.
94+
- Success is only `delivery_injected` matching the returned `event_id` and
95+
target `name`.
96+
- `delivery_ack`, `delivery_verified`, and `message_delivery_confirmed` are not
97+
success for this helper. If they arrive without `delivery_injected`, the
98+
helper keeps waiting until `delivery_injected` or timeout.
99+
- Failure stays `delivery_failed` / `message_delivery_failed`, or helper
100+
timeout.
101+
- Target handling matches the existing helper: use returned `targets` when
102+
present; for direct agent sends with no targets, fall back to `[input.to]`.
103+
- Unsupported brokers may return no `eventId` or no targets; the bridge treats
104+
those as accepted-only and commits immediately, preserving current behavior.
105+
106+
The helper may internally reuse the current `sendMessageAndWaitForDelivery`
107+
mechanics: send once, capture `event_id` / targets, subscribe internally, filter
108+
events in BrokerManager. It must differ by resolving only on
109+
`delivery_injected`.
110+
111+
Calling this helper from the existing project pacer and awaiting it would
112+
reintroduce the old head-of-line block. The bridge must start the helper as a
113+
background confirmation task, attach `.then(commit).catch(release)`, and return
114+
from the pacer callback immediately after starting the send attempt. The pacer
115+
rate-limits starts, not confirmation completion.
116+
117+
Coverage gate before implementation: confirm `delivery_injected` is emitted for
118+
every transport the integration bridge can inject to, including local PTY,
119+
cloud, and remote agents. If any transport lacks `delivery_injected`, define
120+
that transport's success signal first; otherwise the bridge would release on
121+
every timeout and re-drive duplicates indefinitely.
122+
123+
File ownership for implementation:
124+
125+
- pear-worker owns the additive BrokerManager helper and the bridge state
126+
machine.
127+
- relay-worker owns BrokerManager helper tests.
128+
- Keep broker.ts changes strictly additive: new helper, new predicate, and
129+
shared/refactored wait logic only. Do not touch the #146 workspace-key/cloud
130+
regions.
131+
132+
## Bridge State Model
133+
134+
Replace optimistic one-shot claims with a small state machine shared by generic
135+
path dedupe and Slack logical/content-hash dedupe.
136+
137+
```ts
138+
type DeliveryClaimStatus = 'provisional' | 'committed'
139+
140+
type DeliveryClaim = {
141+
status: DeliveryClaimStatus
142+
expiresAt: number
143+
confirmationDeadline: number
144+
recipients: Map<string, 'pending' | 'confirmed' | 'failed'>
145+
contentHash?: string
146+
eventIds: Set<string>
147+
}
148+
```
149+
150+
For Slack logical keys, the entry is still keyed by logical Slack identity. A
151+
content-hash bucket carries one claim:
152+
153+
```ts
154+
type SlackLogicalInjectionState = {
155+
expiresAt: number
156+
blindClaim?: DeliveryClaim
157+
contentHashes: Map<string, DeliveryClaim>
158+
}
159+
```
160+
161+
Behavior:
162+
163+
- A committed matching content hash suppresses duplicates until TTL expiry.
164+
- A provisional matching content hash suppresses only concurrent in-flight alias
165+
copies while confirmation is still pending.
166+
- When confirmation times out or fails, the provisional claim is removed. A later
167+
replay of the same logical key/content hash can inject again.
168+
- A blind claim can still learn the first later content hash as in #145, but that
169+
hash stays provisional until delivery is confirmed.
170+
- Genuine edits with new content hashes continue to inject independently.
171+
172+
For non-Slack integration events, the existing recent-injection map becomes the
173+
same provisional/committed state keyed by `eventDedupeKeyWithFingerprint()`.
174+
175+
## Delivery Flow
176+
177+
1. Build the dedupe key and resolve recipients as today.
178+
2. If an equivalent committed claim exists, drop as duplicate.
179+
3. If an equivalent provisional claim exists, drop as duplicate only while its
180+
confirmation window is still open.
181+
4. Create or update a provisional claim before sending, scoped to the event key,
182+
content hash, and intended recipients.
183+
5. Enqueue each broker send through the existing project pacer. The pacer starts
184+
the `sendMessageAndWaitForInjected()` task and does not await its
185+
confirmation.
186+
6. For each started send, attach background handlers:
187+
- On `delivery_injected`: mark that recipient confirmed.
188+
- When all recipients are confirmed: commit the claim and extend `expiresAt`
189+
to the normal replay TTL.
190+
- On failure or timeout: mark failed and release the provisional claim if no
191+
recipient confirmed.
192+
7. If every recipient fails synchronously before send acceptance, release the
193+
claim immediately as today.
194+
8. If some recipients confirm and some fail, commit for confirmed recipients and
195+
log/telemetry the partial failure. A later replay should target only missing
196+
recipients if recipient-scoped tracking is implemented in the same pass;
197+
otherwise release the whole claim to prefer duplicate delivery over message
198+
loss.
199+
200+
The first implementation should prefer whole-claim release on partial failure.
201+
It may duplicate a message for a recipient that already got it, but it avoids the
202+
known worse behavior of losing human instructions.
203+
204+
## Timeout And Telemetry
205+
206+
Proposed default injection-confirmation timeout: 5 seconds.
207+
208+
Rationale: `delivery_injected` is emitted at the PTY write boundary, so healthy
209+
agents should confirm quickly even if they are momentarily busy. Five seconds
210+
tolerates short event-loop or PTY scheduling delays while keeping the
211+
provisional claim window short enough that replay can recover promptly and
212+
concurrent duplicate suppression does not linger.
213+
214+
Add aggregate counters/logs:
215+
216+
- `deliveryClaimsProvisional`
217+
- `deliveryClaimsCommitted`
218+
- `deliveryClaimsReleased`
219+
- `deliveryInjectedTimeouts`
220+
- `deliveryInjectedUnsupported`
221+
222+
Use the existing aggregated warning cadence for failures and include only
223+
projectId, provider, logical key prefix, recipient count, and reason. Do not log
224+
message content.
225+
226+
## Test Plan
227+
228+
Add integration-event bridge tests on top of the existing #145 suite:
229+
230+
- Slack unchanged-content replay re-drives after first accepted steer never
231+
emits `delivery_injected`.
232+
- Slack unchanged-content replay is suppressed after `delivery_injected`
233+
commits the content hash.
234+
- Slack alias copies during an open provisional window are suppressed, but a
235+
later replay after timeout is delivered.
236+
- Generic non-Slack event dedupe releases after delivery timeout and re-drives.
237+
- No-recipient and synchronous send-failure paths still release immediately.
238+
- Project broker pacing does not wait on delivery confirmation; the existing
239+
"does not wait on delivery confirmation path" test should become "does not
240+
wait while tracking injection confirmation".
241+
- Missing `sendMessageAndWaitForInjected` in tests/mocks is treated as a hard
242+
harness gap after #147 lands; do not silently fall back to
243+
`sendMessageAndWaitForDelivery`.
244+
- Unsupported injection confirmation commits accepted sends immediately and
245+
emits low-noise telemetry.
246+
247+
Add broker tests for relay-worker's helper extension:
248+
249+
- send acceptance returns before `delivery_injected`;
250+
- confirmation resolves on `delivery_injected` for the matching event id and
251+
target;
252+
- confirmation rejects on delivery failure;
253+
- confirmation rejects on timeout;
254+
- `delivery_ack` / `delivery_verified` without `delivery_injected` do not
255+
commit the helper;
256+
- multi-target sends wait for all reported targets.
257+
258+
## Rollout
259+
260+
1. Land #145 and restart Pear on it to restore reliable content reads.
261+
2. Relay-worker confirms `delivery_injected` coverage across bridge target
262+
transports.
263+
3. pear-worker adds the additive BrokerManager helper and bridge state machine;
264+
relay-worker adds helper tests.
265+
4. Implement #147 as a follow-up stacked from #145/main using that helper as the
266+
confirmation source.
267+
5. Keep slack-comms' debug-log/OOB drop guard enabled until #147 is merged and
268+
the running app is restarted on it.
269+
270+
Residual risk: if the PTY accepts bytes but the agent process is wedged and
271+
never semantically processes them, `delivery_injected` still commits. That is an
272+
agent-liveness/read-receipt problem and belongs to the fuller relay#1056 read
273+
ack path, not #147.

0 commit comments

Comments
 (0)