Skip to content

Commit 3262f8d

Browse files
kjgbotkjgbotagent-relay-code[bot]
authored
DO-NOT-MERGE: Design #147 integration event re-drive (#148)
* Design integration event re-drive after lost steers * chore: apply pr-reviewer fixes for #148 * Implement integration event redrive on injected delivery * Document multi-recipient redrive semantics * test: cover injected delivery helper --------- Co-authored-by: kjgbot <kjgbot@agentrelay.dev> Co-authored-by: agent-relay-code[bot] <agent-relay-code[bot]@users.noreply.github.com>
1 parent fe72ac1 commit 3262f8d

5 files changed

Lines changed: 923 additions & 66 deletions

File tree

Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
# Integration Event Re-Drive After Lost Steers
2+
3+
Issue: #147. Base: #145 (`split/132-slack-integration-event-context-retry`).
4+
Status: IMPLEMENTATION APPROVED FOR DIRECT-AGENT PATH ONLY. Do not merge until
5+
#145 has landed and project-lead explicitly clears #147. The broker
6+
delivery-signal contract was pinned by relay-worker and project-lead on
7+
2026-06-07.
8+
9+
## Problem
10+
11+
The integration-event bridge currently treats broker send acceptance as delivery.
12+
For Slack logical dedupe, `claimSlackLogicalInjection()` records the
13+
logical/content-hash claim before a steer is known to have reached the target
14+
agent. For non-Slack paths, `wasRecentlyInjected()` does the same with the path
15+
dedupe key.
16+
17+
That is safe for duplicate suppression only when the broker send reliably reaches
18+
the agent. The confirmed failure mode from 2026-06-07 is different:
19+
20+
- Relayfile emits the same Slack record several times.
21+
- The first delivery logs `injecting` for `slack-comms`, so the bridge claims the
22+
logical key.
23+
- The steer is lost after broker acceptance and never appears in the agent
24+
conversation.
25+
- Later re-deliveries are suppressed for the one-hour Slack replay TTL because
26+
the content hash is unchanged.
27+
28+
#145 fixes the separate "content unavailable" path, but it cannot fix a lost
29+
first steer with unchanged content. The bridge needs delivery confirmation and a
30+
bounded re-drive window.
31+
32+
## Goals
33+
34+
- Make dedupe claims provisional until delivery is confirmed.
35+
- Commit a provisional claim only after the broker/harness-driver reports that
36+
the steer reached the intended recipient.
37+
- Release a provisional claim on explicit delivery failure or confirmation
38+
timeout, so a duplicate/replayed event can re-drive the same message.
39+
- Preserve current pacing: event dispatch must not block the project send queue
40+
while waiting for delivery confirmation.
41+
- Keep duplicate hardening by stable event identity first, Slack logical key and
42+
content hash second.
43+
- Make telemetry low-noise: count released provisional claims and confirmation
44+
timeouts without logging every duplicate.
45+
46+
## Non-Goals
47+
48+
- No durable mailbox design in this PR. That belongs to relay#1056.
49+
- No delivery retry loop inside Pear beyond releasing the dedupe claim. Relayfile
50+
replay, stream reconnect, or event-feed polling provides the re-drive input.
51+
- No merge before Khaliq's explicit go. This design stacks on #145 and should
52+
become a follow-up PR after #145 lands.
53+
- No channel-target delivery confirmation in #147 unless concrete resolved agent
54+
targets are available. Empty/unresolved channel targets must never commit a
55+
provisional claim.
56+
57+
## Broker Delivery Contract
58+
59+
Pear already has a related confirmation helper,
60+
`BrokerManager.sendMessageAndWaitForDelivery(projectId, input, { timeoutMs })`.
61+
That helper:
62+
63+
- sends through the broker;
64+
- captures the returned `event_id` and targets;
65+
- subscribes to `session.client.onEvent(...)`;
66+
- resolves on `delivery_ack`, `delivery_verified`, or
67+
`message_delivery_confirmed` for that event and target;
68+
- rejects on `delivery_failed`, `message_delivery_failed`, or its own timeout.
69+
70+
Project-lead ruled that #147 must NOT commit on that helper's current success
71+
condition. `delivery_ack` and `delivery_verified` can be satisfied by an echo
72+
timeout fallback, which is too weak for this bug: it could commit the dedupe
73+
claim even if the steer never crossed the lost hop.
74+
75+
Commit signal for #147: `delivery_injected`, filtered by returned
76+
`event_id` and target. That event is emitted by the PTY worker after it writes
77+
the injection bytes and carriage return to the PTY. It is the correct boundary
78+
for mode-B loss: "accepted by broker" is too early, and "read by model" is out
79+
of scope.
80+
81+
Relay-worker owns the BrokerManager helper extension. The bridge should depend
82+
on a sibling helper, not hand-subscribe to broker events:
83+
84+
```ts
85+
sendMessageAndWaitForInjected(
86+
projectId: string,
87+
input: SendMessageInput,
88+
options?: { timeoutMs?: number; signal?: AbortSignal }
89+
): Promise<{ eventId: string; targets: string[] }>
90+
```
91+
92+
Semantics:
93+
94+
- Internals share/refactor the existing send-plus-event-wait logic from
95+
`sendMessageAndWaitForDelivery`.
96+
- Success is only `delivery_injected` matching the returned `event_id` and
97+
target `name`.
98+
- `delivery_ack`, `delivery_verified`, and `message_delivery_confirmed` are not
99+
success for this helper. If they arrive without `delivery_injected`, the
100+
helper keeps waiting until `delivery_injected` or timeout.
101+
- Failure stays `delivery_failed` / `message_delivery_failed`, or helper
102+
timeout.
103+
- Target handling matches the existing helper: use returned `targets` when
104+
present; for direct agent sends with no targets, fall back to `[input.to]`.
105+
- Unsupported brokers may return no usable `event_id`; the bridge treats those as
106+
accepted-only and commits immediately, preserving current behavior. A direct
107+
agent send with no returned targets is not unsupported: it falls back to
108+
`[input.to]` and must still wait for `delivery_injected`. Channel/project
109+
fanout with no concrete targets is the residual case described below.
110+
- Before waiting for `delivery_injected`, BrokerManager should detect support
111+
from broker session metadata, such as an explicit capability flag or the
112+
protocol/version threshold that introduced `delivery_injected`. If the session
113+
is known not to support the signal, return an accepted-only result immediately
114+
and record `deliveryInjectedUnsupported` instead of waiting for timeout.
115+
116+
The helper may internally reuse the current `sendMessageAndWaitForDelivery`
117+
mechanics: send once, capture `event_id` / targets, subscribe internally, filter
118+
events in BrokerManager. It must differ by resolving only on
119+
`delivery_injected`.
120+
121+
Calling this helper from the existing project pacer and awaiting it would
122+
reintroduce the old head-of-line block. The bridge must start the helper as a
123+
background confirmation task, attach `.then(commit).catch(release)`, and return
124+
from the pacer callback immediately after starting the send attempt. The pacer
125+
rate-limits starts, not confirmation completion.
126+
127+
Background confirmation tasks must be project-scoped and cancellable. The bridge
128+
should keep an `AbortController` per active confirmation task, pass its
129+
`signal` to `sendMessageAndWaitForInjected()`, and abort all pending
130+
confirmations from `close(projectId)` before disposing the pacer. Abort should
131+
release provisional claims without emitting delivery-failure warnings for a
132+
project that is intentionally closing.
133+
134+
Coverage result: relay-worker confirmed `delivery_injected` is emitted for the
135+
agent transports the integration bridge can inject to. Local PTY workers emit it
136+
after writing the injection bytes and carriage return. Headless/app-server
137+
runtimes emit it on successful handler delivery before ack/failure.
138+
Cloud-attached sandboxes use the same harness-driver event stream through
139+
`attachCloudSandbox`, so direct sends to cloud-owned agents use the same
140+
delivery signal once the agent has been observed/listed.
141+
142+
Timing result: for current integration-event `mode: 'steer'`, `delivery_injected`
143+
is not gated on the model becoming idle. PTY steer sends ESC ESC, waits briefly,
144+
writes the injection and CR, then emits `delivery_injected`. It can still wait
145+
behind the worker's local pending-injection queue or a wedged PTY write, but it
146+
does not wait for a long model turn to complete.
147+
148+
Channel guard: the bridge can target channels as well as direct agents. Broker
149+
channel sends fan out to concrete workers that use the same delivery machinery,
150+
but today's `/api/send` response does not include the resolved worker names.
151+
Existing wait helpers therefore see `targets=[]` for `#channel` sends and return
152+
immediately. For #147, that must NOT count as committed delivery. Engage
153+
provisional -> commit-on-injected only when there is at least one concrete agent
154+
target. Empty-target/channel sends keep prior semantics, are logged as residual,
155+
and must never be falsely committed. Mixed direct-agent plus channel batches use
156+
the same conservative behavior for the whole batch: no injected-delivery claim
157+
is committed, so direct-agent recipients in that mixed batch do not get #147
158+
re-drive protection. The long-term fix is for the broker send response to
159+
include resolved targets.
160+
161+
File ownership for implementation:
162+
163+
- pear-worker owns the additive BrokerManager helper and the bridge state
164+
machine.
165+
- relay-worker owns BrokerManager helper tests.
166+
- Keep broker.ts changes strictly additive: new helper, new predicate, and
167+
shared/refactored wait logic only. A shared wait refactor is allowed, but it
168+
must preserve `sendMessageAndWaitForDelivery()` behavior exactly and keep that
169+
method's existing tests green. Do not touch the #146 workspace-key/cloud
170+
regions.
171+
172+
## Bridge State Model
173+
174+
Replace optimistic one-shot claims with a small state machine shared by generic
175+
path dedupe and Slack logical/content-hash dedupe.
176+
177+
```ts
178+
type DeliveryClaimStatus = 'provisional' | 'committed'
179+
180+
type DeliveryClaim = {
181+
status: DeliveryClaimStatus
182+
expiresAt: number
183+
confirmationDeadline: number
184+
recipients: Map<string, 'pending' | 'confirmed' | 'failed'>
185+
contentHash?: string
186+
eventIds: Set<string>
187+
}
188+
```
189+
190+
For Slack logical keys, the entry is still keyed by logical Slack identity. A
191+
content-hash bucket carries one claim:
192+
193+
```ts
194+
type SlackLogicalInjectionState = {
195+
expiresAt: number
196+
blindClaim?: DeliveryClaim
197+
contentHashes: Map<string, DeliveryClaim>
198+
}
199+
```
200+
201+
Behavior:
202+
203+
- A committed matching content hash suppresses duplicates until TTL expiry.
204+
- A provisional matching content hash suppresses only concurrent in-flight alias
205+
copies while confirmation is still pending.
206+
- When confirmation times out or fails, the provisional claim is removed. A later
207+
replay of the same logical key/content hash can inject again.
208+
- A blind claim can still learn the first later content hash as in #145. When
209+
the hash is learned, move the existing `blindClaim` into `contentHashes` under
210+
that hash and clear `blindClaim`; the moved claim stays provisional until
211+
delivery is confirmed.
212+
- Genuine edits with new content hashes continue to inject independently.
213+
214+
For non-Slack integration events, the existing recent-injection map becomes the
215+
same provisional/committed state keyed by `eventDedupeKeyWithFingerprint()`.
216+
217+
## Delivery Flow
218+
219+
1. Build the dedupe key and resolve recipients as today.
220+
2. If an equivalent committed claim exists, drop as duplicate.
221+
3. If an equivalent provisional claim exists, drop as duplicate only while its
222+
confirmation window is still open.
223+
4. Create or update a provisional claim before sending, scoped to the event key,
224+
content hash, and intended recipients.
225+
5. Enqueue each broker send through the existing project pacer. The pacer starts
226+
the `sendMessageAndWaitForInjected()` task and does not await its
227+
confirmation.
228+
6. For each started send, attach background handlers:
229+
- On `delivery_injected`: mark that recipient confirmed.
230+
- When all recipients are confirmed: commit the claim and extend `expiresAt`
231+
to the normal replay TTL.
232+
- On failure or timeout: mark failed and release the provisional claim.
233+
7. If every recipient fails synchronously before send acceptance, release the
234+
claim immediately as today.
235+
8. If some recipients confirm and some fail, release the whole shared logical
236+
claim rather than committing on first success. A later replay re-sends to all
237+
recipients, so recipients that already received the steer may see a duplicate.
238+
That is the accepted duplicate-over-drop bias for #147. True per-recipient
239+
claims could avoid the duplicate, but they add complexity and are out of
240+
scope for this pass; the live drop path is single-recipient (`slack-comms`).
241+
242+
## Timeout And Telemetry
243+
244+
Default injection-confirmation timeout: 5 seconds for direct `mode: 'steer'`
245+
agent sends.
246+
247+
Rationale: relay-worker confirmed `delivery_injected` fires after the PTY worker
248+
writes bytes, not after the model reads or completes a turn. Five seconds is
249+
therefore intentionally generous for healthy agents while keeping the
250+
provisional claim window short enough that replay can recover promptly and
251+
concurrent duplicate suppression does not linger.
252+
253+
Add aggregate counters/logs:
254+
255+
- `deliveryClaimsProvisional`
256+
- `deliveryClaimsCommitted`
257+
- `deliveryClaimsReleased`
258+
- `deliveryInjectedTimeouts`
259+
- `deliveryInjectedUnsupported`
260+
261+
Use the existing aggregated warning cadence for failures and include only
262+
projectId, provider, logical key prefix, recipient count, and reason. Do not log
263+
message content.
264+
265+
## Test Plan
266+
267+
Add integration-event bridge tests on top of the existing #145 suite:
268+
269+
- Slack unchanged-content replay re-drives after first accepted steer never
270+
emits `delivery_injected`.
271+
- Slack unchanged-content replay is suppressed after `delivery_injected`
272+
commits the content hash.
273+
- Slack alias copies during an open provisional window are suppressed, but a
274+
later replay after timeout is delivered.
275+
- Generic non-Slack event dedupe releases after delivery timeout and re-drives.
276+
- No-recipient and synchronous send-failure paths still release immediately.
277+
- Project broker pacing does not wait on delivery confirmation; the existing
278+
"does not wait on delivery confirmation path" test should become "does not
279+
wait while tracking injection confirmation".
280+
- Missing `sendMessageAndWaitForInjected` in tests/mocks is treated as a hard
281+
harness gap after #147 lands; do not silently fall back to
282+
`sendMessageAndWaitForDelivery`.
283+
- Unsupported injection confirmation commits accepted sends immediately and
284+
emits low-noise telemetry.
285+
- Project close aborts pending background confirmations, releases provisional
286+
claims, and does not log delivery-failure warnings for the intentional abort.
287+
288+
Add broker tests for relay-worker's helper extension:
289+
290+
- send acceptance returns before `delivery_injected`;
291+
- confirmation resolves on `delivery_injected` for the matching event id and
292+
target;
293+
- confirmation rejects on delivery failure;
294+
- confirmation rejects on timeout;
295+
- `delivery_ack` / `delivery_verified` without `delivery_injected` do not
296+
commit the helper;
297+
- multi-target sends wait for all reported targets.
298+
299+
## Rollout
300+
301+
1. Land #145 and restart Pear on it to restore reliable content reads.
302+
2. Direct-agent implementation uses `delivery_injected` with a 5-second timeout.
303+
Channel/unresolved targets remain residual and never commit on `targets=[]`.
304+
Mixed direct-agent plus channel batches also keep prior semantics for the
305+
whole batch.
306+
3. pear-worker adds the additive BrokerManager helper and bridge state machine;
307+
relay-worker adds helper tests.
308+
4. Implement #147 as a follow-up stacked from #145/main using that helper as the
309+
confirmation source.
310+
5. Keep slack-comms' debug-log/OOB drop guard enabled until #147 is merged and
311+
the running app is restarted on it.
312+
313+
Residual risk: if the PTY accepts bytes but the agent process is wedged and
314+
never semantically processes them, `delivery_injected` still commits. That is an
315+
agent-liveness/read-receipt problem and belongs to the fuller relay#1056 read
316+
ack path, not #147.

0 commit comments

Comments
 (0)