|
| 1 | +# Integration Event Re-Drive After Lost Steers |
| 2 | + |
| 3 | +Issue: #147. Base: #145 (`split/132-slack-integration-event-context-retry`). |
| 4 | +Status: IMPLEMENTATION APPROVED FOR DIRECT-AGENT PATH ONLY. Do not merge until |
| 5 | +#145 has landed and project-lead explicitly clears #147. The broker |
| 6 | +delivery-signal contract was pinned by relay-worker and project-lead on |
| 7 | +2026-06-07. |
| 8 | + |
| 9 | +## Problem |
| 10 | + |
| 11 | +The integration-event bridge currently treats broker send acceptance as delivery. |
| 12 | +For Slack logical dedupe, `claimSlackLogicalInjection()` records the |
| 13 | +logical/content-hash claim before a steer is known to have reached the target |
| 14 | +agent. For non-Slack paths, `wasRecentlyInjected()` does the same with the path |
| 15 | +dedupe key. |
| 16 | + |
| 17 | +That is safe for duplicate suppression only when the broker send reliably reaches |
| 18 | +the agent. The confirmed failure mode from 2026-06-07 is different: |
| 19 | + |
| 20 | +- Relayfile emits the same Slack record several times. |
| 21 | +- The first delivery logs `injecting` for `slack-comms`, so the bridge claims the |
| 22 | + logical key. |
| 23 | +- The steer is lost after broker acceptance and never appears in the agent |
| 24 | + conversation. |
| 25 | +- Later re-deliveries are suppressed for the one-hour Slack replay TTL because |
| 26 | + the content hash is unchanged. |
| 27 | + |
| 28 | +#145 fixes the separate "content unavailable" path, but it cannot fix a lost |
| 29 | +first steer with unchanged content. The bridge needs delivery confirmation and a |
| 30 | +bounded re-drive window. |
| 31 | + |
| 32 | +## Goals |
| 33 | + |
| 34 | +- Make dedupe claims provisional until delivery is confirmed. |
| 35 | +- Commit a provisional claim only after the broker/harness-driver reports that |
| 36 | + the steer reached the intended recipient. |
| 37 | +- Release a provisional claim on explicit delivery failure or confirmation |
| 38 | + timeout, so a duplicate/replayed event can re-drive the same message. |
| 39 | +- Preserve current pacing: event dispatch must not block the project send queue |
| 40 | + while waiting for delivery confirmation. |
| 41 | +- Keep duplicate hardening by stable event identity first, Slack logical key and |
| 42 | + content hash second. |
| 43 | +- Make telemetry low-noise: count released provisional claims and confirmation |
| 44 | + timeouts without logging every duplicate. |
| 45 | + |
| 46 | +## Non-Goals |
| 47 | + |
| 48 | +- No durable mailbox design in this PR. That belongs to relay#1056. |
| 49 | +- No delivery retry loop inside Pear beyond releasing the dedupe claim. Relayfile |
| 50 | + replay, stream reconnect, or event-feed polling provides the re-drive input. |
| 51 | +- No merge before Khaliq's explicit go. This design stacks on #145 and should |
| 52 | + become a follow-up PR after #145 lands. |
| 53 | +- No channel-target delivery confirmation in #147 unless concrete resolved agent |
| 54 | + targets are available. Empty/unresolved channel targets must never commit a |
| 55 | + provisional claim. |
| 56 | + |
| 57 | +## Broker Delivery Contract |
| 58 | + |
| 59 | +Pear already has a related confirmation helper, |
| 60 | +`BrokerManager.sendMessageAndWaitForDelivery(projectId, input, { timeoutMs })`. |
| 61 | +That helper: |
| 62 | + |
| 63 | +- sends through the broker; |
| 64 | +- captures the returned `event_id` and targets; |
| 65 | +- subscribes to `session.client.onEvent(...)`; |
| 66 | +- resolves on `delivery_ack`, `delivery_verified`, or |
| 67 | + `message_delivery_confirmed` for that event and target; |
| 68 | +- rejects on `delivery_failed`, `message_delivery_failed`, or its own timeout. |
| 69 | + |
| 70 | +Project-lead ruled that #147 must NOT commit on that helper's current success |
| 71 | +condition. `delivery_ack` and `delivery_verified` can be satisfied by an echo |
| 72 | +timeout fallback, which is too weak for this bug: it could commit the dedupe |
| 73 | +claim even if the steer never crossed the lost hop. |
| 74 | + |
| 75 | +Commit signal for #147: `delivery_injected`, filtered by returned |
| 76 | +`event_id` and target. That event is emitted by the PTY worker after it writes |
| 77 | +the injection bytes and carriage return to the PTY. It is the correct boundary |
| 78 | +for mode-B loss: "accepted by broker" is too early, and "read by model" is out |
| 79 | +of scope. |
| 80 | + |
| 81 | +Relay-worker owns the BrokerManager helper extension. The bridge should depend |
| 82 | +on a sibling helper, not hand-subscribe to broker events: |
| 83 | + |
| 84 | +```ts |
| 85 | +sendMessageAndWaitForInjected( |
| 86 | + projectId: string, |
| 87 | + input: SendMessageInput, |
| 88 | + options?: { timeoutMs?: number; signal?: AbortSignal } |
| 89 | +): Promise<{ eventId: string; targets: string[] }> |
| 90 | +``` |
| 91 | + |
| 92 | +Semantics: |
| 93 | + |
| 94 | +- Internals share/refactor the existing send-plus-event-wait logic from |
| 95 | + `sendMessageAndWaitForDelivery`. |
| 96 | +- Success is only `delivery_injected` matching the returned `event_id` and |
| 97 | + target `name`. |
| 98 | +- `delivery_ack`, `delivery_verified`, and `message_delivery_confirmed` are not |
| 99 | + success for this helper. If they arrive without `delivery_injected`, the |
| 100 | + helper keeps waiting until `delivery_injected` or timeout. |
| 101 | +- Failure stays `delivery_failed` / `message_delivery_failed`, or helper |
| 102 | + timeout. |
| 103 | +- Target handling matches the existing helper: use returned `targets` when |
| 104 | + present; for direct agent sends with no targets, fall back to `[input.to]`. |
| 105 | +- Unsupported brokers may return no usable `event_id`; the bridge treats those as |
| 106 | + accepted-only and commits immediately, preserving current behavior. A direct |
| 107 | + agent send with no returned targets is not unsupported: it falls back to |
| 108 | + `[input.to]` and must still wait for `delivery_injected`. Channel/project |
| 109 | + fanout with no concrete targets is the residual case described below. |
| 110 | +- Before waiting for `delivery_injected`, BrokerManager should detect support |
| 111 | + from broker session metadata, such as an explicit capability flag or the |
| 112 | + protocol/version threshold that introduced `delivery_injected`. If the session |
| 113 | + is known not to support the signal, return an accepted-only result immediately |
| 114 | + and record `deliveryInjectedUnsupported` instead of waiting for timeout. |
| 115 | + |
| 116 | +The helper may internally reuse the current `sendMessageAndWaitForDelivery` |
| 117 | +mechanics: send once, capture `event_id` / targets, subscribe internally, filter |
| 118 | +events in BrokerManager. It must differ by resolving only on |
| 119 | +`delivery_injected`. |
| 120 | + |
| 121 | +Calling this helper from the existing project pacer and awaiting it would |
| 122 | +reintroduce the old head-of-line block. The bridge must start the helper as a |
| 123 | +background confirmation task, attach `.then(commit).catch(release)`, and return |
| 124 | +from the pacer callback immediately after starting the send attempt. The pacer |
| 125 | +rate-limits starts, not confirmation completion. |
| 126 | + |
| 127 | +Background confirmation tasks must be project-scoped and cancellable. The bridge |
| 128 | +should keep an `AbortController` per active confirmation task, pass its |
| 129 | +`signal` to `sendMessageAndWaitForInjected()`, and abort all pending |
| 130 | +confirmations from `close(projectId)` before disposing the pacer. Abort should |
| 131 | +release provisional claims without emitting delivery-failure warnings for a |
| 132 | +project that is intentionally closing. |
| 133 | + |
| 134 | +Coverage result: relay-worker confirmed `delivery_injected` is emitted for the |
| 135 | +agent transports the integration bridge can inject to. Local PTY workers emit it |
| 136 | +after writing the injection bytes and carriage return. Headless/app-server |
| 137 | +runtimes emit it on successful handler delivery before ack/failure. |
| 138 | +Cloud-attached sandboxes use the same harness-driver event stream through |
| 139 | +`attachCloudSandbox`, so direct sends to cloud-owned agents use the same |
| 140 | +delivery signal once the agent has been observed/listed. |
| 141 | + |
| 142 | +Timing result: for current integration-event `mode: 'steer'`, `delivery_injected` |
| 143 | +is not gated on the model becoming idle. PTY steer sends ESC ESC, waits briefly, |
| 144 | +writes the injection and CR, then emits `delivery_injected`. It can still wait |
| 145 | +behind the worker's local pending-injection queue or a wedged PTY write, but it |
| 146 | +does not wait for a long model turn to complete. |
| 147 | + |
| 148 | +Channel guard: the bridge can target channels as well as direct agents. Broker |
| 149 | +channel sends fan out to concrete workers that use the same delivery machinery, |
| 150 | +but today's `/api/send` response does not include the resolved worker names. |
| 151 | +Existing wait helpers therefore see `targets=[]` for `#channel` sends and return |
| 152 | +immediately. For #147, that must NOT count as committed delivery. Engage |
| 153 | +provisional -> commit-on-injected only when there is at least one concrete agent |
| 154 | +target. Empty-target/channel sends keep prior semantics, are logged as residual, |
| 155 | +and must never be falsely committed. Mixed direct-agent plus channel batches use |
| 156 | +the same conservative behavior for the whole batch: no injected-delivery claim |
| 157 | +is committed, so direct-agent recipients in that mixed batch do not get #147 |
| 158 | +re-drive protection. The long-term fix is for the broker send response to |
| 159 | +include resolved targets. |
| 160 | + |
| 161 | +File ownership for implementation: |
| 162 | + |
| 163 | +- pear-worker owns the additive BrokerManager helper and the bridge state |
| 164 | + machine. |
| 165 | +- relay-worker owns BrokerManager helper tests. |
| 166 | +- Keep broker.ts changes strictly additive: new helper, new predicate, and |
| 167 | + shared/refactored wait logic only. A shared wait refactor is allowed, but it |
| 168 | + must preserve `sendMessageAndWaitForDelivery()` behavior exactly and keep that |
| 169 | + method's existing tests green. Do not touch the #146 workspace-key/cloud |
| 170 | + regions. |
| 171 | + |
| 172 | +## Bridge State Model |
| 173 | + |
| 174 | +Replace optimistic one-shot claims with a small state machine shared by generic |
| 175 | +path dedupe and Slack logical/content-hash dedupe. |
| 176 | + |
| 177 | +```ts |
| 178 | +type DeliveryClaimStatus = 'provisional' | 'committed' |
| 179 | + |
| 180 | +type DeliveryClaim = { |
| 181 | + status: DeliveryClaimStatus |
| 182 | + expiresAt: number |
| 183 | + confirmationDeadline: number |
| 184 | + recipients: Map<string, 'pending' | 'confirmed' | 'failed'> |
| 185 | + contentHash?: string |
| 186 | + eventIds: Set<string> |
| 187 | +} |
| 188 | +``` |
| 189 | +
|
| 190 | +For Slack logical keys, the entry is still keyed by logical Slack identity. A |
| 191 | +content-hash bucket carries one claim: |
| 192 | +
|
| 193 | +```ts |
| 194 | +type SlackLogicalInjectionState = { |
| 195 | + expiresAt: number |
| 196 | + blindClaim?: DeliveryClaim |
| 197 | + contentHashes: Map<string, DeliveryClaim> |
| 198 | +} |
| 199 | +``` |
| 200 | +
|
| 201 | +Behavior: |
| 202 | +
|
| 203 | +- A committed matching content hash suppresses duplicates until TTL expiry. |
| 204 | +- A provisional matching content hash suppresses only concurrent in-flight alias |
| 205 | + copies while confirmation is still pending. |
| 206 | +- When confirmation times out or fails, the provisional claim is removed. A later |
| 207 | + replay of the same logical key/content hash can inject again. |
| 208 | +- A blind claim can still learn the first later content hash as in #145. When |
| 209 | + the hash is learned, move the existing `blindClaim` into `contentHashes` under |
| 210 | + that hash and clear `blindClaim`; the moved claim stays provisional until |
| 211 | + delivery is confirmed. |
| 212 | +- Genuine edits with new content hashes continue to inject independently. |
| 213 | +
|
| 214 | +For non-Slack integration events, the existing recent-injection map becomes the |
| 215 | +same provisional/committed state keyed by `eventDedupeKeyWithFingerprint()`. |
| 216 | +
|
| 217 | +## Delivery Flow |
| 218 | +
|
| 219 | +1. Build the dedupe key and resolve recipients as today. |
| 220 | +2. If an equivalent committed claim exists, drop as duplicate. |
| 221 | +3. If an equivalent provisional claim exists, drop as duplicate only while its |
| 222 | + confirmation window is still open. |
| 223 | +4. Create or update a provisional claim before sending, scoped to the event key, |
| 224 | + content hash, and intended recipients. |
| 225 | +5. Enqueue each broker send through the existing project pacer. The pacer starts |
| 226 | + the `sendMessageAndWaitForInjected()` task and does not await its |
| 227 | + confirmation. |
| 228 | +6. For each started send, attach background handlers: |
| 229 | + - On `delivery_injected`: mark that recipient confirmed. |
| 230 | + - When all recipients are confirmed: commit the claim and extend `expiresAt` |
| 231 | + to the normal replay TTL. |
| 232 | + - On failure or timeout: mark failed and release the provisional claim. |
| 233 | +7. If every recipient fails synchronously before send acceptance, release the |
| 234 | + claim immediately as today. |
| 235 | +8. If some recipients confirm and some fail, release the whole shared logical |
| 236 | + claim rather than committing on first success. A later replay re-sends to all |
| 237 | + recipients, so recipients that already received the steer may see a duplicate. |
| 238 | + That is the accepted duplicate-over-drop bias for #147. True per-recipient |
| 239 | + claims could avoid the duplicate, but they add complexity and are out of |
| 240 | + scope for this pass; the live drop path is single-recipient (`slack-comms`). |
| 241 | +
|
| 242 | +## Timeout And Telemetry |
| 243 | +
|
| 244 | +Default injection-confirmation timeout: 5 seconds for direct `mode: 'steer'` |
| 245 | +agent sends. |
| 246 | +
|
| 247 | +Rationale: relay-worker confirmed `delivery_injected` fires after the PTY worker |
| 248 | +writes bytes, not after the model reads or completes a turn. Five seconds is |
| 249 | +therefore intentionally generous for healthy agents while keeping the |
| 250 | +provisional claim window short enough that replay can recover promptly and |
| 251 | +concurrent duplicate suppression does not linger. |
| 252 | +
|
| 253 | +Add aggregate counters/logs: |
| 254 | +
|
| 255 | +- `deliveryClaimsProvisional` |
| 256 | +- `deliveryClaimsCommitted` |
| 257 | +- `deliveryClaimsReleased` |
| 258 | +- `deliveryInjectedTimeouts` |
| 259 | +- `deliveryInjectedUnsupported` |
| 260 | +
|
| 261 | +Use the existing aggregated warning cadence for failures and include only |
| 262 | +projectId, provider, logical key prefix, recipient count, and reason. Do not log |
| 263 | +message content. |
| 264 | +
|
| 265 | +## Test Plan |
| 266 | +
|
| 267 | +Add integration-event bridge tests on top of the existing #145 suite: |
| 268 | +
|
| 269 | +- Slack unchanged-content replay re-drives after first accepted steer never |
| 270 | + emits `delivery_injected`. |
| 271 | +- Slack unchanged-content replay is suppressed after `delivery_injected` |
| 272 | + commits the content hash. |
| 273 | +- Slack alias copies during an open provisional window are suppressed, but a |
| 274 | + later replay after timeout is delivered. |
| 275 | +- Generic non-Slack event dedupe releases after delivery timeout and re-drives. |
| 276 | +- No-recipient and synchronous send-failure paths still release immediately. |
| 277 | +- Project broker pacing does not wait on delivery confirmation; the existing |
| 278 | + "does not wait on delivery confirmation path" test should become "does not |
| 279 | + wait while tracking injection confirmation". |
| 280 | +- Missing `sendMessageAndWaitForInjected` in tests/mocks is treated as a hard |
| 281 | + harness gap after #147 lands; do not silently fall back to |
| 282 | + `sendMessageAndWaitForDelivery`. |
| 283 | +- Unsupported injection confirmation commits accepted sends immediately and |
| 284 | + emits low-noise telemetry. |
| 285 | +- Project close aborts pending background confirmations, releases provisional |
| 286 | + claims, and does not log delivery-failure warnings for the intentional abort. |
| 287 | +
|
| 288 | +Add broker tests for relay-worker's helper extension: |
| 289 | +
|
| 290 | +- send acceptance returns before `delivery_injected`; |
| 291 | +- confirmation resolves on `delivery_injected` for the matching event id and |
| 292 | + target; |
| 293 | +- confirmation rejects on delivery failure; |
| 294 | +- confirmation rejects on timeout; |
| 295 | +- `delivery_ack` / `delivery_verified` without `delivery_injected` do not |
| 296 | + commit the helper; |
| 297 | +- multi-target sends wait for all reported targets. |
| 298 | +
|
| 299 | +## Rollout |
| 300 | +
|
| 301 | +1. Land #145 and restart Pear on it to restore reliable content reads. |
| 302 | +2. Direct-agent implementation uses `delivery_injected` with a 5-second timeout. |
| 303 | + Channel/unresolved targets remain residual and never commit on `targets=[]`. |
| 304 | + Mixed direct-agent plus channel batches also keep prior semantics for the |
| 305 | + whole batch. |
| 306 | +3. pear-worker adds the additive BrokerManager helper and bridge state machine; |
| 307 | + relay-worker adds helper tests. |
| 308 | +4. Implement #147 as a follow-up stacked from #145/main using that helper as the |
| 309 | + confirmation source. |
| 310 | +5. Keep slack-comms' debug-log/OOB drop guard enabled until #147 is merged and |
| 311 | + the running app is restarted on it. |
| 312 | +
|
| 313 | +Residual risk: if the PTY accepts bytes but the agent process is wedged and |
| 314 | +never semantically processes them, `delivery_injected` still commits. That is an |
| 315 | +agent-liveness/read-receipt problem and belongs to the fuller relay#1056 read |
| 316 | +ack path, not #147. |
0 commit comments