|
| 1 | +# Integration Event Re-Drive After Lost Steers |
| 2 | + |
| 3 | +Issue: #147. Base: #145 (`split/132-slack-integration-event-context-retry`). |
| 4 | +Status: DESIGN ONLY. Do not implement or merge until #145 has landed. The |
| 5 | +broker delivery-signal contract was pinned by relay-worker and project-lead on |
| 6 | +2026-06-07. |
| 7 | + |
| 8 | +## Problem |
| 9 | + |
| 10 | +The integration-event bridge currently treats broker send acceptance as delivery. |
| 11 | +For Slack logical dedupe, `claimSlackLogicalInjection()` records the |
| 12 | +logical/content-hash claim before a steer is known to have reached the target |
| 13 | +agent. For non-Slack paths, `wasRecentlyInjected()` does the same with the path |
| 14 | +dedupe key. |
| 15 | + |
| 16 | +That is safe for duplicate suppression only when the broker send reliably reaches |
| 17 | +the agent. The confirmed failure mode from 2026-06-07 is different: |
| 18 | + |
| 19 | +- Relayfile emits the same Slack record several times. |
| 20 | +- The first delivery logs `injecting` for `slack-comms`, so the bridge claims the |
| 21 | + logical key. |
| 22 | +- The steer is lost after broker acceptance and never appears in the agent |
| 23 | + conversation. |
| 24 | +- Later re-deliveries are suppressed for the one-hour Slack replay TTL because |
| 25 | + the content hash is unchanged. |
| 26 | + |
| 27 | +#145 fixes the separate "content unavailable" path, but it cannot fix a lost |
| 28 | +first steer with unchanged content. The bridge needs delivery confirmation and a |
| 29 | +bounded re-drive window. |
| 30 | + |
| 31 | +## Goals |
| 32 | + |
| 33 | +- Make dedupe claims provisional until delivery is confirmed. |
| 34 | +- Commit a provisional claim only after the broker/harness-driver reports that |
| 35 | + the steer reached the intended recipient. |
| 36 | +- Release a provisional claim on explicit delivery failure or confirmation |
| 37 | + timeout, so a duplicate/replayed event can re-drive the same message. |
| 38 | +- Preserve current pacing: event dispatch must not block the project send queue |
| 39 | + while waiting for delivery confirmation. |
| 40 | +- Keep duplicate hardening by stable event identity first, Slack logical key and |
| 41 | + content hash second. |
| 42 | +- Make telemetry low-noise: count released provisional claims and confirmation |
| 43 | + timeouts without logging every duplicate. |
| 44 | + |
| 45 | +## Non-Goals |
| 46 | + |
| 47 | +- No durable mailbox design in this PR. That belongs to relay#1056. |
| 48 | +- No delivery retry loop inside Pear beyond releasing the dedupe claim. Relayfile |
| 49 | + replay, stream reconnect, or event-feed polling provides the re-drive input. |
| 50 | +- No merge before Khaliq's explicit go. This design stacks on #145 and should |
| 51 | + become a follow-up PR after #145 lands. |
| 52 | +- No implementation until relay-worker confirms `delivery_injected` coverage for |
| 53 | + every agent transport the integration bridge can target. |
| 54 | + |
| 55 | +## Broker Delivery Contract |
| 56 | + |
| 57 | +Pear already has a related confirmation helper, |
| 58 | +`BrokerManager.sendMessageAndWaitForDelivery(projectId, input, { timeoutMs })`. |
| 59 | +That helper: |
| 60 | + |
| 61 | +- sends through the broker; |
| 62 | +- captures the returned `event_id` and targets; |
| 63 | +- subscribes to `session.client.onEvent(...)`; |
| 64 | +- resolves on `delivery_ack`, `delivery_verified`, or |
| 65 | + `message_delivery_confirmed` for that event and target; |
| 66 | +- rejects on `delivery_failed`, `message_delivery_failed`, or its own timeout. |
| 67 | + |
| 68 | +Project-lead ruled that #147 must NOT commit on that helper's current success |
| 69 | +condition. `delivery_ack` and `delivery_verified` can be satisfied by an echo |
| 70 | +timeout fallback, which is too weak for this bug: it could commit the dedupe |
| 71 | +claim even if the steer never crossed the lost hop. |
| 72 | + |
| 73 | +Commit signal for #147: `delivery_injected`, filtered by returned |
| 74 | +`event_id` and target. That event is emitted by the PTY worker after it writes |
| 75 | +the injection bytes and carriage return to the PTY. It is the correct boundary |
| 76 | +for mode-B loss: "accepted by broker" is too early, and "read by model" is out |
| 77 | +of scope. |
| 78 | + |
| 79 | +Relay-worker owns the BrokerManager helper extension. The bridge should depend |
| 80 | +on a sibling helper, not hand-subscribe to broker events: |
| 81 | + |
| 82 | +```ts |
| 83 | +sendMessageAndWaitForInjected( |
| 84 | + projectId: string, |
| 85 | + input: SendMessageInput, |
| 86 | + options?: { timeoutMs?: number } |
| 87 | +): Promise<{ eventId: string; targets: string[] }> |
| 88 | +``` |
| 89 | + |
| 90 | +Semantics: |
| 91 | + |
| 92 | +- Internals share/refactor the existing send-plus-event-wait logic from |
| 93 | + `sendMessageAndWaitForDelivery`. |
| 94 | +- Success is only `delivery_injected` matching the returned `event_id` and |
| 95 | + target `name`. |
| 96 | +- `delivery_ack`, `delivery_verified`, and `message_delivery_confirmed` are not |
| 97 | + success for this helper. If they arrive without `delivery_injected`, the |
| 98 | + helper keeps waiting until `delivery_injected` or timeout. |
| 99 | +- Failure stays `delivery_failed` / `message_delivery_failed`, or helper |
| 100 | + timeout. |
| 101 | +- Target handling matches the existing helper: use returned `targets` when |
| 102 | + present; for direct agent sends with no targets, fall back to `[input.to]`. |
| 103 | +- Unsupported brokers may return no `eventId` or no targets; the bridge treats |
| 104 | + those as accepted-only and commits immediately, preserving current behavior. |
| 105 | + |
| 106 | +The helper may internally reuse the current `sendMessageAndWaitForDelivery` |
| 107 | +mechanics: send once, capture `event_id` / targets, subscribe internally, filter |
| 108 | +events in BrokerManager. It must differ by resolving only on |
| 109 | +`delivery_injected`. |
| 110 | + |
| 111 | +Calling this helper from the existing project pacer and awaiting it would |
| 112 | +reintroduce the old head-of-line block. The bridge must start the helper as a |
| 113 | +background confirmation task, attach `.then(commit).catch(release)`, and return |
| 114 | +from the pacer callback immediately after starting the send attempt. The pacer |
| 115 | +rate-limits starts, not confirmation completion. |
| 116 | + |
| 117 | +Coverage gate before implementation: confirm `delivery_injected` is emitted for |
| 118 | +every transport the integration bridge can inject to, including local PTY, |
| 119 | +cloud, and remote agents. If any transport lacks `delivery_injected`, define |
| 120 | +that transport's success signal first; otherwise the bridge would release on |
| 121 | +every timeout and re-drive duplicates indefinitely. |
| 122 | + |
| 123 | +File ownership for implementation: |
| 124 | + |
| 125 | +- pear-worker owns the additive BrokerManager helper and the bridge state |
| 126 | + machine. |
| 127 | +- relay-worker owns BrokerManager helper tests. |
| 128 | +- Keep broker.ts changes strictly additive: new helper, new predicate, and |
| 129 | + shared/refactored wait logic only. Do not touch the #146 workspace-key/cloud |
| 130 | + regions. |
| 131 | + |
| 132 | +## Bridge State Model |
| 133 | + |
| 134 | +Replace optimistic one-shot claims with a small state machine shared by generic |
| 135 | +path dedupe and Slack logical/content-hash dedupe. |
| 136 | + |
| 137 | +```ts |
| 138 | +type DeliveryClaimStatus = 'provisional' | 'committed' |
| 139 | + |
| 140 | +type DeliveryClaim = { |
| 141 | + status: DeliveryClaimStatus |
| 142 | + expiresAt: number |
| 143 | + confirmationDeadline: number |
| 144 | + recipients: Map<string, 'pending' | 'confirmed' | 'failed'> |
| 145 | + contentHash?: string |
| 146 | + eventIds: Set<string> |
| 147 | +} |
| 148 | +``` |
| 149 | +
|
| 150 | +For Slack logical keys, the entry is still keyed by logical Slack identity. A |
| 151 | +content-hash bucket carries one claim: |
| 152 | +
|
| 153 | +```ts |
| 154 | +type SlackLogicalInjectionState = { |
| 155 | + expiresAt: number |
| 156 | + blindClaim?: DeliveryClaim |
| 157 | + contentHashes: Map<string, DeliveryClaim> |
| 158 | +} |
| 159 | +``` |
| 160 | +
|
| 161 | +Behavior: |
| 162 | +
|
| 163 | +- A committed matching content hash suppresses duplicates until TTL expiry. |
| 164 | +- A provisional matching content hash suppresses only concurrent in-flight alias |
| 165 | + copies while confirmation is still pending. |
| 166 | +- When confirmation times out or fails, the provisional claim is removed. A later |
| 167 | + replay of the same logical key/content hash can inject again. |
| 168 | +- A blind claim can still learn the first later content hash as in #145, but that |
| 169 | + hash stays provisional until delivery is confirmed. |
| 170 | +- Genuine edits with new content hashes continue to inject independently. |
| 171 | +
|
| 172 | +For non-Slack integration events, the existing recent-injection map becomes the |
| 173 | +same provisional/committed state keyed by `eventDedupeKeyWithFingerprint()`. |
| 174 | +
|
| 175 | +## Delivery Flow |
| 176 | +
|
| 177 | +1. Build the dedupe key and resolve recipients as today. |
| 178 | +2. If an equivalent committed claim exists, drop as duplicate. |
| 179 | +3. If an equivalent provisional claim exists, drop as duplicate only while its |
| 180 | + confirmation window is still open. |
| 181 | +4. Create or update a provisional claim before sending, scoped to the event key, |
| 182 | + content hash, and intended recipients. |
| 183 | +5. Enqueue each broker send through the existing project pacer. The pacer starts |
| 184 | + the `sendMessageAndWaitForInjected()` task and does not await its |
| 185 | + confirmation. |
| 186 | +6. For each started send, attach background handlers: |
| 187 | + - On `delivery_injected`: mark that recipient confirmed. |
| 188 | + - When all recipients are confirmed: commit the claim and extend `expiresAt` |
| 189 | + to the normal replay TTL. |
| 190 | + - On failure or timeout: mark failed and release the provisional claim if no |
| 191 | + recipient confirmed. |
| 192 | +7. If every recipient fails synchronously before send acceptance, release the |
| 193 | + claim immediately as today. |
| 194 | +8. If some recipients confirm and some fail, commit for confirmed recipients and |
| 195 | + log/telemetry the partial failure. A later replay should target only missing |
| 196 | + recipients if recipient-scoped tracking is implemented in the same pass; |
| 197 | + otherwise release the whole claim to prefer duplicate delivery over message |
| 198 | + loss. |
| 199 | +
|
| 200 | +The first implementation should prefer whole-claim release on partial failure. |
| 201 | +It may duplicate a message for a recipient that already got it, but it avoids the |
| 202 | +known worse behavior of losing human instructions. |
| 203 | +
|
| 204 | +## Timeout And Telemetry |
| 205 | +
|
| 206 | +Proposed default injection-confirmation timeout: 5 seconds. |
| 207 | +
|
| 208 | +Rationale: `delivery_injected` is emitted at the PTY write boundary, so healthy |
| 209 | +agents should confirm quickly even if they are momentarily busy. Five seconds |
| 210 | +tolerates short event-loop or PTY scheduling delays while keeping the |
| 211 | +provisional claim window short enough that replay can recover promptly and |
| 212 | +concurrent duplicate suppression does not linger. |
| 213 | +
|
| 214 | +Add aggregate counters/logs: |
| 215 | +
|
| 216 | +- `deliveryClaimsProvisional` |
| 217 | +- `deliveryClaimsCommitted` |
| 218 | +- `deliveryClaimsReleased` |
| 219 | +- `deliveryInjectedTimeouts` |
| 220 | +- `deliveryInjectedUnsupported` |
| 221 | +
|
| 222 | +Use the existing aggregated warning cadence for failures and include only |
| 223 | +projectId, provider, logical key prefix, recipient count, and reason. Do not log |
| 224 | +message content. |
| 225 | +
|
| 226 | +## Test Plan |
| 227 | +
|
| 228 | +Add integration-event bridge tests on top of the existing #145 suite: |
| 229 | +
|
| 230 | +- Slack unchanged-content replay re-drives after first accepted steer never |
| 231 | + emits `delivery_injected`. |
| 232 | +- Slack unchanged-content replay is suppressed after `delivery_injected` |
| 233 | + commits the content hash. |
| 234 | +- Slack alias copies during an open provisional window are suppressed, but a |
| 235 | + later replay after timeout is delivered. |
| 236 | +- Generic non-Slack event dedupe releases after delivery timeout and re-drives. |
| 237 | +- No-recipient and synchronous send-failure paths still release immediately. |
| 238 | +- Project broker pacing does not wait on delivery confirmation; the existing |
| 239 | + "does not wait on delivery confirmation path" test should become "does not |
| 240 | + wait while tracking injection confirmation". |
| 241 | +- Missing `sendMessageAndWaitForInjected` in tests/mocks is treated as a hard |
| 242 | + harness gap after #147 lands; do not silently fall back to |
| 243 | + `sendMessageAndWaitForDelivery`. |
| 244 | +- Unsupported injection confirmation commits accepted sends immediately and |
| 245 | + emits low-noise telemetry. |
| 246 | +
|
| 247 | +Add broker tests for relay-worker's helper extension: |
| 248 | +
|
| 249 | +- send acceptance returns before `delivery_injected`; |
| 250 | +- confirmation resolves on `delivery_injected` for the matching event id and |
| 251 | + target; |
| 252 | +- confirmation rejects on delivery failure; |
| 253 | +- confirmation rejects on timeout; |
| 254 | +- `delivery_ack` / `delivery_verified` without `delivery_injected` do not |
| 255 | + commit the helper; |
| 256 | +- multi-target sends wait for all reported targets. |
| 257 | +
|
| 258 | +## Rollout |
| 259 | +
|
| 260 | +1. Land #145 and restart Pear on it to restore reliable content reads. |
| 261 | +2. Relay-worker confirms `delivery_injected` coverage across bridge target |
| 262 | + transports. |
| 263 | +3. pear-worker adds the additive BrokerManager helper and bridge state machine; |
| 264 | + relay-worker adds helper tests. |
| 265 | +4. Implement #147 as a follow-up stacked from #145/main using that helper as the |
| 266 | + confirmation source. |
| 267 | +5. Keep slack-comms' debug-log/OOB drop guard enabled until #147 is merged and |
| 268 | + the running app is restarted on it. |
| 269 | +
|
| 270 | +Residual risk: if the PTY accepts bytes but the agent process is wedged and |
| 271 | +never semantically processes them, `delivery_injected` still commits. That is an |
| 272 | +agent-liveness/read-receipt problem and belongs to the fuller relay#1056 read |
| 273 | +ack path, not #147. |
0 commit comments