-
Notifications
You must be signed in to change notification settings - Fork 0
DO-NOT-MERGE: Design #147 integration event re-drive #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f6a9d4e
082b549
729b601
fd05051
720a6ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,316 @@ | ||
| # Integration Event Re-Drive After Lost Steers | ||
|
|
||
| Issue: #147. Base: #145 (`split/132-slack-integration-event-context-retry`). | ||
| Status: IMPLEMENTATION APPROVED FOR DIRECT-AGENT PATH ONLY. Do not merge until | ||
| #145 has landed and project-lead explicitly clears #147. The broker | ||
| delivery-signal contract was pinned by relay-worker and project-lead on | ||
| 2026-06-07. | ||
|
|
||
| ## Problem | ||
|
|
||
| The integration-event bridge currently treats broker send acceptance as delivery. | ||
| For Slack logical dedupe, `claimSlackLogicalInjection()` records the | ||
| logical/content-hash claim before a steer is known to have reached the target | ||
| agent. For non-Slack paths, `wasRecentlyInjected()` does the same with the path | ||
| dedupe key. | ||
|
|
||
| That is safe for duplicate suppression only when the broker send reliably reaches | ||
| the agent. The confirmed failure mode from 2026-06-07 is different: | ||
|
|
||
| - Relayfile emits the same Slack record several times. | ||
| - The first delivery logs `injecting` for `slack-comms`, so the bridge claims the | ||
| logical key. | ||
| - The steer is lost after broker acceptance and never appears in the agent | ||
| conversation. | ||
| - Later re-deliveries are suppressed for the one-hour Slack replay TTL because | ||
| the content hash is unchanged. | ||
|
|
||
| #145 fixes the separate "content unavailable" path, but it cannot fix a lost | ||
| first steer with unchanged content. The bridge needs delivery confirmation and a | ||
| bounded re-drive window. | ||
|
|
||
| ## Goals | ||
|
|
||
| - Make dedupe claims provisional until delivery is confirmed. | ||
| - Commit a provisional claim only after the broker/harness-driver reports that | ||
| the steer reached the intended recipient. | ||
| - Release a provisional claim on explicit delivery failure or confirmation | ||
| timeout, so a duplicate/replayed event can re-drive the same message. | ||
| - Preserve current pacing: event dispatch must not block the project send queue | ||
| while waiting for delivery confirmation. | ||
| - Keep duplicate hardening by stable event identity first, Slack logical key and | ||
| content hash second. | ||
| - Make telemetry low-noise: count released provisional claims and confirmation | ||
| timeouts without logging every duplicate. | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| - No durable mailbox design in this PR. That belongs to relay#1056. | ||
| - No delivery retry loop inside Pear beyond releasing the dedupe claim. Relayfile | ||
| replay, stream reconnect, or event-feed polling provides the re-drive input. | ||
| - No merge before Khaliq's explicit go. This design stacks on #145 and should | ||
| become a follow-up PR after #145 lands. | ||
| - No channel-target delivery confirmation in #147 unless concrete resolved agent | ||
| targets are available. Empty/unresolved channel targets must never commit a | ||
| provisional claim. | ||
|
|
||
| ## Broker Delivery Contract | ||
|
|
||
| Pear already has a related confirmation helper, | ||
| `BrokerManager.sendMessageAndWaitForDelivery(projectId, input, { timeoutMs })`. | ||
| That helper: | ||
|
|
||
| - sends through the broker; | ||
| - captures the returned `event_id` and targets; | ||
| - subscribes to `session.client.onEvent(...)`; | ||
| - resolves on `delivery_ack`, `delivery_verified`, or | ||
| `message_delivery_confirmed` for that event and target; | ||
| - rejects on `delivery_failed`, `message_delivery_failed`, or its own timeout. | ||
|
|
||
| Project-lead ruled that #147 must NOT commit on that helper's current success | ||
| condition. `delivery_ack` and `delivery_verified` can be satisfied by an echo | ||
| timeout fallback, which is too weak for this bug: it could commit the dedupe | ||
| claim even if the steer never crossed the lost hop. | ||
|
|
||
| Commit signal for #147: `delivery_injected`, filtered by returned | ||
| `event_id` and target. That event is emitted by the PTY worker after it writes | ||
| the injection bytes and carriage return to the PTY. It is the correct boundary | ||
| for mode-B loss: "accepted by broker" is too early, and "read by model" is out | ||
| of scope. | ||
|
|
||
| Relay-worker owns the BrokerManager helper extension. The bridge should depend | ||
| on a sibling helper, not hand-subscribe to broker events: | ||
|
|
||
| ```ts | ||
| sendMessageAndWaitForInjected( | ||
| projectId: string, | ||
| input: SendMessageInput, | ||
| options?: { timeoutMs?: number; signal?: AbortSignal } | ||
| ): Promise<{ eventId: string; targets: string[] }> | ||
| ``` | ||
|
|
||
| Semantics: | ||
|
|
||
| - Internals share/refactor the existing send-plus-event-wait logic from | ||
| `sendMessageAndWaitForDelivery`. | ||
| - Success is only `delivery_injected` matching the returned `event_id` and | ||
| target `name`. | ||
| - `delivery_ack`, `delivery_verified`, and `message_delivery_confirmed` are not | ||
| success for this helper. If they arrive without `delivery_injected`, the | ||
| helper keeps waiting until `delivery_injected` or timeout. | ||
| - Failure stays `delivery_failed` / `message_delivery_failed`, or helper | ||
| timeout. | ||
| - Target handling matches the existing helper: use returned `targets` when | ||
| present; for direct agent sends with no targets, fall back to `[input.to]`. | ||
| - Unsupported brokers may return no usable `event_id`; the bridge treats those as | ||
| accepted-only and commits immediately, preserving current behavior. A direct | ||
| agent send with no returned targets is not unsupported: it falls back to | ||
| `[input.to]` and must still wait for `delivery_injected`. Channel/project | ||
| fanout with no concrete targets is the residual case described below. | ||
| - Before waiting for `delivery_injected`, BrokerManager should detect support | ||
| from broker session metadata, such as an explicit capability flag or the | ||
| protocol/version threshold that introduced `delivery_injected`. If the session | ||
| is known not to support the signal, return an accepted-only result immediately | ||
| and record `deliveryInjectedUnsupported` instead of waiting for timeout. | ||
|
|
||
| The helper may internally reuse the current `sendMessageAndWaitForDelivery` | ||
| mechanics: send once, capture `event_id` / targets, subscribe internally, filter | ||
| events in BrokerManager. It must differ by resolving only on | ||
| `delivery_injected`. | ||
|
|
||
| Calling this helper from the existing project pacer and awaiting it would | ||
| reintroduce the old head-of-line block. The bridge must start the helper as a | ||
| background confirmation task, attach `.then(commit).catch(release)`, and return | ||
| from the pacer callback immediately after starting the send attempt. The pacer | ||
| rate-limits starts, not confirmation completion. | ||
|
Comment on lines
+122
to
+125
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since these confirmation tasks run asynchronously in the background and can take up to 15 seconds to resolve or timeout, there is a risk of memory leaks, log spam, or null-pointer/property-access errors if the project is closed or reconciled (via |
||
|
|
||
| Background confirmation tasks must be project-scoped and cancellable. The bridge | ||
| should keep an `AbortController` per active confirmation task, pass its | ||
| `signal` to `sendMessageAndWaitForInjected()`, and abort all pending | ||
| confirmations from `close(projectId)` before disposing the pacer. Abort should | ||
| release provisional claims without emitting delivery-failure warnings for a | ||
| project that is intentionally closing. | ||
|
|
||
| Coverage result: relay-worker confirmed `delivery_injected` is emitted for the | ||
| agent transports the integration bridge can inject to. Local PTY workers emit it | ||
| after writing the injection bytes and carriage return. Headless/app-server | ||
| runtimes emit it on successful handler delivery before ack/failure. | ||
| Cloud-attached sandboxes use the same harness-driver event stream through | ||
| `attachCloudSandbox`, so direct sends to cloud-owned agents use the same | ||
| delivery signal once the agent has been observed/listed. | ||
|
|
||
| Timing result: for current integration-event `mode: 'steer'`, `delivery_injected` | ||
| is not gated on the model becoming idle. PTY steer sends ESC ESC, waits briefly, | ||
| writes the injection and CR, then emits `delivery_injected`. It can still wait | ||
| behind the worker's local pending-injection queue or a wedged PTY write, but it | ||
| does not wait for a long model turn to complete. | ||
|
|
||
| Channel guard: the bridge can target channels as well as direct agents. Broker | ||
| channel sends fan out to concrete workers that use the same delivery machinery, | ||
| but today's `/api/send` response does not include the resolved worker names. | ||
| Existing wait helpers therefore see `targets=[]` for `#channel` sends and return | ||
| immediately. For #147, that must NOT count as committed delivery. Engage | ||
| provisional -> commit-on-injected only when there is at least one concrete agent | ||
| target. Empty-target/channel sends keep prior semantics, are logged as residual, | ||
| and must never be falsely committed. Mixed direct-agent plus channel batches use | ||
| the same conservative behavior for the whole batch: no injected-delivery claim | ||
| is committed, so direct-agent recipients in that mixed batch do not get #147 | ||
| re-drive protection. The long-term fix is for the broker send response to | ||
| include resolved targets. | ||
|
|
||
| File ownership for implementation: | ||
|
|
||
| - pear-worker owns the additive BrokerManager helper and the bridge state | ||
| machine. | ||
| - relay-worker owns BrokerManager helper tests. | ||
| - Keep broker.ts changes strictly additive: new helper, new predicate, and | ||
| shared/refactored wait logic only. A shared wait refactor is allowed, but it | ||
| must preserve `sendMessageAndWaitForDelivery()` behavior exactly and keep that | ||
| method's existing tests green. Do not touch the #146 workspace-key/cloud | ||
| regions. | ||
|
|
||
| ## Bridge State Model | ||
|
|
||
| Replace optimistic one-shot claims with a small state machine shared by generic | ||
| path dedupe and Slack logical/content-hash dedupe. | ||
|
|
||
| ```ts | ||
| type DeliveryClaimStatus = 'provisional' | 'committed' | ||
|
|
||
| type DeliveryClaim = { | ||
| status: DeliveryClaimStatus | ||
| expiresAt: number | ||
| confirmationDeadline: number | ||
| recipients: Map<string, 'pending' | 'confirmed' | 'failed'> | ||
| contentHash?: string | ||
| eventIds: Set<string> | ||
| } | ||
| ``` | ||
|
|
||
| For Slack logical keys, the entry is still keyed by logical Slack identity. A | ||
| content-hash bucket carries one claim: | ||
|
|
||
| ```ts | ||
| type SlackLogicalInjectionState = { | ||
| expiresAt: number | ||
| blindClaim?: DeliveryClaim | ||
| contentHashes: Map<string, DeliveryClaim> | ||
| } | ||
| ``` | ||
|
|
||
| Behavior: | ||
|
|
||
| - A committed matching content hash suppresses duplicates until TTL expiry. | ||
| - A provisional matching content hash suppresses only concurrent in-flight alias | ||
| copies while confirmation is still pending. | ||
| - When confirmation times out or fails, the provisional claim is removed. A later | ||
| replay of the same logical key/content hash can inject again. | ||
| - A blind claim can still learn the first later content hash as in #145. When | ||
| the hash is learned, move the existing `blindClaim` into `contentHashes` under | ||
| that hash and clear `blindClaim`; the moved claim stays provisional until | ||
| delivery is confirmed. | ||
| - Genuine edits with new content hashes continue to inject independently. | ||
|
|
||
| For non-Slack integration events, the existing recent-injection map becomes the | ||
| same provisional/committed state keyed by `eventDedupeKeyWithFingerprint()`. | ||
|
|
||
| ## Delivery Flow | ||
|
|
||
| 1. Build the dedupe key and resolve recipients as today. | ||
| 2. If an equivalent committed claim exists, drop as duplicate. | ||
| 3. If an equivalent provisional claim exists, drop as duplicate only while its | ||
| confirmation window is still open. | ||
| 4. Create or update a provisional claim before sending, scoped to the event key, | ||
| content hash, and intended recipients. | ||
| 5. Enqueue each broker send through the existing project pacer. The pacer starts | ||
| the `sendMessageAndWaitForInjected()` task and does not await its | ||
| confirmation. | ||
| 6. For each started send, attach background handlers: | ||
| - On `delivery_injected`: mark that recipient confirmed. | ||
| - When all recipients are confirmed: commit the claim and extend `expiresAt` | ||
| to the normal replay TTL. | ||
| - On failure or timeout: mark failed and release the provisional claim. | ||
| 7. If every recipient fails synchronously before send acceptance, release the | ||
| claim immediately as today. | ||
| 8. If some recipients confirm and some fail, release the whole shared logical | ||
| claim rather than committing on first success. A later replay re-sends to all | ||
| recipients, so recipients that already received the steer may see a duplicate. | ||
| That is the accepted duplicate-over-drop bias for #147. True per-recipient | ||
| claims could avoid the duplicate, but they add complexity and are out of | ||
| scope for this pass; the live drop path is single-recipient (`slack-comms`). | ||
|
|
||
| ## Timeout And Telemetry | ||
|
|
||
| Default injection-confirmation timeout: 5 seconds for direct `mode: 'steer'` | ||
| agent sends. | ||
|
|
||
| Rationale: relay-worker confirmed `delivery_injected` fires after the PTY worker | ||
| writes bytes, not after the model reads or completes a turn. Five seconds is | ||
| therefore intentionally generous for healthy agents while keeping the | ||
| provisional claim window short enough that replay can recover promptly and | ||
| concurrent duplicate suppression does not linger. | ||
|
|
||
| Add aggregate counters/logs: | ||
|
|
||
| - `deliveryClaimsProvisional` | ||
| - `deliveryClaimsCommitted` | ||
| - `deliveryClaimsReleased` | ||
| - `deliveryInjectedTimeouts` | ||
| - `deliveryInjectedUnsupported` | ||
|
|
||
| Use the existing aggregated warning cadence for failures and include only | ||
| projectId, provider, logical key prefix, recipient count, and reason. Do not log | ||
| message content. | ||
|
|
||
| ## Test Plan | ||
|
|
||
| Add integration-event bridge tests on top of the existing #145 suite: | ||
|
|
||
| - Slack unchanged-content replay re-drives after first accepted steer never | ||
| emits `delivery_injected`. | ||
| - Slack unchanged-content replay is suppressed after `delivery_injected` | ||
| commits the content hash. | ||
| - Slack alias copies during an open provisional window are suppressed, but a | ||
| later replay after timeout is delivered. | ||
| - Generic non-Slack event dedupe releases after delivery timeout and re-drives. | ||
| - No-recipient and synchronous send-failure paths still release immediately. | ||
| - Project broker pacing does not wait on delivery confirmation; the existing | ||
| "does not wait on delivery confirmation path" test should become "does not | ||
| wait while tracking injection confirmation". | ||
| - Missing `sendMessageAndWaitForInjected` in tests/mocks is treated as a hard | ||
| harness gap after #147 lands; do not silently fall back to | ||
| `sendMessageAndWaitForDelivery`. | ||
| - Unsupported injection confirmation commits accepted sends immediately and | ||
| emits low-noise telemetry. | ||
| - Project close aborts pending background confirmations, releases provisional | ||
| claims, and does not log delivery-failure warnings for the intentional abort. | ||
|
|
||
| Add broker tests for relay-worker's helper extension: | ||
|
|
||
| - send acceptance returns before `delivery_injected`; | ||
| - confirmation resolves on `delivery_injected` for the matching event id and | ||
| target; | ||
| - confirmation rejects on delivery failure; | ||
| - confirmation rejects on timeout; | ||
| - `delivery_ack` / `delivery_verified` without `delivery_injected` do not | ||
| commit the helper; | ||
| - multi-target sends wait for all reported targets. | ||
|
|
||
| ## Rollout | ||
|
|
||
| 1. Land #145 and restart Pear on it to restore reliable content reads. | ||
| 2. Direct-agent implementation uses `delivery_injected` with a 5-second timeout. | ||
| Channel/unresolved targets remain residual and never commit on `targets=[]`. | ||
| Mixed direct-agent plus channel batches also keep prior semantics for the | ||
| whole batch. | ||
| 3. pear-worker adds the additive BrokerManager helper and bridge state machine; | ||
| relay-worker adds helper tests. | ||
| 4. Implement #147 as a follow-up stacked from #145/main using that helper as the | ||
| confirmation source. | ||
| 5. Keep slack-comms' debug-log/OOB drop guard enabled until #147 is merged and | ||
| the running app is restarted on it. | ||
|
|
||
| Residual risk: if the PTY accepts bytes but the agent process is wedged and | ||
| never semantically processes them, `delivery_injected` still commits. That is an | ||
| agent-liveness/read-receipt problem and belongs to the fuller relay#1056 read | ||
| ack path, not #147. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Escape line-leading issue references to avoid accidental headings.
#145at line start is parsed as an ATX heading marker in Markdown, which breaks section structure. Escape the hash (or rephrase the sentence) for those references.✏️ Suggested doc fix
As reported by markdownlint MD018 for these exact lines.
Also applies to: 28-28
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 5-5: No space after hash on atx style heading
(MD018, no-missing-space-atx)
🤖 Prompt for AI Agents
Source: Linters/SAST tools