[pull] main from triggerdotdev:main#215
Merged
Merged
Conversation
…workload manager (#3902) `ComputeWorkloadManager.create` swallows gateway errors currently, so a cold start that fails placement (e.g. a netns slot with a busy tap, a full node disk) silently abandons the dequeued run until the run engine's `PENDING_EXECUTING` heartbeat timeout redrives it via stall detection. ### Changes - Retry `instances.create` with short backoff (default 3 attempts, 250ms backoff), recording `createAttempts` in the wide event. - **Only statuses where the create definitely did not commit are retried**: 500 (agent/fcrun create failed) and 503 (no placement). 502/504 are excluded — the gateway emits those when it fails to reach the node or read its response, which can happen *after* the agent committed the create; the gateway only records the instance name on a clean 201, so a same-name retry would miss the collision check and could double-create the VM on another node. Network-level fetch failures are retried (if the gateway processed the create, its name index is populated and the retry 409s harmlessly). Timeouts are not retried. - **Retry attempts after a 5xx use a deterministic `-rN` name suffix**: a failed create can leave its name registered until async cleanup runs. Attempt 1 keeps the unsuffixed name.
…tes or disconnects (#3894) The compute suspend flow delays snapshots by `snapshotDelayMs` (~30s) so short-lived waitpoints skip the snapshot entirely, with the intent that a run continuing before the delay expires cancels the pending snapshot. But the only `cancel()` call site was the `/continue` action, which runners only invoke when restoring from an already-taken snapshot — so pending snapshots were never cancelled (zero `snapshot.canceled` events ever emitted in prod). When a run resumed and completed inside the window, the stale snapshot fired ~30s later anyway, pausing the VM 6–13s mid warm-start long-poll; the frozen guest couldn't fire its abort timer or send a FIN, causing stalls and run-engine driven retries. ### Change - Cancel the pending snapshot on `attempt.complete` — after the platform accepts the completion, before the HTTP reply (so it can't reorder with the runner's next `/suspend`). - Cancel on `runDisconnected` (crash, exit, or run replaced on the socket). - Both cancels are guarded by a runnerId match (new `TimerWheel.peek()`): a stale duplicate runner for a reassigned run must not cancel the fresh runner's pending snapshot. A missing runnerId falls through to an unconditional cancel (the pre-existing `/continue` behavior is unchanged). Waitpoint suspensions keep the runner socket connected and the attempt incomplete, so neither hook touches a snapshot that is still wanted. Known limitation (fail-safe direction): `socket.data.runnerId` is frozen at the websocket handshake, so after a same-supervisor restore the disconnect-path guard refuses the cancel. The `attempt.complete` path uses the runner's current header id and is unaffected.
## Summary HIPAA BAA is offered as a paid add-on on every paid plan. Each paid tier on the in-app pricing card now has a "HIPAA BAA add-on" row with a "Request a BAA" link that opens the existing contact dialog pre-filled with a new `hipaa` inquiry type, prompting the user for their company name and a brief description of the PHI workload. The contact form's `feedbackTypes` are restructured to match the marketing /contact form: every inquiry type carries a Plain label ID and a "Contact form: ..." thread title, so threads land in Plain identically whether they come from the dashboard or the marketing site. The included-compute line on each tier also picks up the credits wording from the marketing pricing page, and the Enterprise tier lifts its title above the features row.
…latency (#3907) ## Summary Three related fixes for `chat.headStart` and continuation boots, found while investigating customer reports. **1. `chat.headStart` now works with `hydrateMessages`.** The turn-0 handover splice only ran on the default accumulation path, so agents registering `hydrateMessages` silently lost the warm route's step-1 response: pure-text turns fired `onTurnComplete` with no assistant message (and an empty durable write), tool-call turns re-ran step 1 from scratch under a fresh `messageId`, and the head-start user message never reached the hydrate hook at all. The first-turn history now reaches `hydrateMessages` as `incomingMessages`, and the splice runs after both accumulation branches, deduplicated by the handover `messageId`. **2. Reasoning parts survive the handover.** The synthesized partial only mapped text and tool-call parts, so an extended-thinking model's step-1 reasoning streamed to the browser but never reached durable history. Reasoning parts now map through with provider metadata, so Anthropic thinking signatures survive a UIMessage round trip on hydrate replays. **3. Continuation boots no longer stall for ~10 seconds.** The `.in` resume cursor was found by draining an SSE subscription that only closes after its 5 second inactivity window, and the scan ran twice per boot. It is now a non-blocking records read of the latest turn-complete header, runs at most once per boot, the boot reads run concurrently, and chat snapshots carry the cursor so subsequent boots skip the scan entirely. Measured locally on a cancel-then-continue repro: pre-turn continuation latency dropped from ~11s to ~0.5s. Every fix was verified red-green: new unit tests reproduced each failure before the fix, and end-to-end smoke tests against a live local stack covered both handover legs, reasoning persistence with extended thinking (including a follow-up turn that round-trips the persisted signed reasoning back to the provider), and the boot timing comparison. ## Rollout SDK-only; no server change required. A new SDK against a server that does not serialize record headers degrades to the existing no-cursor fallback. Old SDKs ignore the new snapshot field, and new SDKs fall back to the records scan on snapshots written before it existed.
…controls (#3906) ## Summary The run trace page loader serialized every span's raw OTel events (with full properties) into the response, even though the tree UI only renders the derived `timelineEvents` and the span detail panel refetches what it needs. On event-heavy traces that inflated both the loader payload and the server-side heap copies built per request. This PR keeps raw span events server-side and pairs that with a few related trace-view improvements: - A new optional `TRACE_VIEW_EMERGENCY_SPAN_CAP` env var (unset by default) clamps the trace summary and detailed trace summary span limits on both event store paths, including the public run trace endpoint, so operators can bound trace query sizes in one place without retuning the per-store limits. - The TreeView virtualizer resolved every rendered row with a linear scan over the whole tree (and `getNodeProps` did the same via `findIndex`); rows now resolve through memoized id lookup maps, which matters once traces reach tens of thousands of spans. - The run stream SSE lookup now applies the same organization membership scoping as the rest of the run page presenters, for consistency. Behavior is unchanged by default: the trace tree renders from the same `timelineEvents` it always has, and the new cap only takes effect when set.
…sts (#3912) ## Summary `test/runsRepositoryCursor.test.ts` pinned its fixture runs to `createdAt = 2026-06-04T16:55:07Z`. `listRuns` applies the default 7 day window when no time filter is given, so the fixtures aged out of the window at 16:55 UTC on 2026-06-11 and all five tests started failing for every branch, regardless of what the branch changed. The tests were green on their own CI two days earlier because the fixtures were only five days old at the time. This switches the fixture base to a relative timestamp (one hour ago), so the fixtures stay inside the default window permanently. Verified the suite goes 5/5 green with this change on the same environment where the pinned dates fail 5/5.
## Summary The environment variables page loaded every variable value in the project, unfiltered by environment. Archiving a preview branch does not delete its environment variable value rows, so projects that churn preview branches accumulate values forever, and every page view loaded all of them. On large projects this made the page loader take many seconds and stalled the server while deserializing the oversized result. ## Fix The presenter now loads the displayed environments first and filters the `values` relation to those environment IDs. That matches the display semantics exactly (per-user dev environments and active branch environments included), and the lookup is covered by the existing unique index on `(variableId, environmentId)`. Values in archived branch environments are no longer fetched at all. Covered by a new testcontainers test asserting that values from active environments (including branch environments) are returned while archived branch environments are excluded.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )
This change is