Dustin4444 · pull · Jun 1, 2026 · Jun 1, 2026 · Jun 1, 2026 · Jun 1, 2026
diff --git a/.changeset/chat-agent-tools.md b/.changeset/chat-agent-tools.md
@@ -0,0 +1,15 @@
+---
+"@trigger.dev/sdk": patch
+---
+
+Add a `tools` option to `chat.agent`. Declaring your tools here threads them into the SDK's internal `convertToModelMessages`, so each tool's `toModelOutput` is re-applied when prior-turn history is re-converted.
+
+```ts
+chat.agent({
+  tools: { readFile, search },
+  run: async ({ messages, tools, signal }) =>
+    streamText({ model, messages, tools, abortSignal: signal }),
+});
+```
+
+Also exports `InferChatUIMessageFromTools<typeof tools>` to derive the chat `UIMessage` type (typed tool parts) directly from a tool set.
diff --git a/.changeset/coerce-concurrency-key-to-string.md b/.changeset/coerce-concurrency-key-to-string.md
@@ -0,0 +1,5 @@
+---
+"@trigger.dev/core": patch
+---
+
+Coerce numeric `concurrencyKey` values to string at the API boundary across `tasks.trigger`, `tasks.batchTrigger`, and the Phase-2 streaming batch endpoint.
diff --git a/.changeset/mollifier-buffer-extensions.md b/.changeset/mollifier-buffer-extensions.md
@@ -0,0 +1,5 @@
+---
+"@trigger.dev/redis-worker": minor
+---
+
+Mollifier buffer extensions: idempotency dedup, an atomic `mutateSnapshot` API, metadata CAS, claim primitives, and a `MollifierSnapshot` type. The buffer's Redis client now reconnects with jittered backoff so a fleet of clients doesn't stampede Redis in lockstep after a blip.
diff --git a/.changeset/mollifier-buffer-pipeline-list-entries.md b/.changeset/mollifier-buffer-pipeline-list-entries.md
@@ -0,0 +1,5 @@
+---
+"@trigger.dev/redis-worker": patch
+---
+
+Pipeline the per-entry `HGETALL` fetches in `MollifierBuffer.listEntriesForEnv`. The previous serial implementation issued one Redis round-trip per runId returned by `LRANGE`, which dominated stale-sweep wall-time at any meaningful backlog (at the sweep's default maxCount=1000, this is ~1000 RTTs per env per pass). Behaviour is unchanged — entries are still skipped when the entry hash has been torn down by a concurrent drainer ack/fail between the LRANGE and the HGETALL.
diff --git a/.changeset/mollifier-drainer-terminal-failure-callback.md b/.changeset/mollifier-drainer-terminal-failure-callback.md
@@ -0,0 +1,5 @@
+---
+"@trigger.dev/redis-worker": minor
+---
+
+Add `onTerminalFailure` callback to `MollifierDrainerOptions` so the customer's run lands a SYSTEM_FAILURE PG row even when the drainer exhausts `maxAttempts` on a retryable PG error. Previously, retryable-error exhaustion called `buffer.fail()` directly, which atomically marks FAILED + DELs the entry hash with no PG write — silent data loss when PG was unreachable across the full retry budget. The callback fires before `buffer.fail()` on any terminal path (`cause: "non-retryable"` or `"max-attempts-exhausted"`); throwing a retryable error from the callback causes the drainer to requeue rather than fail.
diff --git a/.changeset/mollifier-tag-cap.md b/.changeset/mollifier-tag-cap.md
@@ -0,0 +1,5 @@
+---
+"@trigger.dev/redis-worker": patch
+---
+
+Mollifier `mutateSnapshot` now enforces a tag cap: an `append_tags` patch carrying `maxTags` returns `"limit_exceeded"` (writing nothing) when the deduped tag count would exceed the limit, so a buffered run can't accumulate more tags via the tags API than the trigger validator allows at creation.
diff --git a/.server-changes/mollifier-dashboard.md b/.server-changes/mollifier-dashboard.md
@@ -0,0 +1,6 @@
+---
+area: webapp
+type: feature
+---
+
+Mollifier dashboard surface: run-detail page renders buffered runs via synthetic trace, header, and span shapes; admin-only "Buffered" indicator and drainer LOG event in the trace tree.
diff --git a/.server-changes/mollifier-drainer-replay.md b/.server-changes/mollifier-drainer-replay.md
@@ -0,0 +1,6 @@
+---
+area: webapp
+type: feature
+---
+
+Mollifier drainer replay: replay buffered entries into `engine.trigger`, stale-entry sweep, a drainer-health gauge, and run-engine cancelled/failed run APIs. Known limitation: stale-sweep runs per-webapp instance, so stale-entry counter metrics multiply by N webapps in HA until a distributed lease lands as follow-up.
diff --git a/.server-changes/mollifier-mutations.md b/.server-changes/mollifier-mutations.md
@@ -0,0 +1,6 @@
+---
+area: webapp
+type: feature
+---
+
+Mollifier API mutations on buffered runs: tag, metadata, replay, reschedule, cancel, and idempotency-key reset via a buffer-snapshot fallback. When a mutation races a mid-drain run, the wait-and-bounce loop watches the buffer entry in Redis (cheap) and reads the primary exactly once for the actual mutation, instead of polling the writer on a fixed cadence; polls use jittered exponential backoff.
diff --git a/.server-changes/mollifier-reads.md b/.server-changes/mollifier-reads.md
@@ -0,0 +1,6 @@
+---
+area: webapp
+type: feature
+---
+
+Mollifier API read-fallback: serve buffered runs from synthetic run/trace/span data on the retrieve, trace, spans, and events endpoints.
diff --git a/.server-changes/mollifier-trigger.md b/.server-changes/mollifier-trigger.md
@@ -0,0 +1,6 @@
+---
+area: webapp
+type: feature
+---
+
+Mollifier trigger-time decisions: gate `engine.trigger`, mollify bursts into the buffer, claim idempotency keys, and read-fallback for buffered runs.
diff --git a/apps/webapp/app/components/runs/v3/CancelRunDialog.tsx b/apps/webapp/app/components/runs/v3/CancelRunDialog.tsx
@@ -10,9 +10,18 @@ import { SpinnerWhite } from "~/components/primitives/Spinner";
 type CancelRunDialogProps = {
   runFriendlyId: string;
   redirectPath: string;
+  // Fired on submit so the parent can close the Radix Dialog without
+  // wrapping the submit button in `DialogClose` — that wrapper races
+  // submit (close fires first, unmounts the form, and the cancel POST
+  // never lands). Optional so existing call sites still type-check.
+  onCancelSubmitted?: () => void;
 };
 
-export function CancelRunDialog({ runFriendlyId, redirectPath }: CancelRunDialogProps) {
+export function CancelRunDialog({
+  runFriendlyId,
+  redirectPath,
+  onCancelSubmitted,
+}: CancelRunDialogProps) {
   const navigation = useNavigation();
 
   const formAction = `/resources/taskruns/${runFriendlyId}/cancel`;
@@ -27,7 +36,11 @@ export function CancelRunDialog({ runFriendlyId, redirectPath }: CancelRunDialog
         </Paragraph>
         <FormButtons
           confirmButton={
-            <Form action={`/resources/taskruns/${runFriendlyId}/cancel`} method="post">
+            <Form
+              action={`/resources/taskruns/${runFriendlyId}/cancel`}
+              method="post"
+              onSubmit={() => onCancelSubmitted?.()}
+            >
               <Button
                 type="submit"
                 name="redirectUrl"

diff --git a/apps/webapp/app/entry.server.tsx b/apps/webapp/app/entry.server.tsx
@@ -9,6 +9,7 @@ import { renderToPipeableStream } from "react-dom/server";
 import { PassThrough } from "stream";
 import * as Worker from "~/services/worker.server";
 import { initMollifierDrainerWorker } from "~/v3/mollifierDrainerWorker.server";
+import { initMollifierStaleSweepWorker } from "~/v3/mollifierStaleSweepWorker.server";
 import { bootstrap } from "./bootstrap";
 import { LocaleContextProvider } from "./components/primitives/LocaleProvider";
 import {
@@ -228,6 +229,7 @@ Worker.init().catch((error) => {
 });
 
 initMollifierDrainerWorker();
+initMollifierStaleSweepWorker();
 
 bootstrap().catch((error) => {
   logError(error);

diff --git a/apps/webapp/app/env.server.ts b/apps/webapp/app/env.server.ts
@@ -1063,13 +1063,16 @@ const EnvironmentSchema = z
     // Separate switch for the drainer (consumer side) so it can be split
     // off onto a dedicated worker service. Unset → inherits
     // TRIGGER_MOLLIFIER_ENABLED, so single-container self-hosters don't have to
-    // flip two switches. In multi-replica deployments, set this to "0"
-    // explicitly on every replica except the one dedicated drainer
-    // service — otherwise every replica's polling loop races for the
-    // same buffer entries. `TRIGGER_MOLLIFIER_ENABLED` is still the master kill
-    // switch; setting this to "1" while `TRIGGER_MOLLIFIER_ENABLED` is "0" is a
-    // no-op because the gate-side singleton refuses to construct a
-    // buffer when the system is off.
+    // flip two switches. Multi-replica drainers are correct — `popAndMarkDraining`
+    // is an atomic ZPOPMIN + status flip in one Lua call, so only one replica
+    // can win any given entry — but inefficient: polling load (SMEMBERS +
+    // per-env scans) multiplies by N, and `TRIGGER_MOLLIFIER_DRAIN_CONCURRENCY`
+    // is per-process so engine load also multiplies. Splitting the drainer
+    // onto a dedicated worker keeps that traffic off the request-serving
+    // replicas. `TRIGGER_MOLLIFIER_ENABLED` is still the master kill switch;
+    // setting this to "1" while `TRIGGER_MOLLIFIER_ENABLED` is "0" is a
+    // no-op because the gate-side singleton refuses to construct a buffer
+    // when the system is off.
     TRIGGER_MOLLIFIER_DRAINER_ENABLED: z.string().default(process.env.TRIGGER_MOLLIFIER_ENABLED ?? "0"),
     TRIGGER_MOLLIFIER_SHADOW_MODE: z.string().default("0"),
     TRIGGER_MOLLIFIER_REDIS_HOST: z
@@ -1095,10 +1098,35 @@ const EnvironmentSchema = z
     TRIGGER_MOLLIFIER_TRIP_THRESHOLD: z.coerce.number().int().positive().default(100),
     TRIGGER_MOLLIFIER_HOLD_MS: z.coerce.number().int().positive().default(500),
     TRIGGER_MOLLIFIER_DRAIN_CONCURRENCY: z.coerce.number().int().positive().default(50),
-    TRIGGER_MOLLIFIER_ENTRY_TTL_S: z.coerce.number().int().positive().default(600),
     TRIGGER_MOLLIFIER_DRAIN_MAX_ATTEMPTS: z.coerce.number().int().positive().default(3),
     TRIGGER_MOLLIFIER_DRAIN_SHUTDOWN_TIMEOUT_MS: z.coerce.number().int().positive().default(30_000),
     TRIGGER_MOLLIFIER_DRAIN_MAX_ORGS_PER_TICK: z.coerce.number().int().positive().default(500),
+    // Periodic sweep that scans buffer queue LISTs for entries whose
+    // dwell exceeds the stale threshold. Independent of the drainer —
+    // its job is exactly to make a stuck/offline drainer visible to
+    // ops. Defaults: explicitly opt-in (a separate kill switch from
+    // the mollifier itself), run every 5 minutes, alert on anything
+    // that's been dwelling for 5+ minutes (matches the sweep interval
+    // — "anything still here when we check" is the simplest threshold
+    // that converges).
+    //
+    // The sweep was previously defaulting to inherit
+    // `TRIGGER_MOLLIFIER_ENABLED`, which meant any deployment already
+    // running with the mollifier on would auto-start the sweep worker
+    // on upgrade — turning on new background load with no explicit
+    // rollout step. Hard-defaulting to "0" preserves the intent of
+    // exposing the sweep as a separate switch.
+    TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED: z.string().default("0"),
+    TRIGGER_MOLLIFIER_STALE_SWEEP_INTERVAL_MS: z.coerce
+      .number()
+      .int()
+      .positive()
+      .default(5 * 60_000),
+    TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS: z.coerce
+      .number()
+      .int()
+      .positive()
+      .default(5 * 60_000),
 
     BATCH_TRIGGER_PROCESS_JOB_VISIBILITY_TIMEOUT_MS: z.coerce
       .number()