fix(webapp): mollifier mutation routes — log silent buffer outcomes + writer fallback in degraded mode

d-cs · claude · d-cs · commit c1c4d6a9cc6c · 2026-06-01T15:45:12.000+01:00
Three CodeRabbit findings on PR #3756, bundled because they share the same regression class: post-PR the mutation routes read from the replica for offload, but several non-happy-path branches lost the writer-side safety net the pre-PR routes had (or never logged non-throw failures from helper outcomes). 1. **`api.v1.runs.$runId.metadata.ts` — silent failure on non-throw buffer outcomes.** The parent/root fan-out helper wrapped `applyMetadataMutationToBufferedRun` in `tryCatch` and only inspected the thrown error. The helper reports non-throw failures via outcome `kind` (`not_found`, `busy`, `version_exhausted`, `metadata_too_large`); those silently disappeared. Now warn-log each non-success kind so ops can trace where a parent/root op went. Best-effort behaviour preserved — still doesn't bubble to the customer response. Helper exported for unit-test reach. 2. **`mutateWithFallback.server.ts` — `\!buffer` short-circuit returned false 404.** Pre-PR mutation routes read from the writer directly, so a fresh PG row was always visible regardless of replication lag. Post-PR the replica read became the primary lookup; if the buffer isn't available (mollifier disabled, boot- time init error), the helper returned `not_found` without probing the writer — regressing mutation behaviour in mollifier-disabled mode. Mirror the writer-disambiguation block already used in the buffer-says-not-found branch. 3. **`resolveRunForMutation.server.ts` — pre-handler resolver did the same.** Returns null if both replica and buffer miss; the route builder converts null to a hard 404 BEFORE the action handler runs, so the downstream `mutateWithFallback` writer recovery can never fire. Add a final writer probe before returning null, so replica-lag and degraded-buffer states are still served. Tests: - `metadataRouteOperationsLogging.test.ts` (new): 7 assertions — 4× non-success kind logs the warn, 1× happy path stays silent, 1× thrown-error branch unaffected, 1× missing-args short-circuit. - `mollifierMutateWithFallback.test.ts`: +2 tests for the `\!buffer + writer hit/miss` paths. - `mollifierResolveRunForMutation.test.ts`: +4 tests for the writer fallback paths (replica+buffer miss → writer hit, \!buffer + writer hit, all-miss legitimate 404, replica-hit short-circuits writer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/apps/webapp/app/routes/api.v1.runs.$runId.metadata.ts b/apps/webapp/app/routes/api.v1.runs.$runId.metadata.ts
@@ -71,7 +71,12 @@ export async function loader({ request, params }: LoaderFunctionArgs) {
 // `_ingestion_only` flag: a synthetic body that has the operations
 // promoted to top-level `operations` so the service applies them to
 // `targetRunId` directly.
-async function routeOperationsToRun(
+// Exported so the silent-failure logging behaviour can be unit-tested.
+// The route handler itself isn't an attractive test target (createActionApiRoute
+// wraps it in auth + body parsing + error-handler middleware), but the
+// fan-out helper carries the load-bearing logic — including the ops-
+// visibility branch this change adds.
+export async function routeOperationsToRun(
   targetRunId: string | undefined,
   operations: RunMetadataChangeOperation[] | undefined,
   env: AuthenticatedEnvironment
@@ -118,7 +123,7 @@ async function routeOperationsToRun(
   // Best-effort buffer fallback. Wrap so a transient Redis throw on
   // this auxiliary op can't 500 the request after the primary mutation
   // already succeeded.
-  const [bufferError] = await tryCatch(
+  const [bufferError, bufferOutcome] = await tryCatch(
     applyMetadataMutationToBufferedRun({
       runId: targetRunId,
       environmentId: env.id,
@@ -132,6 +137,22 @@ async function routeOperationsToRun(
       targetRunId,
       error: bufferError instanceof Error ? bufferError.message : String(bufferError),
     });
+    return;
+  }
+  // `applyMetadataMutationToBufferedRun` reports non-throw failures via
+  // its returned outcome kind: `not_found`, `busy`, `version_exhausted`,
+  // `metadata_too_large`. Without inspecting `.kind`, the parent/root
+  // operation can silently disappear — no PG row landed it (handled
+  // above) and the buffer rejected it for one of these reasons but the
+  // helper returned cleanly. Surface a warn log per non-success branch
+  // so ops can trace why a parent/root op went missing. The customer's
+  // primary mutation has already succeeded by this point; this remains
+  // best-effort, so we still don't bubble these to the response.
+  if (bufferOutcome && bufferOutcome.kind !== "applied") {
+    logger.warn("metadata route: parent/root buffer op did not apply", {
+      targetRunId,
+      kind: bufferOutcome.kind,
+    });
   }
 }
 
diff --git a/apps/webapp/app/v3/mollifier/mutateWithFallback.server.ts b/apps/webapp/app/v3/mollifier/mutateWithFallback.server.ts
@@ -91,8 +91,19 @@ export async function mutateWithFallback<TResponse>(
   }
 
   if (!buffer) {
-    // No buffer configured (mollifier disabled or boot-time error). PG
-    // missed; nothing else to consult.
+    // No buffer configured (mollifier disabled or boot-time error). The
+    // pre-PR mutation routes read from the writer directly, so a freshly-
+    // created PG row was always visible regardless of replication lag.
+    // Now that the read moved to the replica (line 87) for the offload,
+    // a `!buffer` short-circuit would regress: a real PG row + replica
+    // lag would return 404. Mirror the writer-disambiguation block below
+    // (line 148, the buffer-says-not-found path) so degraded mode
+    // (mollifier disabled) still matches pre-PR mutation behaviour.
+    const writerRow = await findRunInPg(writer, input.runId, input.environmentId);
+    if (writerRow) {
+      const response = await input.pgMutation(writerRow);
+      return { kind: "pg", response };
+    }
     return { kind: "not_found" };
   }
 
diff --git a/apps/webapp/app/v3/mollifier/resolveRunForMutation.server.ts b/apps/webapp/app/v3/mollifier/resolveRunForMutation.server.ts
@@ -1,5 +1,5 @@
 import type { MollifierBuffer } from "@trigger.dev/redis-worker";
-import { $replica as defaultReplica } from "~/db.server";
+import { $replica as defaultReplica, prisma as defaultWriter } from "~/db.server";
 import { getMollifierBuffer as defaultGetBuffer } from "./mollifierBuffer.server";
 
 // Discriminated-union resolver used by mutation routes' `findResource`.
@@ -16,15 +16,18 @@ export type ResolvedRunForMutation =
   | { source: "pg"; friendlyId: string }
   | { source: "buffer"; friendlyId: string };
 
-export type ResolveRunForMutationDeps = {
-  prismaReplica?: {
-    taskRun: {
-      findFirst(args: {
-        where: { friendlyId: string; runtimeEnvironmentId: string };
-        select: { friendlyId: true };
-      }): Promise<{ friendlyId: string } | null>;
-    };
+type PrismaTaskRunFindFirst = {
+  taskRun: {
+    findFirst(args: {
+      where: { friendlyId: string; runtimeEnvironmentId: string };
+      select: { friendlyId: true };
+    }): Promise<{ friendlyId: string } | null>;
   };
+};
+
+export type ResolveRunForMutationDeps = {
+  prismaReplica?: PrismaTaskRunFindFirst;
+  prismaWriter?: PrismaTaskRunFindFirst;
   getBuffer?: () => MollifierBuffer | null;
 };
 
@@ -35,6 +38,7 @@ export async function resolveRunForMutation(input: {
   deps?: ResolveRunForMutationDeps;
 }): Promise<ResolvedRunForMutation | null> {
   const replica = input.deps?.prismaReplica ?? defaultReplica;
+  const writer = input.deps?.prismaWriter ?? defaultWriter;
   const getBuffer = input.deps?.getBuffer ?? defaultGetBuffer;
 
   const pgRun = await replica.taskRun.findFirst({
@@ -44,15 +48,35 @@ export async function resolveRunForMutation(input: {
   if (pgRun) return { source: "pg", friendlyId: pgRun.friendlyId };
 
   const buffer = getBuffer();
-  if (!buffer) return null;
-
-  const entry = await buffer.getEntry(input.runParam);
-  if (
-    entry &&
-    entry.envId === input.environmentId &&
-    entry.orgId === input.organizationId
-  ) {
-    return { source: "buffer", friendlyId: input.runParam };
+
+  if (buffer) {
+    const entry = await buffer.getEntry(input.runParam);
+    if (
+      entry &&
+      entry.envId === input.environmentId &&
+      entry.orgId === input.organizationId
+    ) {
+      return { source: "buffer", friendlyId: input.runParam };
+    }
   }
+
+  // Replica + buffer both missed. Before declaring "not found" (which the
+  // route builder converts to a hard 404 *before* the action handler runs,
+  // so the downstream `mutateWithFallback` writer-recovery never gets a
+  // chance to fire), do one final probe against the writer. This catches
+  // two cases:
+  //   1. Replica lag on a freshly-created PG row.
+  //   2. A buffered run that materialised in the window between the
+  //      replica read and our buffer check (the entry was ack'd and the
+  //      hash is mid-grace-TTL but our getEntry returned null due to
+  //      lookup-by-friendlyId timing).
+  // Without this, the resolver returns null in degraded states that the
+  // downstream mutateWithFallback flow would otherwise handle correctly.
+  const writerRun = await writer.taskRun.findFirst({
+    where: { friendlyId: input.runParam, runtimeEnvironmentId: input.environmentId },
+    select: { friendlyId: true },
+  });
+  if (writerRun) return { source: "pg", friendlyId: writerRun.friendlyId };
+
   return null;
 }
diff --git a/apps/webapp/test/metadataRouteOperationsLogging.test.ts b/apps/webapp/test/metadataRouteOperationsLogging.test.ts
@@ -0,0 +1,132 @@
+import { describe, expect, it, vi } from "vitest";
+
+// `vi.mock` factories are hoisted above regular top-level `const`s, so
+// any cross-references between the spy/mock fns and the factories have
+// to live inside `vi.hoisted`. See `mollifierDrainerHandler.test.ts`
+// for the same pattern.
+const { warnSpy, applyMetadataMutationToBufferedRunMock } = vi.hoisted(() => ({
+  warnSpy: vi.fn(),
+  applyMetadataMutationToBufferedRunMock: vi.fn(),
+}));
+
+// The route module's import graph (createActionApiRoute, the env, the
+// services singleton) is heavier than the helper actually needs. Stub
+// the leaf modules so only the helper under test executes; the route's
+// top-level `createActionApiRoute(...)` call runs against the stubbed
+// builder and never touches platform.v3.server / prisma.
+vi.mock("~/db.server", () => ({ prisma: {}, $replica: {} }));
+vi.mock("~/env.server", () => ({
+  env: { TASK_RUN_METADATA_MAXIMUM_SIZE: 256 * 1024 },
+}));
+vi.mock("~/services/routeBuilders/apiBuilder.server", () => ({
+  createActionApiRoute: () => ({ action: vi.fn() }),
+}));
+vi.mock("~/services/apiAuth.server", () => ({
+  authenticateApiRequest: vi.fn(),
+}));
+vi.mock("~/v3/services/common.server", () => ({
+  ServiceValidationError: class extends Error {
+    constructor(public override message: string, public status?: number) {
+      super(message);
+    }
+  },
+}));
+vi.mock("~/services/metadata/updateMetadataInstance.server", () => ({
+  updateMetadataService: { call: vi.fn(async () => undefined) },
+}));
+vi.mock("~/v3/mollifier/applyMetadataMutation.server", () => ({
+  applyMetadataMutationToBufferedRun: applyMetadataMutationToBufferedRunMock,
+}));
+vi.mock("~/v3/mollifier/readFallback.server", () => ({
+  findRunByIdWithMollifierFallback: vi.fn(),
+}));
+vi.mock("~/services/logger.server", () => ({
+  logger: {
+    warn: warnSpy,
+    info: vi.fn(),
+    error: vi.fn(),
+    debug: vi.fn(),
+  },
+}));
+
+import { routeOperationsToRun } from "~/routes/api.v1.runs.$runId.metadata";
+import type { AuthenticatedEnvironment } from "~/services/apiAuth.server";
+
+const env = {
+  id: "env_a",
+  organizationId: "org_1",
+} as unknown as AuthenticatedEnvironment;
+
+const opsFixture = [{ type: "set", key: "k", value: "v" }] as Parameters<
+  typeof routeOperationsToRun
+>[1];
+
+describe("routeOperationsToRun — non-throw buffer outcome logging", () => {
+  // Each non-success outcome `applyMetadataMutationToBufferedRun` can
+  // return (`not_found`, `busy`, `version_exhausted`, `metadata_too_large`)
+  // must produce a warn log so ops can trace silent drops. Without this
+  // branch the parent/root operation would disappear with no record —
+  // `tryCatch` only catches throws, and the outcome object was
+  // previously ignored.
+  for (const kind of ["not_found", "busy", "version_exhausted", "metadata_too_large"] as const) {
+    it(`warn-logs when buffer outcome is { kind: "${kind}" }`, async () => {
+      warnSpy.mockClear();
+      applyMetadataMutationToBufferedRunMock.mockResolvedValueOnce({ kind });
+
+      await routeOperationsToRun("run_buffered_1", opsFixture, env);
+
+      expect(warnSpy).toHaveBeenCalledWith(
+        "metadata route: parent/root buffer op did not apply",
+        expect.objectContaining({ targetRunId: "run_buffered_1", kind }),
+      );
+    });
+  }
+
+  it("does NOT warn on the happy path (kind: 'applied')", async () => {
+    warnSpy.mockClear();
+    applyMetadataMutationToBufferedRunMock.mockResolvedValueOnce({
+      kind: "applied",
+      newMetadata: { k: "v" },
+      parentTaskRunFriendlyId: undefined,
+      rootTaskRunFriendlyId: undefined,
+    });
+
+    await routeOperationsToRun("run_buffered_1", opsFixture, env);
+
+    expect(warnSpy).not.toHaveBeenCalledWith(
+      "metadata route: parent/root buffer op did not apply",
+      expect.anything(),
+    );
+  });
+
+  it("warn-logs once when the helper throws (the pre-existing throw branch keeps working)", async () => {
+    warnSpy.mockClear();
+    applyMetadataMutationToBufferedRunMock.mockRejectedValueOnce(new Error("ECONNRESET"));
+
+    await routeOperationsToRun("run_buffered_1", opsFixture, env);
+
+    // Pre-existing branch — the catch logs `buffer fallback for parent/root
+    // op failed`. The new non-throw branch must NOT also fire (we return
+    // early on bufferError).
+    expect(warnSpy).toHaveBeenCalledWith(
+      "metadata route: buffer fallback for parent/root op failed",
+      expect.objectContaining({ targetRunId: "run_buffered_1" }),
+    );
+    expect(warnSpy).not.toHaveBeenCalledWith(
+      "metadata route: parent/root buffer op did not apply",
+      expect.anything(),
+    );
+  });
+
+  it("skips both PG and buffer when targetRunId is missing or operations is empty", async () => {
+    warnSpy.mockClear();
+    applyMetadataMutationToBufferedRunMock.mockClear();
+
+    await routeOperationsToRun(undefined, opsFixture, env);
+    await routeOperationsToRun("run_x", undefined, env);
+    await routeOperationsToRun("run_x", [], env);
+
+    expect(applyMetadataMutationToBufferedRunMock).not.toHaveBeenCalled();
+    expect(warnSpy).not.toHaveBeenCalled();
+  });
+});
diff --git a/apps/webapp/test/mollifierMutateWithFallback.test.ts b/apps/webapp/test/mollifierMutateWithFallback.test.ts
@@ -159,6 +159,40 @@ describe("mutateWithFallback", () => {
     expect(ctx?.bufferEntry?.orgId).toBe("org_1");
   });
 
+  // Symmetric writer-fallback in the `!buffer` short-circuit. Without
+  // this, mollifier-disabled deployments (or boot-time buffer init
+  // failures) would regress the pre-PR mutation routes — those read
+  // from the writer directly, so a fresh PG row was always visible.
+  // The replica offload introduced here moves the read to the lagging
+  // follower; if the buffer isn't available to disambiguate, we still
+  // probe the writer before returning 404.
+  it("replica miss + !buffer + writer hit → pgMutation (mollifier-disabled mode recovery)", async () => {
+    const row = fakeRun({ friendlyId: "run_1" });
+    const pgMutation = vi.fn(async () => "pg-recovered-no-buffer");
+    const result = await mutateWithFallback({
+      ...baseInput,
+      pgMutation,
+      synthesisedResponse: () => "snap",
+      prismaReplica: fakePrisma([null]) as unknown as typeof import("~/db.server").$replica,
+      prismaWriter: fakePrisma([row]) as unknown as typeof import("~/db.server").prisma,
+      getBuffer: () => null,
+    });
+    expect(result).toEqual({ kind: "pg", response: "pg-recovered-no-buffer" });
+    expect(pgMutation).toHaveBeenCalledWith(row);
+  });
+
+  it("replica miss + !buffer + writer miss → not_found (genuine 404 in mollifier-disabled mode)", async () => {
+    const result = await mutateWithFallback({
+      ...baseInput,
+      pgMutation: async () => "pg",
+      synthesisedResponse: () => "snap",
+      prismaReplica: fakePrisma([null]) as unknown as typeof import("~/db.server").$replica,
+      prismaWriter: fakePrisma([null]) as unknown as typeof import("~/db.server").prisma,
+      getBuffer: () => null,
+    });
+    expect(result).toEqual({ kind: "not_found" });
+  });
+
   it("replica miss + buffer not_found + writer miss → not_found", async () => {
     const result = await mutateWithFallback({
       ...baseInput,
diff --git a/apps/webapp/test/mollifierResolveRunForMutation.test.ts b/apps/webapp/test/mollifierResolveRunForMutation.test.ts