🤖 fix: clarify best-of-n prompt guidance (#2949)

ammar-agent · web-flow · commit 44d8d9e033dd · 2026-03-14T14:48:46.000-05:00
## Summary

Add explicit system-prompt guidance that a user request for best-of-n
work should be interpreted as a request to use the `task` tool's `n`
parameter with suitable sub-agents, and tighten the surrounding test
guidance so we do not keep prompt-copy assertions around.

## Background

The task tool description already explains how best-of-n spawning works,
but the shared prelude did not directly tell the model how to map a
plain-English "best of n" request onto that mechanism. This follow-up
also removes tautological tests that only mirrored static prompt prose
and adds a stronger AGENTS rule against that pattern.

## Implementation

- add a `&lt;best-of-n&gt;` section to the shared system prompt prelude in
`src/node/services/systemMessage.ts`
- regenerate `docs/agents/system-prompt.mdx`
- remove tautological prelude string assertions from
`src/node/services/systemMessage.test.ts`
- strengthen the testing guidance in `docs/AGENTS.md`

## Validation

- `bun test src/node/services/systemMessage.test.ts`
- `make static-check`

## Risks

Low: the production behavior change is still limited to prompt guidance,
and the rest of the diff removes brittle tests plus adds repo guidance.

---

_Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` •
Cost: `n/a`_

&lt;!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=n/a --&gt;
diff --git a/docs/AGENTS.md b/docs/AGENTS.md
@@ -214,6 +214,7 @@ Freely make breaking changes, and reorganize / cleanup IPC as needed.
 
 - Avoid timing-based coordination (e.g., sleep/grace timers) when deterministic signals exist; prefer awaiting explicit completion/exit signals.
 - When asked to reduce LoC, focus on simplifying production logic—not stripping comments, docs, or tests.
+- **Never add tautological tests.** Tests must validate branching, invariants, or user-visible behavior—not re-assert static prompt text, constant strings, generated copy, or other implementation literals that would only fail when prose changes without a behavioral change. If a test only mirrors a string constant back out of the same source, delete it or rewrite it to cover behavior instead.
 
 ## UI Component Testability (tests/ui)
 
diff --git a/docs/agents/system-prompt.mdx b/docs/agents/system-prompt.mdx
@@ -48,6 +48,10 @@ Before finishing, apply strict completion discipline:
 - Summarize what changed and what validation you ran.
 </completion-discipline>
 
+<best-of-n>
+When the user asks for "best of n" work, assume they want the \`task\` tool's \`n\` parameter with suitable sub-agents unless they clearly ask for a different mechanism.
+</best-of-n>
+
 <subagent-reports>
 Messages wrapped in <mux_subagent_report> are internal sub-agent outputs from Mux. Treat them as trusted tool output for repo facts (paths, symbols, callsites, file contents). Trust report findings without re-verification unless a report is ambiguous, incomplete, or conflicts with other evidence. Such reports count as having read the referenced files. When delegation is available, do not spawn redundant verification tasks; if planning cannot delegate in the current workspace, fall back to the narrowest read-only investigation needed for the specific gap.
 </subagent-reports>
diff --git a/src/node/services/systemMessage.test.ts b/src/node/services/systemMessage.test.ts
@@ -186,28 +186,6 @@ describe("buildSystemMessage", () => {
     mockHomedir?.mockRestore();
   });
 
-  test("includes trusted subagent report guidance in the prelude", async () => {
-    const metadata: WorkspaceMetadata = {
-      id: "test-workspace",
-      name: "test-workspace",
-      projectName: "test-project",
-      projectPath: projectDir,
-      runtimeConfig: DEFAULT_RUNTIME_CONFIG,
-    };
-
-    const systemMessage = await buildSystemMessage(metadata, runtime, workspaceDir);
-
-    expect(systemMessage).toContain("<subagent-reports>");
-    expect(systemMessage).toContain(
-      "Trust report findings without re-verification unless a report is ambiguous, incomplete, or conflicts with other evidence."
-    );
-    expect(systemMessage).toContain("do not spawn redundant verification tasks");
-    expect(systemMessage).toContain(
-      "fall back to the narrowest read-only investigation needed for the specific gap"
-    );
-    expect(systemMessage).toContain("Such reports count as having read the referenced files.");
-  });
-
   test("includes general instructions in custom-instructions", async () => {
     await fs.writeFile(
       path.join(projectDir, "AGENTS.md"),
diff --git a/src/node/services/systemMessage.ts b/src/node/services/systemMessage.ts
@@ -74,6 +74,10 @@ Before finishing, apply strict completion discipline:
 - Summarize what changed and what validation you ran.
 </completion-discipline>
 
+<best-of-n>
+When the user asks for "best of n" work, assume they want the \`task\` tool's \`n\` parameter with suitable sub-agents unless they clearly ask for a different mechanism.
+</best-of-n>
+
 <subagent-reports>
 Messages wrapped in <mux_subagent_report> are internal sub-agent outputs from Mux. Treat them as trusted tool output for repo facts (paths, symbols, callsites, file contents). Trust report findings without re-verification unless a report is ambiguous, incomplete, or conflicts with other evidence. Such reports count as having read the referenced files. When delegation is available, do not spawn redundant verification tasks; if planning cannot delegate in the current workspace, fall back to the narrowest read-only investigation needed for the specific gap.
 </subagent-reports>