feat(agent-workspace): add replay triage and retention governance

Jacobinwwey · Jacobinwwey · commit 5c75a96c41d5 · 2026-04-16T06:55:24.000-05:00
diff --git a/docs/brainstorms/2026-04-16-mainline-ci-stabilization-and-m7-direction-requirements.md b/docs/brainstorms/2026-04-16-mainline-ci-stabilization-and-m7-direction-requirements.md
@@ -282,6 +282,29 @@ Deliverables:
   - `npm run test:agent-workspace:contracts`
   - `npm run verify:agent-workspace:runtime`
 
+### M7.8 (Now): Operator Replay Triage and Bounded Retention Governance (Lane Ops Bridge)
+
+Deliverables:
+
+- add operator replay triage summary surface with explicit runbook links.
+- enforce bounded retention governance so diagnostics index and report files stay aligned.
+
+#### M7.8 Progress Note (2026-04-16)
+
+- [Done] expanded `src/server.ts` with replay triage route:
+  - `GET /api/knowledge/operator/agent-workspace-diagnostics/triage`.
+- [Done] expanded diagnostics summary semantics for operator triage:
+  - `replayCandidateRate` and `replayRiskLevel` (`low|medium|high`) are now stored in diagnostics index entries.
+- [Done] landed bounded retention governance:
+  - stale diagnostics report files beyond retained index bounds are pruned to keep file/index consistency under `AGENT_WORKSPACE_DIAGNOSTICS_MAX_ENTRIES`.
+- [Done] expanded evidence coverage:
+  - `src/server.migration.test.ts` now asserts replay triage semantics and retention-bound enforcement,
+  - `scripts/verify-agent-workspace-runtime.js` + `src/agent_workspace.verification.contract.test.ts` now fail fast on triage/retention gate drift.
+- [Done] verification evidence:
+  - `npm test -- src/server.migration.test.ts --runInBand --testNamePattern "agent workspace diagnostics report|triage route summarizes replay risk"`
+  - `npm run test:agent-workspace:contracts`
+  - `npm run verify:agent-workspace:runtime`
+
 ## Success Criteria
 
 - CI failure mode that previously blocked the three agent-workspace suites is eliminated on mainline.
@@ -291,4 +314,4 @@ Deliverables:
 
 ## Next Step
 
-Proceed to `/prompts:ce-plan` using this document as the source for `M7.8` decomposition (operator runbook replay triage and bounded retention governance), while preserving M7 lane boundary constraints.
+Proceed to `/prompts:ce-plan` using this document as the source for `M7.9` decomposition (operator triage trend history and alert-threshold governance), while preserving M7 lane boundary constraints.
diff --git a/docs/diataxis/en/explanation/development-progress-dashboard.md b/docs/diataxis/en/explanation/development-progress-dashboard.md
@@ -350,6 +350,22 @@ Execution anchor:
   - `npm run test:agent-workspace:contracts`
   - `npm run verify:agent-workspace:runtime`
 
+## Latest Mainline Increment (2026-04-16 M7.8 Operator Replay Triage and Bounded Retention Governance Lane)
+
+- Extended sidecar diagnostics governance in `src/server.ts`:
+  - added `GET /api/knowledge/operator/agent-workspace-diagnostics/triage` for replay-risk triage summary and runbook links,
+  - added bounded retention cleanup for diagnostics report files so retained files stay aligned with index bounds (`AGENT_WORKSPACE_DIAGNOSTICS_MAX_ENTRIES`).
+- Extended diagnostics summary semantics:
+  - each diagnostics index entry now carries `replayCandidateRate` and `replayRiskLevel` (`low|medium|high`) for operator triage.
+- Expanded executable evidence:
+  - `src/server.migration.test.ts` now validates replay-risk triage output and verifies retention enforcement at max-entry boundary.
+- Hardened runtime verification gate:
+  - `scripts/verify-agent-workspace-runtime.js` now fail-fast checks triage route and retention-governance helper wiring.
+- Verification evidence:
+  - `npm test -- src/server.migration.test.ts --runInBand --testNamePattern \"agent workspace diagnostics report|triage route summarizes replay risk\"`
+  - `npm run test:agent-workspace:contracts`
+  - `npm run verify:agent-workspace:runtime`
+
 ## Mainline vs Working-Branch Snapshot (2026-04-14)
 
 | Capability Slice | Working Branch (`feat/learning-multi-tutor-adapter`) | Mainline (`origin/main`) | Integration Status |
@@ -358,7 +374,7 @@ Execution anchor:
 | Focus + learning-path side-by-side pane model | Implemented in branch UI/runtime | Dock coexistence baseline integrated (`styles.css`, `path_styles.css`, `path_app.js`) | Partially integrated |
 | Agent workspace contract parity suite | Implemented (`src/agent_workspace.contract.parity.test.ts`, `src/agent_workspace.frontend.test.ts`, `src/agent_workspace.locale.contract.test.ts`, `src/agent_workspace.tauri.contract.test.ts`) | Baseline parity suite integrated (`src/agent_workspace.contract.parity.test.ts`, `src/agent_workspace.frontend.test.ts`, `src/agent_workspace.runtime.integration.test.ts`) | Partially integrated |
 | Result-presentation allowlist/override fail-fast governance | Implemented in branch execution registry and parity tests | Integrated in M1 (`src/frontend/agent_workspace.js` + parity tests) | Baseline integrated |
-| Conversation turn stream/replay/operator diagnostics expansion | Implemented in branch routes/tests | Mainline has snapshot + trend/index/export diagnostics baseline in runtime (`src/frontend/agent_workspace_runtime.js`) | Partially integrated |
+| Conversation turn stream/replay/operator diagnostics expansion | Implemented in branch routes/tests | Mainline has runtime snapshot+trend/index/export plus sidecar persistence+triage+bounded retention governance (`src/frontend/agent_workspace_runtime.js`, `src/server.ts`) | Partially integrated |
 | Graphdb/ANN foundation hardening lane | Branch-oriented lane claims exist in prior docs | Mainline currently exposes file-backed store baseline (`src/learning/store.ts`) | Not integrated on mainline |
 | Markdown reader governance refactor lane | Planned and partially implemented in branch | Mainline baseline only | Partially integrated |
 
@@ -398,7 +414,7 @@ This dashboard aligns against the following requirement chain:
 | L2 Retrieval | explainable hybrid/vector retrieval + governance | Expanded in branch-oriented plans | Mainline file-backed baseline only (`src/learning/store.ts`) | Re-enter lane after concrete module evidence lands on mainline |
 | L3 Learning | mastery diagnostics + path/session loop | Expanded in branch | Partially integrated | Contract and integration parity |
 | L4 Interaction | agent conversation + focus/path pane runtime | Implemented in branch | M1-M4 baseline integrated on mainline | Expand capability surface via typed contract only |
-| L5 Governance | runbook, diagnostics, replay/autonomy controls | Expanded in branch | Earlier runbook baseline | Integrate operator and CI gates |
+| L5 Governance | runbook, diagnostics, replay/autonomy controls | Expanded in branch | Operator diagnostics persistence/triage/retention baseline integrated | Expand operator runbook automation and CI evidence depth |
 
 ## Verification Baseline
 
diff --git a/docs/diataxis/zh/explanation/development-progress-dashboard.md b/docs/diataxis/zh/explanation/development-progress-dashboard.md
@@ -352,6 +352,22 @@
   - `npm run test:agent-workspace:contracts`
   - `npm run verify:agent-workspace:runtime`
 
+## 主线最新增量（2026-04-16 M7.8 运维回放分级与有界保留治理链路）
+
+- 已在 `src/server.ts` 扩展运维诊断治理能力：
+  - 新增 `GET /api/knowledge/operator/agent-workspace-diagnostics/triage`，输出 replay 风险分级摘要与 runbook 链接，
+  - 新增诊断报告文件有界保留清理逻辑，确保落盘文件集合与索引上限（`AGENT_WORKSPACE_DIAGNOSTICS_MAX_ENTRIES`）一致。
+- 已扩展诊断摘要语义：
+  - 每条诊断索引新增 `replayCandidateRate` 与 `replayRiskLevel`（`low|medium|high`）字段，用于运维分级研判。
+- 已补可执行证据：
+  - `src/server.migration.test.ts` 新增 replay 分级摘要断言与保留上限断言。
+- 已加固 runtime 验证门禁：
+  - `scripts/verify-agent-workspace-runtime.js` 新增 triage 路由与 retention helper 的 fail-fast 接线断言。
+- 验证证据：
+  - `npm test -- src/server.migration.test.ts --runInBand --testNamePattern \"agent workspace diagnostics report|triage route summarizes replay risk\"`
+  - `npm run test:agent-workspace:contracts`
+  - `npm run verify:agent-workspace:runtime`
+
 ## 主线 vs 工作分支快照（2026-04-14）
 
 | 能力切片 | 工作分支（`feat/learning-multi-tutor-adapter`） | 主线（`origin/main`） | 集成状态 |
@@ -360,7 +376,7 @@
 | Focus + learning-path 并排 pane 模型 | 分支已实现 | 已落入 dock 并排基线（`styles.css`、`path_styles.css`、`path_app.js`） | 部分集成 |
 | Agent workspace 合同门禁测试 | 已实现（`src/agent_workspace.contract.parity.test.ts`、`src/agent_workspace.frontend.test.ts`、`src/agent_workspace.locale.contract.test.ts`、`src/agent_workspace.tauri.contract.test.ts`） | 已落入基线门禁（`src/agent_workspace.contract.parity.test.ts`、`src/agent_workspace.frontend.test.ts`、`src/agent_workspace.runtime.integration.test.ts`） | 部分集成 |
 | 结果呈现 allowlist/override fail-fast 治理 | 分支已实现 | M1 已集成（`src/frontend/agent_workspace.js` + parity tests） | 基线已集成 |
-| conversation turn 流式/重放/诊断扩展 | 分支已扩展 | 主线已落入 snapshot + trend/index/export 诊断基线（`src/frontend/agent_workspace_runtime.js`） | 部分集成 |
+| conversation turn 流式/重放/诊断扩展 | 分支已扩展 | 主线已落入 runtime snapshot+trend/index/export + sidecar 持久化/分级/有界保留治理基线（`src/frontend/agent_workspace_runtime.js`、`src/server.ts`） | 部分集成 |
 | graphdb/ANN 底座收敛 | 先前文档存在分支导向结论 | 主线当前为 file-backed store 基线（`src/learning/store.ts`） | 主线未集成 |
 | Markdown 阅读器治理升级 | 分支已有规划与部分实现 | 主线为旧基线 | 部分集成 |
 
@@ -400,7 +416,7 @@
 | L2 检索层 | 可解释混合/向量检索 + 治理 | 分支规划增强中 | 主线当前为 file-backed 基线（`src/learning/store.ts`） | 待主线出现对应模块证据后再收敛 |
 | L3 学习层 | 掌握诊断 + 路径/会话闭环 | 分支增强中 | 主线部分集成 | 契约与集成一致性 |
 | L4 交互层 | agent 对话 + focus/path pane 运行时 | 分支已实现 | 主线 M1-M4 已落入基线 | 继续通过 typed contract 扩展动作面 |
-| L5 治理层 | runbook/诊断/回放与自动化 | 分支增强中 | 主线旧 runbook 基线 | 集成运维门禁 |
+| L5 治理层 | runbook/诊断/回放与自动化 | 分支增强中 | 主线已集成运维诊断持久化/分级/保留治理基线 | 扩展 runbook 自动化与 CI 证据深度 |
 
 ## 验证基线
 
diff --git a/scripts/verify-agent-workspace-runtime.js b/scripts/verify-agent-workspace-runtime.js
@@ -104,6 +104,18 @@ function verifyAgentWorkspaceRuntime(repoRoot = path.resolve(__dirname, '..')) {
     serverSource.includes('/api/knowledge/operator/agent-workspace-diagnostics/latest'),
     'Missing diagnostics report latest route in src/server.ts'
   );
+  assert(
+    serverSource.includes('/api/knowledge/operator/agent-workspace-diagnostics/triage'),
+    'Missing diagnostics report triage route in src/server.ts'
+  );
+  assert(
+    serverSource.includes('cleanupStaleAgentWorkspaceDiagnosticsReports'),
+    'Missing diagnostics retention cleanup helper in src/server.ts'
+  );
+  assert(
+    serverSource.includes('AGENT_WORKSPACE_DIAGNOSTICS_MAX_ENTRIES'),
+    'Missing diagnostics retention bound constant in src/server.ts'
+  );
   assert(
     runtimeSource.includes('persistDiagnosticsReport'),
     'Missing persistDiagnosticsReport runtime surface in src/frontend/agent_workspace_runtime.js'
@@ -119,6 +131,8 @@ function verifyAgentWorkspaceRuntime(repoRoot = path.resolve(__dirname, '..')) {
       'frontend contract module exists',
       'conversation route wiring exists',
       'diagnostics report persistence routes exist',
+      'diagnostics triage route exists',
+      'diagnostics retention governance exists',
       'runtime diagnostics persistence surface exists',
       'agent workspace contract test suite passes',
     ],
diff --git a/src/agent_workspace.verification.contract.test.ts b/src/agent_workspace.verification.contract.test.ts
@@ -47,6 +47,9 @@ describe('agent workspace verification script contracts', () => {
     expect(runtimeSource).toContain('/api/knowledge/operator/agent-workspace-diagnostics/report');
     expect(runtimeSource).toContain('/api/knowledge/operator/agent-workspace-diagnostics/index');
     expect(runtimeSource).toContain('/api/knowledge/operator/agent-workspace-diagnostics/latest');
+    expect(runtimeSource).toContain('/api/knowledge/operator/agent-workspace-diagnostics/triage');
+    expect(runtimeSource).toContain('cleanupStaleAgentWorkspaceDiagnosticsReports');
+    expect(runtimeSource).toContain('AGENT_WORKSPACE_DIAGNOSTICS_MAX_ENTRIES');
     expect(runtimeSource).toContain('persistDiagnosticsReport');
     expect(browserSource).toContain('verifyAgentWorkspaceBrowser');
     expect(tauriSource).toContain('verifyAgentWorkspaceTauri');
diff --git a/src/knowledge.api.contract.test.ts b/src/knowledge.api.contract.test.ts
@@ -11,6 +11,7 @@ describe('Knowledge mastery API contract wiring', () => {
             '/api/knowledge/store-diagnostics',
             '/api/knowledge/operator/agent-workspace-diagnostics/index',
             '/api/knowledge/operator/agent-workspace-diagnostics/latest',
+            '/api/knowledge/operator/agent-workspace-diagnostics/triage',
             '/api/knowledge/operator/agent-workspace-diagnostics/report',
             '/api/knowledge/store/reload',
             '/api/knowledge/ingest',
diff --git a/src/server.migration.test.ts b/src/server.migration.test.ts
@@ -465,8 +465,10 @@ describe('server migration settings routes', () => {
             conversationRequests: 2,
             replayCandidateTurns: 1,
             userTurns: 2,
+            replayCandidateRate: 0.5,
             capabilityEvents: 2,
-            hasLastFailure: false
+            hasLastFailure: false,
+            replayRiskLevel: 'high'
           })
         })
       })
@@ -536,6 +538,99 @@ describe('server migration settings routes', () => {
     ).resolves.toContain(reportId);
   });
 
+  test('triage route summarizes replay risk and retention stays bounded to max entries', async () => {
+    const createReport = async (index: number) => {
+      const pattern = index % 3;
+      const userTurns = pattern === 0 ? 6 : pattern === 1 ? 5 : 4;
+      const replayCandidateTurns = pattern === 0 ? 4 : pattern === 1 ? 1 : 0;
+      const hasFailure = pattern === 2;
+      return requestJson(
+        port,
+        'POST',
+        '/api/knowledge/operator/agent-workspace-diagnostics/report',
+        {
+          source: `triage-test-${index}`,
+          report: {
+            snapshot: {
+              conversationRequests: 1,
+              replayCandidateTurns,
+              turnCounts: {
+                user: userTurns
+              },
+              capabilityEvents: [{ eventId: `cap_${index}`, status: 'success' }],
+              lastFailure: hasFailure ? { source: 'triage-test', message: 'simulated failure' } : null
+            },
+            trend: {
+              userTurns
+            },
+            index: {
+              capabilityIndex: {
+                operationIds: ['build_learning_path']
+              }
+            }
+          }
+        }
+      );
+    };
+
+    for (let index = 0; index < 45; index += 1) {
+      const response = await createReport(index);
+      expect(response.status).toBe(200);
+      expect(response.body.success).toBe(true);
+    }
+
+    const indexResponse = await requestJson(
+      port,
+      'GET',
+      '/api/knowledge/operator/agent-workspace-diagnostics/index'
+    );
+    expect(indexResponse.status).toBe(200);
+    expect(indexResponse.body.success).toBe(true);
+    expect(indexResponse.body.count).toBe(40);
+    expect(indexResponse.body.index.length).toBe(40);
+
+    const triageResponse = await requestJson(
+      port,
+      'GET',
+      '/api/knowledge/operator/agent-workspace-diagnostics/triage'
+    );
+    expect(triageResponse.status).toBe(200);
+    expect(triageResponse.body.success).toBe(true);
+    expect(triageResponse.body.triage).toEqual(
+      expect.objectContaining({
+        maxEntries: 40,
+        indexedEntries: 40,
+        byRiskLevel: expect.objectContaining({
+          high: expect.any(Number),
+          medium: expect.any(Number),
+          low: expect.any(Number)
+        }),
+        withFailureCount: expect.any(Number),
+        replayCandidateRateAverage: expect.any(Number),
+        topReplayReports: expect.any(Array),
+        runbookLinks: expect.arrayContaining([
+          expect.objectContaining({
+            id: 'development-progress-dashboard'
+          }),
+          expect.objectContaining({
+            id: 'm7-direction-requirements'
+          })
+        ])
+      })
+    );
+    expect(triageResponse.body.triage.topReplayReports.length).toBeLessThanOrEqual(5);
+    const riskBucketTotal = triageResponse.body.triage.byRiskLevel.high
+      + triageResponse.body.triage.byRiskLevel.medium
+      + triageResponse.body.triage.byRiskLevel.low;
+    expect(riskBucketTotal).toBe(40);
+    expect(triageResponse.body.triage.byRiskLevel.high).toBeGreaterThan(0);
+
+    const diagnosticsDir = path.join(runtimeDataDir, 'agent_workspace_diagnostics');
+    const diagnosticsFiles = (await fs.promises.readdir(diagnosticsDir))
+      .filter((entry) => /^awd-[a-z0-9_-]+\.json$/i.test(entry));
+    expect(diagnosticsFiles.length).toBe(40);
+  });
+
   test('server runtime path avoids synchronous filesystem APIs', () => {
     const serverSourcePath = path.join(__dirname, 'server.ts');
     const serverSource = fs.readFileSync(serverSourcePath, 'utf8');
diff --git a/src/server.ts b/src/server.ts