fix(reconciler,agent): reconciler alarming + TaskConfig trace-without-user_id validator (krokoko review aws-samples#7, aws-samples#11)

bgagent · bgagent · commit 331e283bb118 · 2026-05-05T11:38:38.000-07:00
Two small hardening changes paired under the theme "surface silent failures at construction time / at run time". Both make existing latent bugs visible instead of silently breaking at the worst possible moment. ## Findings addressed **aws-samples#7 — Reconciler silently succeeds when ALL transitions fail** ``reconcile-stranded-tasks`` previously logged per-task failures at WARN and returned success unconditionally. A systemic failure (DDB throttling at the shard level, IAM outage, schema corruption) could strand 100% of candidates — and the only signal was a WARN log per task plus a final INFO that looked like a healthy run. Operators dashboarding "reconciler completed" counts would not notice the outage. New behaviour classifies the run result into three cases and picks a log level accordingly: - ``stranded > 0 AND failed == 0 AND errors > 0`` → ERROR with ``error_id: 'RECONCILER_TOTAL_FAILURE'``. Systemic failure; alarm on the error_id string. - ``errors > 0 AND failed > 0`` → WARN with ``error_id: 'RECONCILER_PARTIAL_FAILURE'``. Dashboards signal; not alarm-worthy on its own. - Otherwise → INFO, as today. The handler still does NOT throw — event-source-mapping invocations complete normally. The log-level escalation IS the alarm signal, matching the ``error_id`` convention already used in ``fanout-task-events.ts`` (``FANOUT_GITHUB_PERSIST_FAILED``). **aws-samples#11 — TaskConfig missing @model_validator for trace=True + user_id=""** The trace trajectory is uploaded to ``traces/<user_id>/<task_id>.jsonl.gz`` (design §10.1), and the ``get-trace-url`` handler refuses presigned keys outside the caller's own ``traces/<user_id>/`` prefix. Pre-fix, a TaskConfig built with ``trace=True`` and an empty ``user_id`` sentinel would construct fine and fail later at S3 upload time — mid-task, when the agent had already paid the cost of running. Added a ``@model_validator(mode='after')`` on ``TaskConfig`` that raises a descriptive ``ValueError`` when ``trace=True`` and ``user_id`` is empty. Construction fails immediately; local/dev misconfigurations surface before the agent wastes tokens. The error message cites design §10.1 + the get-trace-url handler's prefix guard so the remedy is clear without cross-referencing other files. ## Tests **CDK reconciler (+4 tests):** - Total-failure case logs ERROR + ``RECONCILER_TOTAL_FAILURE``. - Partial-failure case logs WARN + ``RECONCILER_PARTIAL_FAILURE``. - Full-success case logs INFO (happy-path regression). - Empty-candidate case logs INFO (not alarming — absence of stranded tasks is the target state). CDK suite: 1036 passing (was 1032). **Agent TaskConfig (+3 tests, +2 pipeline realignments):** - ``test_trace_true_with_empty_user_id_raises_at_construction`` — validator fires immediately with the documented message fragment. - ``test_trace_true_with_valid_user_id_constructs_cleanly`` — happy-path regression. - ``test_trace_false_allows_empty_user_id`` — negative control; local/batch runs without an orchestrator still work as long as they do not opt into trace capture. - Two existing ``test_pipeline.py`` tests constructed ``TaskConfig(trace=True, user_id="")`` directly. These now either (a) pass a real ``user_id`` for the happy path, or (b) assert that construction raises — the tightened contract is strictly stronger than the previous "defensive-at-upload skip". Agent suite: 500 passing (was 498; +3 new, −1 obsolete, +0 net from pipeline realignment). Refs: krokoko code review on PR aws-samples#52 (findings 7, 11)
diff --git a/agent/src/models.py b/agent/src/models.py
@@ -127,6 +127,28 @@ class TaskConfig(BaseModel):
     issue: GitHubIssue | None = None
     base_branch: str | None = None
 
+    @model_validator(mode="after")
+    def _validate_trace_requires_user_id(self) -> Self:
+        """Fail at construction when trace=True without a user_id.
+
+        The trace trajectory is uploaded to
+        ``traces/<user_id>/<task_id>.jsonl.gz`` (design §10.1). An empty
+        ``user_id`` produces ``traces//<task_id>.jsonl.gz``, which the
+        ``get-trace-url`` handler's per-caller-prefix guard refuses.
+        Catching this at construction time surfaces the misconfiguration
+        locally / in CI instead of deferring to runtime S3 upload.
+        """
+        if self.trace and not self.user_id:
+            raise ValueError(
+                "trace=True requires a non-empty user_id. Local/batch runs "
+                "without an orchestrator must either set trace=False (the "
+                "default) or supply user_id explicitly. The trace trajectory "
+                "is uploaded to traces/<user_id>/<task_id>.jsonl.gz (design "
+                "§10.1), and the get-trace-url handler refuses keys outside "
+                "the caller's traces/<user_id>/ prefix."
+            )
+        return self
+
 
 class RepoSetup(BaseModel):
     model_config = ConfigDict(frozen=True)
diff --git a/agent/tests/test_models.py b/agent/tests/test_models.py
@@ -280,6 +280,40 @@ def test_validate_assignment(self):
         config.max_turns = 50
         assert config.max_turns == 50
 
+    def test_trace_true_with_empty_user_id_raises_at_construction(self):
+        """trace=True + user_id='' must fail at construction, not at S3 upload."""
+        with pytest.raises(ValidationError, match="trace=True requires a non-empty user_id"):
+            TaskConfig(
+                repo_url="owner/repo",
+                github_token="ghp_test",
+                aws_region="us-east-1",
+                trace=True,
+                # user_id omitted — defaults to ""
+            )
+
+    def test_trace_true_with_valid_user_id_constructs_cleanly(self):
+        """Happy path: trace=True with a non-empty user_id is accepted."""
+        config = TaskConfig(
+            repo_url="owner/repo",
+            github_token="ghp_test",
+            aws_region="us-east-1",
+            trace=True,
+            user_id="cognito-sub-abc-123",
+        )
+        assert config.trace is True
+        assert config.user_id == "cognito-sub-abc-123"
+
+    def test_trace_false_allows_empty_user_id(self):
+        """Negative control: local batch runs (trace=False, user_id='') still work."""
+        config = TaskConfig(
+            repo_url="owner/repo",
+            github_token="ghp_test",
+            aws_region="us-east-1",
+            # trace defaults to False; user_id defaults to ""
+        )
+        assert config.trace is False
+        assert config.user_id == ""
+
 
 class TestRepoSetup:
     def test_construction(self):
diff --git a/agent/tests/test_pipeline.py b/agent/tests/test_pipeline.py
@@ -2,6 +2,9 @@
 
 from unittest.mock import MagicMock, patch
 
+import pytest
+from pydantic import ValidationError
+
 from models import AgentResult, RepoSetup, TaskConfig
 from pipeline import _chain_prior_agent_error, _resolve_overall_task_status
 
@@ -439,12 +442,14 @@ async def fake_run_agent(_prompt, _system_prompt, config, cwd=None, trajectory=N
                 aws_region="us-east-1",
                 task_id="t-trace",
                 trace=True,
+                user_id="cognito-sub-trace-user",
             )
 
         assert captured_config is not None
         # The config reaching run_agent carries trace=True so runner.py's
         # _ProgressWriter(config.task_id, trace=config.trace) picks it up.
         assert captured_config.trace is True
+        assert captured_config.user_id == "cognito-sub-trace-user"
 
     @patch("pipeline.run_agent")
     @patch("pipeline.build_system_prompt")
@@ -668,9 +673,17 @@ def test_upload_skipped_when_user_id_empty_and_trace_true(
         mock_upload,
         monkeypatch,
     ):
-        """K2 Stage 3 review Finding #1 — empty user_id with trace=True
-        must skip the upload to avoid writing an unreachable
-        ``traces//<task_id>.jsonl.gz`` artifact."""
+        """krokoko review Finding #11 — trace=True with empty user_id now
+        fails at ``TaskConfig`` construction time (pre-flight validation)
+        rather than silently skipping the upload and returning
+        ``trace_s3_uri=None``.
+
+        Previously (rev-5) this was a best-effort defensive skip inside
+        ``pipeline.run_task``'s trace-upload block; shifting the check to
+        the Pydantic model means misconfigured callers surface the error
+        immediately, before any agent work runs. The upload mock is never
+        exercised because we never reach the upload path.
+        """
         monkeypatch.setenv("GITHUB_TOKEN", "ghp_test")
         monkeypatch.setenv("AWS_REGION", "us-east-1")
 
@@ -700,18 +713,18 @@ async def fake_run_agent(_prompt, _system_prompt, _config, cwd=None, trajectory=
         ):
             from pipeline import run_task
 
-            result = run_task(
-                repo_url="owner/repo",
-                task_description="trace without user",
-                github_token="ghp_test",
-                aws_region="us-east-1",
-                task_id="t-no-uid",
-                trace=True,
-                user_id="",  # empty — must gate upload
-            )
+            with pytest.raises(ValidationError, match="trace=True requires a non-empty user_id"):
+                run_task(
+                    repo_url="owner/repo",
+                    task_description="trace without user",
+                    github_token="ghp_test",
+                    aws_region="us-east-1",
+                    task_id="t-no-uid",
+                    trace=True,
+                    user_id="",  # empty — now rejected at TaskConfig construction
+                )
 
         assert not mock_upload.called
-        assert result["trace_s3_uri"] is None
 
     @patch("pipeline.upload_trace_to_s3")
     @patch("pipeline.run_agent")
diff --git a/cdk/src/handlers/reconcile-stranded-tasks.ts b/cdk/src/handlers/reconcile-stranded-tasks.ts
@@ -284,10 +284,46 @@ export async function handler(): Promise<void> {
     }
   }
 
-  logger.info('Stranded-task reconciler finished', {
+  // Severity escalation for the final log line.
+  //
+  // Per-task failures upstream are caught and swallowed (logged at WARN)
+  // so one flaky DDB call doesn't abort the entire reconcile window. But
+  // a systemic failure — IAM outage, table-level throttling, schema
+  // corruption — can silently strand 100% of candidates while each
+  // individual WARN line looks ignorable. We classify the terminal log
+  // three ways so CloudWatch Log Insights / metric filters can alarm on
+  // the dedicated `error_id` strings:
+  //
+  //   1. totalStranded > 0 AND totalFailed == 0 AND totalErrors > 0
+  //      → SYSTEMIC failure. Every candidate hit an exception. Log ERROR
+  //        with error_id='RECONCILER_TOTAL_FAILURE' (alarm-worthy).
+  //   2. totalErrors > 0 AND totalFailed > 0
+  //      → PARTIAL failure. Some tasks transitioned, some didn't. Log
+  //        WARN with error_id='RECONCILER_PARTIAL_FAILURE' (dashboard
+  //        signal, not an alarm — expected under occasional DDB flakes).
+  //   3. Otherwise (no stranded, or all-success with zero errors)
+  //      → SUCCESS. Log INFO as before.
+  //
+  // We do NOT throw — the EventBridge schedule invocation should still
+  // complete "normally" (no retry storm against an already-degraded
+  // DDB). The log-level escalation IS the alarm signal.
+  const finalPayload = {
     stranded: totalStranded,
     failed: totalFailed,
     skipped: totalSkipped,
     errors: totalErrors,
-  });
+  };
+  if (totalStranded > 0 && totalFailed === 0 && totalErrors > 0) {
+    logger.error('Stranded-task reconciler finished — every candidate failed to transition', {
+      ...finalPayload,
+      error_id: 'RECONCILER_TOTAL_FAILURE',
+    });
+  } else if (totalErrors > 0 && totalFailed > 0) {
+    logger.warn('Stranded-task reconciler finished with partial failures', {
+      ...finalPayload,
+      error_id: 'RECONCILER_PARTIAL_FAILURE',
+    });
+  } else {
+    logger.info('Stranded-task reconciler finished', finalPayload);
+  }
 }
diff --git a/cdk/test/handlers/reconcile-stranded-tasks.test.ts b/cdk/test/handlers/reconcile-stranded-tasks.test.ts
@@ -175,6 +175,152 @@ describe('reconcile-stranded-tasks', () => {
     expect(statusValues).toEqual(expect.arrayContaining(['SUBMITTED', 'HYDRATING']));
   });
 
+  describe('final log severity escalation', () => {
+    // Spy on the logger module used by the handler. We import the logger
+    // directly and replace the three level methods with jest.fn before
+    // each test so we can assert exactly which level was called.
+    // eslint-disable-next-line @typescript-eslint/no-var-requires
+    const loggerModule = require('../../src/handlers/shared/logger') as {
+      logger: {
+        info: (m: string, d?: Record<string, unknown>) => void;
+        warn: (m: string, d?: Record<string, unknown>) => void;
+        error: (m: string, d?: Record<string, unknown>) => void;
+      };
+    };
+
+    let infoSpy: jest.SpyInstance;
+    let warnSpy: jest.SpyInstance;
+    let errorSpy: jest.SpyInstance;
+
+    beforeEach(() => {
+      infoSpy = jest.spyOn(loggerModule.logger, 'info').mockImplementation(() => { /* silence */ });
+      warnSpy = jest.spyOn(loggerModule.logger, 'warn').mockImplementation(() => { /* silence */ });
+      errorSpy = jest.spyOn(loggerModule.logger, 'error').mockImplementation(() => { /* silence */ });
+    });
+
+    afterEach(() => {
+      infoSpy.mockRestore();
+      warnSpy.mockRestore();
+      errorSpy.mockRestore();
+    });
+
+    /**
+     * Find the final reconciler log line (i.e. the one whose message
+     * starts with 'Stranded-task reconciler finished') across all spies
+     * and return its [level, message, payload] triple.
+     */
+    function findFinalLog(): { level: 'INFO' | 'WARN' | 'ERROR'; message: string; payload: Record<string, unknown> } {
+      const match = (spy: jest.SpyInstance, level: 'INFO' | 'WARN' | 'ERROR') => {
+        const call = spy.mock.calls.find(
+          (c: unknown[]) => typeof c[0] === 'string' && (c[0] as string).startsWith('Stranded-task reconciler finished'),
+        );
+        return call ? { level, message: call[0] as string, payload: (call[1] ?? {}) as Record<string, unknown> } : null;
+      };
+      return match(errorSpy, 'ERROR') ?? match(warnSpy, 'WARN') ?? match(infoSpy, 'INFO')
+        ?? (() => { throw new Error('No final reconciler log line found'); })();
+    }
+
+    test('test_logs_ERROR_with_RECONCILER_TOTAL_FAILURE_error_id_when_every_task_fails', async () => {
+      // Two candidates both hit an exception on the first DDB write
+      // (UpdateItem transition). None transition cleanly, so totalFailed=0,
+      // totalStranded=2, totalErrors=2 → systemic failure path.
+      const ancient = new Date(Date.now() - 25 * 60 * 1000).toISOString();
+      const ddbErr = Object.assign(new Error('DDB blew up'), { name: 'InternalServerError' });
+      primeResponses([
+        // SUBMITTED query → two candidates.
+        {
+          Items: [
+            mockTaskRow({ task_id: 't-fail-1', user_id: 'u-1', created_at: ancient }),
+            mockTaskRow({ task_id: 't-fail-2', user_id: 'u-2', created_at: ancient }),
+          ],
+        },
+        ddbErr, // UpdateItem for t-fail-1 → throws
+        ddbErr, // UpdateItem for t-fail-2 → throws
+        { Items: [] }, // HYDRATING query
+      ]);
+
+      await handler();
+
+      const final = findFinalLog();
+      expect(final.level).toBe('ERROR');
+      expect(final.payload.error_id).toBe('RECONCILER_TOTAL_FAILURE');
+      expect(final.payload.stranded).toBe(2);
+      expect(final.payload.failed).toBe(0);
+      expect(final.payload.errors).toBe(2);
+    });
+
+    test('test_logs_WARN_with_RECONCILER_PARTIAL_FAILURE_when_some_tasks_fail', async () => {
+      // One success (4 writes), one failure (throws on UpdateItem).
+      const ancient = new Date(Date.now() - 25 * 60 * 1000).toISOString();
+      const ddbErr = Object.assign(new Error('DDB throttled'), { name: 'ProvisionedThroughputExceededException' });
+      primeResponses([
+        // SUBMITTED query → two candidates.
+        {
+          Items: [
+            mockTaskRow({ task_id: 't-ok', user_id: 'u-a', created_at: ancient }),
+            mockTaskRow({ task_id: 't-fail', user_id: 'u-b', created_at: ancient }),
+          ],
+        },
+        {}, // UpdateItem t-ok (transition) → success
+        {}, // PutItem task_stranded event
+        {}, // PutItem task_failed event
+        {}, // UpdateItem decrement concurrency
+        ddbErr, // UpdateItem t-fail (transition) → throws
+        { Items: [] }, // HYDRATING query
+      ]);
+
+      await handler();
+
+      const final = findFinalLog();
+      expect(final.level).toBe('WARN');
+      expect(final.payload.error_id).toBe('RECONCILER_PARTIAL_FAILURE');
+      expect(final.payload.stranded).toBe(2);
+      expect(final.payload.failed).toBe(1);
+      expect(final.payload.errors).toBe(1);
+    });
+
+    test('test_logs_INFO_on_full_success', async () => {
+      // Two candidates, both transition cleanly.
+      const ancient = new Date(Date.now() - 25 * 60 * 1000).toISOString();
+      primeResponses([
+        {
+          Items: [
+            mockTaskRow({ task_id: 't-1', user_id: 'u-a', created_at: ancient }),
+            mockTaskRow({ task_id: 't-2', user_id: 'u-b', created_at: ancient }),
+          ],
+        },
+        {}, {}, {}, {}, // t-1: transition + 2 events + decrement
+        {}, {}, {}, {}, // t-2: transition + 2 events + decrement
+        { Items: [] }, // HYDRATING
+      ]);
+
+      await handler();
+
+      const final = findFinalLog();
+      expect(final.level).toBe('INFO');
+      expect(final.payload.error_id).toBeUndefined();
+      expect(final.payload.stranded).toBe(2);
+      expect(final.payload.failed).toBe(2);
+      expect(final.payload.errors).toBe(0);
+    });
+
+    test('test_no_stranded_tasks_logs_INFO_not_ERROR', async () => {
+      // Empty-query case: totalStranded=0. Must NOT alarm.
+      primeResponses([
+        { Items: [] }, // SUBMITTED
+        { Items: [] }, // HYDRATING
+      ]);
+
+      await handler();
+
+      const final = findFinalLog();
+      expect(final.level).toBe('INFO');
+      expect(final.payload.stranded).toBe(0);
+      expect(final.payload.errors).toBe(0);
+      expect(errorSpy).not.toHaveBeenCalled();
+    });
+  });
+
   test('query paginates with ExclusiveStartKey when LastEvaluatedKey present', async () => {
     const ancient = new Date(Date.now() - 25 * 60 * 1000).toISOString();
     // findStrandedCandidates paginates internally and returns ALL rows