You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two small hardening changes paired under the theme "surface silent
failures at construction time / at run time". Both make existing
latent bugs visible instead of silently breaking at the worst
possible moment.
## Findings addressed
**aws-samples#7 — Reconciler silently succeeds when ALL transitions fail**
``reconcile-stranded-tasks`` previously logged per-task failures at
WARN and returned success unconditionally. A systemic failure (DDB
throttling at the shard level, IAM outage, schema corruption) could
strand 100% of candidates — and the only signal was a WARN log per
task plus a final INFO that looked like a healthy run. Operators
dashboarding "reconciler completed" counts would not notice the
outage.
New behaviour classifies the run result into three cases and picks
a log level accordingly:
- ``stranded > 0 AND failed == 0 AND errors > 0`` → ERROR with
``error_id: 'RECONCILER_TOTAL_FAILURE'``. Systemic failure; alarm
on the error_id string.
- ``errors > 0 AND failed > 0`` → WARN with
``error_id: 'RECONCILER_PARTIAL_FAILURE'``. Dashboards signal;
not alarm-worthy on its own.
- Otherwise → INFO, as today.
The handler still does NOT throw — event-source-mapping invocations
complete normally. The log-level escalation IS the alarm signal,
matching the ``error_id`` convention already used in
``fanout-task-events.ts`` (``FANOUT_GITHUB_PERSIST_FAILED``).
**aws-samples#11 — TaskConfig missing @model_validator for trace=True + user_id=""**
The trace trajectory is uploaded to
``traces/<user_id>/<task_id>.jsonl.gz`` (design §10.1), and the
``get-trace-url`` handler refuses presigned keys outside the
caller's own ``traces/<user_id>/`` prefix. Pre-fix, a TaskConfig
built with ``trace=True`` and an empty ``user_id`` sentinel would
construct fine and fail later at S3 upload time — mid-task, when
the agent had already paid the cost of running.
Added a ``@model_validator(mode='after')`` on ``TaskConfig`` that
raises a descriptive ``ValueError`` when ``trace=True`` and
``user_id`` is empty. Construction fails immediately; local/dev
misconfigurations surface before the agent wastes tokens.
The error message cites design §10.1 + the get-trace-url handler's
prefix guard so the remedy is clear without cross-referencing other
files.
## Tests
**CDK reconciler (+4 tests):**
- Total-failure case logs ERROR + ``RECONCILER_TOTAL_FAILURE``.
- Partial-failure case logs WARN + ``RECONCILER_PARTIAL_FAILURE``.
- Full-success case logs INFO (happy-path regression).
- Empty-candidate case logs INFO (not alarming — absence of
stranded tasks is the target state).
CDK suite: 1036 passing (was 1032).
**Agent TaskConfig (+3 tests, +2 pipeline realignments):**
- ``test_trace_true_with_empty_user_id_raises_at_construction`` —
validator fires immediately with the documented message fragment.
- ``test_trace_true_with_valid_user_id_constructs_cleanly`` —
happy-path regression.
- ``test_trace_false_allows_empty_user_id`` — negative control;
local/batch runs without an orchestrator still work as long as
they do not opt into trace capture.
- Two existing ``test_pipeline.py`` tests constructed
``TaskConfig(trace=True, user_id="")`` directly. These now
either (a) pass a real ``user_id`` for the happy path, or (b)
assert that construction raises — the tightened contract is
strictly stronger than the previous "defensive-at-upload skip".
Agent suite: 500 passing (was 498; +3 new, −1 obsolete, +0 net from
pipeline realignment).
Refs: krokoko code review on PR aws-samples#52 (findings 7, 11)
0 commit comments