Commit db55bfa
bgagent
fix(agent): progress writer error classification + shared circuit breaker (krokoko review aws-samples#6, aws-samples#8)
Two related hardening changes on ``agent/src/progress_writer.py``.
Grouped because the shared circuit breaker reuses the error-
classification decision to know when NOT to flip itself, and
separating the commits would force an awkward "half-fix" intermediate
state.
## Findings addressed
**aws-samples#6 — Circuit breaker trips on ValidationException (permanent errors)**
The pre-fix ``except Exception`` branch fed ALL errors into the same
``_failure_count`` counter. A persistent schema/size error
(e.g. ``ValidationException`` from an item >400 KB under a
trace-heavy event) counted against the transient-failure budget and
tripped the breaker within 3 events, silencing the entire progress
stream for the rest of the task — even though most subsequent
events were normal size and would have written fine.
New behaviour classifies each DDB error into three buckets:
- **Permanent (drop event, keep stream alive):**
``ValidationException``, ``ItemCollectionSizeLimitExceededException``
— the individual event is malformed or oversized, retrying or
treating as transient would not help. Log at WARN, skip the
event, do NOT increment the failure counter.
- **Immediate-disable (fatal, don't even try to retry):**
``ResourceNotFoundException``, ``AccessDeniedException``,
``UnauthorizedOperation`` — wrong deploy or IAM misconfig. Disable
the breaker on the first occurrence instead of waiting for 3
failures; log at WARN with the error code. Avoids spamming
operator dashboards with 3 copies of the same permissions error.
- **Transient (trip the breaker on repeated failures, as today):**
``ProvisionedThroughputExceededException``,
``RequestLimitExceeded``, ``ServiceUnavailable``,
``InternalServerError``, plus network-layer (``ConnectionError``,
``EndpointConnectionError``, ``ReadTimeoutError``). Same counter
semantics as pre-fix.
- **Unknown (default conservative):** counted as transient (counter
increments) but logged at ERROR with an explicit ``UNKNOWN``
marker so operators notice and can add new codes to the
permanent/transient lists. Does NOT instant-disable — over-
correcting from pre-fix behaviour would swap one failure mode
for another.
Uses ``botocore.exceptions.ClientError`` + ``err.response["Error"]["Code"]``
for AWS errors; class-name matching for non-ClientError (network-
layer) paths. Helper: ``_classify_ddb_error(exc) ->
Literal["permanent", "immediate_disable", "transient", "unknown"]``.
**aws-samples#8 — Dual _ProgressWriter instances with independent circuit breakers**
Pre-fix, the runner-level writer (turn/tool events at ``runner.py:240``)
and the pipeline-level writer (milestones at ``pipeline.py:303``) each
held their own ``_failure_count`` / ``_disabled`` state. If throttling
tripped one writer, the other kept writing — creating visible event
gaps in the stream that operators could not distinguish from agent
activity (milestones firing after turn events stop, or vice versa).
Fix: consolidate circuit-breaker state into a module-level
``_SharedCircuitBreaker`` singleton keyed by ``task_id``. Both writers
for the same task read/write the same ``(_failure_count, _disabled)``
pair through named methods (``is_disabled``, ``record_failure``,
``record_success``, ``disable``). One task's stream is either healthy
(all events flow from both writers) or degraded (no events flow from
either). Cannot have a half-alive stream.
Semantics notes:
- ``record_success`` resets the shared counter but NOT the
``_disabled`` flag. Re-enabling mid-task would let a flaky
minute burn the failure budget repeatedly and defeat the
breaker's purpose.
- Empty-string ``task_id`` (``runner.py`` falls back to sentinel
``"unknown"``) collapses to shared state for all ``"unknown"``
writers. Real task_ids stay isolated.
- Writers retain ``_disabled`` / ``_failure_count`` as properties
that proxy to the shared map. Existing callers (``hooks.py``
does ``getattr(progress, "_disabled", False)``) and tests that
assign ``writer._failure_count = 0`` keep working unchanged —
no constructor signature change required, no
``runner.py`` / ``pipeline.py`` edits.
- Single ``threading.Lock()`` protects the shared map; DDB write
rate (single-digit events/sec) never contends meaningfully.
- Test hygiene: ``_reset_circuit_breakers()`` helper rebinds the
module global so autouse fixtures give each test a clean slate.
## Tests
+24 regression tests net (36 → 60 in ``test_progress_writer.py``).
Coverage:
- Finding aws-samples#6 classification:
``test_permanent_error_does_not_trip_breaker`` (10 consecutive
``ValidationException`` writes keep ``_disabled=False``),
``test_transient_error_trips_breaker_as_before``,
``test_access_denied_disables_writer_immediately_with_loud_log``,
``test_unknown_exception_treated_as_transient_with_error_log``.
- Finding aws-samples#8 sharing:
``test_shared_circuit_breaker_across_writers_same_task_id``
(writer-A trips the breaker; writer-B sees ``is_disabled`` and
skips the DDB call),
``test_separate_tasks_have_independent_breakers``,
``test_unknown_sentinel_task_id_is_isolated``,
``test_reset_helper_clears_shared_state_between_tests``.
- Plus edge cases: success-interleave resets the counter across
writers; ``_disabled`` stays open after re-enabling a success
mid-task; thread-safety via concurrent writes.
Agent suite: 497 passing (was 473; +24).
Refs: krokoko code review on PR aws-samples#52 (findings 6, 8)1 parent 9e6c23f commit db55bfa
2 files changed
Lines changed: 702 additions & 16 deletions
0 commit comments