fix(agent): cancel hook short-circuits nudge consumption (krokoko review aws-samples#3)

bgagent · bgagent · commit 9e6c23ff11b9 · 2026-05-05T10:40:52.000-07:00
Pre-fix, the between-turns-hooks dispatcher at ``agent/src/hooks.py`` ran EVERY registered hook in the list before checking ``ctx["_cancel_requested"]``. The registered order is cancel-hook first, nudge-hook second — so when cancel fired, the nudge hook had already run and: - Read ``TaskNudgesTable`` via ``read_pending``. - Marked the nudge rows ``status=consumed`` via ``mark_consumed``. - Added them to the in-memory ``_INJECTED_NUDGES`` dedup set. But the dispatcher's return value discarded everything when cancel was detected, so the nudge was silently lost. The user's ``bgagent nudge`` had returned 202 Accepted, the DDB row was now ``consumed``, yet the agent never saw the nudge content. A cancelled task leaks a nudge on every invocation. ## Fix — belt-and-braces Two independent guards so the invariant is preserved even if a future refactor reorders the hook registry: 1. **Loop-level break (primary).** The between-turns-hooks dispatcher checks ``ctx.get("_cancel_requested")`` immediately after each hook returns and breaks out of the loop. Cancel-hook runs first by convention, flips the flag, and the nudge-hook never runs. No DDB reads, no status mutation. 2. **Internal early-return in the nudge hook (secondary).** ``_nudge_between_turns_hook`` checks ``ctx.get("_cancel_requested")`` up front (right after the empty-task-id guard, BEFORE any DDB call and BEFORE ``_emit_nudge_milestone`` would write a spurious ``nudge_acknowledged``). Also guards against the pre-loop-cancel case (cancel flag already set when the dispatcher enters). The two guards are independent: if the registry ordering breaks (e.g. Phase 3 prepends an approval hook that flips cancel after the registered cancel hook), the internal early-return still protects against data loss. Hardened the registry-declaration comment at ``agent/src/hooks.py`` to state the ordering is load-bearing ("cancel MUST come before nudge") and documented how future hooks should be appended. Did NOT add a runtime sort — the list literal preserves insertion order deterministically, and forcing an implicit sort would hide real bugs from anyone intentionally reordering. ## Tests +5 regression tests in a new ``agent/tests/test_cancel_hook.py``: - ``test_nudge_hook_not_invoked_when_cancel_fires_first`` — spy hook verifies invocation count is 0 after cancel; dedup set untouched. - ``test_real_nudge_reader_not_touched_on_cancel`` — end-to-end with mocked DDB; asserts neither ``table.query`` nor ``update_item`` is called when cancel fires first. - ``test_preloop_cancel_skips_all_hooks_via_internal_guard`` — the nudge hook invoked directly with ``_cancel_requested=True``; no DDB I/O, no dedup mutation. - ``test_nudge_hook_internal_guard_fires_even_if_registry_reordered`` — asserts ``write_agent_milestone`` is NOT called when cancel is set (no spurious ack milestone for a cancelled turn). - ``test_running_task_nudge_still_consumed_normally`` — negative-control regression: RUNNING tasks still flow through cancel→nudge→inject path with DDB calls firing exactly once. Agent suite: 473 passing (was 468). Refs: krokoko code review on PR aws-samples#52 (finding 3)
diff --git a/agent/src/hooks.py b/agent/src/hooks.py
@@ -248,6 +248,21 @@ def _nudge_between_turns_hook(ctx: dict) -> list[str]:
     if not task_id:
         return []
 
+    # Belt-and-braces second guard against the "cancel consumes nudges" hazard
+    # (krokoko PR #52 review finding #3).  The primary guard is the loop-level
+    # break in :func:`stop_hook` which short-circuits the dispatcher as soon as
+    # any earlier hook sets ``_cancel_requested``.  That assumes
+    # ``_cancel_between_turns_hook`` runs BEFORE this hook — true for the
+    # module-level ``between_turns_hooks`` registry today (line 340), but a
+    # future reorder (or a test that rebinds the list without preserving
+    # order) would silently reintroduce the bug: ``read_pending`` +
+    # ``mark_consumed`` would flip the DDB rows to consumed and stamp
+    # ``_INJECTED_NUDGES`` for a dying agent that will never see the text.
+    # Early-returning here makes the invariant structural — no nudges are
+    # ever consumed once cancel is flagged, regardless of hook ordering.
+    if ctx.get("_cancel_requested"):
+        return []
+
     try:
         pending = nudge_reader.read_pending(task_id)
     except Exception as exc:
@@ -334,9 +349,15 @@ def _cancel_between_turns_hook(ctx: dict) -> list[str]:
     return []
 
 
-# Global list of between-turns hooks.  Cancel runs first so it can short-circuit
-# nudges on cancelled tasks (no point injecting nudges into a dying agent).
-# Phase 3 (approval gates) will ``append`` additional hooks here.
+# Global list of between-turns hooks.  Cancel MUST run first so it can
+# short-circuit nudges on cancelled tasks (no point injecting nudges into a
+# dying agent — worse, the nudge reader mutates DDB state that the agent will
+# never act on; see krokoko PR #52 review finding #3).  The :func:`stop_hook`
+# dispatcher breaks out of the loop as soon as ``_cancel_requested`` is set,
+# and :func:`_nudge_between_turns_hook` early-returns when the flag is already
+# present — belt-and-braces in case a future ``append`` reorders this list.
+# Phase 3 (approval gates) should ``append`` additional hooks AFTER the
+# nudge reader to preserve cancel-wins semantics.
 between_turns_hooks: list[BetweenTurnsHook] = [
     _cancel_between_turns_hook,
     _nudge_between_turns_hook,
@@ -371,6 +392,20 @@ async def stop_hook(
         "progress": progress,
     }
 
+    # Cancel-before-nudge short-circuit (krokoko PR #52 review finding #3).
+    # Previously the loop ran ALL hooks before checking ``_cancel_requested``,
+    # which meant the nudge hook's ``read_pending`` + ``mark_consumed`` path
+    # executed even on cancelled tasks — flipping the DDB rows to consumed
+    # and stamping ``_INJECTED_NUDGES`` for a dying agent.  The user saw a
+    # 202 Accepted for their nudge but the injection was discarded when we
+    # returned ``continue_=False`` below.  Breaking out of the loop as soon
+    # as any hook sets ``_cancel_requested`` guarantees subsequent hooks
+    # (notably the nudge reader) never run, so DDB state is never mutated
+    # for work the agent will never do.  The registry at line 340 keeps
+    # ``_cancel_between_turns_hook`` first so this break fires before the
+    # nudge hook gets a chance.  ``_nudge_between_turns_hook`` also carries
+    # an internal cancel-check as belt-and-braces in case a future refactor
+    # reorders the registry.
     chunks: list[str] = []
     for hook in between_turns_hooks:
         try:
@@ -383,6 +418,13 @@ async def stop_hook(
             continue
         if produced:
             chunks.extend(produced)
+        if ctx.get("_cancel_requested"):
+            # Any text produced by earlier hooks in this same loop iteration
+            # is discarded below — the ``_cancel_requested`` branch returns
+            # ``continue_=False`` and never reads ``chunks``.  This is
+            # intentional: cancel wins, and we would rather drop a
+            # simultaneous nudge than inject into a dying agent.
+            break
 
     # Cancel takes precedence over nudge injection.  ``continue_: False`` tells
     # the SDK to end the turn loop and return control to the caller, which
diff --git a/agent/tests/test_cancel_hook.py b/agent/tests/test_cancel_hook.py
@@ -18,6 +18,7 @@
 import pytest
 
 import hooks as hooks_mod
+import nudge_reader
 import task_state
 
 
@@ -29,8 +30,12 @@ def _run(coro):
 def _reset():
     # Restore the default registry after each test.
     original = list(hooks_mod.between_turns_hooks)
+    nudge_reader._reset_cache_for_tests()
+    hooks_mod._reset_injected_nudges_for_tests()
     yield
     hooks_mod.between_turns_hooks[:] = original
+    nudge_reader._reset_cache_for_tests()
+    hooks_mod._reset_injected_nudges_for_tests()
 
 
 class TestCancelBetweenTurnsHook:
@@ -180,3 +185,237 @@ def test_milestone_emitted_on_cancel_detect(self, monkeypatch):
         progress.write_agent_milestone.assert_called_once()
         call_kwargs = progress.write_agent_milestone.call_args.kwargs
         assert call_kwargs["milestone"] == "cancel_detected"
+
+
+class TestCancelShortCircuitsNudgeConsumption:
+    """Regression for krokoko PR #52 review finding #3.
+
+    Before the fix, :func:`stop_hook` iterated ALL between-turns hooks BEFORE
+    checking ``_cancel_requested`` — so when cancel fired, the nudge hook had
+    already run, mutated DDB (``mark_consumed`` + stamped ``_INJECTED_NUDGES``),
+    and had its return value silently discarded by the cancel branch.  Users
+    saw a 202 Accepted for their nudge but the instruction was never injected
+    into the (dying) agent.
+
+    The fix is two-layered:
+    1. ``stop_hook`` breaks out of the dispatcher loop as soon as any hook
+       sets ``_cancel_requested``, so the nudge hook never runs on a
+       cancelled turn.
+    2. ``_nudge_between_turns_hook`` itself early-returns when
+       ``_cancel_requested`` is already present, as belt-and-braces in
+       case a future refactor reorders the registry.
+    """
+
+    def test_nudge_hook_not_invoked_when_cancel_fires_first(self, monkeypatch):
+        """Happy-path regression: cancel hook flips sentinel → nudge hook is
+        never called → DDB query never issued → injected-nudges set untouched.
+        """
+        monkeypatch.setattr(task_state, "get_task", lambda _tid: {"status": "CANCELLED"})
+
+        nudge_calls = {"count": 0}
+
+        def _spy_nudge(_ctx):
+            nudge_calls["count"] += 1
+            return ["<user_nudge>should never be injected</user_nudge>"]
+
+        hooks_mod.between_turns_hooks[:] = [
+            hooks_mod._cancel_between_turns_hook,
+            _spy_nudge,
+        ]
+
+        result = _run(
+            hooks_mod.stop_hook(
+                hook_input={},
+                tool_use_id=None,
+                hook_context=None,
+                task_id="t-cancel-nudge-race",
+                progress=MagicMock(),
+            )
+        )
+
+        # Cancel-wins semantics unchanged.
+        assert result == {
+            "continue_": False,
+            "stopReason": "Task cancelled by user",
+        }
+        # Critical invariant: the nudge hook was NEVER called.  Before the
+        # fix, ``nudge_calls["count"]`` would have been 1 and the pending
+        # DDB row would have been marked consumed.
+        assert nudge_calls["count"] == 0
+        # In-process dedup set must be untouched — the "task set" should not
+        # have been created because the nudge hook never ran.
+        assert "t-cancel-nudge-race" not in hooks_mod._INJECTED_NUDGES
+
+    def test_real_nudge_reader_not_touched_on_cancel(self, monkeypatch):
+        """End-to-end regression: with the ACTUAL ``_nudge_between_turns_hook``
+        registered alongside the cancel hook, a pending DDB row MUST NOT be
+        read or marked consumed when cancel fires in the same turn.
+
+        This is the scenario the review was concerned about — a user submits
+        a nudge, then immediately cancels, and the nudge disappears silently
+        because it was consumed but never injected.
+        """
+        monkeypatch.setattr(task_state, "get_task", lambda _tid: {"status": "CANCELLED"})
+
+        table = MagicMock()
+        # If the nudge hook runs, it would see this pending row.
+        table.query.return_value = {
+            "Items": [
+                {
+                    "task_id": "t-cancel-real",
+                    "nudge_id": "01NUDGE",
+                    "message": "please add logging",
+                    "created_at": "2026-05-05T12:00:00Z",
+                    "consumed": False,
+                }
+            ]
+        }
+        table.update_item.return_value = {}
+        nudge_reader._TABLE_CACHE = table
+
+        # Default registry order: cancel first, nudge second.
+        hooks_mod.between_turns_hooks[:] = [
+            hooks_mod._cancel_between_turns_hook,
+            hooks_mod._nudge_between_turns_hook,
+        ]
+
+        result = _run(
+            hooks_mod.stop_hook(
+                hook_input={},
+                tool_use_id=None,
+                hook_context=None,
+                task_id="t-cancel-real",
+                progress=MagicMock(),
+            )
+        )
+
+        assert result["continue_"] is False
+        # DDB must not have been queried — the nudge hook never ran.
+        table.query.assert_not_called()
+        # And therefore no ``mark_consumed`` call either.
+        table.update_item.assert_not_called()
+
+    def test_preloop_cancel_skips_all_hooks_via_internal_guard(self, monkeypatch):
+        """If cancel is already flagged on ``ctx`` entering the dispatcher
+        (e.g. a Phase 3 hook prepended to the registry sets it, or a future
+        code path stamps the flag before hook dispatch), the nudge hook's
+        own early-return covers it.
+
+        Today ``stop_hook`` builds ``ctx`` fresh each call so the pre-loop
+        case is not reachable from the normal SDK entry point, but the
+        nudge hook's internal guard is tested here directly to document the
+        second line of defence.
+        """
+        table = MagicMock()
+        table.query.return_value = {
+            "Items": [
+                {
+                    "task_id": "t-preloop",
+                    "nudge_id": "01PRELOOP",
+                    "message": "should not be consumed",
+                    "created_at": "2026-05-05T12:00:00Z",
+                    "consumed": False,
+                }
+            ]
+        }
+        table.update_item.return_value = {}
+        nudge_reader._TABLE_CACHE = table
+
+        # Cancel sentinel already set on ctx entering the nudge hook.
+        ctx = {"task_id": "t-preloop", "_cancel_requested": True}
+        result = hooks_mod._nudge_between_turns_hook(ctx)
+
+        assert result == []
+        # Belt-and-braces check: the nudge hook returned before any DDB I/O.
+        table.query.assert_not_called()
+        table.update_item.assert_not_called()
+        # And the in-process dedup set was not stamped.
+        assert "t-preloop" not in hooks_mod._INJECTED_NUDGES
+
+    def test_nudge_hook_internal_guard_fires_even_if_registry_reordered(
+        self, monkeypatch
+    ):
+        """If a future refactor accidentally puts nudge before cancel in the
+        registry, the loop-level break no longer helps — but the nudge
+        hook's own ``_cancel_requested`` check still has to short-circuit.
+
+        Simulate this by registering a synthetic "early cancel" hook that
+        flips the sentinel BEFORE the nudge hook, but keeping nudge second
+        as usual.  The loop will break after the cancel hook (finding
+        already covered); here we verify the nudge hook's internal guard
+        by driving it directly with cancel already set in ctx and an
+        attached progress writer.
+        """
+        table = MagicMock()
+        table.query.return_value = {
+            "Items": [
+                {
+                    "task_id": "t-guard",
+                    "nudge_id": "01GUARD",
+                    "message": "must not inject",
+                    "created_at": "ts",
+                    "consumed": False,
+                }
+            ]
+        }
+        table.update_item.return_value = {}
+        nudge_reader._TABLE_CACHE = table
+
+        progress = MagicMock()
+        ctx = {
+            "task_id": "t-guard",
+            "progress": progress,
+            "_cancel_requested": True,
+        }
+        result = hooks_mod._nudge_between_turns_hook(ctx)
+
+        assert result == []
+        # The early-return happens before ``_emit_nudge_milestone`` — no
+        # ``nudge_acknowledged`` event should be written for a cancelled task.
+        progress.write_agent_milestone.assert_not_called()
+        table.query.assert_not_called()
+        table.update_item.assert_not_called()
+
+    def test_running_task_nudge_still_consumed_normally(self, monkeypatch):
+        """Negative control: the guard must not regress the happy path.
+
+        A RUNNING task with a pending nudge should still flow through:
+        cancel hook returns [] without setting the sentinel, nudge hook
+        reads + consumes + injects as before.
+        """
+        monkeypatch.setattr(task_state, "get_task", lambda _tid: {"status": "RUNNING"})
+
+        table = MagicMock()
+        table.query.return_value = {
+            "Items": [
+                {
+                    "task_id": "t-live",
+                    "nudge_id": "01LIVE",
+                    "message": "add docs",
+                    "created_at": "ts",
+                    "consumed": False,
+                }
+            ]
+        }
+        table.update_item.return_value = {}
+        nudge_reader._TABLE_CACHE = table
+
+        hooks_mod.between_turns_hooks[:] = [
+            hooks_mod._cancel_between_turns_hook,
+            hooks_mod._nudge_between_turns_hook,
+        ]
+
+        result = _run(
+            hooks_mod.stop_hook(
+                hook_input={},
+                tool_use_id=None,
+                hook_context=None,
+                task_id="t-live",
+                progress=MagicMock(),
+            )
+        )
+
+        assert result["decision"] == "block"
+        assert "add docs" in result["reason"]
+        table.query.assert_called_once()
+        table.update_item.assert_called_once()