feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#138
feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#138acailic wants to merge 5 commits into
Conversation
…hema extensions Closes #136 - Add `agent_debugger_sdk/checkpoints/hooks.py`: RestoreHook protocol, RESTORE_HOOK_REGISTRY dict, apply_restore_hook() with built-in LangChain hook and generic fallback, AutoReplayManager class - Add `agent_debugger_sdk/drift.py`: DriftSeverity enum (MINOR/WARNING/CRITICAL), DriftEvent dataclass, DriftDetector.compare() for action/tool/confidence drift - Extend RestoreRequest with replay_events and track_drift optional fields - Extend RestoreResponse with replayed_events_count and drift_detected optional fields - Update checkpoints __init__ to export new symbols All additions are purely additive; no existing behaviour changed. 22 new tests pass, 10 remain skipped (deeper TraceContext integration). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 18e13fabf8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| result = self._on_event(event) | ||
| if result is False: |
There was a problem hiding this comment.
Await asynchronous replay callbacks
AutoReplayManager.replay() invokes self._on_event(event) without awaiting it, so an async callback is never executed and cancellation-by-returning-False cannot work in async integrations. In that case the loop treats the coroutine object as truthy, continues replaying, and emits runtime warnings about un-awaited coroutines, which breaks expected replay control flow.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit 68ce3b4 — AutoReplayManager.replay() now uses inspect.isawaitable() to detect and await async callbacks, and correctly handles False return for early-stop in both sync and async cases.
| if orig_conf is not None and new_conf is not None: | ||
| delta = abs(float(orig_conf) - float(new_conf)) | ||
| if delta >= _CONFIDENCE_DRIFT_THRESHOLD: |
There was a problem hiding this comment.
Guard confidence parsing before numeric drift check
DriftDetector.compare() unconditionally casts confidence values with float(...); if either event carries a non-numeric confidence payload (e.g., string labels from a model/tool), this raises ValueError/TypeError and aborts drift comparison instead of returning None or a DriftEvent as documented. Because event payloads are dynamic, this can crash replay/drift tracking on real traces.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit ff792b9 — confidence parsing is now wrapped in try/except (ValueError, TypeError) so non-numeric payloads are treated as non-comparable (delta=0.0) rather than raising.
| orig_type = original.get("event_type") | ||
| new_type = new_event.get("event_type") |
There was a problem hiding this comment.
Detect mismatched event types as drift
The comparison path never checks orig_type != new_type directly, so a restored event can change type (for example decision → tool_call) and still return None when the compared fields are missing. Since event-type divergence is a core replay mismatch, this creates false negatives in drift detection and can hide structural execution drift.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit ff792b9 — DriftDetector.compare() now explicitly checks orig_type != new_type first and returns a CRITICAL DriftEvent for event-type mismatches.
There was a problem hiding this comment.
Pull request overview
This PR adds foundational “replay depth L3/L4” primitives to the SDK and API schemas to support checkpoint restoration hooks, event replay orchestration, and state-drift detection (per #136), without wiring them into the existing restore flow yet.
Changes:
- Added
RestoreHook, a framework hook registry, andapply_restore_hook()plus anAutoReplayManagerhelper for post-checkpoint replay. - Added
DriftDetector/DriftEvent/DriftSeverityto compare original vs restored execution events. - Extended API restore request/response schemas with replay/drift tracking fields and status outputs.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
api/schemas.py |
Adds request flags and response status fields for replay/drift features. |
agent_debugger_sdk/drift.py |
Introduces drift detection primitives (severity, event, detector). |
agent_debugger_sdk/checkpoints/hooks.py |
Adds restore-hook protocol/registry + restore application helper + replay manager. |
agent_debugger_sdk/checkpoints/__init__.py |
Re-exports new replay/restore primitives from the checkpoints package. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| original = self._original_events[index] | ||
| orig_data = original.get("data") or {} | ||
| new_data = new_event.get("data") or {} | ||
|
|
There was a problem hiding this comment.
DriftDetector.compare() only reads comparison fields from the nested data dict (orig_data = original.get('data')). For real SDK/API events, typed fields like chosen_action, confidence, and tool_name are typically top-level keys in TraceEvent.to_dict() (see agent_debugger_sdk/core/events/base.py:132-149), so this implementation will miss drift in normal usage. Consider normalizing field access (e.g., look up in event.get(field) first, then fall back to event.get('data', {}).get(field)) or accept TraceEventSchema/TraceEvent objects instead of raw dicts.
There was a problem hiding this comment.
Addressed in commit ff792b9 — introduced _get_field() helper that checks the top-level event dict first, then falls back to the nested data dict, so both real SDK events and legacy nested payloads are handled correctly.
| delta = abs(float(orig_conf) - float(new_conf)) | ||
| if delta >= _CONFIDENCE_DRIFT_THRESHOLD: | ||
| severity = DriftSeverity.CRITICAL if delta >= 0.5 else DriftSeverity.WARNING | ||
| return DriftEvent( | ||
| severity=severity, | ||
| description=f"Decision confidence drifted: {orig_conf} -> {new_conf} (delta={delta:.2f})", | ||
| original_value=orig_conf, | ||
| restored_value=new_conf, | ||
| event_index=index, | ||
| field="confidence", | ||
| ) |
There was a problem hiding this comment.
The confidence drift block calls float(orig_conf) / float(new_conf) without guarding against non-numeric values. If confidence is present but not convertible (e.g., malformed payloads), this will raise and prevent drift detection from working at all. Wrap the conversion in try/except (ValueError, TypeError) and treat unparseable confidence as “not comparable” (return None) or emit a MINOR drift event indicating invalid confidence types.
| delta = abs(float(orig_conf) - float(new_conf)) | |
| if delta >= _CONFIDENCE_DRIFT_THRESHOLD: | |
| severity = DriftSeverity.CRITICAL if delta >= 0.5 else DriftSeverity.WARNING | |
| return DriftEvent( | |
| severity=severity, | |
| description=f"Decision confidence drifted: {orig_conf} -> {new_conf} (delta={delta:.2f})", | |
| original_value=orig_conf, | |
| restored_value=new_conf, | |
| event_index=index, | |
| field="confidence", | |
| ) | |
| try: | |
| orig_conf_float = float(orig_conf) | |
| new_conf_float = float(new_conf) | |
| except (TypeError, ValueError): | |
| pass | |
| else: | |
| delta = abs(orig_conf_float - new_conf_float) | |
| if delta >= _CONFIDENCE_DRIFT_THRESHOLD: | |
| severity = DriftSeverity.CRITICAL if delta >= 0.5 else DriftSeverity.WARNING | |
| return DriftEvent( | |
| severity=severity, | |
| description=f"Decision confidence drifted: {orig_conf} -> {new_conf} (delta={delta:.2f})", | |
| original_value=orig_conf, | |
| restored_value=new_conf, | |
| event_index=index, | |
| field="confidence", | |
| ) |
There was a problem hiding this comment.
Addressed in commit ff792b9 — confidence parsing is wrapped in try/except (ValueError, TypeError); unparseable values yield delta=0.0 so comparison continues gracefully.
| async def _generic_restore_hook(state: Any, target: Any) -> Any: | ||
| """Generic fallback restore hook for unknown frameworks.""" | ||
| return target | ||
|
|
||
|
|
||
| async def apply_restore_hook(framework: str, state: Any, target: Any) -> Any: | ||
| """Apply the registered restore hook for a framework. | ||
|
|
||
| Looks up the hook in RESTORE_HOOK_REGISTRY. Falls back to a generic | ||
| hook that copies common attributes when no framework-specific hook | ||
| is registered. |
There was a problem hiding this comment.
apply_restore_hook() docstring says the generic fallback "copies common attributes" (and tests describe it as copying data), but _generic_restore_hook currently returns target without applying anything from state. Either implement the intended behavior (e.g., if state has a data attribute and target has data, copy it; optionally shallow-copy known fields) or update the docstring/tests to match the no-op behavior.
There was a problem hiding this comment.
Addressed in commit 68ce3b4 — _generic_restore_hook now iterates over common attribute names (messages, intermediate_steps, data) and deep-copies them from state to target when both objects carry the attribute.
| """Return events after checkpoint_sequence meeting importance threshold.""" | ||
| result = [] | ||
| for event in self._all_events: | ||
| seq = event.get("sequence", self._checkpoint_sequence + 1) |
There was a problem hiding this comment.
AutoReplayManager._filter_events() looks for an event sequence number at the top level (event.get('sequence')), but in the SDK EventEmitter stores per-event sequence under event.metadata['sequence'] (see agent_debugger_sdk/core/emitter.py) and TraceEvent.to_dict() keeps it under the nested metadata. This will cause real events to bypass sequence filtering. Consider reading event.get('sequence') OR event.get('metadata', {}).get('sequence'), or standardizing the expected shape for replay events.
| seq = event.get("sequence", self._checkpoint_sequence + 1) | |
| metadata = event.get("metadata") | |
| metadata_sequence = metadata.get("sequence") if isinstance(metadata, dict) else None | |
| seq = event.get("sequence") | |
| if seq is None: | |
| seq = metadata_sequence | |
| if seq is None: | |
| seq = self._checkpoint_sequence + 1 |
There was a problem hiding this comment.
Addressed in commit 68ce3b4 — _filter_events() now checks event.get('metadata', {}).get('sequence') first and falls back to the top-level sequence key, matching the SDK emitter's storage convention.
| """Orchestrates automatic event replay after checkpoint restoration. | ||
|
|
||
| After restoring from a checkpoint, AutoReplayManager fetches the events | ||
| that occurred after the checkpoint and replays them in sequence, optionally | ||
| applying restore hooks and filtering by importance. | ||
|
|
||
| Args: | ||
| events: List of events to replay (post-checkpoint events). |
There was a problem hiding this comment.
AutoReplayManager docstring says it "fetches" post-checkpoint events and may apply restore hooks, but the implementation only replays a provided events list and never applies restore hooks. Either adjust the docstring to reflect the current responsibilities (filter + callback orchestration) or add the hook application / fetching logic that the docstring describes.
| """Orchestrates automatic event replay after checkpoint restoration. | |
| After restoring from a checkpoint, AutoReplayManager fetches the events | |
| that occurred after the checkpoint and replays them in sequence, optionally | |
| applying restore hooks and filtering by importance. | |
| Args: | |
| events: List of events to replay (post-checkpoint events). | |
| """Orchestrates replay of a provided event list after restoration. | |
| AutoReplayManager does not fetch events or apply restore hooks itself. | |
| Instead, it filters the caller-provided events to those that occurred | |
| after ``checkpoint_sequence`` and meet the configured importance threshold, | |
| then replays them in sequence by invoking the optional ``on_event`` | |
| callback for each event. | |
| Args: | |
| events: List of candidate events supplied by the caller for replay. |
There was a problem hiding this comment.
Addressed in commit 68ce3b4 — AutoReplayManager docstring now accurately describes its responsibilities (filter + callback orchestration of a provided event list) without claiming it fetches events or applies restore hooks.
- drift.py: detect event-type mismatches as CRITICAL drift; guard float() confidence conversion with try/except to handle non-numeric payloads - hooks.py: fix _generic_restore_hook to copy common state attributes; guard apply_restore_hook against hook returning None; await async on_event callbacks in AutoReplayManager.replay(); check metadata.sequence before top-level sequence in _filter_events(); update AutoReplayManager docstring to match actual behavior Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed all review feedback in commit 3804646: drift.py
checkpoints/hooks.py
All 22 existing L3 replay tests pass; 10 skipped tests remain unchanged (pre-existing unimplemented TraceContext integration). |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 38046463f4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for action_field in ("chosen_action", "action"): | ||
| orig_val = orig_data.get(action_field) | ||
| new_val = new_data.get(action_field) | ||
| if orig_val is not None and new_val is not None and orig_val != new_val: |
There was a problem hiding this comment.
Normalize action aliases before decision drift comparison
DriftDetector.compare() only compares chosen_action with chosen_action and action with action, so it misses drift when the original event uses one alias and the restored event uses the other (for example, original chosen_action="tool_a" vs restored action="tool_b"). This creates false negatives across mixed/legacy event payloads even though both keys are treated as valid decision-action fields in this function.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit ff792b9 — _get_field() resolves chosen_action and action aliases at both top-level and nested-data locations, so mixed/legacy payloads where one side uses one alias are still compared correctly.
| if hasattr(state, "messages") and hasattr(target, "messages"): | ||
| target.messages = state.messages | ||
| if hasattr(state, "intermediate_steps") and hasattr(target, "intermediate_steps"): | ||
| target.intermediate_steps = state.intermediate_steps |
There was a problem hiding this comment.
Copy mutable checkpoint fields when restoring target state
The restore hooks assign mutable containers by reference (messages, intermediate_steps, data) instead of copying them, so later mutations on the restored target will also mutate the checkpoint snapshot object. That can corrupt the restored baseline used for later inspection/replay and make replay behavior depend on post-restore mutations rather than the original checkpoint state.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit 68ce3b4 — restore hooks now use copy.deepcopy() for mutable containers (messages, intermediate_steps), preventing aliasing between the restored target and the checkpoint snapshot.
…and hook support - Add replay_events param: fetches post-checkpoint events from server and populates ctx.replayed_events via AutoReplayManager (sequence + importance filtering) - Add track_drift param: attaches DriftDetector to ctx._drift_detector initialized with replayed events for live comparison during restored execution - Add on_replay_event callback: called with checkpoint payload first (allows early cancellation), then for each replayed event; return False to stop replay - Add original_session_id and importance_threshold params for replay control - Call apply_restore_hook automatically for the checkpoint framework on restore - Initialize _drift_detector, _drift_event_index, _drift_events, replayed_events on __init__ Advances test_replay_depth_l3.py from 22 passed/10 skipped to 31 passed/1 skipped. The remaining skip (drift event emission) is a test design issue: record_decision requires evidence which the test does not supply. Closes #136 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New commit: extend TraceContext.restore() for L3 replay paramsPushed Before this commit: 22 passed, 10 skipped What was added
Framework restore hooks are also called automatically via Remaining skip
|
…solution, and mutable copy - Add _get_field() helper to drift.py: checks top-level event fields first (as emitted by TraceEvent.to_dict()), then falls back to nested 'data' dict, so typed SDK events and raw/legacy events are both handled correctly. - Normalize action alias resolution in DriftDetector.compare(): resolve chosen_action/action via a single _get_field() call so cross-alias payloads (orig uses chosen_action, restored uses action) are no longer false negatives. - Apply same normalization for tool_name/tool alias in tool call drift. - Shallow-copy mutable containers (list/dict) in _langchain_restore_hook and _generic_restore_hook so post-restore mutations on the target do not corrupt the checkpoint snapshot used for later replay/inspection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed the remaining review feedback in ff792b9: Comment #3035456246 (field access normalization): Added Comment #3035529877 (cross-alias action normalization): Replaced the per-alias loop with a single Comment #3035529878 (mutable field aliasing): Both |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ff792b95f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| resp = await client.get(f"{resolved_url}/api/sessions/{sess_id}/events") | ||
| resp.raise_for_status() | ||
| all_events = resp.json().get("events", []) |
There was a problem hiding this comment.
Query replay events from a valid session route
TraceContext.restore() requests GET /api/sessions/{sess_id}/events and then reads resp.json()["events"], but in this repo the session event endpoint is GET /api/sessions/{session_id}/traces (see api/session_routes.py) and its payload key is traces (see TraceListResponse in api/schemas.py). With replay_events=True, this causes a 404 or empty parse in normal deployments, and because the exception is swallowed, replay/drift tracking silently runs with no events.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit 68ce3b4 — TraceContext.restore() now calls GET /api/sessions/{sess_id}/traces and reads the traces key from the response, matching the actual route and TraceListResponse schema.
| target.messages = copy.copy(state.messages) | ||
| if hasattr(state, "intermediate_steps") and hasattr(target, "intermediate_steps"): | ||
| target.intermediate_steps = copy.copy(state.intermediate_steps) |
There was a problem hiding this comment.
Deep-copy restored checkpoint containers before assignment
The restore hook uses copy.copy(...) for messages and intermediate_steps, which only copies the outer list; nested dict elements remain aliased to the checkpoint snapshot. If restored code mutates a message/step dict in-place (a common pattern), it mutates the original checkpoint state too, so subsequent inspections/replays no longer reflect the true saved snapshot.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit 68ce3b4 — restore hooks now use copy.deepcopy() for all mutable containers so inner dict/list elements are not aliased to the checkpoint snapshot.
…replay event endpoint
- Use copy.deepcopy() in _langchain_restore_hook and _generic_restore_hook so
nested message/step dicts in checkpoint snapshots are not aliased to the
restored target, preventing post-restore mutations from corrupting replay state
- Fix TraceContext.restore() to query /api/sessions/{id}/traces (correct route)
and parse the 'traces' key from TraceListResponse instead of the non-existent
/api/sessions/{id}/events endpoint, so replay_events=True works in deployments
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed the remaining review comments in commit 68ce3b4: discussion_r3035760267 — Query replay events from a valid session route (P1) discussion_r3035529878 / discussion_r3035760269 — Deep-copy restored checkpoint containers All checks pass: |
Review feedback addressedAll review comments from Codex and Copilot have been resolved in commits
Validation: |
|
This PR is superseded by #140, which contains the latest and most complete iteration of the L3 replay primitives feature. |
Summary
Closes #136
Implements three self-contained, additive pieces for L3/L4 replay features:
agent_debugger_sdk/checkpoints/hooks.py—RestoreHookprotocol,RESTORE_HOOK_REGISTRYdict,apply_restore_hook()with built-in LangChain hook and generic fallback,AutoReplayManagerclass for post-checkpoint event replayagent_debugger_sdk/drift.py—DriftSeverityenum (MINOR/WARNING/CRITICAL),DriftEventdataclass withseverity/description/original_value/restored_valuefields,DriftDetector.compare()detecting action, tool-call, and confidence driftapi/schemas.py—RestoreRequestextended with optionalreplay_events: boolandtrack_drift: bool;RestoreResponseextended with optionalreplayed_events_count: intanddrift_detected: boolagent_debugger_sdk/checkpoints/__init__.py— exportsRestoreHook,RESTORE_HOOK_REGISTRY,apply_restore_hook,AutoReplayManagerNo existing behaviour is changed; all additions are purely additive.
Test plan
ruff check .— passes (lint clean)pytest tests/test_replay_depth_l3.py— 22 passed, 10 skipped (remaining skips are deeperTraceContextintegrations outside this issue's scope)pytest -q— 2206 passed, 19 skipped, 0 failures🤖 Generated with Claude Code