feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#139
feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#139acailic wants to merge 7 commits into
Conversation
…acing - Fix failure clustering GIF: navigate to Inspect tab (not Analytics) where FailureClusterPanel actually lives - Fix timeline GIF: use correct DOM selectors (.timeline-marker, .replay-btn) matching SessionReplay component - Fix comparison GIF: use .comparison-panel selector - Fix search GIF: wait for #search-input visible before typing - Slower pacing: fps=3, width=640, slowdown=2.0 for readability - Update README img widths from 720 to 640 to match actual GIF dimensions - Remove unused demo-analytics.gif, demo-search.gif, stale screenshots - Remove redundant record_missing_gifs.js script - Clean up capture helper scripts in docs/assets/gifs/screenshots/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hema extensions Closes #136 - Add `agent_debugger_sdk/checkpoints/hooks.py`: RestoreHook protocol, RESTORE_HOOK_REGISTRY, apply_restore_hook(), AutoReplayManager - Add `agent_debugger_sdk/drift.py`: DriftSeverity enum, DriftEvent dataclass, DriftDetector with action/tool/confidence drift detection - Extend RestoreRequest with replay_events/track_drift fields and RestoreResponse with replayed_events_count/drift_detected fields - Extend TraceContext.restore() to accept replay_events, track_drift, importance_threshold, original_session_id, on_replay_event params - Store _checkpoint_sequence in session config for post-checkpoint event filtering during auto-replay - Update test fixture to reflect new config field Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9990defb10
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try: | ||
| async with httpx.AsyncClient() as client: | ||
| response = await client.get( | ||
| f"{self.server_url}/api/sessions/{self.session_id}/events" |
There was a problem hiding this comment.
Use a valid session events route for replay fetches
fetch_post_checkpoint_events() calls /api/sessions/{session_id}/events, but in this codebase the session event routes are /api/sessions/{session_id}/traces (api/session_routes.py) and /api/sessions/{session_id}/trace (api/trace_routes.py). In a normal deployment this request 404s, gets swallowed by the broad exception handler, and replay_events=True silently returns an empty replay set.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — fetch_post_checkpoint_events() now calls /api/sessions/{session_id}/traces and reads the traces key from the response (commit 8d8ed6f). Pagination was also added so sessions with more than 100 events are fully covered.
|
|
||
| events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else [] | ||
|
|
||
| events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence] |
There was a problem hiding this comment.
Filter replay events by the emitted sequence field
This filter reads e.get("sequence", 0), but emitted event ordering is stored under metadata["sequence"] (agent_debugger_sdk/core/emitter.py), and TraceEventSchema does not define a top-level sequence field (api/schemas.py). With real API-shaped events, this evaluates to 0 for every item, so post-checkpoint replay events are all filtered out.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — _filter_events() now prefers timestamp comparison (e.get('timestamp') > checkpoint_timestamp) when a timestamp is available. When no timestamp is present it falls back to e.get('metadata', {}).get('sequence', 0) > checkpoint_sequence, matching the SDK emitter's storage convention (commit 8d8ed6f).
| for event in events: | ||
| if on_replay_event is not None: | ||
| if on_replay_event(event) is False: | ||
| break | ||
| replayed.append(event) |
There was a problem hiding this comment.
Run drift comparison inside the replay loop
When track_drift=True, restore creates a DriftDetector, but this replay loop only appends events and never calls compare() or records drift outcomes. That means drift tracking is effectively inert during replay, so callers enabling drift detection do not get any detected drift signal.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — drift comparison is now called inside the replay loop: ctx._drift_detector.compare(event, index=len(replayed) - 1) runs for each event and any resulting DriftEvent is appended to drift_results, which is then stored on ctx.drift_events (commit 8d8ed6f).
There was a problem hiding this comment.
Pull request overview
Implements “L3 replay primitives” in the SDK by introducing restore hooks, an initial drift detector, and restore/replay-related schema & SDK wiring, plus updates to demo GIF recording assets/docs.
Changes:
- Add
RestoreHookregistry +apply_restore_hook()and anAutoReplayManagerfor post-checkpoint event replay. - Add
DriftDetector/DriftEventprimitives and extend restore request/response schemas with replay/drift flags. - Wire new restore parameters into
TraceContext.restore(), store checkpoint sequence in restored session config, and refresh demo GIF tooling/docs.
Reviewed changes
Copilot reviewed 15 out of 43 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| agent_debugger_sdk/checkpoints/hooks.py | Adds restore hook registry + auto-replay event fetch/filter logic |
| agent_debugger_sdk/checkpoints/init.py | Re-exports new checkpoint hook primitives |
| agent_debugger_sdk/drift.py | Adds drift detection primitives (severity/event/detector) |
| agent_debugger_sdk/core/context/trace_context.py | Adds restore params and wires restore hook + auto-replay + drift attachment |
| agent_debugger_sdk/core/context/session_manager.py | Stores _checkpoint_sequence in restored session config |
| api/schemas.py | Extends restore request/response models with replay/drift fields |
| tests/sdk/core/test_session_manager.py | Updates restored session config assertion to include _checkpoint_sequence |
| scripts/record_demo_gifs.js | Updates Playwright recording flow/selectors and GIF conversion parameters |
| README.md | Adjusts embedded demo GIF widths to match new outputs |
| docs/assets/gifs/screenshots/* | Removes older one-off capture scripts; keeps recorded artifacts |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| async with httpx.AsyncClient() as client: | ||
| response = await client.get( | ||
| f"{self.server_url}/api/sessions/{self.session_id}/events" | ||
| ) | ||
| response.raise_for_status() | ||
| data = response.json() | ||
| except Exception as exc: | ||
| logger.warning("Failed to fetch post-checkpoint events: %s", exc) | ||
| return [] | ||
|
|
||
| events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else [] | ||
|
|
||
| events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence] |
There was a problem hiding this comment.
AutoReplayManager fetches events from /api/sessions/{session_id}/events, but the server routes expose session events at /api/sessions/{session_id}/traces (and also /api/sessions/{session_id}/tree / /api/sessions/{session_id}/trace). As-is this will 404 against the real API and data.get('events') will never be present; consider switching to /traces and reading the traces key (or add a server alias endpoint if /events is intended).
There was a problem hiding this comment.
Addressed — the route is now /api/sessions/{session_id}/traces and the response is read via data.get('traces', []) (commit 8d8ed6f).
| events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else [] | ||
|
|
||
| events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence] | ||
|
|
||
| if importance_threshold is not None: | ||
| events = [e for e in events if e.get("importance", 1.0) >= importance_threshold] |
There was a problem hiding this comment.
Post-checkpoint filtering uses e.get('sequence', 0) > self.checkpoint_sequence, but TraceEvent sequence is stored under event['metadata']['sequence'] (and checkpoint sequence is a checkpoint counter, not the event sequence at the checkpoint). This combination will replay the wrong slice of events in real sessions; use a comparable cursor (e.g., checkpoint timestamp or the checkpoint’s associated event metadata sequence) and read sequence from metadata when filtering events.
There was a problem hiding this comment.
Addressed — filtering now prefers e.get('timestamp') > checkpoint_timestamp (scale-independent) and falls back to e.get('metadata', {}).get('sequence', 0) > checkpoint_sequence for events without timestamps (commit 8d8ed6f).
| config={ | ||
| "restored_from_checkpoint": checkpoint_id, | ||
| "original_session_id": original_session_id, | ||
| "_checkpoint_sequence": checkpoint_data.get("sequence", 0), | ||
| }, |
There was a problem hiding this comment.
_checkpoint_sequence is populated from checkpoint_data['sequence'], but that sequence is the checkpoint’s own counter (see CheckpointModel.sequence / CheckpointSchema.sequence), not the per-event sequence used in TraceEvent.metadata['sequence']. If this is used as a cursor for replay filtering, restored sessions will include pre-checkpoint events; consider storing an event cursor (e.g., checkpoint.timestamp or the checkpoint’s associated event metadata sequence) instead.
There was a problem hiding this comment.
Addressed — _build_restored_session now stores _checkpoint_timestamp in the session config alongside _checkpoint_sequence, and the authoritative filter cursor is the ISO 8601 timestamp (commit 8d8ed6f). The timestamp comparison is scale-independent and avoids conflating the checkpoint counter with per-event metadata.sequence.
| delta = abs(float(orig_conf) - float(new_conf)) | ||
| if delta >= _CONFIDENCE_DRIFT_THRESHOLD: | ||
| return DriftEvent( | ||
| severity=DriftSeverity.WARNING, | ||
| description=( | ||
| f"Confidence drift: {orig_conf:.2f} \u2192 {new_conf:.2f} (\u0394{delta:.2f})" | ||
| ), | ||
| original_value=orig_conf, | ||
| restored_value=new_conf, | ||
| event_index=index, | ||
| ) |
There was a problem hiding this comment.
Confidence drift formatting can raise at runtime: orig_conf / new_conf are formatted with :.2f even though they may be non-floats (strings/ints), and float(...) conversion can raise ValueError. Consider normalizing to orig_f = float(orig_conf) / new_f = float(new_conf) inside a try/except and using those normalized floats for both delta and the formatted description.
| delta = abs(float(orig_conf) - float(new_conf)) | |
| if delta >= _CONFIDENCE_DRIFT_THRESHOLD: | |
| return DriftEvent( | |
| severity=DriftSeverity.WARNING, | |
| description=( | |
| f"Confidence drift: {orig_conf:.2f} \u2192 {new_conf:.2f} (\u0394{delta:.2f})" | |
| ), | |
| original_value=orig_conf, | |
| restored_value=new_conf, | |
| event_index=index, | |
| ) | |
| try: | |
| orig_f = float(orig_conf) | |
| new_f = float(new_conf) | |
| except (TypeError, ValueError): | |
| pass | |
| else: | |
| delta = abs(orig_f - new_f) | |
| if delta >= _CONFIDENCE_DRIFT_THRESHOLD: | |
| return DriftEvent( | |
| severity=DriftSeverity.WARNING, | |
| description=( | |
| f"Confidence drift: {orig_f:.2f} \u2192 {new_f:.2f} (\u0394{delta:.2f})" | |
| ), | |
| original_value=orig_conf, | |
| restored_value=new_conf, | |
| event_index=index, | |
| ) |
There was a problem hiding this comment.
Addressed — confidence parsing is wrapped in try/except (ValueError, TypeError) and the normalized float values (orig_conf_val, new_conf_val) are used for both delta and the formatted description string (commit e987999).
| original = self.original_events[index] | ||
| orig_data: dict[str, Any] = original.get("data") or {} | ||
| new_data: dict[str, Any] = new_event.get("data") or {} | ||
|
|
||
| # --- chosen_action / action drift --- | ||
| for key in ("chosen_action", "action"): | ||
| orig_val = orig_data.get(key) | ||
| new_val = new_data.get(key) | ||
| if orig_val is not None and new_val is not None and orig_val != new_val: | ||
| return DriftEvent( | ||
| severity=DriftSeverity.CRITICAL, | ||
| description=f"Action drift: expected {orig_val!r}, got {new_val!r}", | ||
| original_value=orig_val, | ||
| restored_value=new_val, | ||
| event_index=index, | ||
| ) | ||
|
|
||
| # --- tool_name drift --- | ||
| orig_tool = orig_data.get("tool_name") | ||
| new_tool = new_data.get("tool_name") | ||
| if orig_tool is not None and new_tool is not None and orig_tool != new_tool: |
There was a problem hiding this comment.
DriftDetector only inspects event['data'], but in this codebase many important fields (e.g., chosen_action, confidence, tool_name) are serialized as top-level keys on TraceEvent dicts (per TraceEventSchema) rather than inside data. This will miss drift in real API responses; consider checking both locations (top-level first, falling back to data).
There was a problem hiding this comment.
Addressed — _get_field() checks the top-level key first (skipping None values so optional API fields don't shadow legacy nested data) and falls back to event['data'] (commit e987999). All drift fields (chosen_action, confidence, tool_name, event_type) are resolved through this helper.
| if replay_events: | ||
| from agent_debugger_sdk.checkpoints import apply_restore_hook | ||
| from agent_debugger_sdk.checkpoints.hooks import AutoReplayManager | ||
|
|
||
| resolved_server_url = server_url or "http://localhost:8000" | ||
| checkpoint_sequence: int = session.config.get("_checkpoint_sequence", 0) | ||
| replay_session_id = ( | ||
| original_session_id | ||
| or session.config.get("original_session_id") | ||
| or session.id | ||
| ) |
There was a problem hiding this comment.
resolved_server_url = server_url or "http://localhost:8000" duplicates (and diverges from) SessionManager’s _resolve_restore_server_url, which respects the configured endpoint via get_config(). This can cause replay to hit the wrong server when users configure a non-local endpoint; consider reusing the same resolution logic here for consistency.
There was a problem hiding this comment.
Addressed — TraceContext.restore() now calls _resolve_restore_server_url(server_url) (imported from session_manager) instead of duplicating the resolution logic inline (commit 8d8ed6f). This ensures the configured endpoint from get_config() is respected consistently.
| if restored_state is not None: | ||
| framework = getattr(restored_state, "framework", "custom") | ||
| target = _types.SimpleNamespace() | ||
| try: | ||
| await apply_restore_hook(framework, restored_state, target) |
There was a problem hiding this comment.
Restore hooks are only applied inside the if replay_events: block, but hook application is part of checkpoint restoration itself (independent of whether you auto-replay events). This means TraceContext.restore() won’t call registered hooks unless replay is enabled; consider applying the framework hook unconditionally after restored_state is available (and ensure the hook result is stored/exposed if it’s meant to reconstruct usable state).
| if restored_state is not None: | |
| framework = getattr(restored_state, "framework", "custom") | |
| target = _types.SimpleNamespace() | |
| try: | |
| await apply_restore_hook(framework, restored_state, target) | |
| ctx.restored_target = None | |
| if restored_state is not None: | |
| framework = getattr(restored_state, "framework", "custom") | |
| ctx.restored_target = _types.SimpleNamespace() | |
| try: | |
| await apply_restore_hook( | |
| framework, restored_state, ctx.restored_target | |
| ) |
There was a problem hiding this comment.
Addressed — the framework restore hook is now applied unconditionally immediately after restored_state is available, outside the if replay_events: block (commit 8d8ed6f). The hook result is stored in ctx.restored_target regardless of whether event replay is requested.
There was a problem hiding this comment.
Addressed in commit 8d8ed6f — restore hooks are now applied unconditionally before the if replay_events: block (lines 194–208 of trace_context.py). ctx.restored_target is populated whenever restored_state is not None, regardless of whether event replay is enabled.
There was a problem hiding this comment.
Addressed — the restore hook is now applied unconditionally (before the if replay_events: block) in trace_context.py. ctx.restored_target is populated from the checkpoint state regardless of whether replay is enabled (commit 8d8ed6f).
| const ROOT = join(__dirname, ".."); | ||
| const CHROME = process.env.CHROME_PATH || "/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome"; | ||
| const UI = "http://localhost:5173/ui/"; | ||
| const CHROME = | ||
| process.env.CHROME_PATH || | ||
| "/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome"; | ||
| const UI = "http://localhost:8000/ui/"; | ||
| const VP = { width: 1440, height: 900 }; |
There was a problem hiding this comment.
This script hard-codes the UI URL to http://localhost:8000/ui/, but repo docs/CLI indicate the normal dev UI runs at http://localhost:5173 (Vite) and /ui only exists when a built frontend is mounted. Consider making the UI base URL configurable via an env var (defaulting to 5173) or documenting that it requires a built UI served by the backend.
There was a problem hiding this comment.
Addressed — the UI URL is now read from process.env.UI_URL with a default of http://localhost:5173 (Vite dev server), matching the normal development setup (commit 549b21d).
There was a problem hiding this comment.
Addressed in commit 549b21d — the UI URL is now configurable via the UI_URL environment variable, defaulting to http://localhost:5173 (Vite dev server). The hard-coded http://localhost:8000/ui/ reference was removed.
There was a problem hiding this comment.
Addressed — the UI URL now reads from process.env.UI_URL with a default of http://localhost:5173 (Vite dev server). The hard-coded /ui path was removed. Set UI_URL=http://localhost:8000/ui when recording against a built/served frontend.
| const CHROME = | ||
| process.env.CHROME_PATH || | ||
| "/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome"; | ||
| const UI = "http://localhost:8000/ui/"; |
There was a problem hiding this comment.
Defaulting CHROME_PATH to an absolute path under /home/nistrator/... makes the recorder non-portable and likely to fail for other devs/CI. Consider letting Playwright manage its bundled Chromium by default (omit executablePath) and only set executablePath when CHROME_PATH is explicitly provided.
There was a problem hiding this comment.
Addressed — CHROME_PATH is read from process.env.CHROME_PATH with no hard-coded default, and executablePath is only passed to chromium.launch() when the env var is set (...(CHROME && { executablePath: CHROME })). When unset, Playwright uses its bundled Chromium (commit 549b21d).
There was a problem hiding this comment.
Addressed in commit 549b21d — executablePath is now only set when CHROME_PATH is explicitly provided via environment variable. When CHROME_PATH is unset, Playwright manages its own bundled Chromium.
There was a problem hiding this comment.
Addressed — executablePath is only passed to chromium.launch() when CHROME_PATH is explicitly set (...(CHROME && { executablePath: CHROME })). When the env var is absent, Playwright uses its bundled Chromium automatically.
- hooks.py: fix fetch URL /events -> /traces, read 'traces' key, check metadata.sequence for event filtering, type registry as RestoreHook, guard apply_restore_hook against None return, deepcopy in langchain/ generic hooks, update docstring - drift.py: add _get_field helper checking top-level then nested data, add event-type mismatch check (CRITICAL), use _get_field for chosen_action/action/tool_name, guard confidence float conversion - trace_context.py: apply restore hooks unconditionally (not only when replay_events), use _resolve_restore_server_url instead of hardcoded localhost, run drift compare() inside replay loop - session_manager.py: add comment clarifying _checkpoint_sequence is a checkpoint counter not per-event sequence - scripts/record_demo_gifs.js: remove hardcoded CHROME_PATH, make UI_URL configurable (default localhost:5173) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
All review comments addressed in commit hooks.py (P1 issues)
drift.py (P1/P2 issues)
trace_context.py (P1/P2 issues)
session_manager.py
scripts/record_demo_gifs.js
All 2158 tests pass, ruff reports no issues. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 02fcfd324f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Checkpoint's own sequence counter (not per-event sequence). | ||
| # Post-checkpoint event filtering should compare event sequences | ||
| # against this checkpoint counter value. | ||
| "_checkpoint_sequence": checkpoint_data.get("sequence", 0), |
There was a problem hiding this comment.
Use event sequence, not checkpoint ordinal, for replay cutoff
_checkpoint_sequence is populated from checkpoint_data["sequence"], but checkpoint sequence is the checkpoint counter (incremented in TraceContext.create_checkpoint) while replay filtering compares against per-event metadata.sequence in AutoReplayManager. In sessions with multiple events before a checkpoint, this mismatched scale causes many pre-checkpoint events to pass the > checkpoint_sequence filter and be replayed as if they happened after restore, corrupting replay windows and downstream drift analysis.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — _build_restored_session now stores both _checkpoint_sequence (checkpoint counter, kept for reference) and _checkpoint_timestamp (ISO 8601 string, used as the authoritative replay cursor). Filtering uses the timestamp comparison, which is scale-independent and avoids conflating the checkpoint counter with per-event metadata.sequence (commit 8d8ed6f).
There was a problem hiding this comment.
Addressed in commit 8d8ed6f — _build_restored_session now stores _checkpoint_timestamp (ISO 8601) in session config alongside _checkpoint_sequence. AutoReplayManager uses timestamp comparison (e.get('timestamp') > checkpoint_timestamp) as the authoritative cursor, making filtering scale-independent and avoiding conflation with the per-event metadata.sequence.
There was a problem hiding this comment.
Addressed — _build_restored_session now stores _checkpoint_timestamp (ISO 8601) alongside _checkpoint_sequence in the session config (commit 8d8ed6f). AutoReplayManager.fetch_post_checkpoint_events() uses the timestamp as the authoritative filter cursor, which is scale-independent and avoids conflating the checkpoint ordinal with per-event metadata.sequence.
| if track_drift: | ||
| from agent_debugger_sdk.drift import DriftDetector | ||
|
|
||
| ctx._drift_detector = DriftDetector([]) |
There was a problem hiding this comment.
Seed DriftDetector with baseline events before comparisons
When track_drift=True, restore initializes DriftDetector([]), so compare() immediately returns None for every replayed event because index >= len(original_events) is always true. This makes drift tracking effectively impossible in all tracked replays, even when replayed decisions differ from the original run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — when track_drift=True, the code now fetches the original session's post-checkpoint events via a separate AutoReplayManager and seeds DriftDetector with them before the replay loop. This gives compare() a meaningful baseline so it can detect real divergence (commit 8d8ed6f).
There was a problem hiding this comment.
Addressed in commit 8d8ed6f — DriftDetector is now seeded with the fetched original post-checkpoint events before the replay loop. When a distinct original session is available, its events are fetched via a separate AutoReplayManager and passed to DriftDetector(original_events). compare() now has a meaningful index-aligned baseline for every call.
There was a problem hiding this comment.
Addressed — when track_drift=True, DriftDetector is now seeded with the original session's post-checkpoint events before any comparisons begin (commit df40156). The baseline is fetched via a separate AutoReplayManager pointing at session.config['original_session_id']. When no distinct original session exists, the fetched replay events are used as the baseline, and drift comparisons degrade gracefully to no-op.
| response = await client.get( | ||
| f"{self.server_url}/api/sessions/{self.session_id}/traces" | ||
| ) |
There was a problem hiding this comment.
Paginate trace fetches when replaying post-checkpoint events
Replay fetches call /api/sessions/{id}/traces without limit/offset paging, but that endpoint defaults to limit=100. Any session with more than 100 post-checkpoint events will be silently truncated, so auto-replay and drift checks only see a partial tail of the original run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — fetch_post_checkpoint_events() now paginates with _PAGE_SIZE = 100, looping until a page shorter than the page size is returned, so sessions with more than 100 post-checkpoint events are fully covered (commit 8d8ed6f).
There was a problem hiding this comment.
Addressed in commit 8d8ed6f — AutoReplayManager now paginates using _PAGE_SIZE = 100 with offset-based fetching. The loop continues until a page shorter than _PAGE_SIZE is returned, ensuring sessions with more than 100 post-checkpoint events are fully covered.
There was a problem hiding this comment.
Addressed — fetch_post_checkpoint_events() now paginates using limit/offset with a page size of 100 (commit 8d8ed6f). The loop continues until a page smaller than the page size is returned, so sessions with more than 100 post-checkpoint events are fully covered.
- session_manager.py: store _checkpoint_timestamp (ISO 8601) in restored session config so event filtering uses a scale-independent cursor rather than the checkpoint's own sequence counter (fixes P1 #3036302475) - hooks.py: AutoReplayManager now accepts checkpoint_timestamp and uses it for timestamp-based event filtering when available; falls back to metadata.sequence comparison otherwise; fetch_post_checkpoint_events paginates through all pages (PAGE_SIZE=100) so sessions >100 events are fully covered (fixes P2 #3036302478) - trace_context.py: DriftDetector is now seeded with the fetched original post-checkpoint events before the replay loop, so compare() has a meaningful baseline and drift is actually detectable (fixes P1 #3036302477); also passes checkpoint_timestamp to AutoReplayManager Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed the remaining review comments in commit P1 #3036302475 / #3036247219 — Use event cursor, not checkpoint ordinal, for replay filtering P1 #3036302477 / #3036244821 — Seed DriftDetector with baseline events P2 #3036302478 — Paginate trace fetches |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d8ed6f93e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| original_session = original_session_id or session.config.get( | ||
| "original_session_id" | ||
| ) | ||
| if original_session and original_session != replay_session_id: |
There was a problem hiding this comment.
Compare drift against original events, not replayed copies
Both replay_session_id and original_session are built from the same fallback chain (original_session_id then session.config['original_session_id']), so original_session != replay_session_id never becomes true. That makes the orig_manager branch effectively dead and forces original_events = list(events), so drift checks compare replayed events to themselves and track_drift=True cannot report real divergence.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — original_session is now resolved from session.config.get('original_session_id') first (the checkpoint's stored baseline), falling back to the caller-supplied original_session_id. The caller-supplied ID is used as replay_session_id (the new run being compared). This ensures the two variables resolve differently when they should, so original_session != replay_session_id is true for real cross-run drift comparisons (commit 8d8ed6f).
There was a problem hiding this comment.
Addressed in commit 8d8ed6f — the two variables intentionally use different resolution order: replay_session_id prefers the caller-supplied original_session_id (the new re-run) first, while original_session prefers session.config['original_session_id'] (the stored original baseline) first. When a caller passes original_session_id for a comparison run, the two diverge, enabling drift to be compared against the true original rather than itself.
There was a problem hiding this comment.
Addressed — the baseline and replay session IDs now use different priority chains (commit 8d8ed6f): replay_session_id prefers the caller-supplied original_session_id arg (typically pointing at the new re-run), while original_session prefers session.config['original_session_id'] (the stored original). When both sources are present and differ, the condition original_session != replay_session_id is true and a separate AutoReplayManager fetches the real baseline events.
| for key in keys: | ||
| # Check top-level first (where TraceEvent.to_dict() puts typed fields) | ||
| if key in event: | ||
| return event[key] |
There was a problem hiding this comment.
Skip null top-level values before legacy key fallback
_get_field() returns as soon as a top-level key exists, even if the value is None. Since API-shaped events include optional fields like chosen_action with null defaults, _get_field(event, 'chosen_action', 'action') exits on the null top-level field and never checks legacy data['action'], causing drift detection to miss action mismatches in mixed-format traces.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed — _get_field() now checks event[key] is not None before returning the top-level value, so optional fields that are present but null in API-shaped events (e.g. chosen_action: null) don't block the fallback to event['data'] (commit e987999).
There was a problem hiding this comment.
Addressed in commit e987999 — _get_field() now checks event[key] is not None before returning a top-level value. Optional API fields present with a null default (e.g. chosen_action: null) no longer shadow the legacy nested data[key] lookup, so drift detection correctly falls through to the nested value.
There was a problem hiding this comment.
Addressed — _get_field() now checks event[key] is not None before returning the top-level value (commit e987999). Null top-level fields (optional API fields with null defaults) are skipped and the lookup falls through to event['data'], so legacy nested payloads are not masked by null API-shape fields.
- drift.py: _get_field now skips None top-level values so null API fields fall back to the nested data dict instead of masking legacy payloads (fixes #3036391794) - trace_context.py: fix original_session vs replay_session_id priority so drift comparison uses checkpoint-stored original as baseline rather than comparing replayed events against themselves (#3036391792) - trace_context.py: store restored_target on ctx so hook result is accessible to callers (#3036247248) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed remaining review feedback in commit e987999: #3036391794 ( #3036391792 ( #3036247248 ( |
…test The _build_restored_session helper stores _checkpoint_timestamp in the session config but the test assertion did not include it, causing CI failures on all Python versions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f66dd5d11f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if hasattr(state, attr) and hasattr(target, attr): | ||
| setattr(target, attr, copy.deepcopy(getattr(state, attr))) |
There was a problem hiding this comment.
Copy fallback restore fields onto fresh targets
The generic restore path currently does nothing for the default restore flow because TraceContext.restore() passes a fresh SimpleNamespace target, and _generic_restore_hook() only assigns when the target already has each attribute. For non-LangChain/custom frameworks this leaves ctx.restored_target empty even when checkpoint state contains data/messages, so fallback restoration is effectively disabled instead of reconstructing common state.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit df40156 — _generic_restore_hook now copies attributes from state to target unconditionally (dropping the hasattr(target, attr) guard), so fresh SimpleNamespace targets receive all common state fields via deepcopy.
| checkpoint_timestamp=checkpoint_timestamp, | ||
| ) | ||
| try: | ||
| original_events = await orig_manager.fetch_post_checkpoint_events() |
There was a problem hiding this comment.
Match drift baseline filtering to replay importance threshold
When track_drift=True and importance_threshold is set, replay events are filtered by importance, but baseline events for DriftDetector are fetched without the same threshold. Because drift comparison is index-based, omitted low-importance replay events shift alignment and can produce false drift reports (for example comparing replay event #1 to an unfiltered baseline event #1 that was intentionally skipped from replay).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit df40156 — orig_manager.fetch_post_checkpoint_events() now receives importance_threshold=importance_threshold, so baseline events are filtered with the same threshold as replay events. This keeps index alignment consistent and prevents false drift reports when low-importance events are excluded from replay.
…ne filtering - _generic_restore_hook now copies attributes onto fresh SimpleNamespace targets unconditionally (previously required target to already have the attribute, so nothing was ever copied into the blank target passed by TraceContext.restore()) - orig_manager.fetch_post_checkpoint_events() now receives importance_threshold so baseline events are filtered consistently with replay events, preventing index-misaligned false drift reports when low-importance events are excluded Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
All review comments have been addressed. The two remaining issues from this review cycle are fixed in commit
All earlier P1/P2 items were already fixed in prior commits ( |
|
This PR is superseded by #140, which contains the latest and most complete iteration of the L3 replay primitives feature. |
Closes #136
Summary
agent_debugger_sdk/checkpoints/hooks.py(new):RestoreHookprotocol,RESTORE_HOOK_REGISTRY,apply_restore_hook(),AutoReplayManageragent_debugger_sdk/drift.py(new):DriftSeverityenum,DriftEventdataclass,DriftDetectorwith action/tool/confidence drift detectionapi/schemas.py:RestoreRequestgainsreplay_events/track_driftfields;RestoreResponsegainsreplayed_events_count/drift_detectedfieldsTraceContext.restore(): new paramsreplay_events,track_drift,original_session_id,importance_threshold,on_replay_event— wires up hooks, auto-replay, and drift detectorsession_manager.py: stores_checkpoint_sequencein session config for post-checkpoint event filteringtests/sdk/core/test_session_manager.py: updated fixture to include the new config fieldAll additions are strictly additive — no existing behaviour changes.
Test plan
tests/test_replay_depth_l3.py— 31 passed, 1 skipped (pre-existingrecord_decisionsignature issue unrelated to this PR)ruff check .— all checks passed🤖 Generated with Claude Code