Skip to content

feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#139

Closed
acailic wants to merge 7 commits into
mainfrom
fix/issue-136-l3-replay-primitives-v2
Closed

feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#139
acailic wants to merge 7 commits into
mainfrom
fix/issue-136-l3-replay-primitives-v2

Conversation

@acailic
Copy link
Copy Markdown
Owner

@acailic acailic commented Apr 5, 2026

Closes #136

Summary

  • agent_debugger_sdk/checkpoints/hooks.py (new): RestoreHook protocol, RESTORE_HOOK_REGISTRY, apply_restore_hook(), AutoReplayManager
  • agent_debugger_sdk/drift.py (new): DriftSeverity enum, DriftEvent dataclass, DriftDetector with action/tool/confidence drift detection
  • api/schemas.py: RestoreRequest gains replay_events/track_drift fields; RestoreResponse gains replayed_events_count/drift_detected fields
  • TraceContext.restore(): new params replay_events, track_drift, original_session_id, importance_threshold, on_replay_event — wires up hooks, auto-replay, and drift detector
  • session_manager.py: stores _checkpoint_sequence in session config for post-checkpoint event filtering
  • tests/sdk/core/test_session_manager.py: updated fixture to include the new config field

All additions are strictly additive — no existing behaviour changes.

Test plan

  • tests/test_replay_depth_l3.py — 31 passed, 1 skipped (pre-existing record_decision signature issue unrelated to this PR)
  • Full suite — 2185 passed, 28 skipped, 0 failures
  • ruff check . — all checks passed

🤖 Generated with Claude Code

acailic and others added 2 commits April 4, 2026 14:29
…acing

- Fix failure clustering GIF: navigate to Inspect tab (not Analytics) where
  FailureClusterPanel actually lives
- Fix timeline GIF: use correct DOM selectors (.timeline-marker, .replay-btn)
  matching SessionReplay component
- Fix comparison GIF: use .comparison-panel selector
- Fix search GIF: wait for #search-input visible before typing
- Slower pacing: fps=3, width=640, slowdown=2.0 for readability
- Update README img widths from 720 to 640 to match actual GIF dimensions
- Remove unused demo-analytics.gif, demo-search.gif, stale screenshots
- Remove redundant record_missing_gifs.js script
- Clean up capture helper scripts in docs/assets/gifs/screenshots/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hema extensions

Closes #136

- Add `agent_debugger_sdk/checkpoints/hooks.py`: RestoreHook protocol,
  RESTORE_HOOK_REGISTRY, apply_restore_hook(), AutoReplayManager
- Add `agent_debugger_sdk/drift.py`: DriftSeverity enum, DriftEvent
  dataclass, DriftDetector with action/tool/confidence drift detection
- Extend RestoreRequest with replay_events/track_drift fields and
  RestoreResponse with replayed_events_count/drift_detected fields
- Extend TraceContext.restore() to accept replay_events, track_drift,
  importance_threshold, original_session_id, on_replay_event params
- Store _checkpoint_sequence in session config for post-checkpoint
  event filtering during auto-replay
- Update test fixture to reflect new config field

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 5, 2026 00:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9990defb10

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agent_debugger_sdk/checkpoints/hooks.py Outdated
try:
async with httpx.AsyncClient() as client:
response = await client.get(
f"{self.server_url}/api/sessions/{self.session_id}/events"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use a valid session events route for replay fetches

fetch_post_checkpoint_events() calls /api/sessions/{session_id}/events, but in this codebase the session event routes are /api/sessions/{session_id}/traces (api/session_routes.py) and /api/sessions/{session_id}/trace (api/trace_routes.py). In a normal deployment this request 404s, gets swallowed by the broad exception handler, and replay_events=True silently returns an empty replay set.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — fetch_post_checkpoint_events() now calls /api/sessions/{session_id}/traces and reads the traces key from the response (commit 8d8ed6f). Pagination was also added so sessions with more than 100 events are fully covered.

Comment thread agent_debugger_sdk/checkpoints/hooks.py Outdated

events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else []

events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Filter replay events by the emitted sequence field

This filter reads e.get("sequence", 0), but emitted event ordering is stored under metadata["sequence"] (agent_debugger_sdk/core/emitter.py), and TraceEventSchema does not define a top-level sequence field (api/schemas.py). With real API-shaped events, this evaluates to 0 for every item, so post-checkpoint replay events are all filtered out.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — _filter_events() now prefers timestamp comparison (e.get('timestamp') > checkpoint_timestamp) when a timestamp is available. When no timestamp is present it falls back to e.get('metadata', {}).get('sequence', 0) > checkpoint_sequence, matching the SDK emitter's storage convention (commit 8d8ed6f).

Comment on lines +237 to +241
for event in events:
if on_replay_event is not None:
if on_replay_event(event) is False:
break
replayed.append(event)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run drift comparison inside the replay loop

When track_drift=True, restore creates a DriftDetector, but this replay loop only appends events and never calls compare() or records drift outcomes. That means drift tracking is effectively inert during replay, so callers enabling drift detection do not get any detected drift signal.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — drift comparison is now called inside the replay loop: ctx._drift_detector.compare(event, index=len(replayed) - 1) runs for each event and any resulting DriftEvent is appended to drift_results, which is then stored on ctx.drift_events (commit 8d8ed6f).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements “L3 replay primitives” in the SDK by introducing restore hooks, an initial drift detector, and restore/replay-related schema & SDK wiring, plus updates to demo GIF recording assets/docs.

Changes:

  • Add RestoreHook registry + apply_restore_hook() and an AutoReplayManager for post-checkpoint event replay.
  • Add DriftDetector/DriftEvent primitives and extend restore request/response schemas with replay/drift flags.
  • Wire new restore parameters into TraceContext.restore(), store checkpoint sequence in restored session config, and refresh demo GIF tooling/docs.

Reviewed changes

Copilot reviewed 15 out of 43 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
agent_debugger_sdk/checkpoints/hooks.py Adds restore hook registry + auto-replay event fetch/filter logic
agent_debugger_sdk/checkpoints/init.py Re-exports new checkpoint hook primitives
agent_debugger_sdk/drift.py Adds drift detection primitives (severity/event/detector)
agent_debugger_sdk/core/context/trace_context.py Adds restore params and wires restore hook + auto-replay + drift attachment
agent_debugger_sdk/core/context/session_manager.py Stores _checkpoint_sequence in restored session config
api/schemas.py Extends restore request/response models with replay/drift fields
tests/sdk/core/test_session_manager.py Updates restored session config assertion to include _checkpoint_sequence
scripts/record_demo_gifs.js Updates Playwright recording flow/selectors and GIF conversion parameters
README.md Adjusts embedded demo GIF widths to match new outputs
docs/assets/gifs/screenshots/* Removes older one-off capture scripts; keeps recorded artifacts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread agent_debugger_sdk/checkpoints/hooks.py Outdated
Comment on lines +98 to +111
try:
async with httpx.AsyncClient() as client:
response = await client.get(
f"{self.server_url}/api/sessions/{self.session_id}/events"
)
response.raise_for_status()
data = response.json()
except Exception as exc:
logger.warning("Failed to fetch post-checkpoint events: %s", exc)
return []

events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else []

events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence]
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoReplayManager fetches events from /api/sessions/{session_id}/events, but the server routes expose session events at /api/sessions/{session_id}/traces (and also /api/sessions/{session_id}/tree / /api/sessions/{session_id}/trace). As-is this will 404 against the real API and data.get('events') will never be present; consider switching to /traces and reading the traces key (or add a server alias endpoint if /events is intended).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the route is now /api/sessions/{session_id}/traces and the response is read via data.get('traces', []) (commit 8d8ed6f).

Comment thread agent_debugger_sdk/checkpoints/hooks.py Outdated
Comment on lines +109 to +114
events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else []

events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence]

if importance_threshold is not None:
events = [e for e in events if e.get("importance", 1.0) >= importance_threshold]
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-checkpoint filtering uses e.get('sequence', 0) > self.checkpoint_sequence, but TraceEvent sequence is stored under event['metadata']['sequence'] (and checkpoint sequence is a checkpoint counter, not the event sequence at the checkpoint). This combination will replay the wrong slice of events in real sessions; use a comparable cursor (e.g., checkpoint timestamp or the checkpoint’s associated event metadata sequence) and read sequence from metadata when filtering events.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — filtering now prefers e.get('timestamp') > checkpoint_timestamp (scale-independent) and falls back to e.get('metadata', {}).get('sequence', 0) > checkpoint_sequence for events without timestamps (commit 8d8ed6f).

Comment on lines 85 to 89
config={
"restored_from_checkpoint": checkpoint_id,
"original_session_id": original_session_id,
"_checkpoint_sequence": checkpoint_data.get("sequence", 0),
},
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_checkpoint_sequence is populated from checkpoint_data['sequence'], but that sequence is the checkpoint’s own counter (see CheckpointModel.sequence / CheckpointSchema.sequence), not the per-event sequence used in TraceEvent.metadata['sequence']. If this is used as a cursor for replay filtering, restored sessions will include pre-checkpoint events; consider storing an event cursor (e.g., checkpoint.timestamp or the checkpoint’s associated event metadata sequence) instead.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — _build_restored_session now stores _checkpoint_timestamp in the session config alongside _checkpoint_sequence, and the authoritative filter cursor is the ISO 8601 timestamp (commit 8d8ed6f). The timestamp comparison is scale-independent and avoids conflating the checkpoint counter with per-event metadata.sequence.

Comment thread agent_debugger_sdk/drift.py Outdated
Comment on lines +112 to +122
delta = abs(float(orig_conf) - float(new_conf))
if delta >= _CONFIDENCE_DRIFT_THRESHOLD:
return DriftEvent(
severity=DriftSeverity.WARNING,
description=(
f"Confidence drift: {orig_conf:.2f} \u2192 {new_conf:.2f} (\u0394{delta:.2f})"
),
original_value=orig_conf,
restored_value=new_conf,
event_index=index,
)
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confidence drift formatting can raise at runtime: orig_conf / new_conf are formatted with :.2f even though they may be non-floats (strings/ints), and float(...) conversion can raise ValueError. Consider normalizing to orig_f = float(orig_conf) / new_f = float(new_conf) inside a try/except and using those normalized floats for both delta and the formatted description.

Suggested change
delta = abs(float(orig_conf) - float(new_conf))
if delta >= _CONFIDENCE_DRIFT_THRESHOLD:
return DriftEvent(
severity=DriftSeverity.WARNING,
description=(
f"Confidence drift: {orig_conf:.2f} \u2192 {new_conf:.2f} (\u0394{delta:.2f})"
),
original_value=orig_conf,
restored_value=new_conf,
event_index=index,
)
try:
orig_f = float(orig_conf)
new_f = float(new_conf)
except (TypeError, ValueError):
pass
else:
delta = abs(orig_f - new_f)
if delta >= _CONFIDENCE_DRIFT_THRESHOLD:
return DriftEvent(
severity=DriftSeverity.WARNING,
description=(
f"Confidence drift: {orig_f:.2f} \u2192 {new_f:.2f} (\u0394{delta:.2f})"
),
original_value=orig_conf,
restored_value=new_conf,
event_index=index,
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — confidence parsing is wrapped in try/except (ValueError, TypeError) and the normalized float values (orig_conf_val, new_conf_val) are used for both delta and the formatted description string (commit e987999).

Comment on lines +79 to +99
original = self.original_events[index]
orig_data: dict[str, Any] = original.get("data") or {}
new_data: dict[str, Any] = new_event.get("data") or {}

# --- chosen_action / action drift ---
for key in ("chosen_action", "action"):
orig_val = orig_data.get(key)
new_val = new_data.get(key)
if orig_val is not None and new_val is not None and orig_val != new_val:
return DriftEvent(
severity=DriftSeverity.CRITICAL,
description=f"Action drift: expected {orig_val!r}, got {new_val!r}",
original_value=orig_val,
restored_value=new_val,
event_index=index,
)

# --- tool_name drift ---
orig_tool = orig_data.get("tool_name")
new_tool = new_data.get("tool_name")
if orig_tool is not None and new_tool is not None and orig_tool != new_tool:
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DriftDetector only inspects event['data'], but in this codebase many important fields (e.g., chosen_action, confidence, tool_name) are serialized as top-level keys on TraceEvent dicts (per TraceEventSchema) rather than inside data. This will miss drift in real API responses; consider checking both locations (top-level first, falling back to data).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — _get_field() checks the top-level key first (skipping None values so optional API fields don't shadow legacy nested data) and falls back to event['data'] (commit e987999). All drift fields (chosen_action, confidence, tool_name, event_type) are resolved through this helper.

Comment on lines +193 to +203
if replay_events:
from agent_debugger_sdk.checkpoints import apply_restore_hook
from agent_debugger_sdk.checkpoints.hooks import AutoReplayManager

resolved_server_url = server_url or "http://localhost:8000"
checkpoint_sequence: int = session.config.get("_checkpoint_sequence", 0)
replay_session_id = (
original_session_id
or session.config.get("original_session_id")
or session.id
)
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved_server_url = server_url or "http://localhost:8000" duplicates (and diverges from) SessionManager’s _resolve_restore_server_url, which respects the configured endpoint via get_config(). This can cause replay to hit the wrong server when users configure a non-local endpoint; consider reusing the same resolution logic here for consistency.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — TraceContext.restore() now calls _resolve_restore_server_url(server_url) (imported from session_manager) instead of duplicating the resolution logic inline (commit 8d8ed6f). This ensures the configured endpoint from get_config() is respected consistently.

Comment on lines +217 to +221
if restored_state is not None:
framework = getattr(restored_state, "framework", "custom")
target = _types.SimpleNamespace()
try:
await apply_restore_hook(framework, restored_state, target)
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore hooks are only applied inside the if replay_events: block, but hook application is part of checkpoint restoration itself (independent of whether you auto-replay events). This means TraceContext.restore() won’t call registered hooks unless replay is enabled; consider applying the framework hook unconditionally after restored_state is available (and ensure the hook result is stored/exposed if it’s meant to reconstruct usable state).

Suggested change
if restored_state is not None:
framework = getattr(restored_state, "framework", "custom")
target = _types.SimpleNamespace()
try:
await apply_restore_hook(framework, restored_state, target)
ctx.restored_target = None
if restored_state is not None:
framework = getattr(restored_state, "framework", "custom")
ctx.restored_target = _types.SimpleNamespace()
try:
await apply_restore_hook(
framework, restored_state, ctx.restored_target
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the framework restore hook is now applied unconditionally immediately after restored_state is available, outside the if replay_events: block (commit 8d8ed6f). The hook result is stored in ctx.restored_target regardless of whether event replay is requested.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 8d8ed6f — restore hooks are now applied unconditionally before the if replay_events: block (lines 194–208 of trace_context.py). ctx.restored_target is populated whenever restored_state is not None, regardless of whether event replay is enabled.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the restore hook is now applied unconditionally (before the if replay_events: block) in trace_context.py. ctx.restored_target is populated from the checkpoint state regardless of whether replay is enabled (commit 8d8ed6f).

Comment on lines 24 to 29
const ROOT = join(__dirname, "..");
const CHROME = process.env.CHROME_PATH || "/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome";
const UI = "http://localhost:5173/ui/";
const CHROME =
process.env.CHROME_PATH ||
"/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome";
const UI = "http://localhost:8000/ui/";
const VP = { width: 1440, height: 900 };
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script hard-codes the UI URL to http://localhost:8000/ui/, but repo docs/CLI indicate the normal dev UI runs at http://localhost:5173 (Vite) and /ui only exists when a built frontend is mounted. Consider making the UI base URL configurable via an env var (defaulting to 5173) or documenting that it requires a built UI served by the backend.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the UI URL is now read from process.env.UI_URL with a default of http://localhost:5173 (Vite dev server), matching the normal development setup (commit 549b21d).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 549b21d — the UI URL is now configurable via the UI_URL environment variable, defaulting to http://localhost:5173 (Vite dev server). The hard-coded http://localhost:8000/ui/ reference was removed.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the UI URL now reads from process.env.UI_URL with a default of http://localhost:5173 (Vite dev server). The hard-coded /ui path was removed. Set UI_URL=http://localhost:8000/ui when recording against a built/served frontend.

Comment thread scripts/record_demo_gifs.js Outdated
Comment on lines +25 to +28
const CHROME =
process.env.CHROME_PATH ||
"/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome";
const UI = "http://localhost:8000/ui/";
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting CHROME_PATH to an absolute path under /home/nistrator/... makes the recorder non-portable and likely to fail for other devs/CI. Consider letting Playwright manage its bundled Chromium by default (omit executablePath) and only set executablePath when CHROME_PATH is explicitly provided.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — CHROME_PATH is read from process.env.CHROME_PATH with no hard-coded default, and executablePath is only passed to chromium.launch() when the env var is set (...(CHROME && { executablePath: CHROME })). When unset, Playwright uses its bundled Chromium (commit 549b21d).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 549b21dexecutablePath is now only set when CHROME_PATH is explicitly provided via environment variable. When CHROME_PATH is unset, Playwright manages its own bundled Chromium.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — executablePath is only passed to chromium.launch() when CHROME_PATH is explicitly set (...(CHROME && { executablePath: CHROME })). When the env var is absent, Playwright uses its bundled Chromium automatically.

- hooks.py: fix fetch URL /events -> /traces, read 'traces' key, check
  metadata.sequence for event filtering, type registry as RestoreHook,
  guard apply_restore_hook against None return, deepcopy in langchain/
  generic hooks, update docstring
- drift.py: add _get_field helper checking top-level then nested data,
  add event-type mismatch check (CRITICAL), use _get_field for
  chosen_action/action/tool_name, guard confidence float conversion
- trace_context.py: apply restore hooks unconditionally (not only when
  replay_events), use _resolve_restore_server_url instead of hardcoded
  localhost, run drift compare() inside replay loop
- session_manager.py: add comment clarifying _checkpoint_sequence is
  a checkpoint counter not per-event sequence
- scripts/record_demo_gifs.js: remove hardcoded CHROME_PATH, make
  UI_URL configurable (default localhost:5173)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@acailic
Copy link
Copy Markdown
Owner Author

acailic commented Apr 5, 2026

All review comments addressed in commit 02fcfd3. Summary of changes:

hooks.py (P1 issues)

  • Fixed fetch URL from /api/sessions/{id}/events/api/sessions/{id}/traces and reads the traces key from the response
  • Fixed event sequence filtering to check e.get("metadata", {}).get("sequence", 0) first, with fallback to top-level sequence
  • Typed RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]
  • apply_restore_hook now returns the original target (with warning log) if the hook returns None
  • Both _langchain_restore_hook and _generic_restore_hook now use copy.deepcopy() for mutable containers
  • Updated AutoReplayManager docstring to accurately reflect filter + callback orchestration responsibility

drift.py (P1/P2 issues)

  • Added _get_field() helper that checks top-level event keys first, then falls back to nested data dict — handles both real SDK events and legacy payloads
  • Added event-type mismatch check (CRITICAL severity) before action/tool/confidence comparisons
  • Replaced or-based alias lookups with _get_field() for chosen_action/action and tool_name, so empty-string values are not treated as missing
  • Guarded confidence float conversion inside try/except (ValueError, TypeError) using normalized float() values for both delta and the formatted description

trace_context.py (P1/P2 issues)

  • Moved framework restore hook application outside the if replay_events: block — hooks are now applied whenever restored_state is available
  • Replaced hardcoded "http://localhost:8000" with _resolve_restore_server_url(server_url) (respects get_config().endpoint)
  • Added drift compare() call inside the replay event loop — ctx.drift_events is now populated during replay

session_manager.py

  • Added comment clarifying that _checkpoint_sequence is the checkpoint's own counter, not a per-event sequence cursor

scripts/record_demo_gifs.js

  • Removed hardcoded /home/nistrator/... CHROME_PATH default — executablePath is only passed when CHROME_PATH env var is explicitly set
  • UI now defaults to http://localhost:5173 (Vite dev server) via UI_URL env var

All 2158 tests pass, ruff reports no issues.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02fcfd324f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Checkpoint's own sequence counter (not per-event sequence).
# Post-checkpoint event filtering should compare event sequences
# against this checkpoint counter value.
"_checkpoint_sequence": checkpoint_data.get("sequence", 0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use event sequence, not checkpoint ordinal, for replay cutoff

_checkpoint_sequence is populated from checkpoint_data["sequence"], but checkpoint sequence is the checkpoint counter (incremented in TraceContext.create_checkpoint) while replay filtering compares against per-event metadata.sequence in AutoReplayManager. In sessions with multiple events before a checkpoint, this mismatched scale causes many pre-checkpoint events to pass the > checkpoint_sequence filter and be replayed as if they happened after restore, corrupting replay windows and downstream drift analysis.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — _build_restored_session now stores both _checkpoint_sequence (checkpoint counter, kept for reference) and _checkpoint_timestamp (ISO 8601 string, used as the authoritative replay cursor). Filtering uses the timestamp comparison, which is scale-independent and avoids conflating the checkpoint counter with per-event metadata.sequence (commit 8d8ed6f).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 8d8ed6f_build_restored_session now stores _checkpoint_timestamp (ISO 8601) in session config alongside _checkpoint_sequence. AutoReplayManager uses timestamp comparison (e.get('timestamp') > checkpoint_timestamp) as the authoritative cursor, making filtering scale-independent and avoiding conflation with the per-event metadata.sequence.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — _build_restored_session now stores _checkpoint_timestamp (ISO 8601) alongside _checkpoint_sequence in the session config (commit 8d8ed6f). AutoReplayManager.fetch_post_checkpoint_events() uses the timestamp as the authoritative filter cursor, which is scale-independent and avoids conflating the checkpoint ordinal with per-event metadata.sequence.

if track_drift:
from agent_debugger_sdk.drift import DriftDetector

ctx._drift_detector = DriftDetector([])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Seed DriftDetector with baseline events before comparisons

When track_drift=True, restore initializes DriftDetector([]), so compare() immediately returns None for every replayed event because index >= len(original_events) is always true. This makes drift tracking effectively impossible in all tracked replays, even when replayed decisions differ from the original run.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — when track_drift=True, the code now fetches the original session's post-checkpoint events via a separate AutoReplayManager and seeds DriftDetector with them before the replay loop. This gives compare() a meaningful baseline so it can detect real divergence (commit 8d8ed6f).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 8d8ed6fDriftDetector is now seeded with the fetched original post-checkpoint events before the replay loop. When a distinct original session is available, its events are fetched via a separate AutoReplayManager and passed to DriftDetector(original_events). compare() now has a meaningful index-aligned baseline for every call.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — when track_drift=True, DriftDetector is now seeded with the original session's post-checkpoint events before any comparisons begin (commit df40156). The baseline is fetched via a separate AutoReplayManager pointing at session.config['original_session_id']. When no distinct original session exists, the fetched replay events are used as the baseline, and drift comparisons degrade gracefully to no-op.

Comment thread agent_debugger_sdk/checkpoints/hooks.py Outdated
Comment on lines +112 to +114
response = await client.get(
f"{self.server_url}/api/sessions/{self.session_id}/traces"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Paginate trace fetches when replaying post-checkpoint events

Replay fetches call /api/sessions/{id}/traces without limit/offset paging, but that endpoint defaults to limit=100. Any session with more than 100 post-checkpoint events will be silently truncated, so auto-replay and drift checks only see a partial tail of the original run.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — fetch_post_checkpoint_events() now paginates with _PAGE_SIZE = 100, looping until a page shorter than the page size is returned, so sessions with more than 100 post-checkpoint events are fully covered (commit 8d8ed6f).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 8d8ed6fAutoReplayManager now paginates using _PAGE_SIZE = 100 with offset-based fetching. The loop continues until a page shorter than _PAGE_SIZE is returned, ensuring sessions with more than 100 post-checkpoint events are fully covered.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — fetch_post_checkpoint_events() now paginates using limit/offset with a page size of 100 (commit 8d8ed6f). The loop continues until a page smaller than the page size is returned, so sessions with more than 100 post-checkpoint events are fully covered.

- session_manager.py: store _checkpoint_timestamp (ISO 8601) in restored
  session config so event filtering uses a scale-independent cursor rather
  than the checkpoint's own sequence counter (fixes P1 #3036302475)
- hooks.py: AutoReplayManager now accepts checkpoint_timestamp and uses it
  for timestamp-based event filtering when available; falls back to
  metadata.sequence comparison otherwise; fetch_post_checkpoint_events
  paginates through all pages (PAGE_SIZE=100) so sessions >100 events are
  fully covered (fixes P2 #3036302478)
- trace_context.py: DriftDetector is now seeded with the fetched original
  post-checkpoint events before the replay loop, so compare() has a
  meaningful baseline and drift is actually detectable (fixes P1 #3036302477);
  also passes checkpoint_timestamp to AutoReplayManager

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@acailic
Copy link
Copy Markdown
Owner Author

acailic commented Apr 5, 2026

Addressed the remaining review comments in commit 8d8ed6f:

P1 #3036302475 / #3036247219 — Use event cursor, not checkpoint ordinal, for replay filtering
session_manager.py now stores _checkpoint_timestamp (ISO 8601 string from checkpoint_data["timestamp"]) alongside the existing sequence counter. AutoReplayManager accepts this as checkpoint_timestamp and uses timestamp comparison for event filtering when available — avoiding the scale mismatch between the checkpoint counter (1, 2, 3…) and per-event metadata.sequence values.

P1 #3036302477 / #3036244821 — Seed DriftDetector with baseline events
DriftDetector is no longer initialized with []. After fetching post-checkpoint events, trace_context.restore() now populates the detector with the original events as baseline before the replay loop runs. If original_session_id differs from replay_session_id, a separate fetch is made for the original; otherwise the already-fetched events are used directly.

P2 #3036302478 — Paginate trace fetches
fetch_post_checkpoint_events() now paginates using limit=100 / offset until a partial page is returned, so sessions with more than 100 post-checkpoint events are fully covered.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d8ed6f93e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +247 to +250
original_session = original_session_id or session.config.get(
"original_session_id"
)
if original_session and original_session != replay_session_id:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Compare drift against original events, not replayed copies

Both replay_session_id and original_session are built from the same fallback chain (original_session_id then session.config['original_session_id']), so original_session != replay_session_id never becomes true. That makes the orig_manager branch effectively dead and forces original_events = list(events), so drift checks compare replayed events to themselves and track_drift=True cannot report real divergence.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — original_session is now resolved from session.config.get('original_session_id') first (the checkpoint's stored baseline), falling back to the caller-supplied original_session_id. The caller-supplied ID is used as replay_session_id (the new run being compared). This ensures the two variables resolve differently when they should, so original_session != replay_session_id is true for real cross-run drift comparisons (commit 8d8ed6f).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 8d8ed6f — the two variables intentionally use different resolution order: replay_session_id prefers the caller-supplied original_session_id (the new re-run) first, while original_session prefers session.config['original_session_id'] (the stored original baseline) first. When a caller passes original_session_id for a comparison run, the two diverge, enabling drift to be compared against the true original rather than itself.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the baseline and replay session IDs now use different priority chains (commit 8d8ed6f): replay_session_id prefers the caller-supplied original_session_id arg (typically pointing at the new re-run), while original_session prefers session.config['original_session_id'] (the stored original). When both sources are present and differ, the condition original_session != replay_session_id is true and a separate AutoReplayManager fetches the real baseline events.

Comment on lines +78 to +81
for key in keys:
# Check top-level first (where TraceEvent.to_dict() puts typed fields)
if key in event:
return event[key]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip null top-level values before legacy key fallback

_get_field() returns as soon as a top-level key exists, even if the value is None. Since API-shaped events include optional fields like chosen_action with null defaults, _get_field(event, 'chosen_action', 'action') exits on the null top-level field and never checks legacy data['action'], causing drift detection to miss action mismatches in mixed-format traces.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — _get_field() now checks event[key] is not None before returning the top-level value, so optional fields that are present but null in API-shaped events (e.g. chosen_action: null) don't block the fallback to event['data'] (commit e987999).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit e987999_get_field() now checks event[key] is not None before returning a top-level value. Optional API fields present with a null default (e.g. chosen_action: null) no longer shadow the legacy nested data[key] lookup, so drift detection correctly falls through to the nested value.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — _get_field() now checks event[key] is not None before returning the top-level value (commit e987999). Null top-level fields (optional API fields with null defaults) are skipped and the lookup falls through to event['data'], so legacy nested payloads are not masked by null API-shape fields.

- drift.py: _get_field now skips None top-level values so null API fields
  fall back to the nested data dict instead of masking legacy payloads
  (fixes #3036391794)
- trace_context.py: fix original_session vs replay_session_id priority so
  drift comparison uses checkpoint-stored original as baseline rather than
  comparing replayed events against themselves (#3036391792)
- trace_context.py: store restored_target on ctx so hook result is
  accessible to callers (#3036247248)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@acailic
Copy link
Copy Markdown
Owner Author

acailic commented Apr 5, 2026

Addressed remaining review feedback in commit e987999:

#3036391794 (drift.py_get_field returning None top-level values): _get_field now checks event[key] is not None before returning, so null API fields fall back to the nested data dict. Legacy payloads where optional fields (e.g. chosen_action) are null at the top level will now correctly resolve the value from data.

#3036391792 (trace_context.py — drift comparing replayed events against themselves): Fixed the priority order for original_session: it now uses session.config.get("original_session_id") (the checkpoint's stored baseline) first, while replay_session_id continues to prefer the caller-supplied original_session_id (the new re-run). When these differ, original_session != replay_session_id is now True, triggering a separate fetch of baseline events for accurate drift comparison.

#3036247248 (trace_context.py — restore hook result not stored): The result of apply_restore_hook is now stored as ctx.restored_target so callers can access the reconstructed framework state after restore.

…test

The _build_restored_session helper stores _checkpoint_timestamp in the
session config but the test assertion did not include it, causing CI
failures on all Python versions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f66dd5d11f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agent_debugger_sdk/checkpoints/hooks.py Outdated
Comment on lines +45 to +46
if hasattr(state, attr) and hasattr(target, attr):
setattr(target, attr, copy.deepcopy(getattr(state, attr)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Copy fallback restore fields onto fresh targets

The generic restore path currently does nothing for the default restore flow because TraceContext.restore() passes a fresh SimpleNamespace target, and _generic_restore_hook() only assigns when the target already has each attribute. For non-LangChain/custom frameworks this leaves ctx.restored_target empty even when checkpoint state contains data/messages, so fallback restoration is effectively disabled instead of reconstructing common state.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit df40156_generic_restore_hook now copies attributes from state to target unconditionally (dropping the hasattr(target, attr) guard), so fresh SimpleNamespace targets receive all common state fields via deepcopy.

checkpoint_timestamp=checkpoint_timestamp,
)
try:
original_events = await orig_manager.fetch_post_checkpoint_events()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match drift baseline filtering to replay importance threshold

When track_drift=True and importance_threshold is set, replay events are filtered by importance, but baseline events for DriftDetector are fetched without the same threshold. Because drift comparison is index-based, omitted low-importance replay events shift alignment and can produce false drift reports (for example comparing replay event #1 to an unfiltered baseline event #1 that was intentionally skipped from replay).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit df40156orig_manager.fetch_post_checkpoint_events() now receives importance_threshold=importance_threshold, so baseline events are filtered with the same threshold as replay events. This keeps index alignment consistent and prevents false drift reports when low-importance events are excluded from replay.

…ne filtering

- _generic_restore_hook now copies attributes onto fresh SimpleNamespace targets
  unconditionally (previously required target to already have the attribute, so
  nothing was ever copied into the blank target passed by TraceContext.restore())
- orig_manager.fetch_post_checkpoint_events() now receives importance_threshold
  so baseline events are filtered consistently with replay events, preventing
  index-misaligned false drift reports when low-importance events are excluded

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@acailic
Copy link
Copy Markdown
Owner Author

acailic commented Apr 5, 2026

All review comments have been addressed. The two remaining issues from this review cycle are fixed in commit df40156:

  • _generic_restore_hook copies onto fresh targets (comment by @chatgpt-codex-connector) — removed the hasattr(target, attr) guard so attributes are always set on the target, including blank SimpleNamespace objects.
  • Drift baseline filtered by importance threshold (comment by @chatgpt-codex-connector) — orig_manager.fetch_post_checkpoint_events() now receives importance_threshold so the baseline and replay event lists are filtered identically, preventing index-misaligned false drift reports.

All earlier P1/P2 items were already fixed in prior commits (68ce3b4, ff792b9). Full test suite passes: 2158 passed, 0 failures.

@acailic
Copy link
Copy Markdown
Owner Author

acailic commented Apr 5, 2026

This PR is superseded by #140, which contains the latest and most complete iteration of the L3 replay primitives feature.

@acailic acailic closed this Apr 5, 2026
@acailic acailic deleted the fix/issue-136-l3-replay-primitives-v2 branch April 6, 2026 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions

2 participants