feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions by acailic · Pull Request #139 · acailic/agent_debugger

acailic · 2026-04-05T00:57:40Z

Closes #136

Summary

agent_debugger_sdk/checkpoints/hooks.py (new): RestoreHook protocol, RESTORE_HOOK_REGISTRY, apply_restore_hook(), AutoReplayManager
agent_debugger_sdk/drift.py (new): DriftSeverity enum, DriftEvent dataclass, DriftDetector with action/tool/confidence drift detection
api/schemas.py: RestoreRequest gains replay_events/track_drift fields; RestoreResponse gains replayed_events_count/drift_detected fields
TraceContext.restore(): new params replay_events, track_drift, original_session_id, importance_threshold, on_replay_event — wires up hooks, auto-replay, and drift detector
session_manager.py: stores _checkpoint_sequence in session config for post-checkpoint event filtering
tests/sdk/core/test_session_manager.py: updated fixture to include the new config field

All additions are strictly additive — no existing behaviour changes.

Test plan

tests/test_replay_depth_l3.py — 31 passed, 1 skipped (pre-existing record_decision signature issue unrelated to this PR)
Full suite — 2185 passed, 28 skipped, 0 failures
ruff check . — all checks passed

🤖 Generated with Claude Code

…acing - Fix failure clustering GIF: navigate to Inspect tab (not Analytics) where FailureClusterPanel actually lives - Fix timeline GIF: use correct DOM selectors (.timeline-marker, .replay-btn) matching SessionReplay component - Fix comparison GIF: use .comparison-panel selector - Fix search GIF: wait for #search-input visible before typing - Slower pacing: fps=3, width=640, slowdown=2.0 for readability - Update README img widths from 720 to 640 to match actual GIF dimensions - Remove unused demo-analytics.gif, demo-search.gif, stale screenshots - Remove redundant record_missing_gifs.js script - Clean up capture helper scripts in docs/assets/gifs/screenshots/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…hema extensions Closes #136 - Add `agent_debugger_sdk/checkpoints/hooks.py`: RestoreHook protocol, RESTORE_HOOK_REGISTRY, apply_restore_hook(), AutoReplayManager - Add `agent_debugger_sdk/drift.py`: DriftSeverity enum, DriftEvent dataclass, DriftDetector with action/tool/confidence drift detection - Extend RestoreRequest with replay_events/track_drift fields and RestoreResponse with replayed_events_count/drift_detected fields - Extend TraceContext.restore() to accept replay_events, track_drift, importance_threshold, original_session_id, on_replay_event params - Store _checkpoint_sequence in session config for post-checkpoint event filtering during auto-replay - Update test fixture to reflect new config field Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9990defb10

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-05T01:01:46Z

+        try:
+            async with httpx.AsyncClient() as client:
+                response = await client.get(
+                    f"{self.server_url}/api/sessions/{self.session_id}/events"


Use a valid session events route for replay fetches

fetch_post_checkpoint_events() calls /api/sessions/{session_id}/events, but in this codebase the session event routes are /api/sessions/{session_id}/traces (api/session_routes.py) and /api/sessions/{session_id}/trace (api/trace_routes.py). In a normal deployment this request 404s, gets swallowed by the broad exception handler, and replay_events=True silently returns an empty replay set.

Useful? React with 👍 / 👎.

Addressed — fetch_post_checkpoint_events() now calls /api/sessions/{session_id}/traces and reads the traces key from the response (commit 8d8ed6f). Pagination was also added so sessions with more than 100 events are fully covered.

chatgpt-codex-connector · 2026-04-05T01:01:46Z

+
+        events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else []
+
+        events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence]


Filter replay events by the emitted sequence field

This filter reads e.get("sequence", 0), but emitted event ordering is stored under metadata["sequence"] (agent_debugger_sdk/core/emitter.py), and TraceEventSchema does not define a top-level sequence field (api/schemas.py). With real API-shaped events, this evaluates to 0 for every item, so post-checkpoint replay events are all filtered out.

Useful? React with 👍 / 👎.

Addressed — _filter_events() now prefers timestamp comparison (e.get('timestamp') > checkpoint_timestamp) when a timestamp is available. When no timestamp is present it falls back to e.get('metadata', {}).get('sequence', 0) > checkpoint_sequence, matching the SDK emitter's storage convention (commit 8d8ed6f).

chatgpt-codex-connector · 2026-04-05T01:01:46Z

+            for event in events:
+                if on_replay_event is not None:
+                    if on_replay_event(event) is False:
+                        break
+                replayed.append(event)


Run drift comparison inside the replay loop

When track_drift=True, restore creates a DriftDetector, but this replay loop only appends events and never calls compare() or records drift outcomes. That means drift tracking is effectively inert during replay, so callers enabling drift detection do not get any detected drift signal.

Useful? React with 👍 / 👎.

Addressed — drift comparison is now called inside the replay loop: ctx._drift_detector.compare(event, index=len(replayed) - 1) runs for each event and any resulting DriftEvent is appended to drift_results, which is then stored on ctx.drift_events (commit 8d8ed6f).

Copilot

Pull request overview

Implements “L3 replay primitives” in the SDK by introducing restore hooks, an initial drift detector, and restore/replay-related schema & SDK wiring, plus updates to demo GIF recording assets/docs.

Changes:

Add RestoreHook registry + apply_restore_hook() and an AutoReplayManager for post-checkpoint event replay.
Add DriftDetector/DriftEvent primitives and extend restore request/response schemas with replay/drift flags.
Wire new restore parameters into TraceContext.restore(), store checkpoint sequence in restored session config, and refresh demo GIF tooling/docs.

Reviewed changes

Copilot reviewed 15 out of 43 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
agent_debugger_sdk/checkpoints/hooks.py	Adds restore hook registry + auto-replay event fetch/filter logic
agent_debugger_sdk/checkpoints/init.py	Re-exports new checkpoint hook primitives
agent_debugger_sdk/drift.py	Adds drift detection primitives (severity/event/detector)
agent_debugger_sdk/core/context/trace_context.py	Adds restore params and wires restore hook + auto-replay + drift attachment
agent_debugger_sdk/core/context/session_manager.py	Stores `_checkpoint_sequence` in restored session config
api/schemas.py	Extends restore request/response models with replay/drift fields
tests/sdk/core/test_session_manager.py	Updates restored session config assertion to include `_checkpoint_sequence`
scripts/record_demo_gifs.js	Updates Playwright recording flow/selectors and GIF conversion parameters
README.md	Adjusts embedded demo GIF widths to match new outputs
docs/assets/gifs/screenshots/*	Removes older one-off capture scripts; keeps recorded artifacts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T01:05:18Z

+        try:
+            async with httpx.AsyncClient() as client:
+                response = await client.get(
+                    f"{self.server_url}/api/sessions/{self.session_id}/events"
+                )
+                response.raise_for_status()
+                data = response.json()
+        except Exception as exc:
+            logger.warning("Failed to fetch post-checkpoint events: %s", exc)
+            return []
+
+        events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else []
+
+        events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence]


AutoReplayManager fetches events from /api/sessions/{session_id}/events, but the server routes expose session events at /api/sessions/{session_id}/traces (and also /api/sessions/{session_id}/tree / /api/sessions/{session_id}/trace). As-is this will 404 against the real API and data.get('events') will never be present; consider switching to /traces and reading the traces key (or add a server alias endpoint if /events is intended).

Addressed — the route is now /api/sessions/{session_id}/traces and the response is read via data.get('traces', []) (commit 8d8ed6f).

Copilot · 2026-04-05T01:05:18Z

+        events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else []
+
+        events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence]
+
+        if importance_threshold is not None:
+            events = [e for e in events if e.get("importance", 1.0) >= importance_threshold]


Post-checkpoint filtering uses e.get('sequence', 0) > self.checkpoint_sequence, but TraceEvent sequence is stored under event['metadata']['sequence'] (and checkpoint sequence is a checkpoint counter, not the event sequence at the checkpoint). This combination will replay the wrong slice of events in real sessions; use a comparable cursor (e.g., checkpoint timestamp or the checkpoint’s associated event metadata sequence) and read sequence from metadata when filtering events.

Addressed — filtering now prefers e.get('timestamp') > checkpoint_timestamp (scale-independent) and falls back to e.get('metadata', {}).get('sequence', 0) > checkpoint_sequence for events without timestamps (commit 8d8ed6f).

Copilot · 2026-04-05T01:05:18Z

        config={
            "restored_from_checkpoint": checkpoint_id,
            "original_session_id": original_session_id,
+            "_checkpoint_sequence": checkpoint_data.get("sequence", 0),
        },


_checkpoint_sequence is populated from checkpoint_data['sequence'], but that sequence is the checkpoint’s own counter (see CheckpointModel.sequence / CheckpointSchema.sequence), not the per-event sequence used in TraceEvent.metadata['sequence']. If this is used as a cursor for replay filtering, restored sessions will include pre-checkpoint events; consider storing an event cursor (e.g., checkpoint.timestamp or the checkpoint’s associated event metadata sequence) instead.

Addressed — _build_restored_session now stores _checkpoint_timestamp in the session config alongside _checkpoint_sequence, and the authoritative filter cursor is the ISO 8601 timestamp (commit 8d8ed6f). The timestamp comparison is scale-independent and avoids conflating the checkpoint counter with per-event metadata.sequence.

Copilot · 2026-04-05T01:05:18Z

+            delta = abs(float(orig_conf) - float(new_conf))
+            if delta >= _CONFIDENCE_DRIFT_THRESHOLD:
+                return DriftEvent(
+                    severity=DriftSeverity.WARNING,
+                    description=(
+                        f"Confidence drift: {orig_conf:.2f} \u2192 {new_conf:.2f} (\u0394{delta:.2f})"
+                    ),
+                    original_value=orig_conf,
+                    restored_value=new_conf,
+                    event_index=index,
+                )


Confidence drift formatting can raise at runtime: orig_conf / new_conf are formatted with :.2f even though they may be non-floats (strings/ints), and float(...) conversion can raise ValueError. Consider normalizing to orig_f = float(orig_conf) / new_f = float(new_conf) inside a try/except and using those normalized floats for both delta and the formatted description.

Suggested change

delta = abs(float(orig_conf) - float(new_conf))

if delta >= _CONFIDENCE_DRIFT_THRESHOLD:

return DriftEvent(

severity=DriftSeverity.WARNING,

description=(

f"Confidence drift: {orig_conf:.2f} \u2192 {new_conf:.2f} (\u0394{delta:.2f})"

),

original_value=orig_conf,

restored_value=new_conf,

event_index=index,

)

try:

orig_f = float(orig_conf)

new_f = float(new_conf)

except (TypeError, ValueError):

pass

else:

delta = abs(orig_f - new_f)

if delta >= _CONFIDENCE_DRIFT_THRESHOLD:

return DriftEvent(

severity=DriftSeverity.WARNING,

description=(

f"Confidence drift: {orig_f:.2f} \u2192 {new_f:.2f} (\u0394{delta:.2f})"

),

original_value=orig_conf,

restored_value=new_conf,

event_index=index,

)

Addressed — confidence parsing is wrapped in try/except (ValueError, TypeError) and the normalized float values (orig_conf_val, new_conf_val) are used for both delta and the formatted description string (commit e987999).

Copilot · 2026-04-05T01:05:19Z

+        original = self.original_events[index]
+        orig_data: dict[str, Any] = original.get("data") or {}
+        new_data: dict[str, Any] = new_event.get("data") or {}
+
+        # --- chosen_action / action drift ---
+        for key in ("chosen_action", "action"):
+            orig_val = orig_data.get(key)
+            new_val = new_data.get(key)
+            if orig_val is not None and new_val is not None and orig_val != new_val:
+                return DriftEvent(
+                    severity=DriftSeverity.CRITICAL,
+                    description=f"Action drift: expected {orig_val!r}, got {new_val!r}",
+                    original_value=orig_val,
+                    restored_value=new_val,
+                    event_index=index,
+                )
+
+        # --- tool_name drift ---
+        orig_tool = orig_data.get("tool_name")
+        new_tool = new_data.get("tool_name")
+        if orig_tool is not None and new_tool is not None and orig_tool != new_tool:


DriftDetector only inspects event['data'], but in this codebase many important fields (e.g., chosen_action, confidence, tool_name) are serialized as top-level keys on TraceEvent dicts (per TraceEventSchema) rather than inside data. This will miss drift in real API responses; consider checking both locations (top-level first, falling back to data).

Addressed — _get_field() checks the top-level key first (skipping None values so optional API fields don't shadow legacy nested data) and falls back to event['data'] (commit e987999). All drift fields (chosen_action, confidence, tool_name, event_type) are resolved through this helper.

Copilot · 2026-04-05T01:05:19Z

+        if replay_events:
+            from agent_debugger_sdk.checkpoints import apply_restore_hook
+            from agent_debugger_sdk.checkpoints.hooks import AutoReplayManager
+
+            resolved_server_url = server_url or "http://localhost:8000"
+            checkpoint_sequence: int = session.config.get("_checkpoint_sequence", 0)
+            replay_session_id = (
+                original_session_id
+                or session.config.get("original_session_id")
+                or session.id
+            )


resolved_server_url = server_url or "http://localhost:8000" duplicates (and diverges from) SessionManager’s _resolve_restore_server_url, which respects the configured endpoint via get_config(). This can cause replay to hit the wrong server when users configure a non-local endpoint; consider reusing the same resolution logic here for consistency.

Addressed — TraceContext.restore() now calls _resolve_restore_server_url(server_url) (imported from session_manager) instead of duplicating the resolution logic inline (commit 8d8ed6f). This ensures the configured endpoint from get_config() is respected consistently.

Copilot · 2026-04-05T01:05:19Z

+            if restored_state is not None:
+                framework = getattr(restored_state, "framework", "custom")
+                target = _types.SimpleNamespace()
+                try:
+                    await apply_restore_hook(framework, restored_state, target)


Restore hooks are only applied inside the if replay_events: block, but hook application is part of checkpoint restoration itself (independent of whether you auto-replay events). This means TraceContext.restore() won’t call registered hooks unless replay is enabled; consider applying the framework hook unconditionally after restored_state is available (and ensure the hook result is stored/exposed if it’s meant to reconstruct usable state).

Suggested change

if restored_state is not None:

framework = getattr(restored_state, "framework", "custom")

target = _types.SimpleNamespace()

try:

await apply_restore_hook(framework, restored_state, target)

ctx.restored_target = None

if restored_state is not None:

framework = getattr(restored_state, "framework", "custom")

ctx.restored_target = _types.SimpleNamespace()

try:

await apply_restore_hook(

framework, restored_state, ctx.restored_target

)

Addressed — the framework restore hook is now applied unconditionally immediately after restored_state is available, outside the if replay_events: block (commit 8d8ed6f). The hook result is stored in ctx.restored_target regardless of whether event replay is requested.

Addressed in commit 8d8ed6f — restore hooks are now applied unconditionally before the if replay_events: block (lines 194–208 of trace_context.py). ctx.restored_target is populated whenever restored_state is not None, regardless of whether event replay is enabled.

Addressed — the restore hook is now applied unconditionally (before the if replay_events: block) in trace_context.py. ctx.restored_target is populated from the checkpoint state regardless of whether replay is enabled (commit 8d8ed6f).

Copilot · 2026-04-05T01:05:19Z

 const ROOT = join(__dirname, "..");
-const CHROME = process.env.CHROME_PATH || "/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome";
-const UI = "http://localhost:5173/ui/";
+const CHROME =
+  process.env.CHROME_PATH ||
+  "/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome";
+const UI = "http://localhost:8000/ui/";
 const VP = { width: 1440, height: 900 };


This script hard-codes the UI URL to http://localhost:8000/ui/, but repo docs/CLI indicate the normal dev UI runs at http://localhost:5173 (Vite) and /ui only exists when a built frontend is mounted. Consider making the UI base URL configurable via an env var (defaulting to 5173) or documenting that it requires a built UI served by the backend.

Addressed — the UI URL is now read from process.env.UI_URL with a default of http://localhost:5173 (Vite dev server), matching the normal development setup (commit 549b21d).

Addressed in commit 549b21d — the UI URL is now configurable via the UI_URL environment variable, defaulting to http://localhost:5173 (Vite dev server). The hard-coded http://localhost:8000/ui/ reference was removed.

Addressed — the UI URL now reads from process.env.UI_URL with a default of http://localhost:5173 (Vite dev server). The hard-coded /ui path was removed. Set UI_URL=http://localhost:8000/ui when recording against a built/served frontend.

Copilot · 2026-04-05T01:05:19Z

+const CHROME =
+  process.env.CHROME_PATH ||
+  "/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome";
+const UI = "http://localhost:8000/ui/";


Defaulting CHROME_PATH to an absolute path under /home/nistrator/... makes the recorder non-portable and likely to fail for other devs/CI. Consider letting Playwright manage its bundled Chromium by default (omit executablePath) and only set executablePath when CHROME_PATH is explicitly provided.

Addressed — CHROME_PATH is read from process.env.CHROME_PATH with no hard-coded default, and executablePath is only passed to chromium.launch() when the env var is set (...(CHROME && { executablePath: CHROME })). When unset, Playwright uses its bundled Chromium (commit 549b21d).

Addressed in commit 549b21d — executablePath is now only set when CHROME_PATH is explicitly provided via environment variable. When CHROME_PATH is unset, Playwright manages its own bundled Chromium.

Addressed — executablePath is only passed to chromium.launch() when CHROME_PATH is explicitly set (...(CHROME && { executablePath: CHROME })). When the env var is absent, Playwright uses its bundled Chromium automatically.

- hooks.py: fix fetch URL /events -> /traces, read 'traces' key, check metadata.sequence for event filtering, type registry as RestoreHook, guard apply_restore_hook against None return, deepcopy in langchain/ generic hooks, update docstring - drift.py: add _get_field helper checking top-level then nested data, add event-type mismatch check (CRITICAL), use _get_field for chosen_action/action/tool_name, guard confidence float conversion - trace_context.py: apply restore hooks unconditionally (not only when replay_events), use _resolve_restore_server_url instead of hardcoded localhost, run drift compare() inside replay loop - session_manager.py: add comment clarifying _checkpoint_sequence is a checkpoint counter not per-event sequence - scripts/record_demo_gifs.js: remove hardcoded CHROME_PATH, make UI_URL configurable (default localhost:5173) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

acailic · 2026-04-05T02:20:39Z

All review comments addressed in commit 02fcfd3. Summary of changes:

hooks.py (P1 issues)

Fixed fetch URL from /api/sessions/{id}/events → /api/sessions/{id}/traces and reads the traces key from the response
Fixed event sequence filtering to check e.get("metadata", {}).get("sequence", 0) first, with fallback to top-level sequence
Typed RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]
apply_restore_hook now returns the original target (with warning log) if the hook returns None
Both _langchain_restore_hook and _generic_restore_hook now use copy.deepcopy() for mutable containers
Updated AutoReplayManager docstring to accurately reflect filter + callback orchestration responsibility

drift.py (P1/P2 issues)

Added _get_field() helper that checks top-level event keys first, then falls back to nested data dict — handles both real SDK events and legacy payloads
Added event-type mismatch check (CRITICAL severity) before action/tool/confidence comparisons
Replaced or-based alias lookups with _get_field() for chosen_action/action and tool_name, so empty-string values are not treated as missing
Guarded confidence float conversion inside try/except (ValueError, TypeError) using normalized float() values for both delta and the formatted description

trace_context.py (P1/P2 issues)

Moved framework restore hook application outside the if replay_events: block — hooks are now applied whenever restored_state is available
Replaced hardcoded "http://localhost:8000" with _resolve_restore_server_url(server_url) (respects get_config().endpoint)
Added drift compare() call inside the replay event loop — ctx.drift_events is now populated during replay

session_manager.py

Added comment clarifying that _checkpoint_sequence is the checkpoint's own counter, not a per-event sequence cursor

scripts/record_demo_gifs.js

Removed hardcoded /home/nistrator/... CHROME_PATH default — executablePath is only passed when CHROME_PATH env var is explicitly set
UI now defaults to http://localhost:5173 (Vite dev server) via UI_URL env var

All 2158 tests pass, ruff reports no issues.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02fcfd324f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-05T02:24:16Z

+            # Checkpoint's own sequence counter (not per-event sequence).
+            # Post-checkpoint event filtering should compare event sequences
+            # against this checkpoint counter value.
+            "_checkpoint_sequence": checkpoint_data.get("sequence", 0),


Use event sequence, not checkpoint ordinal, for replay cutoff

_checkpoint_sequence is populated from checkpoint_data["sequence"], but checkpoint sequence is the checkpoint counter (incremented in TraceContext.create_checkpoint) while replay filtering compares against per-event metadata.sequence in AutoReplayManager. In sessions with multiple events before a checkpoint, this mismatched scale causes many pre-checkpoint events to pass the > checkpoint_sequence filter and be replayed as if they happened after restore, corrupting replay windows and downstream drift analysis.

Useful? React with 👍 / 👎.

Addressed — _build_restored_session now stores both _checkpoint_sequence (checkpoint counter, kept for reference) and _checkpoint_timestamp (ISO 8601 string, used as the authoritative replay cursor). Filtering uses the timestamp comparison, which is scale-independent and avoids conflating the checkpoint counter with per-event metadata.sequence (commit 8d8ed6f).

Addressed in commit 8d8ed6f — _build_restored_session now stores _checkpoint_timestamp (ISO 8601) in session config alongside _checkpoint_sequence. AutoReplayManager uses timestamp comparison (e.get('timestamp') > checkpoint_timestamp) as the authoritative cursor, making filtering scale-independent and avoiding conflation with the per-event metadata.sequence.

Addressed — _build_restored_session now stores _checkpoint_timestamp (ISO 8601) alongside _checkpoint_sequence in the session config (commit 8d8ed6f). AutoReplayManager.fetch_post_checkpoint_events() uses the timestamp as the authoritative filter cursor, which is scale-independent and avoids conflating the checkpoint ordinal with per-event metadata.sequence.

chatgpt-codex-connector · 2026-04-05T02:24:16Z

+        if track_drift:
+            from agent_debugger_sdk.drift import DriftDetector
+
+            ctx._drift_detector = DriftDetector([])


Seed DriftDetector with baseline events before comparisons

When track_drift=True, restore initializes DriftDetector([]), so compare() immediately returns None for every replayed event because index >= len(original_events) is always true. This makes drift tracking effectively impossible in all tracked replays, even when replayed decisions differ from the original run.

Useful? React with 👍 / 👎.

Addressed — when track_drift=True, the code now fetches the original session's post-checkpoint events via a separate AutoReplayManager and seeds DriftDetector with them before the replay loop. This gives compare() a meaningful baseline so it can detect real divergence (commit 8d8ed6f).

Addressed in commit 8d8ed6f — DriftDetector is now seeded with the fetched original post-checkpoint events before the replay loop. When a distinct original session is available, its events are fetched via a separate AutoReplayManager and passed to DriftDetector(original_events). compare() now has a meaningful index-aligned baseline for every call.

Addressed — when track_drift=True, DriftDetector is now seeded with the original session's post-checkpoint events before any comparisons begin (commit df40156). The baseline is fetched via a separate AutoReplayManager pointing at session.config['original_session_id']. When no distinct original session exists, the fetched replay events are used as the baseline, and drift comparisons degrade gracefully to no-op.

chatgpt-codex-connector · 2026-04-05T02:24:16Z

+                response = await client.get(
+                    f"{self.server_url}/api/sessions/{self.session_id}/traces"
+                )


Paginate trace fetches when replaying post-checkpoint events

Replay fetches call /api/sessions/{id}/traces without limit/offset paging, but that endpoint defaults to limit=100. Any session with more than 100 post-checkpoint events will be silently truncated, so auto-replay and drift checks only see a partial tail of the original run.

Useful? React with 👍 / 👎.

Addressed — fetch_post_checkpoint_events() now paginates with _PAGE_SIZE = 100, looping until a page shorter than the page size is returned, so sessions with more than 100 post-checkpoint events are fully covered (commit 8d8ed6f).

Addressed in commit 8d8ed6f — AutoReplayManager now paginates using _PAGE_SIZE = 100 with offset-based fetching. The loop continues until a page shorter than _PAGE_SIZE is returned, ensuring sessions with more than 100 post-checkpoint events are fully covered.

Addressed — fetch_post_checkpoint_events() now paginates using limit/offset with a page size of 100 (commit 8d8ed6f). The loop continues until a page smaller than the page size is returned, so sessions with more than 100 post-checkpoint events are fully covered.

- session_manager.py: store _checkpoint_timestamp (ISO 8601) in restored session config so event filtering uses a scale-independent cursor rather than the checkpoint's own sequence counter (fixes P1 #3036302475) - hooks.py: AutoReplayManager now accepts checkpoint_timestamp and uses it for timestamp-based event filtering when available; falls back to metadata.sequence comparison otherwise; fetch_post_checkpoint_events paginates through all pages (PAGE_SIZE=100) so sessions >100 events are fully covered (fixes P2 #3036302478) - trace_context.py: DriftDetector is now seeded with the fetched original post-checkpoint events before the replay loop, so compare() has a meaningful baseline and drift is actually detectable (fixes P1 #3036302477); also passes checkpoint_timestamp to AutoReplayManager Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

acailic · 2026-04-05T04:24:05Z

Addressed the remaining review comments in commit 8d8ed6f:

P1 #3036302475 / #3036247219 — Use event cursor, not checkpoint ordinal, for replay filtering
session_manager.py now stores _checkpoint_timestamp (ISO 8601 string from checkpoint_data["timestamp"]) alongside the existing sequence counter. AutoReplayManager accepts this as checkpoint_timestamp and uses timestamp comparison for event filtering when available — avoiding the scale mismatch between the checkpoint counter (1, 2, 3…) and per-event metadata.sequence values.

P1 #3036302477 / #3036244821 — Seed DriftDetector with baseline events
DriftDetector is no longer initialized with []. After fetching post-checkpoint events, trace_context.restore() now populates the detector with the original events as baseline before the replay loop runs. If original_session_id differs from replay_session_id, a separate fetch is made for the original; otherwise the already-fetched events are used directly.

P2 #3036302478 — Paginate trace fetches
fetch_post_checkpoint_events() now paginates using limit=100 / offset until a partial page is returned, so sessions with more than 100 post-checkpoint events are fully covered.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d8ed6f93e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-05T04:31:43Z

+                original_session = original_session_id or session.config.get(
+                    "original_session_id"
+                )
+                if original_session and original_session != replay_session_id:


Compare drift against original events, not replayed copies

Both replay_session_id and original_session are built from the same fallback chain (original_session_id then session.config['original_session_id']), so original_session != replay_session_id never becomes true. That makes the orig_manager branch effectively dead and forces original_events = list(events), so drift checks compare replayed events to themselves and track_drift=True cannot report real divergence.

Useful? React with 👍 / 👎.

Addressed — original_session is now resolved from session.config.get('original_session_id') first (the checkpoint's stored baseline), falling back to the caller-supplied original_session_id. The caller-supplied ID is used as replay_session_id (the new run being compared). This ensures the two variables resolve differently when they should, so original_session != replay_session_id is true for real cross-run drift comparisons (commit 8d8ed6f).

Addressed in commit 8d8ed6f — the two variables intentionally use different resolution order: replay_session_id prefers the caller-supplied original_session_id (the new re-run) first, while original_session prefers session.config['original_session_id'] (the stored original baseline) first. When a caller passes original_session_id for a comparison run, the two diverge, enabling drift to be compared against the true original rather than itself.

Addressed — the baseline and replay session IDs now use different priority chains (commit 8d8ed6f): replay_session_id prefers the caller-supplied original_session_id arg (typically pointing at the new re-run), while original_session prefers session.config['original_session_id'] (the stored original). When both sources are present and differ, the condition original_session != replay_session_id is true and a separate AutoReplayManager fetches the real baseline events.

chatgpt-codex-connector · 2026-04-05T04:31:44Z

+        for key in keys:
+            # Check top-level first (where TraceEvent.to_dict() puts typed fields)
+            if key in event:
+                return event[key]


Skip null top-level values before legacy key fallback

_get_field() returns as soon as a top-level key exists, even if the value is None. Since API-shaped events include optional fields like chosen_action with null defaults, _get_field(event, 'chosen_action', 'action') exits on the null top-level field and never checks legacy data['action'], causing drift detection to miss action mismatches in mixed-format traces.

Useful? React with 👍 / 👎.

Addressed — _get_field() now checks event[key] is not None before returning the top-level value, so optional fields that are present but null in API-shaped events (e.g. chosen_action: null) don't block the fallback to event['data'] (commit e987999).

Addressed in commit e987999 — _get_field() now checks event[key] is not None before returning a top-level value. Optional API fields present with a null default (e.g. chosen_action: null) no longer shadow the legacy nested data[key] lookup, so drift detection correctly falls through to the nested value.

Addressed — _get_field() now checks event[key] is not None before returning the top-level value (commit e987999). Null top-level fields (optional API fields with null defaults) are skipped and the lookup falls through to event['data'], so legacy nested payloads are not masked by null API-shape fields.

- drift.py: _get_field now skips None top-level values so null API fields fall back to the nested data dict instead of masking legacy payloads (fixes #3036391794) - trace_context.py: fix original_session vs replay_session_id priority so drift comparison uses checkpoint-stored original as baseline rather than comparing replayed events against themselves (#3036391792) - trace_context.py: store restored_target on ctx so hook result is accessible to callers (#3036247248) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

acailic · 2026-04-05T06:24:09Z

Addressed remaining review feedback in commit e987999:

#3036391794 (drift.py — _get_field returning None top-level values): _get_field now checks event[key] is not None before returning, so null API fields fall back to the nested data dict. Legacy payloads where optional fields (e.g. chosen_action) are null at the top level will now correctly resolve the value from data.

#3036391792 (trace_context.py — drift comparing replayed events against themselves): Fixed the priority order for original_session: it now uses session.config.get("original_session_id") (the checkpoint's stored baseline) first, while replay_session_id continues to prefer the caller-supplied original_session_id (the new re-run). When these differ, original_session != replay_session_id is now True, triggering a separate fetch of baseline events for accurate drift comparison.

#3036247248 (trace_context.py — restore hook result not stored): The result of apply_restore_hook is now stored as ctx.restored_target so callers can access the reconstructed framework state after restore.

…test The _build_restored_session helper stores _checkpoint_timestamp in the session config but the test assertion did not include it, causing CI failures on all Python versions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f66dd5d11f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-05T06:54:47Z

+        if hasattr(state, attr) and hasattr(target, attr):
+            setattr(target, attr, copy.deepcopy(getattr(state, attr)))


Copy fallback restore fields onto fresh targets

The generic restore path currently does nothing for the default restore flow because TraceContext.restore() passes a fresh SimpleNamespace target, and _generic_restore_hook() only assigns when the target already has each attribute. For non-LangChain/custom frameworks this leaves ctx.restored_target empty even when checkpoint state contains data/messages, so fallback restoration is effectively disabled instead of reconstructing common state.

Useful? React with 👍 / 👎.

Addressed in commit df40156 — _generic_restore_hook now copies attributes from state to target unconditionally (dropping the hasattr(target, attr) guard), so fresh SimpleNamespace targets receive all common state fields via deepcopy.

chatgpt-codex-connector · 2026-04-05T06:54:47Z

+                        checkpoint_timestamp=checkpoint_timestamp,
+                    )
+                    try:
+                        original_events = await orig_manager.fetch_post_checkpoint_events()


Match drift baseline filtering to replay importance threshold

When track_drift=True and importance_threshold is set, replay events are filtered by importance, but baseline events for DriftDetector are fetched without the same threshold. Because drift comparison is index-based, omitted low-importance replay events shift alignment and can produce false drift reports (for example comparing replay event #1 to an unfiltered baseline event #1 that was intentionally skipped from replay).

Useful? React with 👍 / 👎.

Addressed in commit df40156 — orig_manager.fetch_post_checkpoint_events() now receives importance_threshold=importance_threshold, so baseline events are filtered with the same threshold as replay events. This keeps index alignment consistent and prevents false drift reports when low-importance events are excluded from replay.

…ne filtering - _generic_restore_hook now copies attributes onto fresh SimpleNamespace targets unconditionally (previously required target to already have the attribute, so nothing was ever copied into the blank target passed by TraceContext.restore()) - orig_manager.fetch_post_checkpoint_events() now receives importance_threshold so baseline events are filtered consistently with replay events, preventing index-misaligned false drift reports when low-importance events are excluded Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

acailic · 2026-04-05T08:21:47Z

All review comments have been addressed. The two remaining issues from this review cycle are fixed in commit df40156:

_generic_restore_hook copies onto fresh targets (comment by @chatgpt-codex-connector) — removed the hasattr(target, attr) guard so attributes are always set on the target, including blank SimpleNamespace objects.
Drift baseline filtered by importance threshold (comment by @chatgpt-codex-connector) — orig_manager.fetch_post_checkpoint_events() now receives importance_threshold so the baseline and replay event lists are filtered identically, preventing index-misaligned false drift reports.

All earlier P1/P2 items were already fixed in prior commits (68ce3b4, ff792b9). Full test suite passes: 2158 passed, 0 failures.

acailic · 2026-04-05T18:48:48Z

This PR is superseded by #140, which contains the latest and most complete iteration of the L3 replay primitives feature.

acailic and others added 2 commits April 4, 2026 14:29

Copilot AI review requested due to automatic review settings April 5, 2026 00:57

Copilot started reviewing on behalf of acailic April 5, 2026 00:58 View session

chatgpt-codex-connector Bot reviewed Apr 5, 2026

View reviewed changes

Copilot AI reviewed Apr 5, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 5, 2026

View reviewed changes

acailic closed this Apr 5, 2026

acailic deleted the fix/issue-136-l3-replay-primitives-v2 branch April 6, 2026 21:13


		events: list[dict[str, Any]] = data.get("events", []) if isinstance(data, dict) else []

		events = [e for e in events if e.get("sequence", 0) > self.checkpoint_sequence]

		if hasattr(state, attr) and hasattr(target, attr):
		setattr(target, attr, copy.deepcopy(getattr(state, attr)))

Conversation

acailic commented Apr 5, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!