This page is about the shortest path from working debugger core to research-grade product.
Do not re-open solved contract problems. Build on the now-coherent core path from trace capture to persistence to UI.
The current replay path can start from the nearest checkpoint and expose useful slices of a run. The next step is to restore meaningful agent state, not just replay stored events.
Best next steps:
- standardize what checkpoint state must contain for each adapter
- support deterministic restore hooks per framework
- record state-drift markers when replay diverges from the original run
- expose replay provenance and restore boundaries in the UI
The current ranking model is useful, but still local and heuristic.
Best next steps:
- cluster failures across sessions, not only within one run
- add recurrence windows for repeated loops and flaky tool behavior
- score traces using richer signals such as retry churn, latency spikes, and policy escalation
- surface one-click representative traces for each cluster
The repo now has benchmark-style tests, but it needs a larger, reusable corpus for regression testing and demos.
Best next steps:
- add benchmark seeds for prompt injection, evidence-grounded tool use, prompt-policy shifts, multi-agent debate, loop detection, and replay determinism
- persist demo sessions into the local database for UI smoke testing
- track expected rankings, clusters, and breakpoint hits as regression assertions
- add fixtures that mimic both safe and unsafe tool-use paths
Best next steps:
- wire API key auth into the actual API dependency chain
- add
tenant_idto trace models and enforce it inTraceRepository - apply redaction before persistence so privacy settings affect stored traces
- implement the SDK's remote/cloud transport path
- add PostgreSQL support, migrations, and backpressure handling for high-volume streams
The current UI is coherent, but still narrow.
Best next steps:
- side-by-side run comparison
- search over traces, clusters, and safety outcomes
- saved debugger views and pinned failures
- richer drill-down for provenance chains and evidence links
- standardize checkpoint contents
- add restore semantics per adapter
- detect replay divergence
- test focused and failure replay paths end to end
- expand ranking signals
- add cross-session failure clustering
- strengthen loop and anomaly detection
- add representative trace surfacing
- add reusable benchmark fixtures
- add seeded demo sessions
- run benchmark assertions in CI
- add benchmark docs for expected debugger behavior
- repository-enforced tenant isolation
- redaction wired into persistence
- SDK cloud transport
- retention policies
- PostgreSQL support and migrations
- backpressure handling for high-volume streams
If only one engineering task is chosen next, it should be this:
- finish the end-to-end cloud ingestion path from
agent_debugger.init()through authenticated ingestion, tenant-aware persistence, and redaction on write
That work turns several partial features into one coherent product capability.