feat(replay): debug and replay mode (#315)#449
Conversation
Add three new database tables (execution_steps, llm_interactions, file_operations) and corresponding dataclasses + CRUD operations for recording complete execution traces during agent runs.
Add ExecutionRecorder class with buffered writes for recording execution traces during ReactAgent runs. Hook into _react_loop to capture iteration steps, LLM interactions, and file operations without affecting agent logic.
- cf work replay <run-id> — step-through execution with Rich formatting - cf work diff <run-id> — show file changes with unified diff - cf work export-trace <run-id> — export to JSON or Markdown
…nt (#315) - ReplaySession class with n/p/j navigation for step-through - prepare_rerun() reconstructs file state at any step - cf work rerun command shows state and remaining steps
…315) End-to-end tests: ExecutionRecorder records a 3-step agent run, then verifies trace loading, step snapshots, diffs, JSON/Markdown export, ReplaySession navigation, and rerun preparation.
WalkthroughAdds execution tracing and replay capabilities: a new replay module and ExecutionRecorder, DB schema additions, ReactAgent integration to record traces, CLI commands for replay/diff/export/rerun, and comprehensive unit and integration tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Agent as ReactAgent
participant Recorder as ExecutionRecorder
participant DB as Workspace (SQLite)
participant CLI as Replay CLI
Agent->>Recorder: record_iteration(step_number, tools, summary)
Note over Recorder: Buffer ExecutionStep
Agent->>Recorder: record_llm_call(step_id, prompt, response, model, tokens)
Note over Recorder: Buffer LLMInteraction
Agent->>Recorder: record_file_operation(step_id, op_type, path, before, after)
Note over Recorder: Buffer FileOperation
Agent->>Recorder: flush()
Recorder->>DB: save_execution_step(step)
Recorder->>DB: save_llm_interaction(interaction)
Recorder->>DB: save_file_operation(operation)
CLI->>DB: load_execution_trace(run_id)
DB-->>CLI: ExecutionTrace
CLI->>CLI: ReplaySession.navigate(step)
CLI->>DB: get_step_snapshot(run_id, step_number)
DB-->>CLI: file_state
sequenceDiagram
participant CLI as Replay Commands
participant Session as ReplaySession
participant DB as Trace Data
participant Formatter as Export/Display
CLI->>DB: load_execution_trace(run_id)
DB-->>Session: ExecutionTrace
alt work_replay
Session->>Formatter: Format step with LLM/files
Formatter-->>CLI: Display output
else work_diff
CLI->>DB: compare_steps(from_step, to_step)
DB-->>CLI: Changes dict
Formatter->>CLI: Unified diff
else work_export_trace
Formatter->>Formatter: export_trace_json() / export_trace_markdown()
Formatter-->>CLI: JSON/Markdown output
else work_rerun
CLI->>DB: prepare_rerun(run_id, from_step)
DB-->>CLI: File state + metadata
Formatter->>CLI: Render checkpoint state
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.Add a .trivyignore file to your project to customize which findings Trivy reports. |
Review: feat(replay): debug and replay mode (PR 315)Good feature addition overall - the architecture is clean and follows the repo headless core pattern. Here are my findings: Bugs1. Data loss in ExecutionRecorder.flush() - critical In codeframe/core/replay.py, the flush() method clears the buffers in a finally block, which runs even when the DB write fails. The try/except/finally structure means if save_execution_step raises, the exception is caught and logged at DEBUG, then finally clears all three buffers - permanently losing data. The clear should be inside the try block so it only runs on success. 2. Incorrect file state reconstruction for edit_file operations In codeframe/core/react_agent.py, the recorder captures new_text as content_after for edit_file operations. But new_text in the search-replace editor is the replacement snippet, not the full file content. get_step_snapshot() replays file operations by setting file_state[path] = op.content_after, so it reconstructs the file as just the replaced fragment. This means cf work diff and cf work rerun will produce incorrect file states for any run that used edit_file. The fix is to read the actual file content from disk after the edit completes, rather than capturing the tool input argument. Code Issues3. format parameter shadows Python builtin In work_export_trace (cli/app.py), the parameter format: str = typer.Option(...) shadows the Python builtin. Rename to output_format or fmt. 4. cf work rerun is misleadingly named The command prepares a rerun (shows file state and remaining steps) but does not actually execute anything. Given cf work start starts execution, users will expect cf work rerun to do the same. Either rename to something like cf work inspect-step, or add a --execute flag consistent with the cf work start UX. 5. Silent trace degradation on flush failure Flush failures in react_agent.py are logged at DEBUG. If tracing silently fails, cf work replay will show an empty/partial trace with no user-visible explanation. Raise this to WARNING level. Performance Consideration6. Full file content stored in SQLite TEXT columns For large source files, storing full before/after content in file_operations could bloat the database significantly over time. Consider adding a size cap (e.g., skip content capture if file exceeds 100KB) or tracking this as a follow-up issue. Minor Observations
SummaryThe architecture is sound and the core pattern (headless module, thin CLI adapter, buffered recorder) is well-executed. Two items should be addressed before merging:
|
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@codeframe/cli/app.py`:
- Around line 3411-3469: work_rerun currently only calls prepare_rerun and
prints rerun_info but never restores the workspace state or triggers execution;
after obtaining rerun_info in work_rerun you should apply the returned
file_state to the workspace (e.g., call a method on the workspace like
restore_snapshot/apply_file_state using rerun_info["file_state"]) and then
invoke the runtime to start/resume execution from from_step (use your runtime
entrypoint / runner function to execute the remaining steps or start a new run
with the restored workspace, passing run_id/from_step/remaining_steps as
needed); ensure errors from restore or runtime start are handled similarly to
the existing FileNotFoundError/ValueError branches so the CLI exits with
non-zero on failure.
- Around line 3138-3246: The command handler work_replay currently prints all
steps when no --step is provided instead of starting an interactive replay
session; update work_replay to enter an interactive loop (e.g., using
typer.prompt or input) when step is None: initialize a current_index over
trace.steps, render the current step (using existing rendering logic that
references trace.steps, ops_by_step, llm_by_step and the Step fields like
id/step_number/description/status), then accept simple commands next/prev/jump
<n>/show-llm/quit to move the index, re-render the selected step, and only exit
on quit; keep the existing single-step rendering code for reuse and ensure the
--step path still shows just that step without entering the loop.
- Around line 3378-3405: The export functions export_trace_json and
export_trace_markdown currently only serialize step metadata and paths; update
them to include trace.llm_interactions and the full before/after file contents
for each file referenced in the trace so exported artifacts are reconstructible
offline. Specifically, modify export_trace_json(trace) and
export_trace_markdown(trace) to iterate trace.llm_interactions and include them
in the output structure/markdown, and for each step that references files use
load_execution_trace/get_workspace utilities or the trace’s stored file
snapshots to embed the file contents (pre-change and post-change) rather than
only paths; ensure the JSON output nests llm_interactions and file contents and
the markdown includes readable sections for interactions and before/after file
diffs.
In `@codeframe/core/react_agent.py`:
- Around line 465-488: The verification/fix path in _run_final_verification
isn’t recording LLM execution like _react_loop, so runs that perform bounded
fixes produce incomplete traces and prepare_rerun reconstructs the wrong
(pre-fix) state; update _run_final_verification to mirror the execution_recorder
usage in _react_loop by calling execution_recorder.record_iteration and
execution_recorder.record_llm_call for each LLM invocation in the bounded fix
loop (include step numbering, tool_names from response.tool_calls,
llm_response_summary, prompt_summary, model, and tokens_used computed from
response.input_tokens + response.output_tokens) and ensure prepare_rerun reads
the latest step_id produced by record_iteration so reruns reconstruct the
post-fix state.
In `@codeframe/core/replay.py`:
- Around line 244-258: The flush method currently clears _step_buffer,
_llm_buffer, and _file_op_buffer in the finally block even when
save_execution_step/save_llm_interaction/save_file_operation raise, causing
permanent data loss; change flush so buffers are only cleared after all saves
complete successfully (e.g., move the clear calls into the try block after the
loops or wrap the saves in a DB transaction and clear buffers only on commit)
and ensure exceptions still propagate or are logged without dropping buffered
items.
- Around line 539-578: The JSON exporter export_trace_json is currently omitting
trace.llm_interactions; update export_trace_json (and the other exporter
referenced around 581-632) to include LLM data by grouping
trace.llm_interactions by step_id (similar to ops_by_step) and adding an
"llm_interactions" entry to each step_dict with a list of serializable objects
(e.g., model, role, prompt/input, response/output, tokens, timestamps) taken
from each LLMInteraction instance; ensure you use the step.id to attach
interactions to the correct step and preserve None-safe serialization
(timestamps via .isoformat(), optional fields omitted or null) so offline
JSON/Markdown traces include the recorded prompts and responses.
In `@tests/core/test_execution_recording.py`:
- Around line 15-28: Remove the unused imports causing F401: delete LLMResponse
from the import line that currently reads "from codeframe.adapters.llm.base
import LLMResponse, ToolCall, ToolResult", delete FileContent from "from
codeframe.core.context import FileContent, TaskContext", and delete Workspace
from "from codeframe.core.workspace import Workspace, create_or_load_workspace";
keep the used symbols (ToolCall, ToolResult, TaskContext,
create_or_load_workspace) so the tests file no longer triggers the unused-import
lint error.
In `@tests/core/test_replay.py`:
- Around line 11-26: Add the v2 marker to this new test module by defining a
module-level variable pytestmark = pytest.mark.v2 (import pytest is already
present), placing it near the top of tests/core/test_replay.py so the test file
participates in marker-based v2 runs; ensure pytestmark is a top-level variable
(not inside the workspace fixture or any function) and keep the existing imports
and fixture name workspace unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9f4abf3c-248b-4caf-805e-13218f6b8c71
📒 Files selected for processing (8)
codeframe/cli/app.pycodeframe/core/react_agent.pycodeframe/core/replay.pycodeframe/core/workspace.pytests/cli/test_replay_commands.pytests/core/test_execution_recording.pytests/core/test_replay.pytests/core/test_replay_integration.py
| @work_app.command("replay") | ||
| def work_replay( | ||
| run_id: str = typer.Argument(..., help="Run ID to replay"), | ||
| workspace_path: Optional[Path] = typer.Option( | ||
| None, | ||
| "--workspace", | ||
| "-w", | ||
| help="Workspace path (defaults to current directory)", | ||
| ), | ||
| step: Optional[int] = typer.Option( | ||
| None, | ||
| "--step", | ||
| "-s", | ||
| help="Jump to a specific step number", | ||
| ), | ||
| show_llm: bool = typer.Option( | ||
| False, | ||
| "--show-llm", | ||
| help="Show LLM prompts and responses", | ||
| ), | ||
| show_files: bool = typer.Option( | ||
| True, | ||
| "--show-files/--no-files", | ||
| help="Show file changes at each step", | ||
| ), | ||
| ) -> None: | ||
| """Replay a past execution step by step. | ||
|
|
||
| Shows what happened during an agent run: which tools were called, | ||
| what files were changed, and what the LLM produced at each step. | ||
|
|
||
| Example: | ||
| cf work replay <run-id> | ||
| cf work replay <run-id> --step 3 | ||
| cf work replay <run-id> --show-llm | ||
| """ | ||
| from rich.panel import Panel | ||
|
|
||
| from codeframe.core.replay import ( | ||
| load_execution_trace, | ||
| ) | ||
| from codeframe.core.workspace import get_workspace | ||
|
|
||
| path = workspace_path or Path.cwd() | ||
|
|
||
| try: | ||
| workspace = get_workspace(path) | ||
| trace = load_execution_trace(workspace, run_id) | ||
|
|
||
| if not trace: | ||
| console.print(f"[red]Error:[/red] No trace found for run '{run_id}'") | ||
| raise typer.Exit(1) | ||
|
|
||
| # Header | ||
| console.print( | ||
| Panel( | ||
| f"[bold]Run:[/bold] {trace.run_id}\n" | ||
| f"[bold]Task:[/bold] {trace.task_id}\n" | ||
| f"[bold]Status:[/bold] {trace.status}\n" | ||
| f"[bold]Steps:[/bold] {len(trace.steps)}", | ||
| title="Execution Replay", | ||
| ) | ||
| ) | ||
|
|
||
| # Build lookups | ||
| ops_by_step = {} | ||
| for op in trace.file_operations: | ||
| ops_by_step.setdefault(op.step_id, []).append(op) | ||
|
|
||
| llm_by_step = {} | ||
| for llm in trace.llm_interactions: | ||
| llm_by_step.setdefault(llm.step_id, []).append(llm) | ||
|
|
||
| # Filter to specific step if requested | ||
| steps_to_show = trace.steps | ||
| if step is not None: | ||
| steps_to_show = [s for s in trace.steps if s.step_number == step] | ||
| if not steps_to_show: | ||
| console.print(f"[yellow]No step {step} found (max: {len(trace.steps)})[/yellow]") | ||
| raise typer.Exit(1) | ||
|
|
||
| for s in steps_to_show: | ||
| status_color = {"completed": "green", "failed": "red"}.get(s.status, "yellow") | ||
| console.print( | ||
| f"\n[bold]Step {s.step_number}:[/bold] {s.description} " | ||
| f"[{status_color}][{s.status}][/{status_color}]" | ||
| ) | ||
|
|
||
| if show_files: | ||
| step_ops = ops_by_step.get(s.id, []) | ||
| for op in step_ops: | ||
| op_color = {"create": "green", "edit": "yellow", "delete": "red"}.get( | ||
| op.operation_type, "white" | ||
| ) | ||
| console.print(f" [{op_color}]{op.operation_type}[/{op_color}] {op.file_path}") | ||
|
|
||
| if show_llm: | ||
| step_llms = llm_by_step.get(s.id, []) | ||
| for llm in step_llms: | ||
| console.print(f" [dim]LLM ({llm.model}, {llm.tokens_used} tokens):[/dim]") | ||
| console.print(f" [cyan]Prompt:[/cyan] {llm.prompt[:200]}") | ||
| console.print(f" [cyan]Response:[/cyan] {llm.response[:200]}") | ||
|
|
||
| # Summary | ||
| summary = trace.summary() | ||
| console.print(f"\n[dim]Total: {summary['total_steps']} steps, " | ||
| f"{summary['llm_calls']} LLM calls, " | ||
| f"{summary['total_tokens']} tokens, " | ||
| f"{summary['files_modified']} files modified[/dim]") |
There was a problem hiding this comment.
work replay still dumps the trace instead of replaying it.
Without --step, this prints every step and exits. There’s no next/previous/jump loop here, so users still can’t step through a run from the CLI as a replay session.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/cli/app.py` around lines 3138 - 3246, The command handler
work_replay currently prints all steps when no --step is provided instead of
starting an interactive replay session; update work_replay to enter an
interactive loop (e.g., using typer.prompt or input) when step is None:
initialize a current_index over trace.steps, render the current step (using
existing rendering logic that references trace.steps, ops_by_step, llm_by_step
and the Step fields like id/step_number/description/status), then accept simple
commands next/prev/jump <n>/show-llm/quit to move the index, re-render the
selected step, and only exit on quit; keep the existing single-step rendering
code for reuse and ensure the --step path still shows just that step without
entering the loop.
| from codeframe.core.replay import ( | ||
| export_trace_json, | ||
| export_trace_markdown, | ||
| load_execution_trace, | ||
| ) | ||
| from codeframe.core.workspace import get_workspace | ||
|
|
||
| path = workspace_path or Path.cwd() | ||
|
|
||
| try: | ||
| workspace = get_workspace(path) | ||
| trace = load_execution_trace(workspace, run_id) | ||
|
|
||
| if not trace: | ||
| console.print(f"[red]Error:[/red] No trace found for run '{run_id}'") | ||
| raise typer.Exit(1) | ||
|
|
||
| if format == "json": | ||
| content = json.dumps(export_trace_json(trace), indent=2) | ||
| else: | ||
| content = export_trace_markdown(trace) | ||
|
|
||
| if output: | ||
| output.write_text(content) | ||
| console.print(f"[green]Trace exported to {output}[/green]") | ||
| else: | ||
| console.print(content, highlight=False) | ||
|
|
There was a problem hiding this comment.
export-trace is not exporting a complete trace.
codeframe/core/replay.py:538-577 and codeframe/core/replay.py:580-631 only serialize step metadata plus file path names. They omit trace.llm_interactions entirely and don’t include file before/after content, so the exported artifact isn’t enough for offline debugging or reconstruction.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/cli/app.py` around lines 3378 - 3405, The export functions
export_trace_json and export_trace_markdown currently only serialize step
metadata and paths; update them to include trace.llm_interactions and the full
before/after file contents for each file referenced in the trace so exported
artifacts are reconstructible offline. Specifically, modify
export_trace_json(trace) and export_trace_markdown(trace) to iterate
trace.llm_interactions and include them in the output structure/markdown, and
for each step that references files use load_execution_trace/get_workspace
utilities or the trace’s stored file snapshots to embed the file contents
(pre-change and post-change) rather than only paths; ensure the JSON output
nests llm_interactions and file contents and the markdown includes readable
sections for interactions and before/after file diffs.
| @work_app.command("rerun") | ||
| def work_rerun( | ||
| run_id: str = typer.Argument(..., help="Run ID to re-run from"), | ||
| workspace_path: Optional[Path] = typer.Option( | ||
| None, | ||
| "--workspace", | ||
| "-w", | ||
| help="Workspace path (defaults to current directory)", | ||
| ), | ||
| from_step: int = typer.Option( | ||
| 1, | ||
| "--from-step", | ||
| help="Step number to resume from", | ||
| ), | ||
| ) -> None: | ||
| """Prepare to re-execute a run from a specific step. | ||
|
|
||
| Reconstructs the file state at step N and shows what | ||
| would need to be re-executed. Use this to understand | ||
| what happened and plan a manual re-run. | ||
|
|
||
| Example: | ||
| cf work rerun <run-id> --from-step 2 | ||
| """ | ||
| from codeframe.core.replay import prepare_rerun | ||
| from codeframe.core.workspace import get_workspace | ||
|
|
||
| path = workspace_path or Path.cwd() | ||
|
|
||
| try: | ||
| workspace = get_workspace(path) | ||
| rerun_info = prepare_rerun(workspace, run_id, from_step) | ||
|
|
||
| console.print(f"[bold]Re-run preparation for run {run_id}[/bold]\n") | ||
| console.print(f"[bold]Resume from:[/bold] Step {from_step}") | ||
| console.print(f"[bold]Task:[/bold] {rerun_info['task_id']}") | ||
|
|
||
| file_state = rerun_info["file_state"] | ||
| if file_state: | ||
| console.print(f"\n[bold]File state at step {from_step}:[/bold]") | ||
| for fp in sorted(file_state.keys()): | ||
| console.print(f" {fp}") | ||
| else: | ||
| console.print(f"\n[yellow]No files modified at step {from_step}[/yellow]") | ||
|
|
||
| remaining = rerun_info["remaining_steps"] | ||
| if remaining: | ||
| console.print(f"\n[bold]Remaining steps ({len(remaining)}):[/bold]") | ||
| for rs in remaining: | ||
| console.print(f" Step {rs['step_number']}: {rs['description']}") | ||
| else: | ||
| console.print("\n[yellow]No remaining steps after this point[/yellow]") | ||
|
|
||
| except FileNotFoundError: | ||
| console.print(f"[red]Error:[/red] No workspace found at {path}") | ||
| raise typer.Exit(1) | ||
| except ValueError as e: | ||
| console.print(f"[red]Error:[/red] {e}") | ||
| raise typer.Exit(1) |
There was a problem hiding this comment.
work rerun never actually reruns anything.
This path only calls prepare_rerun() and prints the returned state. It never restores the snapshot into the workspace or starts a new run through runtime, so users can’t resume execution from the chosen checkpoint.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/cli/app.py` around lines 3411 - 3469, work_rerun currently only
calls prepare_rerun and prints rerun_info but never restores the workspace state
or triggers execution; after obtaining rerun_info in work_rerun you should apply
the returned file_state to the workspace (e.g., call a method on the workspace
like restore_snapshot/apply_file_state using rerun_info["file_state"]) and then
invoke the runtime to start/resume execution from from_step (use your runtime
entrypoint / runner function to execute the remaining steps or start a new run
with the restored workspace, passing run_id/from_step/remaining_steps as
needed); ensure errors from restore or runtime start are handled similarly to
the existing FileNotFoundError/ValueError branches so the CLI exits with
non-zero on failure.
| # --- Execution recording: LLM call --- | ||
| _rec_step_id: Optional[str] = None | ||
| if self.execution_recorder is not None: | ||
| # Build condensed summaries for the trace | ||
| _rec_prompt = f"System: {prompt_summary} | Messages: {len(messages)}" | ||
| if response.has_tool_calls: | ||
| _rec_response = "Tool calls: " + ", ".join( | ||
| tc.name for tc in response.tool_calls | ||
| ) | ||
| else: | ||
| _rec_response = (response.content or "")[:200] | ||
| _rec_step_id = self.execution_recorder.record_iteration( | ||
| step_number=iterations, | ||
| tool_names=[tc.name for tc in response.tool_calls], | ||
| llm_response_summary=_rec_response, | ||
| ) | ||
| self.execution_recorder.record_llm_call( | ||
| step_id=_rec_step_id, | ||
| prompt_summary=_rec_prompt, | ||
| response_summary=_rec_response, | ||
| model=response.model or "", | ||
| tokens_used=response.input_tokens + response.output_tokens, | ||
| purpose="execution", | ||
| ) |
There was a problem hiding this comment.
Trace the verification-fix path too.
execution_recorder is only populated inside _react_loop(). The bounded fix loop in _run_final_verification() also makes LLM calls and can edit files, so runs that need gate retries will replay/export an incomplete trace and prepare_rerun() will reconstruct the pre-fix state instead of the final run state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/core/react_agent.py` around lines 465 - 488, The verification/fix
path in _run_final_verification isn’t recording LLM execution like _react_loop,
so runs that perform bounded fixes produce incomplete traces and prepare_rerun
reconstructs the wrong (pre-fix) state; update _run_final_verification to mirror
the execution_recorder usage in _react_loop by calling
execution_recorder.record_iteration and execution_recorder.record_llm_call for
each LLM invocation in the bounded fix loop (include step numbering, tool_names
from response.tool_calls, llm_response_summary, prompt_summary, model, and
tokens_used computed from response.input_tokens + response.output_tokens) and
ensure prepare_rerun reads the latest step_id produced by record_iteration so
reruns reconstruct the post-fix state.
| def flush(self) -> None: | ||
| """Write all buffered records to the database.""" | ||
| try: | ||
| for step in self._step_buffer: | ||
| save_execution_step(self.workspace, step) | ||
| for interaction in self._llm_buffer: | ||
| save_llm_interaction(self.workspace, interaction) | ||
| for op in self._file_op_buffer: | ||
| save_file_operation(self.workspace, op) | ||
| except Exception: | ||
| logger.debug("ExecutionRecorder flush failed", exc_info=True) | ||
| finally: | ||
| self._step_buffer.clear() | ||
| self._llm_buffer.clear() | ||
| self._file_op_buffer.clear() |
There was a problem hiding this comment.
Don't clear the buffers after a failed flush.
If one save_*() call raises, finally still clears _step_buffer, _llm_buffer, and _file_op_buffer. That turns a transient write error into permanent trace loss, because the caller has nothing left to retry. Clear only after a successful flush, ideally as one DB transaction.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/core/replay.py` around lines 244 - 258, The flush method currently
clears _step_buffer, _llm_buffer, and _file_op_buffer in the finally block even
when save_execution_step/save_llm_interaction/save_file_operation raise, causing
permanent data loss; change flush so buffers are only cleared after all saves
complete successfully (e.g., move the clear calls into the try block after the
loops or wrap the saves in a DB transaction and clear buffers only on commit)
and ensure exceptions still propagate or are logged without dropping buffered
items.
| def export_trace_json(trace: ExecutionTrace) -> dict[str, Any]: | ||
| """Export an ExecutionTrace as a JSON-serializable dict. | ||
|
|
||
| Returns a dict with run metadata, step details, and summary stats. | ||
| """ | ||
| # Build a lookup of file operations by step_id | ||
| ops_by_step: dict[str, list[FileOperation]] = {} | ||
| for op in trace.file_operations: | ||
| ops_by_step.setdefault(op.step_id, []).append(op) | ||
|
|
||
| steps = [] | ||
| for step in trace.steps: | ||
| step_ops = ops_by_step.get(step.id, []) | ||
| step_dict: dict[str, Any] = { | ||
| "step_number": step.step_number, | ||
| "step_type": step.step_type, | ||
| "description": step.description, | ||
| "status": step.status, | ||
| "started_at": step.started_at.isoformat(), | ||
| "completed_at": step.completed_at.isoformat() if step.completed_at else None, | ||
| } | ||
| if step_ops: | ||
| step_dict["file_changes"] = [ | ||
| { | ||
| "operation": op.operation_type, | ||
| "file_path": op.file_path, | ||
| } | ||
| for op in step_ops | ||
| ] | ||
| steps.append(step_dict) | ||
|
|
||
| return { | ||
| "run_id": trace.run_id, | ||
| "task_id": trace.task_id, | ||
| "started_at": trace.started_at.isoformat(), | ||
| "completed_at": trace.completed_at.isoformat() if trace.completed_at else None, | ||
| "status": trace.status, | ||
| "steps": steps, | ||
| "summary": trace.summary(), | ||
| } |
There was a problem hiding this comment.
export-trace is dropping the recorded prompts/responses.
Both exporters ignore trace.llm_interactions, so the exported artifact loses the LLM data this feature is capturing for debugging. A run with no file edits currently exports little more than step headings, which makes the JSON/Markdown trace much less useful for offline analysis.
Also applies to: 581-632
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/core/replay.py` around lines 539 - 578, The JSON exporter
export_trace_json is currently omitting trace.llm_interactions; update
export_trace_json (and the other exporter referenced around 581-632) to include
LLM data by grouping trace.llm_interactions by step_id (similar to ops_by_step)
and adding an "llm_interactions" entry to each step_dict with a list of
serializable objects (e.g., model, role, prompt/input, response/output, tokens,
timestamps) taken from each LLMInteraction instance; ensure you use the step.id
to attach interactions to the correct step and preserve None-safe serialization
(timestamps via .isoformat(), optional fields omitted or null) so offline
JSON/Markdown traces include the recorded prompts and responses.
| import json | ||
| import uuid | ||
| from datetime import datetime, timezone | ||
| from pathlib import Path | ||
|
|
||
| import pytest | ||
|
|
||
| from codeframe.core.workspace import create_or_load_workspace, get_db_connection | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def workspace(tmp_path: Path): | ||
| """Create a temporary workspace for testing.""" | ||
| repo_path = tmp_path / "test_repo" | ||
| repo_path.mkdir() | ||
| return create_or_load_workspace(repo_path) |
There was a problem hiding this comment.
Mark this new test module as v2.
This file is missing pytestmark = pytest.mark.v2, so it won’t participate in marker-based v2 runs.
🧪 Minimal fix
import pytest
from codeframe.core.workspace import create_or_load_workspace, get_db_connection
+pytestmark = pytest.mark.v2
+
`@pytest.fixture`
def workspace(tmp_path: Path):As per coding guidelines, "New v2 Python tests must be marked with @pytest.mark.v2 decorator or pytestmark = pytest.mark.v2."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/core/test_replay.py` around lines 11 - 26, Add the v2 marker to this
new test module by defining a module-level variable pytestmark = pytest.mark.v2
(import pytest is already present), placing it near the top of
tests/core/test_replay.py so the test file participates in marker-based v2 runs;
ensure pytestmark is a top-level variable (not inside the workspace fixture or
any function) and keep the existing imports and fixture name workspace
unchanged.
|
TEST |
CodeFRAME Development GuidelinesLast updated: 2026-03-09 Product VisionCodeFrame is a project delivery system: Think → Build → Prove → Ship. It owns the edges of the AI coding pipeline — everything BEFORE code gets written (PRD, specification, task decomposition) and everything AFTER (verification gates, quality memory, deployment). The actual code writing is delegated to frontier coding agents (Claude Code, Codex, OpenCode) that are better at it than any custom agent. CodeFrame does not compete with coding agents. It orchestrates them. Status: Phase 1 ✅ | Phase 2 ✅ | Phase 2.5 ✅ — CLI workflow, server layer, and ReAct agent complete. Agent adapter architecture (#408) and PROOF9 quality system (#422) are next priorities. See If you are an agent working in this repo: do not improvise architecture. Follow the documents listed below. Primary Contract (MUST FOLLOW)
Rule 0: If a change does not directly support the Think → Build → Prove → Ship pipeline, do not implement it. Strategic Priority (Phase 4)The next major architectural work is the Agent Adapter Architecture (#408):
Current Reality (Phase 1, 2 & 2.5 Complete)What's Working Now
v2 Architecture (current)
v1 Legacy
Repository StructureArchitecture Rules (non-negotiable)1) Core must be headless
Core is allowed to:
2) CLI must not require a serverGolden Path commands must work from the CLI with no server running. FastAPI is optional and must be started explicitly (e.g., 3) Agent state transitions flow through runtimeCritical pattern discovered during implementation:
This separation prevents duplicate state transitions (e.g., DONE→DONE, BLOCKED→BLOCKED errors). 4) Legacy can be read, not depended onLegacy code is reference material.
5) Keep commits runnableAt all times:
Agent System ArchitectureComponents
Model Selection StrategyTask-based heuristic via
Future: Engine SelectionCodeFRAME supports two execution engines, selected via
Execution Flow (ReAct — default)Execution Flow (Plan — legacy,
|
| Phase | Focus | Pipeline Stage | Status |
|---|---|---|---|
| 1 | CLI Completion | Think + Build | ✅ Complete |
| 2 | Server Layer | Build (API) | ✅ Complete |
| 2.5 | ReAct Agent | Build (execution) | ✅ Complete |
| 3 | Web UI Rebuild | All (dashboard) | In Progress |
| 4 | Agent Adapters + Orchestration | Build (delegate to frontier agents) | Next |
| 5 | PROOF9 + Advanced | Prove + Ship (quality memory) | Planned |
Phase 2 Complete: Server Layer (2026-02-03)
Phase 2 deliverables completed:
- ✅ Server audit and refactor ([Phase 2] Server audit and refactor - routes delegating to core modules #322) - 15 v2 routers following thin adapter pattern
- ✅ API key authentication (feat(auth): add API key authentication for CLI and REST API #326) - Scopes: read/write/admin
- ✅ Rate limiting (feat(security): add API rate limiting with slowapi #327) - Configurable per-endpoint with Redis support
- ✅ Real-time SSE streaming (feat(streaming): add real-time SSE events for task execution #328) -
/api/v2/tasks/{id}/stream - ✅ OpenAPI documentation ([Phase 2] Complete OpenAPI documentation for all endpoints #119) - Full Swagger/ReDoc with examples
Server Architecture (Phase 2)
Pattern: Thin adapter over core - server routes delegate to core.* modules.
CLI (typer) ─┬── core.* ─── adapters.*
│
Server (fastapi) ─┘
V2 Router Modules (15 total):
| Router | Endpoints | Purpose |
|---|---|---|
blockers_v2 |
5 | Blocker CRUD |
prd_v2 |
8 | PRD management + versioning |
tasks_v2 |
12 | Task management + streaming |
workspace_v2 |
5 | Init, status, tech stack |
batches_v2 |
5 | Batch execution strategies |
streaming_v2 |
2 | SSE event streaming |
api_key_v2 |
4 | API key management |
discovery_v2 |
5 | PRD discovery sessions |
checkpoints_v2 |
6 | State checkpoints |
schedule_v2 |
3 | Task scheduling |
templates_v2 |
4 | PRD templates |
git_v2 |
3 | Git operations |
review_v2 |
2 | Code review |
pr_v2 |
5 | GitHub PR workflow |
environment_v2 |
4 | Tool detection |
API Authentication:
# Create API key
cf auth api-key-create --name "my-key" --scopes read,write
# Use in requests
curl -H "X-API-Key: cf_..." https://api.example.com/api/v2/tasksRate Limiting:
- Default: 100 requests/minute (standard endpoints)
- Auth endpoints: 10/minute
- AI endpoints: 20/minute
- Configurable via
RATE_LIMIT_*environment variables
OpenAPI Documentation:
- Swagger UI:
/docs - ReDoc:
/redoc - OpenAPI JSON:
/openapi.json
Previous Updates (2026-01-29)
V2 Strategic Roadmap Established
Created comprehensive 5-phase roadmap in docs/V2_STRATEGIC_ROADMAP.md.
Phase 1 Complete: CLI Foundation
All Phase 1 priorities completed:
- ✅
cf prd generate- Socratic PRD discovery ([Phase 1] cf prd generate - Interactive AI PRD creation (Socratic Discovery) #307) - ✅
cf work follow- Live execution streaming ([Phase 1] cf work follow - Live execution streaming #308) - ✅ Integration tests for credential/env modules ([Phase 1] Integration tests for credential and environment modules #309)
- ✅ PRD template system ([Phase 1] PRD template system for customizable output formats #316)
Environment Validation (cf env)
New commands for validating development environment:
cf env check # Validate required tools (git, uv, ruff, pytest)
cf env install # Install missing tools automatically
cf env doctor # Comprehensive environment health checkModules:
core/environment.py- Tool detection and validationcore/installer.py- Cross-platform tool installation
GitHub PR Workflow (cf pr)
Streamlined PR management without leaving the CLI:
cf pr create # Create PR from current branch
cf pr status # Show PR status and review state
cf pr checks # Show CI check results
cf pr merge # Merge approved PRTask Self-Diagnosis (cf work diagnose)
AI-powered analysis of failed tasks:
cf work diagnose <task-id> # Analyze why a task failedModules:
core/diagnostics.py- Failed task analysiscore/diagnostic_agent.py- AI-powered diagnosis
Bug Fixes
- [Phase 1] Backend: NoneType error accessing search_pattern during task execution #265: Fixed NoneType error in
codebase_index.search_pattern()- added null check - [Phase 1] Checkpoint diff API returns 500 - workspace directory missing #253: Fixed checkpoint diff API returning 500 - added workspace existence validation
GitHub Issue Organization
- Created
v1-legacylabel for 22 v1-specific issues (closed, retained as Phase 3 reference) - Created phase labels:
phase-1,phase-2,phase-4,phase-5 - Created 9 new issues ([Phase 1] cf prd generate - Interactive AI PRD creation (Socratic Discovery) #307-[Phase 5] Debug and replay mode #315) for roadmap features
- Consistent naming:
[Phase #] Titleformat
Previous Updates (2026-01-16)
Phase 3.1: Tech Stack Configuration
Simplified tech stack configuration using natural language descriptions:
- ✅
tech_stackfield on Workspace model - stores natural language description - ✅
--detectflag - auto-detects from pyproject.toml, package.json, Cargo.toml, go.mod - ✅
--tech-stackflag - explicit tech stack description (e.g., "Rust project with cargo") - ✅
--tech-stack-interactiveflag - simple prompt for user input (stub for future multi-round) - ✅ Agent integration - TaskContext and Planner include tech_stack in LLM prompts
- ✅ Removed
cf configsubcommand - tech stack is now part of workspace init
Design philosophy: Instead of structured configuration with specific package managers and frameworks, users describe their stack in natural language. The agent interprets and adapts.
Examples:
cf init . --detect # Auto-detect: "Python with uv, pytest, ruff for linting"
cf init . --tech-stack "Rust project using cargo"
cf init . --tech-stack "TypeScript monorepo with pnpm, Next.js, jest"
cf init . --tech-stack-interactive # Prompts user for descriptionFuture work: Multi-round interactive discovery (bead: codeframe-8d80)
Agent Self-Correction & Observability
Improved agent reliability with automatic error recovery:
- ✅ Self-correction loop in
_run_final_verification()- agent retries up to 3 times - ✅ Verbose mode (
--verbose/-v) - shows detailed verification/self-correction progress - ✅ FAILED task status - tasks transition to FAILED for proper error visibility
- ✅ Project preferences - agent loads AGENTS.md/CLAUDE.md for per-project config
- ✅ Fixed
fail_run()- now properly transitions task status (was leaving tasks stuck)
Enhanced Self-Correction (Phase 3.4)
Advanced error recovery with loop prevention and smart escalation:
-
✅ Fix Attempt Tracker (
core/fix_tracker.py) - prevents repeating failed fixes- Normalizes errors for comparison (removes line numbers, memory addresses)
- Tracks (error_signature, fix_description) pairs with outcomes
- Detects escalation patterns (same error 3+ times, same file 3+ times)
-
✅ Pattern-Based Quick Fixes (
core/quick_fixes.py) - fixes common errors without LLMModuleNotFoundError→ auto-install package (detects package manager)ImportError→ add missing import statementNameError→ add common imports (Optional, dataclass, Path, etc.)SyntaxError→ fix missing colons, f-string prefixesIndentationError→ normalize mixed tabs/spaces
-
✅ Escalation to Blocker - creates informative blockers when stuck
- Triggered after MAX_SAME_ERROR_ATTEMPTS (3) failures on same error
- Triggered after MAX_SAME_FILE_ATTEMPTS (3) failures on same file
- Triggered after MAX_TOTAL_FAILURES (5) in a run
- Blocker includes error type, attempted fixes, and guidance questions
Self-Correction Flow
Error occurs
│
├── Try ruff --fix (auto-lint)
│
├── Try pattern-based quick fix (no LLM)
│ ├── Check if fix already attempted → skip
│ ├── Apply fix
│ └── Record outcome in tracker
│
├── Check escalation threshold
│ └── If exceeded → create escalation blocker
│
└── Use LLM to generate fix plan
├── Include already-tried fixes to avoid repetition
├── Execute fix steps with tracking
└── Re-verify
Key Self-Correction Methods
_run_final_verification(): While loop that re-runs gates after self-correction_attempt_verification_fix(): Orchestrates quick fixes, escalation check, LLM fixes_create_escalation_blocker(): Creates detailed blocker with context_verbose_print(): Conditional stdout output for observability
Phase 2 Complete (2026-01-15): Parallel Batch Execution
All 6 Phase 2 items from CLI_WIREFRAME.md are done:
- ✅
work batch resume <batch-id>- re-run failed/blocked tasks - ✅
depends_onfield on Task model - ✅ Dependency graph analysis (DAG, cycle detection, topological sort)
- ✅ True parallel execution with ThreadPoolExecutor worker pool
- ✅
--strategy autowith LLM-based dependency inference - ✅
--retry Nautomatic retry of failed tasks
Key Phase 2 Modules
- conductor.py: Batch orchestration with serial/parallel/auto strategies
- dependency_graph.py: DAG operations, level-based grouping for parallelization
- dependency_analyzer.py: LLM analyzes task descriptions to infer dependencies
Agent Implementation Complete (2026-01-14)
All 8 implementation tasks from AGENT_IMPLEMENTATION_TASKS.md are done:
- ✅ LLM Adapter Interface (
adapters/llm/) - ✅ Task Context Loader (
core/context.py) - ✅ Agent Planning (
core/planner.py) - ✅ Code Execution Engine (
core/executor.py) - ✅ Automatic Blocker Detection (in
core/agent.py) - ✅ Gate Integration (in
core/agent.py) - ✅ Agent Orchestrator (
core/agent.py) - ✅ Wire into Runtime (
core/runtime.py)
Bug Fixes During Testing
- GateResult attribute access: Fixed
gate_result.status→gate_result.passed - Duplicate task transitions: Removed task status updates from agent.py (runtime handles all)
- READY→READY error: Added check in
stop_runbefore transitioning - Verification step handling: Made
_execute_verificationsmarter about file vs command targets
Key Design Decisions
- State separation: Agent manages AgentState, Runtime manages TaskStatus
- Model selection: Task-based heuristic via Purpose enum
- Blocker creation: Agent creates blockers, Runtime updates task status
- Verification: Incremental (ruff after each file change) + final (all gates)
Testing
Run all tests
uv run pytestRun v2 tests only
uv run pytest -m v2 # All v2 tests (~411 tests)
uv run pytest -m v2 -q # Quiet modeThe v2 marker identifies tests for CLI-first, headless functionality:
- All tests in
tests/core/are automatically marked v2 (via conftest.py) - v2 CLI tests have
pytestmark = pytest.mark.v2at the top
Convention: When adding new v2 functionality, mark tests with @pytest.mark.v2 or add pytestmark = pytest.mark.v2 at module level for CLI tests that use codeframe.cli.app.
Run core module tests
uv run pytest tests/core/
uv run pytest tests/core/test_agent.py -v
uv run pytest tests/adapters/test_llm.py -vTest coverage
uv run pytest --cov=codeframe --cov-report=htmlEnvironment Variables
# Required for agent execution
ANTHROPIC_API_KEY=sk-ant-...
# Optional - Database
DATABASE_PATH=./codeframe.db
# Optional - Rate Limiting (Phase 2)
RATE_LIMIT_ENABLED=true # Enable/disable rate limiting
RATE_LIMIT_DEFAULT=100/minute # Default limit
RATE_LIMIT_AUTH=10/minute # Auth endpoints
RATE_LIMIT_AI=20/minute # AI/LLM endpoints
RATE_LIMIT_WEBSOCKET=50/minute # WebSocket connections
REDIS_URL=redis://localhost:6379 # Redis for distributed rate limiting (optional)
# Optional - API Server
CODEFRAME_API_KEY_SECRET=<random-secret> # Secret for API key hashingLegacy sections removed on purpose
This file previously contained extensive v1 details (auth, websocket, UI template, sprint history).
Those are still in git history and legacy docs, but they are not the current contract.
The current contract is Golden Path + Refactor Plan + Command Tree mapping + Agent Implementation.
CodeFRAME v2 — Golden Path Contract (CLI-first)This document is the contract for CodeFRAME v2 development. Rule 0 (the only rule that matters):
This applies to both humans and agentic coding assistants. GoalsWhat "done" looks like (Enhanced MVP definition)CodeFRAME can run a complete end-to-end AI-driven development workflow from the CLI on a target repo:
No UI is required. Non-Goals (explicitly forbidden until Golden Path works)Do not build or refactor:
These may be revisited only after Golden Path is working and stable. Golden Path CLI Flow (the only flow that matters)0) Preconditions
1) Initialize a workspaceCommand:
Required behavior:
Artifacts:
2) AI-driven PRD generation and refinementCommands:
Required behavior for
3) Intelligent task generation with dependency analysisCommands:
Required behavior:
4) Batch task execution with orchestrationCommands:
Required behavior for batch execution:
5) Enhanced human-in-loop blocker resolutionCommands:
Required behavior:
6) Integrated Git workflow and PR managementCommands:
Required behavior:
7) Enhanced verification and quality gatesCommands:
Required behavior:
8) Integrated artifact and commit managementCommands:
Required behavior:
9) Comprehensive checkpointing and state managementCommands:
Required behavior:
State Machine (authoritative)Statuses:
Allowed transitions (comprehensive):
The CLI is the authority for transitions. PR Workflow Integration:
Implementation PrinciplesCore-first (no FastAPI in the core)
CLI-first (server optional)
Salvage safely
Keep it runnable
Acceptance Checklist (Enhanced MVP - must pass)Status: 🔄 Enhanced MVP Partially Complete 📊 Current Implementation StatusOverall Assessment: Enhanced MVP is ~60% complete with solid foundation but critical gaps remaining. ✅ Fully Implemented Phases:
|
CodeFRAME v2 Strategic RoadmapCreated: 2026-01-29 Executive SummaryCodeFRAME v2 CLI Phase 1 is complete with a production-ready foundation. The path forward involves:
Current State AssessmentWhat's Working (Phase 1 Complete)
Phase 1 Gaps - ALL CLOSED
Phase 1: CLI Foundation Completion ✅ COMPLETEGoal: Make CLI fully production-ready for headless agent workflows. Deliverables
Success Criteria - ALL MET
Phase 2: Server Layer as Thin AdapterGoal: FastAPI server exposing core functionality via REST + real-time events. Deliverables
Phase 2 Progress Summary
All Phase 2 Issues
Architecture Principle: Thin Adapter PatternServer and CLI are siblings, both calling core. Key Pattern: V2 routers follow the thin adapter pattern:
See Phase 2.5: ReAct Agent Architecture ✅ COMPLETEGoal: Replace plan-then-execute agent with iterative ReAct (Reasoning + Acting) loop as the default engine. MotivationThe plan-based agent had several failure modes discovered during testing:
Deliverables
Key Architecture Decisions
Reference Documentation
Phase 3: Web UI RebuildGoal: Modern dashboard consuming REST/WebSocket API. Deliverables
Tech Stack
Note: v1-legacy issues (labeled and closed) serve as reference for this phase. Phase 4: Multi-Agent CoordinationGoal: Realize the "FRAME" vision - specialist agents working together. Deliverables
Related Issues
Phase 5: Advanced Features & PolishGoal: Power user features and production hardening. Deliverables
Execution Timeline
GitHub Issue OrganizationLabels
Phase 1 Issues - ALL COMPLETE
Phase 2 Issues - MOSTLY COMPLETE
Architecture Decisions1. Core-first pattern maintainedCore remains headless. Server and CLI are equal adapters. 2. Integration tests as guardrailThe existing 130+ v2 router tests ensure "always working codebase" through all phases. 3. No big-bang UI rewriteWeb UI is built incrementally on v2 server, not by fixing v1. 4. Agent swarms are Phase 4, not Phase 1Focus on single-agent excellence first, then parallelize. Verification PlanAfter each phase:
Summary
Current focus: Phase 3 - Web UI rebuild on v2 foundation. |
Code Review Report: PRD View - Document Creation & DiscoveryDate: 2026-02-05 Executive SummaryThis PR implements the full PRD View for Phase 3 UI — a well-structured, component-driven implementation across 9 incremental commits. The code follows established project patterns (Shadcn/UI Nova template, Hugeicons, SWR, axios namespace pattern) and includes 64 new tests. Two major reliability issues were found (missing error handling in page.tsx handlers and a misused React hook in DiscoveryPanel), plus a few minor improvements. No critical security vulnerabilities. Critical Issues: 0 Review ContextCode Type: Frontend (Next.js React components, hooks, API client) Review Focus Areas
Priority 1 Issues - Critical ⛔None found. Priority 2 Issues - Major
|
Code Review: feat(replay): debug and replay mode |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/core/test_execution_recording.py (1)
141-157: Decouple buffering test from defaultflush_interval.At Line 142, this test relies on the current default flush threshold being greater than 4. Make the threshold explicit in-test to avoid future brittle failures if defaults change.
♻️ Proposed change
- recorder = ExecutionRecorder(workspace=workspace, run_id="run-1") + recorder = ExecutionRecorder(workspace=workspace, run_id="run-1", flush_interval=100)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/test_execution_recording.py` around lines 141 - 157, The test test_flush_writes_buffered_records assumes the default flush threshold is >4; make it explicit by constructing the ExecutionRecorder with a known flush interval (e.g., flush_interval=10) so the four buffered records remain in-memory until recorder.flush() is called; update the recorder instantiation in test_flush_writes_buffered_records (the ExecutionRecorder(...) call) to include the flush_interval argument and keep the rest of the assertions unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/core/test_execution_recording.py`:
- Around line 141-157: The test test_flush_writes_buffered_records assumes the
default flush threshold is >4; make it explicit by constructing the
ExecutionRecorder with a known flush interval (e.g., flush_interval=10) so the
four buffered records remain in-memory until recorder.flush() is called; update
the recorder instantiation in test_flush_writes_buffered_records (the
ExecutionRecorder(...) call) to include the flush_interval argument and keep the
rest of the assertions unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 14081693-3a50-45de-9ef5-adb55940e293
📒 Files selected for processing (1)
tests/core/test_execution_recording.py
CodeFRAME
The ProblemCoding agents are getting remarkably good at writing code. But shipping software is not the same as writing code. Before code gets written, someone has to figure out what to build, decompose it into tasks that an agent can execute, and resolve ambiguities. After code gets written, someone has to verify it actually works, catch regressions, and deploy with confidence. Today, that "someone" is still you. CodeFRAME owns the edges of the pipeline -- everything that happens before and after the code gets written. The actual coding is delegated to frontier agents (Claude Code, Codex, OpenCode, or CodeFRAME's built-in ReAct agent) that are better at it than any custom agent could be. Think. Build. Prove. Ship.Why CodeFRAMENobody else does the full upstream pipeline. Most orchestrators assume issues and specs already exist. CodeFRAME generates them through AI-guided Socratic discovery and recursive decomposition. Agent-agnostic execution. CodeFRAME does not compete with Claude Code or Codex. It orchestrates them. The built-in ReAct agent is a capable fallback, not the point. Quality memory (PROOF9). Every failure becomes a permanent proof obligation across 9 verification gates. Not just test coverage -- evidence-based verification that compounds over time. The closed loop is what turns a project into a learning system. Radical simplicity. Single CLI binary, SQLite, no daemons, no infrastructure. Install and start building in under a minute. Quick Start# Install
git clone https://github.com/frankbria/codeframe.git
cd codeframe
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv && source .venv/bin/activate && uv sync
export ANTHROPIC_API_KEY="your-key"
# Initialize a project
cd /path/to/your/project
cf init . --detect
# Generate requirements through AI-guided discovery
cf prd generate
# Decompose into atomic tasks
cf tasks generate
# Execute (delegates to the agent engine)
cf work start <task-id> --execute
# Ship
cf pr createThat is the entire workflow. Everything else is optional. ArchitectureThe core domain is headless and runs entirely from the CLI. The FastAPI server and web UI are optional adapters for teams that want a dashboard. CLI ReferenceTHINK -- Requirements and Planning# Workspace
cf init <path> # Initialize workspace
cf init <path> --detect # Auto-detect tech stack
cf status # Workspace status
# Requirements
cf prd generate # AI-guided Socratic PRD creation
cf prd generate --template lean # Use a specific template
cf prd add <file.md> # Import existing PRD
cf prd show # Display current PRD
# Task decomposition
cf tasks generate # Generate tasks from PRD (LLM-powered)
cf tasks list # List all tasks
cf tasks list --status READY # Filter by status
cf tasks show <id> # Task details with dependencies
# Scheduling
cf schedule show # Task schedule with dependencies
cf schedule predict # Completion date estimates
cf schedule bottlenecks # Identify blocking tasksBUILD -- Execution# Single task
cf work start <id> --execute # Execute with default engine (ReAct)
cf work start <id> --execute --engine plan # Use legacy plan engine
cf work start <id> --execute --verbose # Detailed progress output
cf work start <id> --execute --dry-run # Preview without applying
cf work start <id> --execute --stall-timeout 120 # Custom stall timeout (seconds)
cf work start <id> --execute --stall-action retry # Auto-retry on stall (blocker|retry|fail)
cf work follow <id> # Stream live output
cf work stop <id> # Cancel a run
cf work resume <id> # Resume after answering blockers
# Batch execution
cf work batch run --all-ready # All READY tasks
cf work batch run --strategy parallel # Parallel execution
cf work batch run --strategy auto # LLM-inferred dependencies
cf work batch run --retry 3 # Auto-retry failures
cf work batch status [batch_id] # Batch progress
cf work batch resume <batch_id> # Re-run failed tasks
# Blockers (human-in-the-loop)
cf blocker list # Questions the agent needs answered
cf blocker show <id> # Blocker details
cf blocker answer <id> "answer" # Unblock the agent
# Diagnostics
cf work diagnose <id> # AI-powered failure analysis
cf env check # Validate environment
cf env doctor # Comprehensive health checkPROVE -- Verificationcf review # Run verification gates
cf checkpoint create "milestone" # Snapshot project state
cf checkpoint list # List checkpoints
cf checkpoint restore <id> # Roll back to checkpointSHIP -- Deliverycf pr create # Create PR from current branch
cf pr status # PR status and review state
cf pr checks # CI check results
cf pr merge # Merge approved PR
cf commit # Commit verified changes
cf patch export # Export changes as patchWhat Works TodayCodeFRAME v2 (Phase 2.5 complete) delivers the full Think-Build-Ship loop:
RoadmapTHINK (upstream pipeline)
BUILD (agent adapters)
PROVE (quality memory)
SHIP (delivery confidence)
Web UI
Configuration# Required
export ANTHROPIC_API_KEY=sk-ant-...
# Optional
export DATABASE_PATH=./codeframe.db # Default: in-memory SQLite
export RATE_LIMIT_ENABLED=true # API rate limiting
export RATE_LIMIT_DEFAULT=100/minute # Default limitFor server configuration, rate limiting options, and API key setup, see docs/PHASE_2_DEVELOPER_GUIDE.md. Testinguv run pytest # All tests
uv run pytest -m v2 # v2 tests only
uv run pytest tests/core/ # Core module tests
uv run pytest --cov=codeframe --cov-report=html # With coverageDocumentation
Contributing
Code standards: PEP 8, LicenseAGPL-3.0 -- Free to use, modify, and distribute. Derivative works and network services must release source code under the same license. Built by Frank Bria |
Code Review: feat(replay): debug and replay modeWell-structured addition with good test coverage. Core architecture is solid (headless replay.py, optional recorder in ReactAgent, clean CLI/data separation). A few issues need attention before merge. Bug: work_diff crashes on empty step list In app.py, the default to_step calculation raises ValueError if trace.steps is empty: Gap: Recorder never wired up in builtin.py _build_agent() in codeframe/core/adapters/builtin.py constructs ReactAgent without passing execution_recorder, so no traces are actually recorded during normal cf work start --execute runs. The recorder parameter exists but has no production callsite. The PR notes this is intentional, but it means cf work replay will always return 'no trace found' for real runs. A follow-up ticket to wire this up from builtin.py (or an explicit --record flag) would prevent user confusion. Style: format parameter shadows built-in In work_export_trace, format: str = typer.Option() shadows Python's built-in. Rename to output_format or fmt. Maint: Schema DDL duplicated in init and upgrade paths The three new table DDL statements are copy-pasted verbatim into both _init_database and _ensure_schema_upgrades. Since the upgrade path already uses CREATE TABLE IF NOT EXISTS, the tables only need to appear there. UX: work_rerun command name implies execution it does not do The command shows what would be re-run but does not actually execute anything. Consider cf work replay-plan or add an --apply flag stub with a clear 'not yet implemented' message. Minor: delete operation type in schema is unreachable The DB schema enforces CHECK (operation_type IN ('create', 'edit', 'delete')) but react_agent.py only hooks edit_file and create_file. Dead schema -- either add delete recording or drop the variant from the constraint. Minor: work_rerun has no CLI tests test_replay_commands.py covers replay, diff, and export-trace but skips rerun. Add at minimum a happy-path and a step-not-found error case. Nit: flush() eats all exceptions at DEBUG level Correct to not crash the agent, but a schema error or disk-full will be invisible without debug logging. Consider logger.warning instead. Summary table: Bug (work_diff max() crash), Gap (recorder not wired in production), Style (format shadows built-in), Maint (schema DDL duplication), UX (rerun name misleads), Minor x2 (dead delete schema, missing rerun tests), Nit (DEBUG vs WARNING). Fix the max() bug and track the recorder wiring before merge. |
…format shadow (#315) - Fix data loss in ExecutionRecorder.flush(): only clear buffers on success, retain data for retry on failure, log at WARNING not DEBUG - Fix edit_file recording: read actual file content after edit instead of capturing search/replace snippet (create_file still uses tool input) - Rename format parameter to output_format to avoid shadowing builtin
Code Review: feat(replay): debug and replay mode (#449)Overall this is a well-structured addition. The feature is clearly scoped, the module is headless (no FastAPI), the ExecutionRecorder integration on ReactAgent is entirely opt-in, and the test coverage (78 new tests across unit, integration, and CLI layers) is thorough. A few issues worth addressing before merge, ranging from a potential crash to data-quality gaps. Critical
In
Important
In
The auto-flush fires when Minor / Nice-to-haveSchema duplication between The three new tables and their six indexes are copy-pasted verbatim in both functions in
The
The class is well-designed and tested, but all four CLI commands build their own
The display slice at 200 chars is fine, but appending Conventions Check (CLAUDE.md)
The two items to fix before merge are the |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
codeframe/cli/app.py (3)
3378-3405:⚠️ Potential issue | 🟠 Major
export-tracecurrently exports a partial trace shape.This command claims full trace export, but
codeframe/core/replay.pyexporters currently omittrace.llm_interactionsand file before/after contents, so offline analysis/reconstruction is incomplete.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/cli/app.py` around lines 3378 - 3405, The exporters used by the export-trace command (export_trace_json and export_trace_markdown in codeframe.core.replay) currently omit trace.llm_interactions and file before/after contents, so update those functions to include the full trace shape: add trace.llm_interactions (preserving interaction metadata and tokens) to the JSON/Markdown output and include file before_contents and after_contents (or full file diffs) for each file change entry; ensure load_execution_trace continues to return these fields and that export_trace_markdown formats/embeds the file contents or diffs (not just filenames) so offline consumers can reconstruct the run.
3138-3246:⚠️ Potential issue | 🟠 Major
work replaystill isn’t a replay session.Without
--step, Line 3219 just prints all steps and exits. There’s no interactive next/prev/jump navigation, so users still can’t step through execution as replay mode implies.Suggested direction
- for s in steps_to_show: - ... + if step is None: + # start interactive replay loop: next/prev/jump/show-llm/quit + # render one current step at a time + ... + else: + # single-step render + ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/cli/app.py` around lines 3138 - 3246, The command currently prints all steps when no --step is given instead of an interactive replay; update work_replay so that when step is None it enters an interactive loop letting the user navigate next/prev/jump/quit (commands like n/p/j <num>/q), updates a current index over trace.steps, and renders only the current step. To implement: extract the per-step rendering logic (the block that prints status, files and LLM output using ops_by_step, llm_by_step, show_files and show_llm) into a helper (e.g., render_step) and call it from the interactive loop; accept user input via console.input (or typer.prompt), adjust the index on n/p/j commands, validate bounds and show helpful prompts, and exit the loop on q. Keep existing single-step behavior (when --step is provided) unchanged.
3411-3463:⚠️ Potential issue | 🟠 Major
work rerunprepares state but never reruns.Line 3442 only calls
prepare_rerun()and then prints metadata; it does not restore workspace file state nor start a new execution run, so the command behavior doesn’t match rerun semantics.Suggested direction
- rerun_info = prepare_rerun(workspace, run_id, from_step) - # print only + rerun_info = prepare_rerun(workspace, run_id, from_step) + # restore rerun_info["file_state"] to workspace + # create/start a new run for rerun_info["task_id"] + # invoke runtime.execute_agent(...) and report status🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/cli/app.py` around lines 3411 - 3463, work_rerun currently only calls prepare_rerun and prints info but never restores files or starts execution; update work_rerun to accept an explicit flag (e.g., --apply / --execute) and when set: 1) obtain file_state from prepare_rerun and restore those files into the workspace (use get_workspace(path) and the workspace API to write/overwrite each path in rerun_info["file_state"]), and 2) invoke the replay execution routine (call a function such as codeframe.core.replay.execute_rerun(workspace, run_id, from_step) or the existing runner API to start/resume the run) and print the run outcome; keep prepare_rerun, get_workspace and rerun_info keys ("file_state", "remaining_steps", "task_id") as the reference points for locating and applying the changes.codeframe/core/react_agent.py (1)
465-488:⚠️ Potential issue | 🟠 MajorTrace recording still skips the verification/fix execution path.
Line 465 instruments only
_react_loop(), but_run_final_verification()also performs LLM calls and tool executions. Runs needing verification retries will export/replay incomplete traces and reconstruct the wrong checkpoint state.Suggested direction
+# in _run_final_verification(), for each correction turn: +# 1) record_iteration(...) +# 2) record_llm_call(...) +# 3) record_file_operation(...) for successful create/edit/delete tools +# Reuse the same recording helper used by _react_loop() to keep behavior consistent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/react_agent.py` around lines 465 - 488, The trace recording currently only wraps `_react_loop()` LLM/tool interactions, so calls made inside `_run_final_verification()` are not recorded and lead to incomplete traces; update the logic where `_rec_step_id` and subsequent `execution_recorder.record_iteration(...)` / `record_llm_call(...)` are invoked to also run for LLM/tool responses originating from `_run_final_verification()` (or factor the recording into a helper used by both `_react_loop` and `_run_final_verification`), ensuring you pass the same fields (step_number/iterations, tool_names from `response.tool_calls`, llm_response_summary, prompt_summary, model, tokens_used using `response.input_tokens + response.output_tokens`, and purpose="execution") so verification retries are included in exported/replayed traces.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@codeframe/core/react_agent.py`:
- Around line 569-597: The execution recording currently uses the create_file
input content and omits the pre-change state for edits, so update the recording
logic in the block handling execution_recorder/_rec_step_id for tc.name in
("edit_file","create_file") to: for create_file, read the actual on-disk file
content after the tool runs (using self.workspace.repo_path / path) and use that
as _op_after instead of tc.input.get("content"); for edit_file, read and supply
both the pre-change content (before) by reading the file prior to the edit (if
present) and the post-change content (after) by reading the file after the edit
into _op_after; then call
execution_recorder.record_file_operation(step_id=_rec_step_id, op_type=_op_type,
path=_op_path, before=<pre-content-or-None>, after=_op_after) so edits have a
non-None before value and creates reflect actual disk state.
---
Duplicate comments:
In `@codeframe/cli/app.py`:
- Around line 3378-3405: The exporters used by the export-trace command
(export_trace_json and export_trace_markdown in codeframe.core.replay) currently
omit trace.llm_interactions and file before/after contents, so update those
functions to include the full trace shape: add trace.llm_interactions
(preserving interaction metadata and tokens) to the JSON/Markdown output and
include file before_contents and after_contents (or full file diffs) for each
file change entry; ensure load_execution_trace continues to return these fields
and that export_trace_markdown formats/embeds the file contents or diffs (not
just filenames) so offline consumers can reconstruct the run.
- Around line 3138-3246: The command currently prints all steps when no --step
is given instead of an interactive replay; update work_replay so that when step
is None it enters an interactive loop letting the user navigate
next/prev/jump/quit (commands like n/p/j <num>/q), updates a current index over
trace.steps, and renders only the current step. To implement: extract the
per-step rendering logic (the block that prints status, files and LLM output
using ops_by_step, llm_by_step, show_files and show_llm) into a helper (e.g.,
render_step) and call it from the interactive loop; accept user input via
console.input (or typer.prompt), adjust the index on n/p/j commands, validate
bounds and show helpful prompts, and exit the loop on q. Keep existing
single-step behavior (when --step is provided) unchanged.
- Around line 3411-3463: work_rerun currently only calls prepare_rerun and
prints info but never restores files or starts execution; update work_rerun to
accept an explicit flag (e.g., --apply / --execute) and when set: 1) obtain
file_state from prepare_rerun and restore those files into the workspace (use
get_workspace(path) and the workspace API to write/overwrite each path in
rerun_info["file_state"]), and 2) invoke the replay execution routine (call a
function such as codeframe.core.replay.execute_rerun(workspace, run_id,
from_step) or the existing runner API to start/resume the run) and print the run
outcome; keep prepare_rerun, get_workspace and rerun_info keys ("file_state",
"remaining_steps", "task_id") as the reference points for locating and applying
the changes.
In `@codeframe/core/react_agent.py`:
- Around line 465-488: The trace recording currently only wraps `_react_loop()`
LLM/tool interactions, so calls made inside `_run_final_verification()` are not
recorded and lead to incomplete traces; update the logic where `_rec_step_id`
and subsequent `execution_recorder.record_iteration(...)` /
`record_llm_call(...)` are invoked to also run for LLM/tool responses
originating from `_run_final_verification()` (or factor the recording into a
helper used by both `_react_loop` and `_run_final_verification`), ensuring you
pass the same fields (step_number/iterations, tool_names from
`response.tool_calls`, llm_response_summary, prompt_summary, model, tokens_used
using `response.input_tokens + response.output_tokens`, and purpose="execution")
so verification retries are included in exported/replayed traces.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3cd06411-ca01-41bf-943c-74ee7739abe5
📒 Files selected for processing (3)
codeframe/cli/app.pycodeframe/core/react_agent.pycodeframe/core/replay.py
🚧 Files skipped from review as they are similar to previous changes (1)
- codeframe/core/replay.py
| # --- Execution recording: file operations --- | ||
| if ( | ||
| self.execution_recorder is not None | ||
| and _rec_step_id is not None | ||
| and tc.name in ("edit_file", "create_file") | ||
| and not result.is_error | ||
| ): | ||
| _op_type = "create" if tc.name == "create_file" else "edit" | ||
| _op_path = tc.input.get("path", "") | ||
| if tc.name == "create_file": | ||
| # create_file input has the full content | ||
| _op_after = tc.input.get("content", "") | ||
| else: | ||
| # edit_file uses search/replace snippets — read the | ||
| # actual file content after the edit for accurate state. | ||
| _op_after = None | ||
| try: | ||
| _full_path = self.workspace.repo_path / _op_path | ||
| if _full_path.is_file(): | ||
| _op_after = _full_path.read_text(errors="replace") | ||
| except OSError: | ||
| pass | ||
| self.execution_recorder.record_file_operation( | ||
| step_id=_rec_step_id, | ||
| op_type=_op_type, | ||
| path=_op_path, | ||
| before=None, | ||
| after=_op_after, | ||
| ) |
There was a problem hiding this comment.
File-operation snapshots are still not fully faithful to on-disk state.
At Line 580, create_file uses input content instead of reading the post-tool file, so autofix/lint mutations are missed. At Line 595, edits are recorded with before=None, which drops pre-change state expected by ExecutionRecorder.record_file_operation() for edit operations.
Suggested fix
- if tc.name == "create_file":
- # create_file input has the full content
- _op_after = tc.input.get("content", "")
- else:
- # edit_file uses search/replace snippets — read the
- # actual file content after the edit for accurate state.
- _op_after = None
- try:
- _full_path = self.workspace.repo_path / _op_path
- if _full_path.is_file():
- _op_after = _full_path.read_text(errors="replace")
- except OSError:
- pass
+ _op_after = None
+ try:
+ _full_path = (self.workspace.repo_path / _op_path).resolve()
+ if _full_path.is_file():
+ _op_after = _full_path.read_text(errors="replace")
+ except OSError:
+ pass
self.execution_recorder.record_file_operation(
step_id=_rec_step_id,
op_type=_op_type,
path=_op_path,
- before=None,
+ before=_op_before,
after=_op_after,
)- result = self._execute_tool_with_lint(tc)
+ _op_before = None
+ if tc.name == "edit_file":
+ try:
+ _before_path = (self.workspace.repo_path / tc.input.get("path", "")).resolve()
+ if _before_path.is_file():
+ _op_before = _before_path.read_text(errors="replace")
+ except OSError:
+ pass
+ result = self._execute_tool_with_lint(tc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/core/react_agent.py` around lines 569 - 597, The execution
recording currently uses the create_file input content and omits the pre-change
state for edits, so update the recording logic in the block handling
execution_recorder/_rec_step_id for tc.name in ("edit_file","create_file") to:
for create_file, read the actual on-disk file content after the tool runs (using
self.workspace.repo_path / path) and use that as _op_after instead of
tc.input.get("content"); for edit_file, read and supply both the pre-change
content (before) by reading the file prior to the edit (if present) and the
post-change content (after) by reading the file after the edit into _op_after;
then call execution_recorder.record_file_operation(step_id=_rec_step_id,
op_type=_op_type, path=_op_path, before=<pre-content-or-None>, after=_op_after)
so edits have a non-None before value and creates reflect actual disk state.

Summary
Implements #315: Debug and replay mode for stepping through past agent executions.
execution_steps,llm_interactions,file_operations) for recording complete execution tracescf work replay,cf work diff,cf work export-trace,cf work rerunAcceptance Criteria
Test Plan
Implementation Notes
if recorderCloses #315
Summary by CodeRabbit
New Features
Chores
Tests