Skip to content

feat(replay): debug and replay mode (#315)#449

Merged
frankbria merged 8 commits into
mainfrom
feature/issue-315-debug-replay-mode
Mar 17, 2026
Merged

feat(replay): debug and replay mode (#315)#449
frankbria merged 8 commits into
mainfrom
feature/issue-315-debug-replay-mode

Conversation

@frankbria

@frankbria frankbria commented Mar 17, 2026

Copy link
Copy Markdown
Owner

Summary

Implements #315: Debug and replay mode for stepping through past agent executions.

  • 3 new DB tables (execution_steps, llm_interactions, file_operations) for recording complete execution traces
  • ExecutionRecorder class that ReactAgent uses to persist each iteration, LLM call, and file operation
  • 4 new CLI commands: cf work replay, cf work diff, cf work export-trace, cf work rerun
  • ReplaySession class for interactive step-through navigation
  • Export to JSON and Markdown formats
  • State reconstruction at any step via file operation replay

Acceptance Criteria

  • Execution details recorded to database
  • Replay command shows step-by-step execution
  • File diffs available at each step
  • Execution can be re-run from any checkpoint
  • Export execution trace for analysis

Test Plan

  • 45 unit tests for data models, CRUD, session navigation, export
  • 11 tests for execution recording + ReactAgent integration
  • 14 CLI command tests (replay, diff, export-trace, rerun)
  • 9 integration tests for full record → replay lifecycle
  • All 78 new tests passing
  • Linting clean (ruff)
  • No regressions in existing test suite (124 total)

Implementation Notes

  • ReactAgent instrumentation is optional — recorder defaults to None, all hooks guarded by if recorder
  • Deferred from original plan: HTML export, full what-if re-execution with modified LLM inputs, checkpoint system integration (existing checkpoints are workspace-level, not step-level)
  • ReplaySession is a pure data navigator — display is delegated to the CLI layer for testability

Closes #315

Summary by CodeRabbit

  • New Features

    • CLI commands to replay traces step-by-step, show diffs between steps, export traces (JSON/Markdown), and prepare reruns.
    • Optional agent tracing: record and persist execution traces including per-step summaries, LLM interactions, and file operations; replay navigation supported.
  • Chores

    • Database schema extended to store execution steps, LLM interactions, and file operation history with upgrade handling.
  • Tests

    • Extensive unit and integration tests covering recording, persistence, replay, diffing, export, and rerun flows.

Test User added 6 commits March 17, 2026 07:16
Add three new database tables (execution_steps, llm_interactions,
file_operations) and corresponding dataclasses + CRUD operations
for recording complete execution traces during agent runs.
Add ExecutionRecorder class with buffered writes for recording execution
traces during ReactAgent runs. Hook into _react_loop to capture iteration
steps, LLM interactions, and file operations without affecting agent logic.
- cf work replay <run-id> — step-through execution with Rich formatting
- cf work diff <run-id> — show file changes with unified diff
- cf work export-trace <run-id> — export to JSON or Markdown
…nt (#315)

- ReplaySession class with n/p/j navigation for step-through
- prepare_rerun() reconstructs file state at any step
- cf work rerun command shows state and remaining steps
…315)

End-to-end tests: ExecutionRecorder records a 3-step agent run,
then verifies trace loading, step snapshots, diffs, JSON/Markdown
export, ReplaySession navigation, and rerun preparation.
@coderabbitai

coderabbitai Bot commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds execution tracing and replay capabilities: a new replay module and ExecutionRecorder, DB schema additions, ReactAgent integration to record traces, CLI commands for replay/diff/export/rerun, and comprehensive unit and integration tests.

Changes

Cohort / File(s) Summary
CLI Replay Commands
codeframe/cli/app.py
Adds work_replay, work_diff, work_export_trace, work_rerun CLI handlers and import json. Implements trace loading, diffs, exports (JSON/Markdown), and rerun preparation. Duplicate implementations appear in two file sections.
Core Replay Infrastructure
codeframe/core/replay.py
New module with data models (ExecutionStep, LLMInteraction, FileOperation, ExecutionTrace), ExecutionRecorder (buffered recording + flush), CRUD persistence helpers, trace loading, snapshots, step comparisons, export utilities (JSON/Markdown), ReplaySession, and prepare_rerun.
ReactAgent Integration
codeframe/core/react_agent.py
Adds optional execution_recorder param to ReactAgent; records iterations, LLM calls, and file operations during the ReAct loop and flushes the recorder on completion with guarded error handling.
Database Schema
codeframe/core/workspace.py
Adds SQLite tables execution_steps, llm_interactions, file_operations with indexes and constraints; ensures creation during initial init and schema upgrades.
CLI Tests
tests/cli/test_replay_commands.py
New CLI tests using Typer CliRunner for replay, step selection, LLM/file display, diffs, export (stdout and file), rerun, and error handling for missing runs.
Execution Recording Tests
tests/core/test_execution_recording.py
Unit and integration tests validating ExecutionRecorder methods, buffering/flush behavior, and ReactAgent recording behavior with and without a recorder.
Replay Unit Tests
tests/core/test_replay.py
Tests for data models, DB CRUD, trace loading, snapshots, diffs, JSON/Markdown exports, ReplaySession navigation, and prepare_rerun.
Replay Integration Tests
tests/core/test_replay_integration.py
End-to-end integration tests exercising recording, flush, trace loading, snapshots/diffs, exports, replay navigation, rerun preparation, and summary aggregation.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as ReactAgent
    participant Recorder as ExecutionRecorder
    participant DB as Workspace (SQLite)
    participant CLI as Replay CLI

    Agent->>Recorder: record_iteration(step_number, tools, summary)
    Note over Recorder: Buffer ExecutionStep
    Agent->>Recorder: record_llm_call(step_id, prompt, response, model, tokens)
    Note over Recorder: Buffer LLMInteraction
    Agent->>Recorder: record_file_operation(step_id, op_type, path, before, after)
    Note over Recorder: Buffer FileOperation
    Agent->>Recorder: flush()
    Recorder->>DB: save_execution_step(step)
    Recorder->>DB: save_llm_interaction(interaction)
    Recorder->>DB: save_file_operation(operation)

    CLI->>DB: load_execution_trace(run_id)
    DB-->>CLI: ExecutionTrace
    CLI->>CLI: ReplaySession.navigate(step)
    CLI->>DB: get_step_snapshot(run_id, step_number)
    DB-->>CLI: file_state
Loading
sequenceDiagram
    participant CLI as Replay Commands
    participant Session as ReplaySession
    participant DB as Trace Data
    participant Formatter as Export/Display

    CLI->>DB: load_execution_trace(run_id)
    DB-->>Session: ExecutionTrace

    alt work_replay
        Session->>Formatter: Format step with LLM/files
        Formatter-->>CLI: Display output
    else work_diff
        CLI->>DB: compare_steps(from_step, to_step)
        DB-->>CLI: Changes dict
        Formatter->>CLI: Unified diff
    else work_export_trace
        Formatter->>Formatter: export_trace_json() / export_trace_markdown()
        Formatter-->>CLI: JSON/Markdown output
    else work_rerun
        CLI->>DB: prepare_rerun(run_id, from_step)
        DB-->>CLI: File state + metadata
        Formatter->>CLI: Render checkpoint state
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped through traces, prompt and file,

Saved each step and paused to smile,
Replay the tale, diff every line,
Rerun from checkpoints, past to refine,
Crunching traces, carrot in paw — what a find!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(replay): debug and replay mode (#315)' directly describes the main feature added: debug and replay capabilities. It is concise, clear, and accurately reflects the primary objective of this changeset.
Linked Issues check ✅ Passed The PR implements all coding-related objectives from issue #315: execution recording to database [execution_steps, llm_interactions, file_operations tables], replay command with step navigation, file diffs, rerun from checkpoints, and trace export (JSON/Markdown). The PR directly addresses all acceptance criteria marked satisfied in the PR summary.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #315 objectives: database schema extensions for tracing, ExecutionRecorder integration with ReactAgent, CLI command additions (replay, diff, export-trace, rerun), and comprehensive test coverage. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/issue-315-debug-replay-mode
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.

Add a .trivyignore file to your project to customize which findings Trivy reports.

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

Review: feat(replay): debug and replay mode (PR 315)

Good feature addition overall - the architecture is clean and follows the repo headless core pattern. Here are my findings:


Bugs

1. Data loss in ExecutionRecorder.flush() - critical

In codeframe/core/replay.py, the flush() method clears the buffers in a finally block, which runs even when the DB write fails. The try/except/finally structure means if save_execution_step raises, the exception is caught and logged at DEBUG, then finally clears all three buffers - permanently losing data. The clear should be inside the try block so it only runs on success.

2. Incorrect file state reconstruction for edit_file operations

In codeframe/core/react_agent.py, the recorder captures new_text as content_after for edit_file operations. But new_text in the search-replace editor is the replacement snippet, not the full file content. get_step_snapshot() replays file operations by setting file_state[path] = op.content_after, so it reconstructs the file as just the replaced fragment. This means cf work diff and cf work rerun will produce incorrect file states for any run that used edit_file. The fix is to read the actual file content from disk after the edit completes, rather than capturing the tool input argument.


Code Issues

3. format parameter shadows Python builtin

In work_export_trace (cli/app.py), the parameter format: str = typer.Option(...) shadows the Python builtin. Rename to output_format or fmt.

4. cf work rerun is misleadingly named

The command prepares a rerun (shows file state and remaining steps) but does not actually execute anything. Given cf work start starts execution, users will expect cf work rerun to do the same. Either rename to something like cf work inspect-step, or add a --execute flag consistent with the cf work start UX.

5. Silent trace degradation on flush failure

Flush failures in react_agent.py are logged at DEBUG. If tracing silently fails, cf work replay will show an empty/partial trace with no user-visible explanation. Raise this to WARNING level.


Performance Consideration

6. Full file content stored in SQLite TEXT columns

For large source files, storing full before/after content in file_operations could bloat the database significantly over time. Consider adding a size cap (e.g., skip content capture if file exceeds 100KB) or tracking this as a follow-up issue.


Minor Observations

  • ReplaySession is well-designed and testable, but none of the CLI commands currently use it - they build their own step lookups inline. Fine for now since the interactive REPL loop is deferred, but worth noting for when that mode is added.
  • The load_execution_trace fallback returning task_id=unknown is useful for tests but could surface confusingly in production if a run record is missing.
  • Test coverage looks solid - 78 new tests across unit, integration, and CLI layers is the right approach for a new feature of this scope.

Summary

The architecture is sound and the core pattern (headless module, thin CLI adapter, buffered recorder) is well-executed. Two items should be addressed before merging:

  1. flush() data loss bug - buffer cleared in finally, so failures silently drop trace data
  2. edit_file state reconstruction bug - new_text is a snippet, not the full post-edit file content; cf work diff and cf work rerun will reconstruct wrong state

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@codeframe/cli/app.py`:
- Around line 3411-3469: work_rerun currently only calls prepare_rerun and
prints rerun_info but never restores the workspace state or triggers execution;
after obtaining rerun_info in work_rerun you should apply the returned
file_state to the workspace (e.g., call a method on the workspace like
restore_snapshot/apply_file_state using rerun_info["file_state"]) and then
invoke the runtime to start/resume execution from from_step (use your runtime
entrypoint / runner function to execute the remaining steps or start a new run
with the restored workspace, passing run_id/from_step/remaining_steps as
needed); ensure errors from restore or runtime start are handled similarly to
the existing FileNotFoundError/ValueError branches so the CLI exits with
non-zero on failure.
- Around line 3138-3246: The command handler work_replay currently prints all
steps when no --step is provided instead of starting an interactive replay
session; update work_replay to enter an interactive loop (e.g., using
typer.prompt or input) when step is None: initialize a current_index over
trace.steps, render the current step (using existing rendering logic that
references trace.steps, ops_by_step, llm_by_step and the Step fields like
id/step_number/description/status), then accept simple commands next/prev/jump
<n>/show-llm/quit to move the index, re-render the selected step, and only exit
on quit; keep the existing single-step rendering code for reuse and ensure the
--step path still shows just that step without entering the loop.
- Around line 3378-3405: The export functions export_trace_json and
export_trace_markdown currently only serialize step metadata and paths; update
them to include trace.llm_interactions and the full before/after file contents
for each file referenced in the trace so exported artifacts are reconstructible
offline. Specifically, modify export_trace_json(trace) and
export_trace_markdown(trace) to iterate trace.llm_interactions and include them
in the output structure/markdown, and for each step that references files use
load_execution_trace/get_workspace utilities or the trace’s stored file
snapshots to embed the file contents (pre-change and post-change) rather than
only paths; ensure the JSON output nests llm_interactions and file contents and
the markdown includes readable sections for interactions and before/after file
diffs.

In `@codeframe/core/react_agent.py`:
- Around line 465-488: The verification/fix path in _run_final_verification
isn’t recording LLM execution like _react_loop, so runs that perform bounded
fixes produce incomplete traces and prepare_rerun reconstructs the wrong
(pre-fix) state; update _run_final_verification to mirror the execution_recorder
usage in _react_loop by calling execution_recorder.record_iteration and
execution_recorder.record_llm_call for each LLM invocation in the bounded fix
loop (include step numbering, tool_names from response.tool_calls,
llm_response_summary, prompt_summary, model, and tokens_used computed from
response.input_tokens + response.output_tokens) and ensure prepare_rerun reads
the latest step_id produced by record_iteration so reruns reconstruct the
post-fix state.

In `@codeframe/core/replay.py`:
- Around line 244-258: The flush method currently clears _step_buffer,
_llm_buffer, and _file_op_buffer in the finally block even when
save_execution_step/save_llm_interaction/save_file_operation raise, causing
permanent data loss; change flush so buffers are only cleared after all saves
complete successfully (e.g., move the clear calls into the try block after the
loops or wrap the saves in a DB transaction and clear buffers only on commit)
and ensure exceptions still propagate or are logged without dropping buffered
items.
- Around line 539-578: The JSON exporter export_trace_json is currently omitting
trace.llm_interactions; update export_trace_json (and the other exporter
referenced around 581-632) to include LLM data by grouping
trace.llm_interactions by step_id (similar to ops_by_step) and adding an
"llm_interactions" entry to each step_dict with a list of serializable objects
(e.g., model, role, prompt/input, response/output, tokens, timestamps) taken
from each LLMInteraction instance; ensure you use the step.id to attach
interactions to the correct step and preserve None-safe serialization
(timestamps via .isoformat(), optional fields omitted or null) so offline
JSON/Markdown traces include the recorded prompts and responses.

In `@tests/core/test_execution_recording.py`:
- Around line 15-28: Remove the unused imports causing F401: delete LLMResponse
from the import line that currently reads "from codeframe.adapters.llm.base
import LLMResponse, ToolCall, ToolResult", delete FileContent from "from
codeframe.core.context import FileContent, TaskContext", and delete Workspace
from "from codeframe.core.workspace import Workspace, create_or_load_workspace";
keep the used symbols (ToolCall, ToolResult, TaskContext,
create_or_load_workspace) so the tests file no longer triggers the unused-import
lint error.

In `@tests/core/test_replay.py`:
- Around line 11-26: Add the v2 marker to this new test module by defining a
module-level variable pytestmark = pytest.mark.v2 (import pytest is already
present), placing it near the top of tests/core/test_replay.py so the test file
participates in marker-based v2 runs; ensure pytestmark is a top-level variable
(not inside the workspace fixture or any function) and keep the existing imports
and fixture name workspace unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9f4abf3c-248b-4caf-805e-13218f6b8c71

📥 Commits

Reviewing files that changed from the base of the PR and between 5646f8d and de742c7.

📒 Files selected for processing (8)
  • codeframe/cli/app.py
  • codeframe/core/react_agent.py
  • codeframe/core/replay.py
  • codeframe/core/workspace.py
  • tests/cli/test_replay_commands.py
  • tests/core/test_execution_recording.py
  • tests/core/test_replay.py
  • tests/core/test_replay_integration.py

Comment thread codeframe/cli/app.py
Comment on lines +3138 to +3246
@work_app.command("replay")
def work_replay(
run_id: str = typer.Argument(..., help="Run ID to replay"),
workspace_path: Optional[Path] = typer.Option(
None,
"--workspace",
"-w",
help="Workspace path (defaults to current directory)",
),
step: Optional[int] = typer.Option(
None,
"--step",
"-s",
help="Jump to a specific step number",
),
show_llm: bool = typer.Option(
False,
"--show-llm",
help="Show LLM prompts and responses",
),
show_files: bool = typer.Option(
True,
"--show-files/--no-files",
help="Show file changes at each step",
),
) -> None:
"""Replay a past execution step by step.

Shows what happened during an agent run: which tools were called,
what files were changed, and what the LLM produced at each step.

Example:
cf work replay <run-id>
cf work replay <run-id> --step 3
cf work replay <run-id> --show-llm
"""
from rich.panel import Panel

from codeframe.core.replay import (
load_execution_trace,
)
from codeframe.core.workspace import get_workspace

path = workspace_path or Path.cwd()

try:
workspace = get_workspace(path)
trace = load_execution_trace(workspace, run_id)

if not trace:
console.print(f"[red]Error:[/red] No trace found for run '{run_id}'")
raise typer.Exit(1)

# Header
console.print(
Panel(
f"[bold]Run:[/bold] {trace.run_id}\n"
f"[bold]Task:[/bold] {trace.task_id}\n"
f"[bold]Status:[/bold] {trace.status}\n"
f"[bold]Steps:[/bold] {len(trace.steps)}",
title="Execution Replay",
)
)

# Build lookups
ops_by_step = {}
for op in trace.file_operations:
ops_by_step.setdefault(op.step_id, []).append(op)

llm_by_step = {}
for llm in trace.llm_interactions:
llm_by_step.setdefault(llm.step_id, []).append(llm)

# Filter to specific step if requested
steps_to_show = trace.steps
if step is not None:
steps_to_show = [s for s in trace.steps if s.step_number == step]
if not steps_to_show:
console.print(f"[yellow]No step {step} found (max: {len(trace.steps)})[/yellow]")
raise typer.Exit(1)

for s in steps_to_show:
status_color = {"completed": "green", "failed": "red"}.get(s.status, "yellow")
console.print(
f"\n[bold]Step {s.step_number}:[/bold] {s.description} "
f"[{status_color}][{s.status}][/{status_color}]"
)

if show_files:
step_ops = ops_by_step.get(s.id, [])
for op in step_ops:
op_color = {"create": "green", "edit": "yellow", "delete": "red"}.get(
op.operation_type, "white"
)
console.print(f" [{op_color}]{op.operation_type}[/{op_color}] {op.file_path}")

if show_llm:
step_llms = llm_by_step.get(s.id, [])
for llm in step_llms:
console.print(f" [dim]LLM ({llm.model}, {llm.tokens_used} tokens):[/dim]")
console.print(f" [cyan]Prompt:[/cyan] {llm.prompt[:200]}")
console.print(f" [cyan]Response:[/cyan] {llm.response[:200]}")

# Summary
summary = trace.summary()
console.print(f"\n[dim]Total: {summary['total_steps']} steps, "
f"{summary['llm_calls']} LLM calls, "
f"{summary['total_tokens']} tokens, "
f"{summary['files_modified']} files modified[/dim]")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

work replay still dumps the trace instead of replaying it.

Without --step, this prints every step and exits. There’s no next/previous/jump loop here, so users still can’t step through a run from the CLI as a replay session.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/app.py` around lines 3138 - 3246, The command handler
work_replay currently prints all steps when no --step is provided instead of
starting an interactive replay session; update work_replay to enter an
interactive loop (e.g., using typer.prompt or input) when step is None:
initialize a current_index over trace.steps, render the current step (using
existing rendering logic that references trace.steps, ops_by_step, llm_by_step
and the Step fields like id/step_number/description/status), then accept simple
commands next/prev/jump <n>/show-llm/quit to move the index, re-render the
selected step, and only exit on quit; keep the existing single-step rendering
code for reuse and ensure the --step path still shows just that step without
entering the loop.

Comment thread codeframe/cli/app.py
Comment on lines +3378 to +3405
from codeframe.core.replay import (
export_trace_json,
export_trace_markdown,
load_execution_trace,
)
from codeframe.core.workspace import get_workspace

path = workspace_path or Path.cwd()

try:
workspace = get_workspace(path)
trace = load_execution_trace(workspace, run_id)

if not trace:
console.print(f"[red]Error:[/red] No trace found for run '{run_id}'")
raise typer.Exit(1)

if format == "json":
content = json.dumps(export_trace_json(trace), indent=2)
else:
content = export_trace_markdown(trace)

if output:
output.write_text(content)
console.print(f"[green]Trace exported to {output}[/green]")
else:
console.print(content, highlight=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

export-trace is not exporting a complete trace.

codeframe/core/replay.py:538-577 and codeframe/core/replay.py:580-631 only serialize step metadata plus file path names. They omit trace.llm_interactions entirely and don’t include file before/after content, so the exported artifact isn’t enough for offline debugging or reconstruction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/app.py` around lines 3378 - 3405, The export functions
export_trace_json and export_trace_markdown currently only serialize step
metadata and paths; update them to include trace.llm_interactions and the full
before/after file contents for each file referenced in the trace so exported
artifacts are reconstructible offline. Specifically, modify
export_trace_json(trace) and export_trace_markdown(trace) to iterate
trace.llm_interactions and include them in the output structure/markdown, and
for each step that references files use load_execution_trace/get_workspace
utilities or the trace’s stored file snapshots to embed the file contents
(pre-change and post-change) rather than only paths; ensure the JSON output
nests llm_interactions and file contents and the markdown includes readable
sections for interactions and before/after file diffs.

Comment thread codeframe/cli/app.py
Comment on lines +3411 to +3469
@work_app.command("rerun")
def work_rerun(
run_id: str = typer.Argument(..., help="Run ID to re-run from"),
workspace_path: Optional[Path] = typer.Option(
None,
"--workspace",
"-w",
help="Workspace path (defaults to current directory)",
),
from_step: int = typer.Option(
1,
"--from-step",
help="Step number to resume from",
),
) -> None:
"""Prepare to re-execute a run from a specific step.

Reconstructs the file state at step N and shows what
would need to be re-executed. Use this to understand
what happened and plan a manual re-run.

Example:
cf work rerun <run-id> --from-step 2
"""
from codeframe.core.replay import prepare_rerun
from codeframe.core.workspace import get_workspace

path = workspace_path or Path.cwd()

try:
workspace = get_workspace(path)
rerun_info = prepare_rerun(workspace, run_id, from_step)

console.print(f"[bold]Re-run preparation for run {run_id}[/bold]\n")
console.print(f"[bold]Resume from:[/bold] Step {from_step}")
console.print(f"[bold]Task:[/bold] {rerun_info['task_id']}")

file_state = rerun_info["file_state"]
if file_state:
console.print(f"\n[bold]File state at step {from_step}:[/bold]")
for fp in sorted(file_state.keys()):
console.print(f" {fp}")
else:
console.print(f"\n[yellow]No files modified at step {from_step}[/yellow]")

remaining = rerun_info["remaining_steps"]
if remaining:
console.print(f"\n[bold]Remaining steps ({len(remaining)}):[/bold]")
for rs in remaining:
console.print(f" Step {rs['step_number']}: {rs['description']}")
else:
console.print("\n[yellow]No remaining steps after this point[/yellow]")

except FileNotFoundError:
console.print(f"[red]Error:[/red] No workspace found at {path}")
raise typer.Exit(1)
except ValueError as e:
console.print(f"[red]Error:[/red] {e}")
raise typer.Exit(1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

work rerun never actually reruns anything.

This path only calls prepare_rerun() and prints the returned state. It never restores the snapshot into the workspace or starts a new run through runtime, so users can’t resume execution from the chosen checkpoint.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/app.py` around lines 3411 - 3469, work_rerun currently only
calls prepare_rerun and prints rerun_info but never restores the workspace state
or triggers execution; after obtaining rerun_info in work_rerun you should apply
the returned file_state to the workspace (e.g., call a method on the workspace
like restore_snapshot/apply_file_state using rerun_info["file_state"]) and then
invoke the runtime to start/resume execution from from_step (use your runtime
entrypoint / runner function to execute the remaining steps or start a new run
with the restored workspace, passing run_id/from_step/remaining_steps as
needed); ensure errors from restore or runtime start are handled similarly to
the existing FileNotFoundError/ValueError branches so the CLI exits with
non-zero on failure.

Comment on lines +465 to +488
# --- Execution recording: LLM call ---
_rec_step_id: Optional[str] = None
if self.execution_recorder is not None:
# Build condensed summaries for the trace
_rec_prompt = f"System: {prompt_summary} | Messages: {len(messages)}"
if response.has_tool_calls:
_rec_response = "Tool calls: " + ", ".join(
tc.name for tc in response.tool_calls
)
else:
_rec_response = (response.content or "")[:200]
_rec_step_id = self.execution_recorder.record_iteration(
step_number=iterations,
tool_names=[tc.name for tc in response.tool_calls],
llm_response_summary=_rec_response,
)
self.execution_recorder.record_llm_call(
step_id=_rec_step_id,
prompt_summary=_rec_prompt,
response_summary=_rec_response,
model=response.model or "",
tokens_used=response.input_tokens + response.output_tokens,
purpose="execution",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Trace the verification-fix path too.

execution_recorder is only populated inside _react_loop(). The bounded fix loop in _run_final_verification() also makes LLM calls and can edit files, so runs that need gate retries will replay/export an incomplete trace and prepare_rerun() will reconstruct the pre-fix state instead of the final run state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/react_agent.py` around lines 465 - 488, The verification/fix
path in _run_final_verification isn’t recording LLM execution like _react_loop,
so runs that perform bounded fixes produce incomplete traces and prepare_rerun
reconstructs the wrong (pre-fix) state; update _run_final_verification to mirror
the execution_recorder usage in _react_loop by calling
execution_recorder.record_iteration and execution_recorder.record_llm_call for
each LLM invocation in the bounded fix loop (include step numbering, tool_names
from response.tool_calls, llm_response_summary, prompt_summary, model, and
tokens_used computed from response.input_tokens + response.output_tokens) and
ensure prepare_rerun reads the latest step_id produced by record_iteration so
reruns reconstruct the post-fix state.

Comment thread codeframe/core/replay.py
Comment on lines +244 to +258
def flush(self) -> None:
"""Write all buffered records to the database."""
try:
for step in self._step_buffer:
save_execution_step(self.workspace, step)
for interaction in self._llm_buffer:
save_llm_interaction(self.workspace, interaction)
for op in self._file_op_buffer:
save_file_operation(self.workspace, op)
except Exception:
logger.debug("ExecutionRecorder flush failed", exc_info=True)
finally:
self._step_buffer.clear()
self._llm_buffer.clear()
self._file_op_buffer.clear()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't clear the buffers after a failed flush.

If one save_*() call raises, finally still clears _step_buffer, _llm_buffer, and _file_op_buffer. That turns a transient write error into permanent trace loss, because the caller has nothing left to retry. Clear only after a successful flush, ideally as one DB transaction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/replay.py` around lines 244 - 258, The flush method currently
clears _step_buffer, _llm_buffer, and _file_op_buffer in the finally block even
when save_execution_step/save_llm_interaction/save_file_operation raise, causing
permanent data loss; change flush so buffers are only cleared after all saves
complete successfully (e.g., move the clear calls into the try block after the
loops or wrap the saves in a DB transaction and clear buffers only on commit)
and ensure exceptions still propagate or are logged without dropping buffered
items.

Comment thread codeframe/core/replay.py
Comment on lines +539 to +578
def export_trace_json(trace: ExecutionTrace) -> dict[str, Any]:
"""Export an ExecutionTrace as a JSON-serializable dict.

Returns a dict with run metadata, step details, and summary stats.
"""
# Build a lookup of file operations by step_id
ops_by_step: dict[str, list[FileOperation]] = {}
for op in trace.file_operations:
ops_by_step.setdefault(op.step_id, []).append(op)

steps = []
for step in trace.steps:
step_ops = ops_by_step.get(step.id, [])
step_dict: dict[str, Any] = {
"step_number": step.step_number,
"step_type": step.step_type,
"description": step.description,
"status": step.status,
"started_at": step.started_at.isoformat(),
"completed_at": step.completed_at.isoformat() if step.completed_at else None,
}
if step_ops:
step_dict["file_changes"] = [
{
"operation": op.operation_type,
"file_path": op.file_path,
}
for op in step_ops
]
steps.append(step_dict)

return {
"run_id": trace.run_id,
"task_id": trace.task_id,
"started_at": trace.started_at.isoformat(),
"completed_at": trace.completed_at.isoformat() if trace.completed_at else None,
"status": trace.status,
"steps": steps,
"summary": trace.summary(),
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

export-trace is dropping the recorded prompts/responses.

Both exporters ignore trace.llm_interactions, so the exported artifact loses the LLM data this feature is capturing for debugging. A run with no file edits currently exports little more than step headings, which makes the JSON/Markdown trace much less useful for offline analysis.

Also applies to: 581-632

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/replay.py` around lines 539 - 578, The JSON exporter
export_trace_json is currently omitting trace.llm_interactions; update
export_trace_json (and the other exporter referenced around 581-632) to include
LLM data by grouping trace.llm_interactions by step_id (similar to ops_by_step)
and adding an "llm_interactions" entry to each step_dict with a list of
serializable objects (e.g., model, role, prompt/input, response/output, tokens,
timestamps) taken from each LLMInteraction instance; ensure you use the step.id
to attach interactions to the correct step and preserve None-safe serialization
(timestamps via .isoformat(), optional fields omitted or null) so offline
JSON/Markdown traces include the recorded prompts and responses.

Comment thread tests/core/test_execution_recording.py Outdated
Comment thread tests/core/test_replay.py
Comment on lines +11 to +26
import json
import uuid
from datetime import datetime, timezone
from pathlib import Path

import pytest

from codeframe.core.workspace import create_or_load_workspace, get_db_connection


@pytest.fixture
def workspace(tmp_path: Path):
"""Create a temporary workspace for testing."""
repo_path = tmp_path / "test_repo"
repo_path.mkdir()
return create_or_load_workspace(repo_path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mark this new test module as v2.

This file is missing pytestmark = pytest.mark.v2, so it won’t participate in marker-based v2 runs.

🧪 Minimal fix
 import pytest
 
 from codeframe.core.workspace import create_or_load_workspace, get_db_connection
 
+pytestmark = pytest.mark.v2
+
 
 `@pytest.fixture`
 def workspace(tmp_path: Path):

As per coding guidelines, "New v2 Python tests must be marked with @pytest.mark.v2 decorator or pytestmark = pytest.mark.v2."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_replay.py` around lines 11 - 26, Add the v2 marker to this
new test module by defining a module-level variable pytestmark = pytest.mark.v2
(import pytest is already present), placing it near the top of
tests/core/test_replay.py so the test file participates in marker-based v2 runs;
ensure pytestmark is a top-level variable (not inside the workspace fixture or
any function) and keep the existing imports and fixture name workspace
unchanged.

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

TEST

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

CodeFRAME Development Guidelines

Last updated: 2026-03-09

Product Vision

CodeFrame is a project delivery system: Think → Build → Prove → Ship.

It owns the edges of the AI coding pipeline — everything BEFORE code gets written (PRD, specification, task decomposition) and everything AFTER (verification gates, quality memory, deployment). The actual code writing is delegated to frontier coding agents (Claude Code, Codex, OpenCode) that are better at it than any custom agent.

CodeFrame does not compete with coding agents. It orchestrates them.

THINK:  cf prd generate → cf prd stress-test → cf tasks generate
BUILD:  cf work start --engine claude-code  (or codex, opencode, built-in)
PROVE:  cf proof run  (9-gate evidence-based quality system)
SHIP:   cf pr create → cf pr merge
LOOP:   Glitch → cf proof capture → New REQ → Enforced forever

Status: Phase 1 ✅ | Phase 2 ✅ | Phase 2.5 ✅ — CLI workflow, server layer, and ReAct agent complete. Agent adapter architecture (#408) and PROOF9 quality system (#422) are next priorities. See docs/V2_STRATEGIC_ROADMAP.md for the full plan.

If you are an agent working in this repo: do not improvise architecture. Follow the documents listed below.


Primary Contract (MUST FOLLOW)

  1. Golden Path: docs/GOLDEN_PATH.md
    The only workflow we build until it works end-to-end.

  2. Refactor Plan: docs/REFACTOR_PLAN_FOR_AGENT.md
    Step-by-step refactor instructions.

  3. Command Tree + Module Mapping: docs/CLI_WIREFRAME.md
    The authoritative map from CLI commands → core modules/functions.

  4. Agent Implementation: docs/AGENT_IMPLEMENTATION_TASKS.md
    Tracks the agent system components (all complete).

  5. Strategic Roadmap: docs/V2_STRATEGIC_ROADMAP.md
    5-phase plan from CLI to multi-agent.

Rule 0: If a change does not directly support the Think → Build → Prove → Ship pipeline, do not implement it.

Strategic Priority (Phase 4)

The next major architectural work is the Agent Adapter Architecture (#408):


Current Reality (Phase 1, 2 & 2.5 Complete)

What's Working Now

  • Full agent execution: cf work start <task-id> --execute (uses ReAct engine by default)
  • Engine selection: --engine react (default) or --engine plan (legacy)
  • Verbose mode: cf work start <task-id> --execute --verbose shows detailed progress
  • Dry run mode: cf work start <task-id> --execute --dry-run
  • Self-correction loop: Agent automatically fixes failing verification gates (up to 5 attempts with ReAct)
  • FAILED task status: Tasks can transition to FAILED for proper error visibility
  • Tech stack configuration: cf init . --detect auto-detects tech stack from project files
  • Project preferences: Agent loads AGENTS.md or CLAUDE.md for per-project configuration
  • Stall detection: Thread-based monitor with configurable recovery (--stall-action blocker|retry|fail)
  • Blocker detection: Agent creates blockers when stuck
  • Verification gates: Ruff/pytest checks after file changes
  • State persistence: Pause/resume across sessions
  • Batch execution: cf work batch run with serial/parallel/auto strategies
  • Task dependencies: depends_on field with dependency graph analysis
  • LLM dependency inference: --strategy auto analyzes task descriptions
  • Automatic retry: --retry N for failed task recovery
  • Batch resume: Re-run failed/blocked tasks from previous batches
  • Task scheduling: cf schedule show/predict/bottlenecks with CPM-based scheduling
  • Task templates: cf templates list/show/apply with 7 builtin templates
  • Effort estimation: Tasks support estimated_hours field for scheduling
  • Environment validation: cf env check/install/doctor validates tools and dependencies
  • GitHub PR workflow: cf pr create/status/checks/merge for PR management
  • Task self-diagnosis: cf work diagnose <task-id> analyzes failed tasks
  • 70+ integration tests: Comprehensive CLI test coverage
  • REST API: Full v2 API with 15 router modules (see Phase 2 below)
  • API authentication: API key auth with scopes (read/write/admin)
  • Rate limiting: Configurable per-endpoint rate limits
  • Real-time streaming: SSE for task execution events
  • OpenAPI documentation: Full Swagger/ReDoc at /docs and /redoc

v2 Architecture (current)

  • Core-first: Domain logic lives in codeframe/core/ (headless, no FastAPI imports)
  • CLI-first: Golden Path works without any running FastAPI server
  • Adapters: LLM providers in codeframe/adapters/llm/
  • Server/UI optional: FastAPI and UI are thin adapters over core

v1 Legacy

  • FastAPI server + WebSockets + React/Next.js dashboard retained for reference
  • Do not build toward v1 patterns during Golden Path work

Repository Structure

codeframe/
├── core/                    # Headless domain + orchestration (NO FastAPI imports)
│   ├── react_agent.py      # ReAct agent (default engine) - observe-think-act loop
│   ├── tools.py            # Tool definitions for ReAct agent (7 tools)
│   ├── editor.py           # Search-replace file editor with fuzzy matching
│   ├── agent.py            # Legacy plan-based agent (--engine plan)
│   ├── planner.py          # LLM-powered implementation planning (plan engine)
│   ├── executor.py         # Code execution engine with rollback (plan engine)
│   ├── context.py          # Task context loader with relevance scoring
│   ├── tasks.py            # Task management with depends_on field
│   ├── blockers.py         # Human-in-the-loop blocker system
│   ├── runtime.py          # Run lifecycle management
│   ├── conductor.py        # Batch orchestration with worker pool
│   ├── dependency_graph.py # DAG operations and execution planning
│   ├── dependency_analyzer.py # LLM-based dependency inference
│   ├── gates.py            # Verification gates (ruff, pytest, BUILD)
│   ├── fix_tracker.py      # Fix attempt tracking for loop prevention
│   ├── quick_fixes.py      # Pattern-based fixes without LLM
│   ├── agents_config.py    # AGENTS.md/CLAUDE.md preference loading
│   ├── workspace.py        # Workspace initialization
│   ├── prd.py              # PRD management
│   ├── events.py           # Event emission
│   ├── state_machine.py    # Task status transitions
│   ├── environment.py      # Environment validation and tool detection
│   ├── installer.py        # Automatic tool installation
│   ├── diagnostics.py      # Failed task analysis
│   ├── diagnostic_agent.py # AI-powered task diagnosis
│   ├── credentials.py      # API key and credential management
│   ├── stall_detector.py   # Synchronous stall detector + StallAction enum + StallDetectedError
│   ├── stall_monitor.py    # Thread-based stall watchdog with callback
│   ├── streaming.py        # Real-time output streaming for cf work follow
│   └── ...
├── adapters/
│   └── llm/                # LLM provider adapters
│       ├── base.py         # Protocol + ModelSelector + Purpose enum
│       ├── anthropic.py    # Anthropic Claude provider
│       └── mock.py         # Mock provider for testing
├── cli/
│   └── app.py              # Typer CLI entry + subcommands
├── ui/                     # FastAPI server (Phase 2 - thin adapter over core)
│   ├── server.py           # FastAPI app with OpenAPI configuration
│   ├── models.py           # Pydantic request/response models
│   ├── dependencies.py     # Shared dependencies (workspace, auth)
│   └── routers/            # API route handlers
│       ├── blockers_v2.py  # Blocker CRUD
│       ├── tasks_v2.py     # Task management + streaming
│       ├── prd_v2.py       # PRD management + versioning
│       ├── workspace_v2.py # Workspace init and status
│       ├── batches_v2.py   # Batch execution
│       ├── streaming_v2.py # SSE event streaming
│       ├── api_key_v2.py   # API key management
│       └── ...             # 15 router modules total
├── lib/                    # Shared utilities
│   ├── rate_limiter.py     # SlowAPI rate limiting
│   └── audit_logger.py     # Request audit logging
├── auth/                   # Authentication
│   ├── api_key_service.py  # API key creation/validation
│   └── dependencies.py     # Auth dependencies
├── config/
│   └── rate_limits.py      # Rate limit configuration
└── server/                 # Legacy server code (reference only)

web-ui/                     # Frontend (legacy, reference only)
tests/
├── core/                   # Core module tests
│   ├── test_agent.py
│   ├── test_executor.py
│   ├── test_planner.py
│   ├── test_context.py
│   ├── test_conductor.py
│   ├── test_dependency_graph.py
│   ├── test_dependency_analyzer.py
│   ├── test_task_dependencies.py
│   └── ...
└── adapters/
    └── test_llm.py

Architecture Rules (non-negotiable)

1) Core must be headless

codeframe/core/** must NOT import:

  • FastAPI
  • WebSocket frameworks
  • HTTP request/response objects
  • UI modules

Core is allowed to:

  • read/write durable state (SQLite/filesystem)
  • run orchestration/worker loops
  • emit events to an append-only event log
  • call adapters via interfaces (LLM, git, fs)

2) CLI must not require a server

Golden Path commands must work from the CLI with no server running.

FastAPI is optional and must be started explicitly (e.g., codeframe serve) and must wrap core.

3) Agent state transitions flow through runtime

Critical pattern discovered during implementation:

  • Agent (agent.py) manages its own AgentState (IDLE, PLANNING, EXECUTING, BLOCKED, COMPLETED, FAILED)
  • Runtime (runtime.py) handles all TaskStatus transitions (BACKLOG, READY, IN_PROGRESS, DONE, BLOCKED)
  • Agent does NOT call tasks.update_status() - runtime does this based on agent state

This separation prevents duplicate state transitions (e.g., DONE→DONE, BLOCKED→BLOCKED errors).

4) Legacy can be read, not depended on

Legacy code is reference material.

  • Copy/simplify logic into core when useful
  • Do NOT import legacy UI/server modules into core
  • Do NOT "fix the UI" during Golden Path work

5) Keep commits runnable

At all times:

  • codeframe --help works
  • Golden Path command stubs can run
  • Avoid breaking the repo with large renames/moves

Agent System Architecture

Components

Component File Purpose
ReactAgent core/react_agent.py Default engine: observe-think-act loop with tool use
Tools core/tools.py 7 agent tools: read/edit/create file, run command/tests, search, list
Editor core/editor.py Search-replace editor with 4-level fuzzy matching
Stall Detector core/stall_detector.py Synchronous stall check + StallAction enum + StallDetectedError
Stall Monitor core/stall_monitor.py Thread-based watchdog with callback (integrated into ReactAgent)
LLM Adapter adapters/llm/base.py Protocol, ModelSelector, Purpose enum
Anthropic Provider adapters/llm/anthropic.py Claude integration with streaming
Mock Provider adapters/llm/mock.py Testing with call tracking
Context Loader core/context.py Codebase scanning, relevance scoring
Planner core/planner.py Task → ImplementationPlan via LLM (plan engine)
Executor core/executor.py File ops, shell commands, rollback (plan engine)
Agent (legacy) core/agent.py Plan-based orchestration (--engine plan)
Runtime core/runtime.py Run lifecycle, engine selection, agent invocation
Conductor core/conductor.py Batch orchestration, worker pool
Dependency Graph core/dependency_graph.py DAG operations, topological sort
Dependency Analyzer core/dependency_analyzer.py LLM-based dependency inference
Environment Validator core/environment.py Tool detection and validation
Installer core/installer.py Automatic tool installation
Diagnostics core/diagnostics.py Failed task analysis
Diagnostic Agent core/diagnostic_agent.py AI-powered task diagnosis
Credentials core/credentials.py API key and credential management
Event Publisher core/streaming.py Real-time SSE event distribution
API Key Service auth/api_key_service.py API key CRUD and validation
Rate Limiter lib/rate_limiter.py Per-endpoint rate limiting

Model Selection Strategy

Task-based heuristic via Purpose enum:

  • PLANNING → claude-sonnet-4-20250514 (complex reasoning)
  • EXECUTION → claude-sonnet-4-20250514 (balanced)
  • GENERATION → claude-haiku-4-20250514 (fast/cheap)

Future: cf tasks set provider <id> <provider> for per-task override.

Engine Selection

CodeFRAME supports two execution engines, selected via --engine:

Engine Flag Pattern Best For
ReAct (default) --engine react Observe → Think → Act loop Most tasks, adaptive execution
Plan (legacy) --engine plan Plan all steps → Execute sequentially Well-defined, predictable tasks

Execution Flow (ReAct — default)

cf work start <id> --execute [--verbose]
    │
    ├── runtime.start_task_run()      # Creates run, transitions task→IN_PROGRESS
    │
    └── runtime.execute_agent(engine="react")
            │
            └── ReactAgent.run(task_id)
                ├── Load context (PRD, codebase, blockers, AGENTS.md, tech_stack)
                ├── Build layered system prompt
                │
                └── Tool-use loop (until complete/blocked/failed):
                    ├── Check stall detector (configurable: retry/blocker/fail)
                    ├── LLM decides next action (tool call)
                    ├── Execute tool: read_file, edit_file, create_file,
                    │   run_command, run_tests, search_codebase, list_files
                    ├── Observe result → feed back to LLM
                    ├── Record activity (resets stall timer)
                    ├── Incremental verification (ruff after file changes)
                    └── Token budget management (3-tier compaction)
                │
                └── Final verification with self-correction (up to 5 retries)
                │
                └── Update run/task status based on agent result
                    ├── COMPLETED → complete_run() → task→DONE
                    ├── BLOCKED → block_run() → task→BLOCKED
                    └── FAILED → fail_run() → task→FAILED

Execution Flow (Plan — legacy, --engine plan)

cf work start <id> --execute --engine plan
    │
    ├── runtime.start_task_run()
    │
    └── runtime.execute_agent(engine="plan")
            │
            ├── agent.run(task_id)
            │   ├── Load context (PRD, codebase, blockers, AGENTS.md)
            │   ├── Create plan via LLM
            │   ├── Execute steps (file create/edit, shell commands)
            │   ├── Run incremental verification (ruff)
            │   ├── Detect blockers (consecutive failures, missing files)
            │   └── Run final verification with SELF-CORRECTION LOOP:
            │       ├── Run all gates (pytest, ruff)
            │       ├── If failed: _attempt_verification_fix()
            │       │   ├── Try ruff --fix for quick lint fixes
            │       │   ├── Use LLM to generate fix plan from errors
            │       │   └── Execute fix steps
            │       └── Retry up to max_attempts (default: 3)
            │
            └── Update run/task status based on agent result
                ├── COMPLETED → complete_run() → task→DONE
                ├── BLOCKED → block_run() → task→BLOCKED
                └── FAILED → fail_run() → task→FAILED

Commands (v2 CLI)

Python (preferred)

Use uv for Python tasks:

uv run pytest
uv run pytest tests/core/  # Core module tests only
uv run ruff check .

CLI (Golden Path)

# Workspace
cf init <repo>                                    # Initialize workspace
cf init <repo> --detect                           # Initialize + auto-detect tech stack
cf init <repo> --tech-stack "Python with uv"      # Initialize + explicit tech stack
cf init <repo> --tech-stack-interactive           # Initialize + interactive setup
cf status

# PRD
cf prd add <file.md>
cf prd show

# Tasks
cf tasks generate          # Uses LLM to generate from PRD
cf tasks list
cf tasks list --status READY
cf tasks show <id>

# Work execution (single task)
cf work start <task-id>                    # Creates run record
cf work start <task-id> --execute          # Runs AI agent (ReAct engine, default)
cf work start <task-id> --execute --engine plan  # Use legacy plan engine
cf work start <task-id> --execute --verbose  # With detailed output
cf work start <task-id> --execute --dry-run  # Preview changes
cf work start <task-id> --execute --stall-timeout 120  # Custom stall timeout (0=disabled)
cf work start <task-id> --execute --stall-action retry  # Recovery: blocker|retry|fail
cf work stop <task-id>                     # Cancel stale run
cf work resume <task-id>                   # Resume blocked work
cf work follow <task-id>                   # Stream real-time output
cf work follow <task-id> --tail 50         # Show last 50 lines then stream

# Batch execution (multiple tasks)
cf work batch run <id1> <id2> ...          # Execute multiple tasks (ReAct default)
cf work batch run --all-ready              # All READY tasks
cf work batch run --all-ready --engine plan  # Use legacy plan engine
cf work batch run --strategy serial        # Serial (default)
cf work batch run --strategy parallel      # Parallel execution
cf work batch run --strategy auto          # LLM-inferred dependencies
cf work batch run --max-parallel 4         # Concurrent limit
cf work batch run --retry 3               # Auto-retry failures
cf work batch status [batch_id]            # Show batch status
cf work batch cancel <batch_id>            # Cancel running batch
cf work batch resume <batch_id>            # Re-run failed tasks

# Blockers
cf blocker list
cf blocker show <id>
cf blocker answer <id> "answer"

# Quality
cf review
cf patch export
cf commit

# State
cf checkpoint create "name"
cf checkpoint list
cf checkpoint restore <id>
cf summary

# Environment validation
cf env check                     # Validate tools and dependencies
cf env install                   # Install missing tools
cf env doctor                    # Comprehensive environment health check

# GitHub PR workflow
cf pr create                     # Create PR from current branch
cf pr status                     # Show PR status
cf pr checks                     # Show CI check results
cf pr merge                      # Merge approved PR

# Diagnostics
cf work diagnose <task-id>       # AI-powered analysis of failed tasks

Note: codeframe serve exists but Golden Path does not depend on it.

Frontend (legacy)

cd web-ui && npm test
cd web-ui && npm run build

Do not expand frontend scope during Golden Path work.


Documentation Navigation

Authoritative (v2)

  • docs/GOLDEN_PATH.md - CLI-first workflow contract
  • docs/REFACTOR_PLAN_FOR_AGENT.md - Step-by-step refactor instructions
  • docs/CLI_WIREFRAME.md - Command → module mapping
  • docs/AGENT_IMPLEMENTATION_TASKS.md - Agent system components
  • docs/V2_STRATEGIC_ROADMAP.md - 5-phase plan from CLI to multi-agent

Agent Architecture (Phase 2.5)

  • docs/AGENT_V3_UNIFIED_PLAN.md - ReAct architecture design and rules
  • docs/REACT_AGENT_ARCHITECTURE.md - Deep-dive: tools, editor, token management
  • docs/REACT_AGENT_ANALYSIS.md - Golden path test run analysis

API Documentation (Phase 2)

  • /docs - Swagger UI (interactive API explorer)
  • /redoc - ReDoc (readable API documentation)
  • /openapi.json - OpenAPI 3.1 specification
  • docs/PHASE_2_DEVELOPER_GUIDE.md - Server layer implementation guide
  • docs/PHASE_2_CLI_API_MAPPING.md - CLI to API endpoint mapping

Legacy (v1 reference only)

These describe old server/UI-driven architecture:

  • SPRINTS.md, sprints/
  • specs/
  • CODEFRAME_SPEC.md
  • v1 feature docs (context/session/auth/UI state management)

What NOT to do (common agent failure modes)

  • Don't add new HTTP endpoints to support the CLI
  • Don't require codeframe serve for CLI workflows
  • Don't implement UI concepts (tabs, panels, progress bars) inside core
  • Don't redesign auth, websockets, or UI state management
  • Don't add multi-providers/model switching features before Golden Path works
  • Don't "clean up the repo" as a goal - only refactor to enable Golden Path
  • Don't update task status from agent.py - let runtime handle transitions

Testing / Demoing CodeFRAME on Sample Projects

When running uv run cf commands against a sample project (e.g., cf-test/) to test or demo CodeFRAME's capabilities, you are observing the CodeFRAME agent's work, not doing the work yourself.

Rules for testing/demo mode:

  • You are evaluating how well the CodeFRAME agent (ReAct or Plan engine) builds the project
  • Do NOT help out, fix errors, or write code on behalf of the CodeFRAME agent
  • Do NOT intervene when the agent makes mistakes — that's data
  • Your job is to report the process: what worked, what failed, how close the agent got
  • Document the agent's output, errors encountered, and final state
  • Assess completion against the PRD/acceptance criteria objectively
  • If the agent gets stuck or fails, report that as a finding — don't rescue it

This applies when using commands like cf work start <id> --execute, cf work batch run, or any command that triggers the AI agent to do implementation work on a target project.


Practical Working Mode for Agents

When implementing anything, do this loop:

  1. Read docs/GOLDEN_PATH.md and confirm the change is required
  2. Find the command in docs/CLI_WIREFRAME.md
  3. Implement core functionality in codeframe/core/
  4. Call it from Typer command in codeframe/cli/
  5. Emit events + persist state
  6. Keep it runnable. Commit.

If you are unsure which direction to take, default to:

  • simpler state
  • fewer dependencies
  • smaller surface area
  • core-first, CLI-first

Recent Updates (2026-03-09)

Stall Detection System (#399, #400, #401)

Complete stall detection and configurable recovery for agent execution:

Components:

  • StallMonitor (core/stall_monitor.py) — Thread-based watchdog polling every 5s
  • StallDetector (core/stall_detector.py) — Synchronous time-tracking primitive
  • StallAction enum — Recovery strategy: RETRY, BLOCKER, FAIL
  • StallDetectedError — Exception for RETRY path (propagates to runtime for retry)

CLI flags:

  • --stall-timeout N — Seconds without tool activity before stall (default: 300, 0=disabled)
  • --stall-action {blocker,retry,fail} — Recovery action (default: blocker)
  • Both flags available on cf work start and cf work batch run

Recovery flow:

  • BLOCKER (default): Creates informative blocker, task → BLOCKED
  • RETRY: Raises StallDetectedError, runtime retries once with fresh agent
  • FAIL: Task transitions directly to FAILED

Config: agent_budget.stall_timeout_s in .codeframe/config.yaml (0 = disabled)


Phase 2.5 Complete: ReAct Agent Architecture (#355)

Default execution engine switched from plan-based to ReAct (Reasoning + Acting).

What changed:

  • Default engine is now "react" — all cf work start --execute and cf work batch run commands use ReactAgent
  • Legacy plan engine available via --engine plan flag
  • ReactAgent uses iterative tool-use loop (observe → think → act) instead of plan-all-then-execute
  • 7 structured tools: read_file, edit_file, create_file, run_command, run_tests, search_codebase, list_files
  • Search-replace editing with 4-level fuzzy matching (exact → whitespace-normalized → indentation-agnostic → fuzzy)
  • Token budget management with 3-tier compaction
  • Adaptive iteration budget based on task complexity

Phase 2.5 deliverables:

Phase Focus Pipeline Stage Status
1 CLI Completion Think + Build Complete
2 Server Layer Build (API) Complete
2.5 ReAct Agent Build (execution) Complete
3 Web UI Rebuild All (dashboard) In Progress
4 Agent Adapters + Orchestration Build (delegate to frontier agents) Next
5 PROOF9 + Advanced Prove + Ship (quality memory) Planned

Phase 2 Complete: Server Layer (2026-02-03)

Phase 2 deliverables completed:

Server Architecture (Phase 2)

Pattern: Thin adapter over core - server routes delegate to core.* modules.

CLI (typer) ─┬── core.* ─── adapters.*
             │
Server (fastapi) ─┘

V2 Router Modules (15 total):

Router Endpoints Purpose
blockers_v2 5 Blocker CRUD
prd_v2 8 PRD management + versioning
tasks_v2 12 Task management + streaming
workspace_v2 5 Init, status, tech stack
batches_v2 5 Batch execution strategies
streaming_v2 2 SSE event streaming
api_key_v2 4 API key management
discovery_v2 5 PRD discovery sessions
checkpoints_v2 6 State checkpoints
schedule_v2 3 Task scheduling
templates_v2 4 PRD templates
git_v2 3 Git operations
review_v2 2 Code review
pr_v2 5 GitHub PR workflow
environment_v2 4 Tool detection

API Authentication:

# Create API key
cf auth api-key-create --name "my-key" --scopes read,write

# Use in requests
curl -H "X-API-Key: cf_..." https://api.example.com/api/v2/tasks

Rate Limiting:

  • Default: 100 requests/minute (standard endpoints)
  • Auth endpoints: 10/minute
  • AI endpoints: 20/minute
  • Configurable via RATE_LIMIT_* environment variables

OpenAPI Documentation:

  • Swagger UI: /docs
  • ReDoc: /redoc
  • OpenAPI JSON: /openapi.json

Previous Updates (2026-01-29)

V2 Strategic Roadmap Established

Created comprehensive 5-phase roadmap in docs/V2_STRATEGIC_ROADMAP.md.

Phase 1 Complete: CLI Foundation

All Phase 1 priorities completed:

Environment Validation (cf env)

New commands for validating development environment:

cf env check              # Validate required tools (git, uv, ruff, pytest)
cf env install            # Install missing tools automatically
cf env doctor             # Comprehensive environment health check

Modules:

  • core/environment.py - Tool detection and validation
  • core/installer.py - Cross-platform tool installation

GitHub PR Workflow (cf pr)

Streamlined PR management without leaving the CLI:

cf pr create              # Create PR from current branch
cf pr status              # Show PR status and review state
cf pr checks              # Show CI check results
cf pr merge               # Merge approved PR

Task Self-Diagnosis (cf work diagnose)

AI-powered analysis of failed tasks:

cf work diagnose <task-id>   # Analyze why a task failed

Modules:

  • core/diagnostics.py - Failed task analysis
  • core/diagnostic_agent.py - AI-powered diagnosis

Bug Fixes

GitHub Issue Organization


Previous Updates (2026-01-16)

Phase 3.1: Tech Stack Configuration

Simplified tech stack configuration using natural language descriptions:

  1. tech_stack field on Workspace model - stores natural language description
  2. --detect flag - auto-detects from pyproject.toml, package.json, Cargo.toml, go.mod
  3. --tech-stack flag - explicit tech stack description (e.g., "Rust project with cargo")
  4. --tech-stack-interactive flag - simple prompt for user input (stub for future multi-round)
  5. Agent integration - TaskContext and Planner include tech_stack in LLM prompts
  6. Removed cf config subcommand - tech stack is now part of workspace init

Design philosophy: Instead of structured configuration with specific package managers and frameworks, users describe their stack in natural language. The agent interprets and adapts.

Examples:

cf init . --detect                           # Auto-detect: "Python with uv, pytest, ruff for linting"
cf init . --tech-stack "Rust project using cargo"
cf init . --tech-stack "TypeScript monorepo with pnpm, Next.js, jest"
cf init . --tech-stack-interactive           # Prompts user for description

Future work: Multi-round interactive discovery (bead: codeframe-8d80)


Agent Self-Correction & Observability

Improved agent reliability with automatic error recovery:

  1. Self-correction loop in _run_final_verification() - agent retries up to 3 times
  2. Verbose mode (--verbose / -v) - shows detailed verification/self-correction progress
  3. FAILED task status - tasks transition to FAILED for proper error visibility
  4. Project preferences - agent loads AGENTS.md/CLAUDE.md for per-project config
  5. Fixed fail_run() - now properly transitions task status (was leaving tasks stuck)

Enhanced Self-Correction (Phase 3.4)

Advanced error recovery with loop prevention and smart escalation:

  1. Fix Attempt Tracker (core/fix_tracker.py) - prevents repeating failed fixes

    • Normalizes errors for comparison (removes line numbers, memory addresses)
    • Tracks (error_signature, fix_description) pairs with outcomes
    • Detects escalation patterns (same error 3+ times, same file 3+ times)
  2. Pattern-Based Quick Fixes (core/quick_fixes.py) - fixes common errors without LLM

    • ModuleNotFoundError → auto-install package (detects package manager)
    • ImportError → add missing import statement
    • NameError → add common imports (Optional, dataclass, Path, etc.)
    • SyntaxError → fix missing colons, f-string prefixes
    • IndentationError → normalize mixed tabs/spaces
  3. Escalation to Blocker - creates informative blockers when stuck

    • Triggered after MAX_SAME_ERROR_ATTEMPTS (3) failures on same error
    • Triggered after MAX_SAME_FILE_ATTEMPTS (3) failures on same file
    • Triggered after MAX_TOTAL_FAILURES (5) in a run
    • Blocker includes error type, attempted fixes, and guidance questions

Self-Correction Flow

Error occurs
    │
    ├── Try ruff --fix (auto-lint)
    │
    ├── Try pattern-based quick fix (no LLM)
    │   ├── Check if fix already attempted → skip
    │   ├── Apply fix
    │   └── Record outcome in tracker
    │
    ├── Check escalation threshold
    │   └── If exceeded → create escalation blocker
    │
    └── Use LLM to generate fix plan
        ├── Include already-tried fixes to avoid repetition
        ├── Execute fix steps with tracking
        └── Re-verify

Key Self-Correction Methods

  • _run_final_verification(): While loop that re-runs gates after self-correction
  • _attempt_verification_fix(): Orchestrates quick fixes, escalation check, LLM fixes
  • _create_escalation_blocker(): Creates detailed blocker with context
  • _verbose_print(): Conditional stdout output for observability

Phase 2 Complete (2026-01-15): Parallel Batch Execution

All 6 Phase 2 items from CLI_WIREFRAME.md are done:

  1. work batch resume <batch-id> - re-run failed/blocked tasks
  2. depends_on field on Task model
  3. ✅ Dependency graph analysis (DAG, cycle detection, topological sort)
  4. ✅ True parallel execution with ThreadPoolExecutor worker pool
  5. --strategy auto with LLM-based dependency inference
  6. --retry N automatic retry of failed tasks

Key Phase 2 Modules

  • conductor.py: Batch orchestration with serial/parallel/auto strategies
  • dependency_graph.py: DAG operations, level-based grouping for parallelization
  • dependency_analyzer.py: LLM analyzes task descriptions to infer dependencies

Agent Implementation Complete (2026-01-14)

All 8 implementation tasks from AGENT_IMPLEMENTATION_TASKS.md are done:

  1. ✅ LLM Adapter Interface (adapters/llm/)
  2. ✅ Task Context Loader (core/context.py)
  3. ✅ Agent Planning (core/planner.py)
  4. ✅ Code Execution Engine (core/executor.py)
  5. ✅ Automatic Blocker Detection (in core/agent.py)
  6. ✅ Gate Integration (in core/agent.py)
  7. ✅ Agent Orchestrator (core/agent.py)
  8. ✅ Wire into Runtime (core/runtime.py)

Bug Fixes During Testing

  • GateResult attribute access: Fixed gate_result.statusgate_result.passed
  • Duplicate task transitions: Removed task status updates from agent.py (runtime handles all)
  • READY→READY error: Added check in stop_run before transitioning
  • Verification step handling: Made _execute_verification smarter about file vs command targets

Key Design Decisions

  • State separation: Agent manages AgentState, Runtime manages TaskStatus
  • Model selection: Task-based heuristic via Purpose enum
  • Blocker creation: Agent creates blockers, Runtime updates task status
  • Verification: Incremental (ruff after each file change) + final (all gates)

Testing

Run all tests

uv run pytest

Run v2 tests only

uv run pytest -m v2           # All v2 tests (~411 tests)
uv run pytest -m v2 -q        # Quiet mode

The v2 marker identifies tests for CLI-first, headless functionality:

  • All tests in tests/core/ are automatically marked v2 (via conftest.py)
  • v2 CLI tests have pytestmark = pytest.mark.v2 at the top

Convention: When adding new v2 functionality, mark tests with @pytest.mark.v2 or add pytestmark = pytest.mark.v2 at module level for CLI tests that use codeframe.cli.app.

Run core module tests

uv run pytest tests/core/
uv run pytest tests/core/test_agent.py -v
uv run pytest tests/adapters/test_llm.py -v

Test coverage

uv run pytest --cov=codeframe --cov-report=html

Environment Variables

# Required for agent execution
ANTHROPIC_API_KEY=sk-ant-...

# Optional - Database
DATABASE_PATH=./codeframe.db

# Optional - Rate Limiting (Phase 2)
RATE_LIMIT_ENABLED=true                    # Enable/disable rate limiting
RATE_LIMIT_DEFAULT=100/minute              # Default limit
RATE_LIMIT_AUTH=10/minute                  # Auth endpoints
RATE_LIMIT_AI=20/minute                    # AI/LLM endpoints
RATE_LIMIT_WEBSOCKET=50/minute             # WebSocket connections
REDIS_URL=redis://localhost:6379           # Redis for distributed rate limiting (optional)

# Optional - API Server
CODEFRAME_API_KEY_SECRET=<random-secret>   # Secret for API key hashing

Legacy sections removed on purpose

This file previously contained extensive v1 details (auth, websocket, UI template, sprint history).
Those are still in git history and legacy docs, but they are not the current contract.

The current contract is Golden Path + Refactor Plan + Command Tree mapping + Agent Implementation.

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

CodeFRAME v2 — Golden Path Contract (CLI-first)

This document is the contract for CodeFRAME v2 development.

Rule 0 (the only rule that matters):

If a change does not directly support the Golden Path flow below, do not implement it.

This applies to both humans and agentic coding assistants.


Goals

What "done" looks like (Enhanced MVP definition)

CodeFRAME can run a complete end-to-end AI-driven development workflow from the CLI on a target repo:

  1. Initialize workspace with project discovery

    • Analyze codebase and detect tech stack
    • Configure environment and tooling automatically
    • Create durable state storage
  2. AI-driven PRD generation and refinement

    • Interactive AI session gathers project requirements
    • AI asks follow-up questions about scope, users, constraints
    • Generates comprehensive PRD + technical specs + user stories
    • Iterative refinement based on user feedback
  3. Intelligent task generation with dependency analysis

    • Decompose PRD into actionable tasks with dependencies
    • Prioritize tasks and group by functionality
    • Generate implementation strategies per task
  4. Batch task execution with orchestration

    • Execute multiple tasks in sequence or parallel
    • Handle inter-task dependencies automatically
    • Main agent coordinates entire batch workflow
    • Real-time progress monitoring and event streaming
  5. Human-in-the-loop blocker resolution

    • Interactive blocker handling with contextual AI suggestions
    • Resume execution after blocker resolution
    • Learning from blocker patterns
  6. Integrated Git workflow and PR management

    • Automatic branch creation per task/batch
    • AI-generated commit messages and PR descriptions
    • Automated verification gate execution
    • PR creation, review, and merging workflows
  7. Comprehensive checkpointing and state management

    • Snapshots of workspace state with git refs
    • Resume interrupted workflows from checkpoints
    • Multi-environment state isolation

No UI is required.
A FastAPI server is not required for the Golden Path to work.
All Git operations are integrated into the CLI workflow.


Non-Goals (explicitly forbidden until Golden Path works)

Do not build or refactor:

  • Web UI / dashboard features
  • Settings pages, preferences, themes
  • Multi-provider/model switching UI or complex provider management
  • Advanced metrics dashboards or timeseries endpoints
  • Auth / sessions for remote users
  • Electron desktop app
  • Plugin marketplace / extensibility frameworks
  • “Perfect” project structure, monorepo tooling, or build system redesign
  • Large migrations or renames that aren’t required by Golden Path

These may be revisited only after Golden Path is working and stable.


Golden Path CLI Flow (the only flow that matters)

0) Preconditions

  • A target repo exists (any small test repo is fine).
  • CodeFRAME runs locally and can store durable state (SQLite or filesystem).
  • The CLI can be run from anywhere.

1) Initialize a workspace

Command:

  • codeframe init <path-to-repo>

Required behavior:

  • Registers the repo as a workspace.
  • Creates/updates durable state storage.
  • Prints a short workspace summary (repo path, workspace id, state location).

Artifacts:

  • Local state created (DB/file), e.g. .codeframe/ and/or codeframe.db.

2) AI-driven PRD generation and refinement

Commands:

  • codeframe prd generate (primary - interactive AI session)
  • codeframe prd add <file.md> (secondary - existing file support)
  • codeframe prd refine (iterative improvement)

Required behavior for prd generate:

  • AI conducts interactive discovery session asking:
    • Project scope, objectives, and success criteria
    • Target users, use cases, and user stories
    • Technical constraints, preferences, and requirements
    • Timeline, priorities, and MVP boundaries
  • Generates comprehensive PRD with:
    • Executive summary and problem statement
    • Functional requirements with acceptance criteria
    • Technical specifications and architecture guidance
    • User stories with priority ranking
    • Success metrics and validation criteria
  • Provides iterative refinement based on user feedback
  • Stores PRD in durable state with versioning
  • Supports multiple PRD versions with change tracking

3) Intelligent task generation with dependency analysis

Commands:

  • codeframe tasks generate (enhanced with dependencies)
  • codeframe tasks analyze (dependency graph analysis)

Required behavior:

  • Decomposes PRD into granular, actionable tasks
  • Automatically detects and assigns task dependencies
  • Estimates effort and complexity for each task
  • Groups related tasks into logical workstreams
  • Prioritizes tasks based on dependencies and value delivery
  • Supports task templates for common patterns (setup, implementation, testing, deployment)
  • Generates implementation strategy per task (files to modify, approaches to consider)
  • Creates task dependency graph with critical path identification

4) Batch task execution with orchestration

Commands:

  • codeframe work batch run (primary - main execution pathway)
  • codeframe work start <task-id> (secondary - single task fallback)
  • codeframe work batch status <batch-id> (monitoring)
  • codeframe work batch follow <batch-id> (real-time streaming)

Required behavior for batch execution:

  • Executes multiple tasks with intelligent scheduling:
    • Serial execution for dependent tasks
    • Parallel execution for independent tasks
    • Auto-strategy using dependency graph analysis
  • Main orchestrator agent coordinates entire batch:
    • Resource allocation and task scheduling
    • Inter-task communication and data sharing
    • Failure handling and retry logic
    • Progress tracking and milestone reporting
  • Real-time event streaming with:
    • Task start/completion events
    • Progress indicators and ETAs
    • Blocker detection and notification
    • Dependency resolution updates
  • Supports execution strategies:
    • --strategy serial: Linear execution
    • --strategy parallel: Max parallelization
    • --strategy auto: AI-optimized based on dependencies

5) Enhanced human-in-loop blocker resolution

Commands:

  • codeframe blockers list (enhanced with context)
  • codeframe blocker answer <blocker-id> "<text>" (with AI suggestions)
  • codeframe blocker resolve <blocker-id> (automated resolution options)

Required behavior:

  • AI provides contextual blocker resolution suggestions:
    • Similar past blockers and their solutions
    • Multiple solution approaches with trade-offs
    • Impact analysis of resolution choices
  • Interactive blocker handling with:
    • Rich context display (related code, PRD sections, task dependencies)
    • Suggested responses ranked by confidence
    • Impact on task timeline and dependencies
  • Learning system that:
    • Records blocker patterns and resolutions
    • Improves future blocker handling suggestions
    • Reduces human intervention over time

6) Integrated Git workflow and PR management

Commands:

  • codeframe work start <task-id> --create-branch (branch management)
  • codeframe pr create (PR creation with AI descriptions)
  • codeframe pr list (PR status monitoring)
  • codeframe pr merge <pr-id> (PR merging with verification)

Required behavior:

  • Branch Management:
    • Automatic feature branch creation per task/batch
    • Branch naming conventions with task/batch IDs
    • Branch cleanup and organization utilities
    • Conflict detection and resolution assistance
  • PR Creation:
    • AI generates comprehensive PR descriptions:
      • Summary of changes and business impact
      • Technical implementation details
      • Testing performed and results
      • Breaking changes and migration notes
    • Automated PR labeling and categorization
    • Reviewer assignment based on code expertise
  • PR Workflow:
    • Automated gate execution before merge (tests, lint, security scans)
    • Integration with CI/CD pipelines
    • Merge strategies (squash, merge, rebase) based on team preferences
    • Post-merge cleanup and notification

7) Enhanced verification and quality gates

Commands:

  • codeframe review (comprehensive code review)
  • codeframe gates run (automated quality checks)
  • codeframe quality report (quality metrics and trends)

Required behavior:

  • Comprehensive Gate Suite:
    • Unit tests with coverage reporting
    • Integration and end-to-end tests
    • Static code analysis (lint, security, complexity)
    • Performance regression tests
    • Documentation and API specification validation
  • AI-Assisted Code Review:
    • Automated code quality assessment
    • Best practices compliance checking
    • Potential bug detection and suggestions
    • Code style and maintainability analysis
  • Quality Tracking:
    • Trend analysis of code quality metrics
    • Technical debt accumulation tracking
    • Gate failure pattern identification

8) Integrated artifact and commit management

Commands:

  • codeframe commit create -m "<message>" (AI-generated commits)
  • codeframe patch export (safe patch generation)
  • codeframe artifacts list (artifact tracking)

Required behavior:

  • Smart Commits:
    • AI generates meaningful commit messages:
      • Conventional commit format compliance
      • Contextual change descriptions
      • References to tasks/PRDs/issues
      • Breaking change highlights
    • Atomic commit boundaries and logical grouping
  • Artifact Management:
    • Automatic patch generation for safety
    • Commit linking to tasks and batches
    • Rollback points and recovery procedures
    • Integration with external artifact repositories

9) Comprehensive checkpointing and state management

Commands:

  • codeframe checkpoint create "<name>" (enhanced snapshots)
  • codeframe checkpoint restore <checkpoint-id> (workflow resume)
  • codeframe summary (comprehensive reporting)

Required behavior:

  • Rich Checkpoints:
    • Complete workspace state capture:
      • Task statuses and progress
      • Git refs and working directory state
      • PRD versions and requirements
      • Configuration and environment settings
    • Incremental checkpoint optimization
    • Cross-environment checkpoint portability
  • Workflow Resume:
    • Seamless resumption from any checkpoint
    • Context restoration for active agents
    • Branch and working directory restoration
    • Event log continuity and replay
  • Comprehensive Reporting:
    • Executive summaries with progress metrics
    • Detailed task completion reports
    • Quality gate performance tracking
    • Resource utilization and timing analysis
    • Risk assessment and mitigation recommendations

State Machine (authoritative)

Statuses:

  • BACKLOG - Task identified but not ready for execution
  • READY - Task prepared and ready to start
  • IN_PROGRESS - Task actively being worked on
  • BLOCKED - Task waiting for human input or external dependency
  • DONE - Task completed locally, ready for review/integration
  • IN_REVIEW - Task changes in PR review process
  • MERGED - Task changes integrated into main branch
  • FAILED - Task execution failed (can be retried)

Allowed transitions (comprehensive):

  • BACKLOG -> READY (task preparation complete)
  • READY -> IN_PROGRESS (work started)
  • IN_PROGRESS -> BLOCKED (awaiting input/dependency)
  • BLOCKED -> IN_PROGRESS (blocker resolved)
  • BLOCKED -> READY (returned to queue)
  • IN_PROGRESS -> DONE (local completion)
  • IN_PROGRESS -> FAILED (execution failure)
  • DONE -> IN_REVIEW (PR created/under review)
  • IN_REVIEW -> DONE (PR rejected, needs work)
  • IN_REVIEW -> MERGED (PR approved and merged)
  • DONE -> READY (reopened for additional work)
  • FAILED -> READY (retry after failure)
  • MERGED -> BACKLOG (reopened for enhancement)

The CLI is the authority for transitions.
UIs (web/electron) are views over this state machine, not the source of truth.

PR Workflow Integration:

  • Tasks automatically transition to IN_REVIEW when codeframe pr create is run
  • PR status changes trigger corresponding task state updates
  • Merge actions transition tasks to MERGED status
  • Failed or rejected PRs return tasks to DONE for additional work

Implementation Principles

Core-first (no FastAPI in the core)

  • Domain logic must live in a reusable core module/package.
  • Core must not import FastAPI, websockets, or HTTP request objects.
  • FastAPI server (if used) must be a thin adapter over core.

CLI-first (server optional)

  • Golden Path commands must work without any running backend server.
  • If a server exists, it may be started separately (codeframe serve) and must wrap core.

Salvage safely

  • Legacy code can be read and copied from.
  • Core must not take dependencies on legacy UI-driven modules.
  • Prefer copying useful functions into core and simplifying interfaces.

Keep it runnable

  • Every commit should keep codeframe --help working.
  • The Golden Path commands should remain executable even if stubs at first.

Acceptance Checklist (Enhanced MVP - must pass)

Status: 🔄 Enhanced MVP Partially Complete

📊 Current Implementation Status

Overall Assessment: Enhanced MVP is ~60% complete with solid foundation but critical gaps remaining.

✅ Fully Implemented Phases:

  • Phase 1: Basic PRD functionality (prd add) - Enhanced PRD generation missing
  • Phase 2: Core task generation with LLM support - Advanced dependency analysis incomplete
  • Phase 3: Complete batch execution framework - Orchestrator integration complete
  • Phase 4: Basic blocker management system - AI-powered suggestions missing
  • Phase 6: Basic verification gates (codeframe review) - AI-assisted review missing
  • Phase 7: Comprehensive checkpointing system - Incremental/batch features missing

⚠️ Critical Missing Components:

  • AI-driven PRD generation: No codeframe prd generate command
  • Credential management: No codeframe auth system - CRITICAL BLOCKER
  • Git/PR workflow: GitHub integration exists but no CLI commands
  • Environment validation: No pre-flight validation system
  • Advanced recovery: Limited rollback beyond full checkpoints
  • Enhanced monitoring: Basic event streaming, no rich debugging

🎯 Key Finding:

The single most critical issue is missing credential management - users cannot reliably use the enhanced workflow without it.

Foundation is solid - Core CLI functionality, batch execution, and basic Git integration work reliably.

Next priority: Implement credential management system as outlined in gap analysis documents.

Phase 1: AI-Driven Project Discovery & PRD Generation

  • codeframe init with auto tech stack detection and environment setup
    • Implementation: Auto tech stack detection with --detect flag
    • Implementation: Interactive tech stack configuration with --tech-stack-interactive
    • Note: Basic init works, enhanced features not yet integrated
  • codeframe prd generate conducts interactive AI discovery session
    • ⚠️ Status: Command not implemented - only codeframe prd add <file.md> exists
    • Note: Discovery exists in legacy codebase but not integrated into CLI
  • AI asks contextual follow-up questions about requirements and constraints
  • Generates comprehensive PRD with technical specs and user stories
  • Supports iterative PRD refinement based on user feedback
  • PRD versioning and change tracking

Phase 2: Intelligent Task Generation & Dependency Management

  • codeframe tasks generate creates dependency-aware task graphs
    • Implementation: Uses LLM for task generation with dependency analysis
    • Implementation: Supports both LLM and simple extraction modes
    • ⚠️ Status: Limited dependency graph functionality
    • Note: Basic task generation works, advanced dependency analysis incomplete
  • Automatic task prioritization and workstream grouping
  • Effort estimation and complexity analysis
  • Critical path identification and scheduling
  • Task template system for common implementation patterns

Phase 3: Batch Execution & Orchestration

  • codeframe work batch run as primary execution pathway
    • Implementation: Comprehensive batch execution with multiple strategies
    • Implementation: Serial, parallel, and auto-strategy execution modes
    • Implementation: Event streaming and progress monitoring
    • Implementation: Failure handling and retry logic
    • Implementation: Real-time status and batch monitoring commands
    • Note: Main batch functionality works, orchestrator integration complete
  • Serial, parallel, and auto-strategy execution modes
  • Real-time progress monitoring with event streaming
  • Inter-task dependency management and coordination
  • Main orchestrator agent manages entire batch workflow
  • Failure handling and automatic retry logic

Phase 4: Enhanced Human-in-the-Loop Blocker Resolution

  • Contextual blocker display with rich background information
    • Implementation: Comprehensive blocker management system
    • Implementation: Rich context display with codebase references
    • ⚠️ Status: AI-powered suggestions not yet implemented
    • Note: Basic blocker listing and answering works, AI suggestions missing
  • AI-powered blocker resolution suggestions
  • Learning system for blocker pattern recognition
  • Similar past blocker solutions and recommendations
  • Impact analysis for different resolution approaches

Phase 5: Integrated Git Workflow & PR Management

  • Automatic branch creation per task/batch with naming conventions
  • AI-generated comprehensive PR descriptions with business impact
  • Automated PR labeling and reviewer assignment
  • Integration with CI/CD pipelines and gate execution
  • Multiple merge strategies (squash, merge, rebase) support
  • Post-merge cleanup and notification automation
    • ⚠️ Status: Basic Git integration exists, PR creation incomplete
    • Note: GitHub integration module exists (codeframe/git/github_integration.py)
    • Note: Auth commands exist but credential management missing
    • Note: No CLI commands for PR creation/management yet implemented

Phase 6: Comprehensive Quality Gates & Verification

  • Expanded gate suite: unit tests, integration tests, security scans
    • Implementation: Basic codeframe review command exists
    • Implementation: Supports multiple gate types (pytest, ruff, mypy, npm)
    • ⚠️ Status: Limited gate functionality - stub implementation
    • Note: Only basic verification works, AI-assisted review not implemented
  • AI-assisted code review with best practices checking
  • Quality metrics tracking and trend analysis
  • Technical debt accumulation monitoring
  • Automated regression detection and prevention

Phase 7: Advanced Checkpointing & State Management

  • Rich checkpoint snapshots with complete workspace state
    • Implementation: Comprehensive checkpoint management system
    • Implementation: Checkpoint create, list, show, and restore commands
    • Implementation: Git reference integration for state tracking
    • ⚠️ Status: Basic checkpointing works, advanced features missing
    • Note: No incremental checkpointing during batch execution
  • Cross-environment checkpoint portability
  • Seamless workflow resumption from any checkpoint
  • Incremental checkpoint optimization
  • Executive reporting with progress and risk metrics

Cross-Cutting Requirements

  • All functionality works without FastAPI server running
    • Implementation: Core functionality works independently of server
    • Verification: CLI commands work without FastAPI dependency
    • ⚠️ Status: Server wrapper incomplete but not required for CLI workflow
  • No UI required at any point in workflow
    • Event logging and streaming for observability
    • Implementation: Comprehensive event system with rich logging
    • Implementation: Real-time event streaming during batch execution
    • ⚠️ Status: Advanced monitoring features missing
  • Comprehensive error handling and recovery procedures
    • ⚠️ Status: Basic error handling exists, advanced recovery missing
    • Note: No rollback capability beyond full checkpoints
  • Performance optimization for large repositories
    • Security best practices and credential management
    • Documentation and help commands for all new features
    • ⚠️ Status: No credential management system implemented
    • ⚠️ Critical Gap: Authentication failures would block entire workflow
    • Note: See gap analysis documents for detailed credential management plan

Definition of Done:

  • All acceptance criteria must be satisfied
  • End-to-end workflow tested on real project repositories
  • Performance benchmarks meet minimum standards
  • Security audit passes all compliance checks
  • Documentation is complete and accurate
  • User feedback collected from beta testing validates approach

Next phase: Production Readiness & Advanced Features (see roadmap planning).

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

CodeFRAME v2 Strategic Roadmap

Created: 2026-01-29
Updated: 2026-02-15
Status: Active - Phase 2.5 Complete, Phase 3 Next

Executive Summary

CodeFRAME v2 CLI Phase 1 is complete with a production-ready foundation. The path forward involves:

  1. Closing the remaining 5-10% CLI gap (mainly prd generate and observability) ✅ DONE
  2. Building server layer as thin adapter over core
  3. Rebuilding web UI on the v2 foundation
  4. Evolving toward the multi-agent "FRAME" vision

Current State Assessment

What's Working (Phase 1 Complete)

  • Full agent execution: cf work start <task-id> --execute
  • Batch orchestration: serial, parallel, auto (LLM-inferred dependencies)
  • Self-correction loop with up to 3 retry attempts
  • Blocker system for human-in-the-loop decisions
  • Verification gates (ruff, pytest, BUILD)
  • State persistence and checkpoint/restore
  • Tech stack auto-detection
  • 76+ integration tests, all passing
  • GitHub PR workflow commands
  • Interactive PRD generation (cf prd generate) ✅
  • Live execution streaming (cf work follow) ✅
  • PRD template system for customizable output ✅
  • Integration tests for credentials/environment modules ✅

Phase 1 Gaps - ALL CLOSED

Gap Issue Status
cf prd generate (Socratic discovery) #307 ✅ CLOSED
Live streaming (cf work follow) #308 ✅ CLOSED
PRD template system #316 ✅ CLOSED
Integration tests for credential/env modules #309 ✅ CLOSED

Phase 1: CLI Foundation Completion ✅ COMPLETE

Goal: Make CLI fully production-ready for headless agent workflows.
Status: ✅ ALL DELIVERABLES COMPLETE (2026-02-01)

Deliverables

  1. cf prd generate command ([Phase 1] cf prd generate - Interactive AI PRD creation (Socratic Discovery) #307) - ✅ COMPLETE

    • Interactive AI-driven requirements discovery
    • Multi-turn Socratic questioning (5+ turns minimum)
    • Progressive refinement: broad vision → specific requirements → acceptance criteria
    • Outputs structured PRD document
    • Template support for customizable output formats
  2. Live execution streaming ([Phase 1] cf work follow - Live execution streaming #308) - ✅ COMPLETE

    • cf work follow <task-id> for real-time output
    • File-based streaming with tail support
  3. PRD template system ([Phase 1] PRD template system for customizable output formats #316) - ✅ COMPLETE (BONUS)

    • 5 built-in templates: standard, lean, enterprise, technical, user-story
    • Export/import for customization
    • cf prd templates list/show/export/import commands
  4. Integration test expansion ([Phase 1] Integration tests for credential and environment modules #309) - ✅ COMPLETE

    • Test credential manager with keyring
    • Test environment validator with tool detection
    • 76+ integration tests (exceeded target)

Success Criteria - ALL MET

  • ✅ New user completes full workflow without hitting credential/env failures
  • cf prd generate conducts 5+ turn discovery session
  • ✅ All v2 integration tests pass (4285 total tests)

Phase 2: Server Layer as Thin Adapter

Goal: FastAPI server exposing core functionality via REST + real-time events.
Status: ✅ COMPLETE

Deliverables

  1. Server audit and refactor ([Phase 2] Server audit and refactor - routes delegating to core modules #322) - ✅ COMPLETE

    • ✅ Business logic audit completed (see docs/PHASE_2_BUSINESS_LOGIC_AUDIT.md)
    • ✅ CLI-to-API route mapping (see docs/PHASE_2_CLI_API_MAPPING.md)
    • ✅ V2 routers created following thin adapter pattern:
      • blockers_v2.py - Full CRUD delegating to core.blockers
      • prd_v2.py - Full CRUD + versioning delegating to core.prd
      • tasks_v2.py - Enhanced with PATCH/DELETE/streaming/run status
      • workspace_v2.py - Init, status, tech stack detection
      • batches_v2.py - Batch execution with strategies
      • diagnose_v2.py - Failed task analysis
      • pr_v2.py - GitHub PR workflow
      • environment_v2.py - Tool detection and validation
      • gates_v2.py - Verification gate execution
    • ✅ Integration tests: 130+ tests for v2 routers
  2. Real-time events ([Phase 2] Real-time events via SSE/WebSocket for task execution #323) - 🔄 PARTIAL

    • ✅ SSE streaming via /api/v2/tasks/{id}/stream
    • ⚠️ WebSocket for bidirectional events still needed
  3. Authentication & Security

Phase 2 Progress Summary

Component Routes Status
Blockers v2 5 endpoints ✅ Complete
PRD v2 8 endpoints ✅ Complete
Tasks v2 (enhanced) 12 endpoints ✅ Complete
Discovery v2 5 endpoints ✅ Complete
Checkpoints v2 6 endpoints ✅ Complete
Schedule v2 3 endpoints ✅ Complete
Templates v2 4 endpoints ✅ Complete
Git v2 3 endpoints ✅ Complete
Review v2 2 endpoints ✅ Complete
Workspace v2 5 endpoints ✅ Complete
Batches v2 5 endpoints ✅ Complete
Diagnose v2 2 endpoints ✅ Complete
PR v2 5 endpoints ✅ Complete
Environment v2 4 endpoints ✅ Complete
Gates v2 2 endpoints ✅ Complete
API Key Auth 4 endpoints ✅ Complete
Rate Limiting All routes ✅ Complete

All Phase 2 Issues

Issue Title Priority Status
#322 Server audit and refactor HIGH ✅ Complete
#325 Phase 2 Server Layer PR HIGH ✅ Complete
#326 API key authentication HIGH ✅ Complete
#327 Rate limiting HIGH ✅ Complete
#323 Real-time events (SSE/WebSocket) HIGH 🔄 Partial (SSE done)
#119 OpenAPI documentation MEDIUM Open
#118 API pagination MEDIUM Open

Architecture Principle: Thin Adapter Pattern

CLI (typer) ─┬── core.* ─── adapters.*
             │
Server (fastapi) ─┘

Server and CLI are siblings, both calling core.

Key Pattern: V2 routers follow the thin adapter pattern:

  1. Parse HTTP request parameters
  2. Call core module function with workspace
  3. Transform result to HTTP response
  4. Handle errors with standard format

See docs/PHASE_2_DEVELOPER_GUIDE.md for implementation guide.


Phase 2.5: ReAct Agent Architecture ✅ COMPLETE

Goal: Replace plan-then-execute agent with iterative ReAct (Reasoning + Acting) loop as the default engine.
Status: ✅ COMPLETE (2026-02-15)

Motivation

The plan-based agent had several failure modes discovered during testing:

  • Config file overwrites (whole-file generation ignores existing content)
  • Cross-file naming inconsistency (each file generated in isolation)
  • Accumulated lint errors (no incremental verification)
  • Ineffective self-correction (empty error context)

Deliverables

  1. ReAct Agent Implementation - ✅ COMPLETE

    • core/react_agent.py - Observe-Think-Act loop with tool use
    • core/tools.py - 7 structured tools (read/edit/create file, run command/tests, search, list)
    • core/editor.py - Search-replace editor with 4-level fuzzy matching
  2. Engine Selection - ✅ COMPLETE

    • --engine react (default) or --engine plan (legacy) on all work commands
    • Runtime routes to ReactAgent or Agent based on engine parameter
    • API endpoints support engine parameter with validation
  3. CLI Validation ([Phase 2.5-F] End-to-end CLI validation with cf-test project #353) - ✅ COMPLETE

    • --engine flag on cf work start and cf work batch run
    • Default switched to "react"
  4. API Validation ([Phase 2.5-F] Verify ReAct engine works via API routes #354) - ✅ COMPLETE

    • Engine parameter on execute, approve, and stream endpoints
    • Backward compatible — omitting engine uses "react" default
  5. Default Switch + Documentation ([Phase 2.5-F] Switch default engine to react and update documentation #355) - ✅ COMPLETE

    • Default engine changed from "plan" to "react" across CLI, API, and runtime
    • CLAUDE.md updated with ReAct architecture documentation

Key Architecture Decisions

  • Search-replace editing: ~98% accuracy vs ~70-80% for whole-file regeneration
  • Read before write: Agent always sees actual file state before editing
  • Lint after every change: Catch errors immediately, not after they accumulate
  • 7 focused tools: Fewer tools = higher accuracy
  • Token budget management: 3-tier compaction prevents context window overflow
  • Adaptive iteration budget: Task complexity scoring adjusts iteration limits

Reference Documentation

  • docs/AGENT_V3_UNIFIED_PLAN.md - Architecture design and rules
  • docs/REACT_AGENT_ARCHITECTURE.md - Deep-dive on tools, editor, token management
  • docs/PHASE_25_VALIDATION_REPORT.md - End-to-end validation results

Phase 3: Web UI Rebuild

Goal: Modern dashboard consuming REST/WebSocket API.

Deliverables

  1. Project management - Workspace list, creation, configuration
  2. PRD interface - Visual editor with AI assistance
  3. Task board - Drag-and-drop with dependency visualization
  4. Execution monitor - Live dashboard showing agent progress
  5. Blocker resolution - Interactive Q&A interface
  6. Onboarding flow - First-time user experience

Tech Stack

  • Next.js with App Router
  • Shadcn/UI + Tailwind (Nova template)
  • Hugeicons
  • Real-time via WebSocket/SSE

Note: v1-legacy issues (labeled and closed) serve as reference for this phase.


Phase 4: Multi-Agent Coordination

Goal: Realize the "FRAME" vision - specialist agents working together.

Deliverables

  1. Agent roles ([Phase 4] Agent role system with specialized prompts #310)

    • Backend Agent, Frontend Agent, Test Agent, Review Agent
    • Role-specific system prompts and tool access
    • Automatic task-to-agent matching
  2. Parallel multi-agent execution

    • Multiple agents on independent tasks
    • Worker pool management
  3. Conflict detection & resolution ([Phase 4] Multi-agent conflict detection and resolution #311)

    • Identify concurrent modifications to same files
    • Strategies: serialize, merge, escalate to blocker
    • 90%+ automatic resolution target
  4. Handoff protocols ([Phase 4] Agent handoff protocols #312)

    • Context passing between roles
    • Implementation → Test → Review pipeline

Related Issues


Phase 5: Advanced Features & Polish

Goal: Power user features and production hardening.

Deliverables

  1. TUI Dashboard ([Phase 5] TUI Dashboard with Rich/Textual #313) - Rich/Textual terminal interface
  2. Token/cost tracking ([Phase 5] Token and cost tracking per task #314) - Usage metrics per task/batch
  3. Debug/replay mode ([Phase 5] Debug and replay mode #315) - Step through past executions
  4. Performance benchmarks ([Phase 5] Add performance and load testing benchmarks #115) - Baseline metrics
  5. Context optimization ([Future] Optimize Context Assembly Order for LLM Cache Hits #63, [Future] Implement Memory Eviction Strategy for COLD Tier #67) - Assembly order, eviction strategy

Execution Timeline

Phase 1 (CLI) ──────────────────────────────────►
                  │
                  ├── Phase 2 (Server) ────────────────►
                  │                     │
                  │                     ├── Phase 3 (UI) ──────►
                  │
                  └── Phase 4 (Multi-Agent) ────────────────────►
                                                          │
                                                          └── Phase 5 (Advanced) ──►
  • Phase 1 is prerequisite for everything
  • Phases 2-3 (server/UI) can run in parallel with Phase 4 (multi-agent)
  • Phase 5 depends on earlier phases but can start partially

GitHub Issue Organization

Labels

  • phase-1: CLI Foundation (7 issues - ALL CLOSED)
  • phase-2: Server Layer (6 issues - ALL OPEN)
  • phase-4: Multi-Agent (10 issues)
  • phase-5: Advanced Features (5 issues)
  • v1-legacy: V1-specific issues, closed but retained as Phase 3 reference (22 issues)

Phase 1 Issues - ALL COMPLETE

Issue Title Status
#307 cf prd generate - Socratic Discovery ✅ CLOSED
#308 cf work follow - Live streaming ✅ CLOSED
#309 Integration tests for credential/env ✅ CLOSED
#316 PRD template system ✅ CLOSED
#318 PRD template support ✅ CLOSED
#265 NoneType error fix ✅ CLOSED
#253 Checkpoint diff API fix ✅ CLOSED

Phase 2 Issues - MOSTLY COMPLETE

Issue Title Priority Status
#322 Server audit and refactor HIGH ✅ Complete
#325 Phase 2 Server Layer PR HIGH ✅ Complete
#326 API key authentication HIGH ✅ Complete
#327 Rate limiting HIGH ✅ Complete
#323 Real-time events (SSE/WebSocket) HIGH 🔄 Partial (SSE done)
#119 OpenAPI documentation MEDIUM Open
#118 API pagination MEDIUM Open

Architecture Decisions

1. Core-first pattern maintained

Core remains headless. Server and CLI are equal adapters.

2. Integration tests as guardrail

The existing 130+ v2 router tests ensure "always working codebase" through all phases.

3. No big-bang UI rewrite

Web UI is built incrementally on v2 server, not by fixing v1.

4. Agent swarms are Phase 4, not Phase 1

Focus on single-agent excellence first, then parallelize.


Verification Plan

After each phase:

  1. Run full integration test suite: uv run pytest tests/cli/test_v2_cli_integration.py
  2. Manual smoke test of Golden Path
  3. Confirm no regressions in existing functionality

Summary

Phase Focus Key Outcome Status
1 CLI Completion Production-ready headless agent COMPLETE
2 Server Layer REST API + real-time events COMPLETE
2.5 ReAct Agent Iterative tool-use execution engine COMPLETE
3 Web UI Modern dashboard Planned
4 Multi-Agent Agent swarms Planned
5 Advanced Power features Planned

Current focus: Phase 3 - Web UI rebuild on v2 foundation.

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

Code Review Report: PRD View - Document Creation & Discovery

Date: 2026-02-05
Reviewer: Code Review Agent
Component: PRD View (PR #337, Issue #330)
Files Reviewed: 34 files (20 source, 8 test, 1 mock, 1 docs, 4 config)
Ready for Production: Yes, with 2 major issues recommended for near-term fix

Executive Summary

This PR implements the full PRD View for Phase 3 UI — a well-structured, component-driven implementation across 9 incremental commits. The code follows established project patterns (Shadcn/UI Nova template, Hugeicons, SWR, axios namespace pattern) and includes 64 new tests. Two major reliability issues were found (missing error handling in page.tsx handlers and a misused React hook in DiscoveryPanel), plus a few minor improvements. No critical security vulnerabilities.

Critical Issues: 0
Major Issues: 2
Minor Issues: 4
Positive Findings: 7


Review Context

Code Type: Frontend (Next.js React components, hooks, API client)
Risk Level: Medium (user input handling, file upload, SSE connections, AI chat rendering)
Business Constraints: Phase 3 UI rebuild — first user-facing view beyond workspace

Review Focus Areas

  • ✅ A03 - Injection/XSS — User markdown rendered via react-markdown, file upload content
  • ✅ Reliability — Error handling in async handlers, resource cleanup in SSE hooks
  • ✅ Resource Management — EventSource lifecycle, FileReader cleanup, SWR cache management
  • ✅ A06 - Vulnerable Components — Dependency audit
  • ❌ OWASP LLM Top 10 — Skipped (frontend doesn't interact with LLM directly)
  • ❌ Zero Trust / Auth — Skipped (auth is backend concern; API client already has withCredentials: true)
  • ❌ Performance — Skipped (not performance-critical UI code)

Priority 1 Issues - Critical ⛔

None found.


Priority 2 Issues - Major ⚠️

1. Missing error handling in handleSavePrd and handleGenerateTasks

Location: web-ui/src/app/prd/page.tsx:89-103 and web-ui/src/app/prd/page.tsx:118-127
Severity: Major
Category: Reliability

Problem:
Both handleSavePrd and handleGenerateTasks use try...finally without a catch block. If the API call fails, the error propagates as an unhandled rejection. Unlike DiscoveryPanel and UploadPRDModal (which properly catch and display errors), these handlers silently fail — the user sees the spinner stop but gets no feedback about what went wrong.

Current Code:

const handleSavePrd = async (content: string, changeSummary: string) => {
  if (!prd || !workspacePath) return;
  setIsSaving(true);
  try {
    const updated = await prdApi.createVersion(...);
    mutatePrd(updated, false);
  } finally {
    setIsSaving(false);
  }
};

Recommended Fix:

const handleSavePrd = async (content: string, changeSummary: string) => {
  if (!prd || !workspacePath) return;
  setIsSaving(true);
  try {
    const updated = await prdApi.createVersion(...);
    mutatePrd(updated, false);
  } catch (err) {
    const apiError = err as ApiError;
    console.error('[PRD] Save failed:', apiError.detail);
    // TODO: Show error toast/banner to user
  } finally {
    setIsSaving(false);
  }
};

Why This Fix Works:
Prevents unhandled promise rejections and gives the user feedback. A toast/notification system would be the ideal UX, but at minimum logging prevents silent failures.


2. Misuse of useState as initializer in DiscoveryPanel

Location: web-ui/src/components/prd/DiscoveryPanel.tsx:68-70
Severity: Major
Category: Reliability / Correctness

Problem:
useState is being used with a callback to auto-start the discovery session on mount. This is an unconventional pattern — useState's initializer runs during the first render (synchronously), but here it triggers an async side effect (startSession()). This works coincidentally because React state initializers run once, but:

  1. It violates React's rules — side effects should use useEffect
  2. The async call fires during render, not after mount
  3. React StrictMode in development will call it twice

Current Code:

useState(() => {
  if (!sessionId) startSession();
});

Recommended Fix:

useEffect(() => {
  if (!sessionId) startSession();
  // eslint-disable-next-line react-hooks/exhaustive-deps
}, []);

Why This Fix Works:
useEffect with [] runs after the component mounts, which is the correct lifecycle for firing API calls. The eslint-disable is needed because startSession and sessionId are intentionally excluded (we only want to run on mount).


Priority 3 Issues - Minor 📝

1. sessionId interpolated directly into URL path

Location: web-ui/src/lib/api.ts:258, 270
Severity: Minor
Category: A03 - Injection (Defense in depth)

Recommendation:
sessionId is interpolated into the URL path via template literal: `/api/v2/discovery/${sessionId}/answer`. While sessionId comes from the server (not user input), encoding it would add defense-in-depth against future misuse.

Suggested Approach:

`/api/v2/discovery/${encodeURIComponent(sessionId)}/answer`

This is a nitpick — the backend validates the session ID format, and the value originates from the server. No immediate risk.


2. No file size limit on upload

Location: web-ui/src/components/prd/UploadPRDModal.tsx:45-67
Severity: Minor
Category: Reliability

Recommendation:
The file upload handler reads the entire file into memory via FileReader.readAsText() without checking file size. A user could accidentally select a very large file (e.g., a binary misnamed .md), causing browser memory issues.

Suggested Approach:
Add a size check before reading:

const MAX_FILE_SIZE = 5 * 1024 * 1024; // 5 MB
if (file.size > MAX_FILE_SIZE) {
  setError('File too large (max 5 MB)');
  return;
}

3. genResp variable unused in handleGeneratePrd

Location: web-ui/src/components/prd/DiscoveryPanel.tsx:124
Severity: Minor
Category: Code Quality

Recommendation:
The response from discoveryApi.generatePrd() is assigned to genResp but never used — the function immediately fetches the full PRD via prdApi.getLatest(). This is correct behavior (the generate endpoint returns a preview, not the full PRD), but the unused variable should be removed for clarity.

Suggested Approach:

await discoveryApi.generatePrd(sessionId, workspacePath);
const fullPrd = await prdApi.getLatest(workspacePath);

4. DiscoveryPanel and PRDView lack test coverage

Location: web-ui/src/components/prd/DiscoveryPanel.tsx, PRDView.tsx
Severity: Minor
Category: Test Coverage

Recommendation:
DiscoveryPanel (0% coverage) and PRDView (0% coverage) are the two orchestrator components that tie everything together. While their child components are well-tested (93-100%), the orchestrators contain the async API flow logic (session start, answer submission, PRD generation) that is most likely to break in production.

Suggested Approach:
Add tests that mock @/lib/api and SWR, then verify:

  • DiscoveryPanel: session auto-start, answer submission flow, error display
  • PRDView: loading/empty/content state rendering, discovery toggle

Positive Findings ✨

Excellent Practices

  • Consistent error pattern: All API-facing components (DiscoveryPanel, UploadPRDModal, DiscoveryInput) use try/catch/finally with typed ApiError extraction — follows the project's established normalizeErrorDetail pattern.
  • SWR optimistic updates: mutatePrd(newPrd, false) correctly uses false for revalidate to avoid redundant refetches after mutations.
  • Proper React patterns: Stable callback refs in useEventSource prevent unnecessary effect re-runs. useCallback used consistently for handlers passed to children.

Good Architectural Decisions

  • Incremental commits: Each of the 9 commits is independently reviewable and represents a testable increment — excellent for bisecting bugs.
  • Component separation: Clear responsibility split (PRDView = layout, PRDHeader = actions, MarkdownEditor = content, DiscoveryPanel = chat lifecycle). Each component is independently testable.
  • Generic + specific hook pattern: useEventSource (generic SSE) wrapping into useTaskStream (typed for task events) is a clean, reusable pattern.

Security Wins

  • react-markdown 10.1.0: Uses micromark parser which does not support raw HTML by default — safe against XSS in markdown content without needing rehype-raw or rehype-sanitize.
  • accept attribute on file input: Limits file picker to .md,.markdown,.txt — client-side defense against wrong file types.
  • API client withCredentials: true: Already configured in the existing axios instance — cookies sent with cross-origin requests, matching backend auth pattern.
  • Zero npm audit vulnerabilities: npm audit --production shows 0 vulnerabilities.

Team Collaboration Needed

Handoffs to Other Agents

Architecture Agent:

  • The AppSidebar reads workspace state from localStorage independently of the workspace page's own state management. If the workspace is deselected on the home page, the sidebar relies on a storage event listener to update. Consider a shared React context for workspace state to ensure consistency.

UX Designer Agent:

  • Error feedback for handleSavePrd and handleGenerateTasks failures currently has no visual indicator — user sees spinner stop but no message. A toast/notification system should be prioritized.
  • Disabled nav items in the sidebar show as dimmed text with no tooltip on mobile (icon-only mode). Consider adding title tooltips on the icon-only view.

Testing Recommendations

Unit Tests Needed

  • PRDHeader (10 tests) ✅
  • AssociatedTasksSummary (4 tests) ✅
  • MarkdownEditor (8 tests) ✅
  • DiscoveryTranscript (6 tests) ✅
  • DiscoveryInput (9 tests) ✅
  • AppSidebar (7 tests) ✅
  • useEventSource (9 tests) ✅
  • useTaskStream (9 tests) ✅
  • DiscoveryPanel (mock API, test session lifecycle)
  • PRDView (test state-driven rendering)
  • UploadPRDModal (mock prdApi.create, test submit flow)

Integration Tests

  • Full discovery flow: mount → auto-start → answer questions → generate PRD
  • Upload PRD via paste → verify editor populated
  • Task generation → verify AssociatedTasksSummary updates

Future Considerations

Patterns for Project Evolution

  • Toast/notification system: Multiple components need user-facing error feedback. Consider adding a lightweight toast (e.g., Sonner or Shadcn Toast) before building more views.
  • Workspace context: As more pages are added (Tasks, Execution, Blockers, Review), workspace state should move from localStorage + per-page hooks to a shared React context.

Technical Debt Items


Compliance & Best Practices

Security Standards Met

  • ✅ No raw HTML rendering in markdown (react-markdown default config)
  • ✅ File upload restricted by accept attribute
  • ✅ API client uses withCredentials for cookie-based auth
  • ✅ No secrets or credentials in frontend code
  • ✅ Zero npm audit vulnerabilities
  • ✅ User input sent to API via POST body (not URL path), except workspace_path which is from localStorage

Enterprise Best Practices

  • ✅ TypeScript strict types for all API responses
  • ✅ Consistent error handling pattern across components
  • ✅ 64 tests with 93-100% coverage on tested components
  • ⚠️ Two orchestrator components (DiscoveryPanel, PRDView) at 0% coverage

Action Items Summary

Immediate (Before Merge - Recommended)

  1. Add catch blocks to handleSavePrd and handleGenerateTasks in page.tsx
  2. Replace useState() with useEffect() for auto-start in DiscoveryPanel.tsx

Short-term (Next Sprint)

  1. Add toast/notification system for error feedback
  2. Add tests for DiscoveryPanel and PRDView orchestrator components
  3. Add file size validation to UploadPRDModal

Long-term (Backlog)

  1. Workspace state context (replace localStorage reads per-component)
  2. encodeURIComponent for path-interpolated IDs in API client
  3. Remove unused genResp variable

Conclusion

This is a well-executed PR that delivers a complete PRD View with strong component architecture, comprehensive tests for leaf components, and proper security defaults. The two major issues (missing error handling and misused useState) are straightforward fixes that don't require architectural changes. The codebase follows established patterns consistently across all 20 source files.

Recommendation: Fix the 2 major issues, then merge. Short-term items can be addressed in follow-up.


Appendix

Tools Used for Review

  • Manual code review of all 34 changed files
  • npm audit --production — 0 vulnerabilities
  • npx tsc --noEmit — 0 new type errors
  • npx jest — 152/152 tests passing
  • react-markdown version check (v10.1.0 — safe defaults)

References

  • OWASP Top 10 Web Application Security (A03, A06, A07)
  • React Rules of Hooks documentation
  • react-markdown security model (micromark parser, no raw HTML by default)

Metrics

  • Lines of Code Reviewed: ~2,400 (source), ~880 (tests)
  • Components Reviewed: 10 new components, 2 hooks, 1 API client extension
  • Security Patterns Checked: 6 (XSS, injection, file upload, auth headers, dependency audit, resource cleanup)

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

Code Review: feat(replay): debug and replay mode

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/core/test_execution_recording.py (1)

141-157: Decouple buffering test from default flush_interval.

At Line 142, this test relies on the current default flush threshold being greater than 4. Make the threshold explicit in-test to avoid future brittle failures if defaults change.

♻️ Proposed change
-        recorder = ExecutionRecorder(workspace=workspace, run_id="run-1")
+        recorder = ExecutionRecorder(workspace=workspace, run_id="run-1", flush_interval=100)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_execution_recording.py` around lines 141 - 157, The test
test_flush_writes_buffered_records assumes the default flush threshold is >4;
make it explicit by constructing the ExecutionRecorder with a known flush
interval (e.g., flush_interval=10) so the four buffered records remain in-memory
until recorder.flush() is called; update the recorder instantiation in
test_flush_writes_buffered_records (the ExecutionRecorder(...) call) to include
the flush_interval argument and keep the rest of the assertions unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/core/test_execution_recording.py`:
- Around line 141-157: The test test_flush_writes_buffered_records assumes the
default flush threshold is >4; make it explicit by constructing the
ExecutionRecorder with a known flush interval (e.g., flush_interval=10) so the
four buffered records remain in-memory until recorder.flush() is called; update
the recorder instantiation in test_flush_writes_buffered_records (the
ExecutionRecorder(...) call) to include the flush_interval argument and keep the
rest of the assertions unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 14081693-3a50-45de-9ef5-adb55940e293

📥 Commits

Reviewing files that changed from the base of the PR and between de742c7 and dc2ad33.

📒 Files selected for processing (1)
  • tests/core/test_execution_recording.py

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

CodeFRAME Header

CodeFRAME

Status
License
Python
Tests
Coverage
Follow on X

The IDE of the future is not a better text editor with AI autocomplete. It is a project delivery system where writing code is a subprocess.


The Problem

Coding agents are getting remarkably good at writing code. But shipping software is not the same as writing code.

Before code gets written, someone has to figure out what to build, decompose it into tasks that an agent can execute, and resolve ambiguities. After code gets written, someone has to verify it actually works, catch regressions, and deploy with confidence. Today, that "someone" is still you.

CodeFRAME owns the edges of the pipeline -- everything that happens before and after the code gets written. The actual coding is delegated to frontier agents (Claude Code, Codex, OpenCode, or CodeFRAME's built-in ReAct agent) that are better at it than any custom agent could be.

Think. Build. Prove. Ship.

THINK    What are you building? How should it be broken down?
           cf prd generate         Socratic requirements gathering
           cf prd stress-test      Recursive decomposition, surface ambiguities  [planned]
           cf tasks generate       Atomic tasks with dependency graphs

BUILD    Delegate to the best coding agent for the job
           cf work start --engine  Claude Code, Codex, OpenCode, or built-in
           CodeFRAME owns: verification gates, self-correction, stall detection

PROVE    Is the output any good?
           cf proof run            9-gate evidence-based quality system           [planned]
           cf proof capture        Glitch becomes a permanent requirement         [planned]

SHIP     Deploy with confidence
           cf pr create            PR with proof report attached
           cf pr merge             Only merges if proof passes

THE CLOSED LOOP
  Glitch in production
    -> cf proof capture
    -> New requirement
    -> Enforced on every future build
    = Quality compounding interest

Why CodeFRAME

Nobody else does the full upstream pipeline. Most orchestrators assume issues and specs already exist. CodeFRAME generates them through AI-guided Socratic discovery and recursive decomposition.

Agent-agnostic execution. CodeFRAME does not compete with Claude Code or Codex. It orchestrates them. The built-in ReAct agent is a capable fallback, not the point.

Quality memory (PROOF9). Every failure becomes a permanent proof obligation across 9 verification gates. Not just test coverage -- evidence-based verification that compounds over time. The closed loop is what turns a project into a learning system.

Radical simplicity. Single CLI binary, SQLite, no daemons, no infrastructure. Install and start building in under a minute.


Quick Start

# Install
git clone https://github.com/frankbria/codeframe.git
cd codeframe
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv && source .venv/bin/activate && uv sync
export ANTHROPIC_API_KEY="your-key"

# Initialize a project
cd /path/to/your/project
cf init . --detect

# Generate requirements through AI-guided discovery
cf prd generate

# Decompose into atomic tasks
cf tasks generate

# Execute (delegates to the agent engine)
cf work start <task-id> --execute

# Ship
cf pr create

That is the entire workflow. Everything else is optional.


Architecture

    YOU
     |
     v
  +-THINK---------------------------------------------+
  |  cf prd generate    Socratic requirements          |
  |  cf tasks generate  Atomic decomposition           |
  +----------------------------+-----------------------+
                               |
                               v
  +-BUILD---------------------------------------------+
  |  cf work start --engine <agent>                    |
  |                                                    |
  |  +-- Claude Code / Codex / OpenCode / ReactAgent   |
  |  |                                                 |
  |  +-- Verification gates (ruff, pytest, BUILD)      |
  |  +-- Self-correction loop (up to 5 retries)        |
  |  +-- Stall detection -> retry / blocker / fail       |
  +----------------------------+-----------------------+
                               |
                               v
  +-PROVE---------------------------------------------+
  |  cf proof run       9-gate quality system [planned]|
  |  cf review          Verification gates             |
  +----------------------------+-----------------------+
                               |
                               v
  +-SHIP----------------------------------------------+
  |  cf pr create       PR with proof report           |
  |  cf pr merge        Merge if proof passes          |
  +---------------------------------------------------+
                               |
            Glitch in production?
                               |
                               v
            cf proof capture -> new requirement
            -> enforced forever (closed loop)

The core domain is headless and runs entirely from the CLI. The FastAPI server and web UI are optional adapters for teams that want a dashboard.


CLI Reference

THINK -- Requirements and Planning

# Workspace
cf init <path>                        # Initialize workspace
cf init <path> --detect               # Auto-detect tech stack
cf status                             # Workspace status

# Requirements
cf prd generate                       # AI-guided Socratic PRD creation
cf prd generate --template lean       # Use a specific template
cf prd add <file.md>                  # Import existing PRD
cf prd show                           # Display current PRD

# Task decomposition
cf tasks generate                     # Generate tasks from PRD (LLM-powered)
cf tasks list                         # List all tasks
cf tasks list --status READY          # Filter by status
cf tasks show <id>                    # Task details with dependencies

# Scheduling
cf schedule show                      # Task schedule with dependencies
cf schedule predict                   # Completion date estimates
cf schedule bottlenecks               # Identify blocking tasks

BUILD -- Execution

# Single task
cf work start <id> --execute          # Execute with default engine (ReAct)
cf work start <id> --execute --engine plan   # Use legacy plan engine
cf work start <id> --execute --verbose       # Detailed progress output
cf work start <id> --execute --dry-run       # Preview without applying
cf work start <id> --execute --stall-timeout 120   # Custom stall timeout (seconds)
cf work start <id> --execute --stall-action retry  # Auto-retry on stall (blocker|retry|fail)
cf work follow <id>                   # Stream live output
cf work stop <id>                     # Cancel a run
cf work resume <id>                   # Resume after answering blockers

# Batch execution
cf work batch run --all-ready                # All READY tasks
cf work batch run --strategy parallel        # Parallel execution
cf work batch run --strategy auto            # LLM-inferred dependencies
cf work batch run --retry 3                  # Auto-retry failures
cf work batch status [batch_id]              # Batch progress
cf work batch resume <batch_id>              # Re-run failed tasks

# Blockers (human-in-the-loop)
cf blocker list                       # Questions the agent needs answered
cf blocker show <id>                  # Blocker details
cf blocker answer <id> "answer"       # Unblock the agent

# Diagnostics
cf work diagnose <id>                 # AI-powered failure analysis
cf env check                          # Validate environment
cf env doctor                         # Comprehensive health check

PROVE -- Verification

cf review                             # Run verification gates
cf checkpoint create "milestone"      # Snapshot project state
cf checkpoint list                    # List checkpoints
cf checkpoint restore <id>            # Roll back to checkpoint

SHIP -- Delivery

cf pr create                          # Create PR from current branch
cf pr status                          # PR status and review state
cf pr checks                          # CI check results
cf pr merge                           # Merge approved PR
cf commit                             # Commit verified changes
cf patch export                       # Export changes as patch

What Works Today

CodeFRAME v2 (Phase 2.5 complete) delivers the full Think-Build-Ship loop:

  • THINK: Socratic PRD generation, LLM-powered task decomposition with dependency graphs, 5 PRD templates, 7 task templates, CPM-based scheduling
  • BUILD: ReAct agent with 7 tools, self-correction with loop prevention, verification gates (ruff/pytest/BUILD), stall detection with configurable recovery (retry/blocker/fail), batch execution (serial/parallel/auto), human-in-the-loop blockers, checkpointing, state persistence
  • SHIP: GitHub PR workflow, environment validation, task self-diagnosis
  • Server layer (optional): FastAPI with 15 v2 routers, API key auth, rate limiting, SSE streaming, OpenAPI docs
  • Web UI (Phase 3, partial): Workspace view, PRD view with discovery, Task board with Kanban and batch execution, Blocker resolution, Review and commit with diff viewer
  • Test suite: 4200+ tests, 88% coverage

Roadmap

THINK (upstream pipeline)

  • cf prd stress-test -- Recursive decomposition that surfaces ambiguities before execution
  • Multi-round PRD refinement with domain-specific probes
  • Specification-level dependency analysis

BUILD (agent adapters)

  • Agent adapter architecture -- delegate to Claude Code, Codex, OpenCode via workspace hooks
  • Worktree isolation for parallel agent execution
  • Engine performance tracking and automatic routing
  • Reconciliation layer for multi-agent output

PROVE (quality memory)

  • PROOF9 -- 9-gate evidence-based quality system
  • cf proof capture -- Glitch-to-requirement closed loop
  • Quality compounding: every failure becomes a permanent proof obligation
  • Per-engine quality scoring

SHIP (delivery confidence)

  • Proof report attached to PRs
  • Merge gating on PROOF9 pass
  • Unified configuration (cf config)
  • Deployment hooks

Web UI

  • Blocker Resolution view
  • Review and Commit view with diff viewer
  • Execution Monitor view

Configuration

# Required
export ANTHROPIC_API_KEY=sk-ant-...

# Optional
export DATABASE_PATH=./codeframe.db         # Default: in-memory SQLite
export RATE_LIMIT_ENABLED=true              # API rate limiting
export RATE_LIMIT_DEFAULT=100/minute        # Default limit

For server configuration, rate limiting options, and API key setup, see docs/PHASE_2_DEVELOPER_GUIDE.md.


Testing

uv run pytest                          # All tests
uv run pytest -m v2                    # v2 tests only
uv run pytest tests/core/             # Core module tests
uv run pytest --cov=codeframe --cov-report=html   # With coverage

Documentation


Contributing

  1. Fork and clone the repository
  2. Install dependencies: uv sync
  3. Install pre-commit hooks: pre-commit install
  4. Run tests: uv run pytest
  5. Submit PR with tests and clear description

Code standards: PEP 8, ruff for linting, type hints required, 85%+ test coverage.


License

AGPL-3.0 -- Free to use, modify, and distribute. Derivative works and network services must release source code under the same license.


Built by Frank Bria

Issues | Discussions | Documentation

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

Code Review: feat(replay): debug and replay mode

Well-structured addition with good test coverage. Core architecture is solid (headless replay.py, optional recorder in ReactAgent, clean CLI/data separation). A few issues need attention before merge.

Bug: work_diff crashes on empty step list

In app.py, the default to_step calculation raises ValueError if trace.steps is empty:
step_b = to_step if to_step is not None else max(s.step_number for s in trace.steps)
If a run recorded zero steps (agent failed immediately), this max() call crashes. Use max((s.step_number for s in trace.steps), default=0) or guard with an early return.

Gap: Recorder never wired up in builtin.py

_build_agent() in codeframe/core/adapters/builtin.py constructs ReactAgent without passing execution_recorder, so no traces are actually recorded during normal cf work start --execute runs. The recorder parameter exists but has no production callsite.

The PR notes this is intentional, but it means cf work replay will always return 'no trace found' for real runs. A follow-up ticket to wire this up from builtin.py (or an explicit --record flag) would prevent user confusion.

Style: format parameter shadows built-in

In work_export_trace, format: str = typer.Option() shadows Python's built-in. Rename to output_format or fmt.

Maint: Schema DDL duplicated in init and upgrade paths

The three new table DDL statements are copy-pasted verbatim into both _init_database and _ensure_schema_upgrades. Since the upgrade path already uses CREATE TABLE IF NOT EXISTS, the tables only need to appear there.

UX: work_rerun command name implies execution it does not do

The command shows what would be re-run but does not actually execute anything. Consider cf work replay-plan or add an --apply flag stub with a clear 'not yet implemented' message.

Minor: delete operation type in schema is unreachable

The DB schema enforces CHECK (operation_type IN ('create', 'edit', 'delete')) but react_agent.py only hooks edit_file and create_file. Dead schema -- either add delete recording or drop the variant from the constraint.

Minor: work_rerun has no CLI tests

test_replay_commands.py covers replay, diff, and export-trace but skips rerun. Add at minimum a happy-path and a step-not-found error case.

Nit: flush() eats all exceptions at DEBUG level

Correct to not crash the agent, but a schema error or disk-full will be invisible without debug logging. Consider logger.warning instead.

Summary table: Bug (work_diff max() crash), Gap (recorder not wired in production), Style (format shadows built-in), Maint (schema DDL duplication), UX (rerun name misleads), Minor x2 (dead delete schema, missing rerun tests), Nit (DEBUG vs WARNING).

Fix the max() bug and track the recorder wiring before merge.

…format shadow (#315)

- Fix data loss in ExecutionRecorder.flush(): only clear buffers on
  success, retain data for retry on failure, log at WARNING not DEBUG
- Fix edit_file recording: read actual file content after edit instead
  of capturing search/replace snippet (create_file still uses tool input)
- Rename format parameter to output_format to avoid shadowing builtin
@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown

Code Review: feat(replay): debug and replay mode (#449)

Overall this is a well-structured addition. The feature is clearly scoped, the module is headless (no FastAPI), the ExecutionRecorder integration on ReactAgent is entirely opt-in, and the test coverage (78 new tests across unit, integration, and CLI layers) is thorough. A few issues worth addressing before merge, ranging from a potential crash to data-quality gaps.


Critical

work_diff crashes on empty trace steps

In codeframe/cli/app.py, the to_step default calculation uses an unguarded max() call:

step_b = to_step if to_step is not None else max(s.step_number for s in trace.steps)

load_execution_trace returns None for missing runs (handled above), but trace.steps could legitimately be an empty list if a run was created but never iterated. max() on an empty sequence raises ValueError. Suggested fix:

step_b = to_step if to_step is not None else (
    max(s.step_number for s in trace.steps) if trace.steps else 0
)

Important

content_before is always None for edit_file operations

In codeframe/core/react_agent.py, when recording a file edit the before-content is hardcoded to None. The after-content is correctly read from disk post-edit, but the before-content should be read from disk before execute_tool is called. Without it, compare_steps and get_step_snapshot produce correct cumulative state (they use content_after), but the stored content_before on FileOperation is misleading for the edit case. The _seed_three_step_trace test helper manually populates a non-None content_before for the edit step, so this gap is not caught by the test suite. Either read the file before executing the edit tool, or add a comment documenting this known limitation.

_maybe_flush threshold counts across all three buffers combined

The auto-flush fires when total >= flush_interval (default 10). This means 9 file operations alone will not trigger a flush, but adding 1 step will. An agent that produces many file ops per iteration may buffer an unexpectedly large number of records. Consider per-buffer thresholds or at least document this as intentional.


Minor / Nice-to-have

Schema duplication between _init_database and _ensure_schema_upgrades

The three new tables and their six indexes are copy-pasted verbatim in both functions in workspace.py (~60 lines duplicated). Both use CREATE TABLE IF NOT EXISTS so it works correctly, but extracting a shared helper would make future schema changes easier to maintain.

work_diff coerces None to "" without a comment

The before = change["before"] or "" coercion is used only to satisfy difflib.unified_diff, but the display branch above it correctly checks if change["before"] is None. A short inline comment would clarify intent.

ReplaySession is not used by any CLI command

The class is well-designed and tested, but all four CLI commands build their own ops_by_step / llm_by_step lookups directly instead of using ReplaySession. A brief docstring note like "intended for future interactive / TUI use" would prevent confusion about why it exists.

--show-llm truncates without indicating truncation

The display slice at 200 chars is fine, but appending "..." when the content is longer would improve UX.


Conventions Check (CLAUDE.md)

  • codeframe/core/replay.py has no FastAPI/HTTP imports - headless requirement satisfied.
  • ExecutionRecorder integration is fully guarded with if self.execution_recorder is not None - backward compatibility maintained.
  • CLI commands work without a running server - Golden Path compliance satisfied.
  • ExecutionRecorder is referenced via TYPE_CHECKING guard in react_agent.py - clean import.
  • New tables use CREATE TABLE IF NOT EXISTS plus _ensure_schema_upgrades - existing workspace migration handled correctly.

The two items to fix before merge are the max() crash on empty step lists (critical) and the missing content_before for edit operations (data quality). Everything else is non-blocking.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
codeframe/cli/app.py (3)

3378-3405: ⚠️ Potential issue | 🟠 Major

export-trace currently exports a partial trace shape.

This command claims full trace export, but codeframe/core/replay.py exporters currently omit trace.llm_interactions and file before/after contents, so offline analysis/reconstruction is incomplete.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/app.py` around lines 3378 - 3405, The exporters used by the
export-trace command (export_trace_json and export_trace_markdown in
codeframe.core.replay) currently omit trace.llm_interactions and file
before/after contents, so update those functions to include the full trace
shape: add trace.llm_interactions (preserving interaction metadata and tokens)
to the JSON/Markdown output and include file before_contents and after_contents
(or full file diffs) for each file change entry; ensure load_execution_trace
continues to return these fields and that export_trace_markdown formats/embeds
the file contents or diffs (not just filenames) so offline consumers can
reconstruct the run.

3138-3246: ⚠️ Potential issue | 🟠 Major

work replay still isn’t a replay session.

Without --step, Line 3219 just prints all steps and exits. There’s no interactive next/prev/jump navigation, so users still can’t step through execution as replay mode implies.

Suggested direction
-        for s in steps_to_show:
-            ...
+        if step is None:
+            # start interactive replay loop: next/prev/jump/show-llm/quit
+            # render one current step at a time
+            ...
+        else:
+            # single-step render
+            ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/app.py` around lines 3138 - 3246, The command currently prints
all steps when no --step is given instead of an interactive replay; update
work_replay so that when step is None it enters an interactive loop letting the
user navigate next/prev/jump/quit (commands like n/p/j <num>/q), updates a
current index over trace.steps, and renders only the current step. To implement:
extract the per-step rendering logic (the block that prints status, files and
LLM output using ops_by_step, llm_by_step, show_files and show_llm) into a
helper (e.g., render_step) and call it from the interactive loop; accept user
input via console.input (or typer.prompt), adjust the index on n/p/j commands,
validate bounds and show helpful prompts, and exit the loop on q. Keep existing
single-step behavior (when --step is provided) unchanged.

3411-3463: ⚠️ Potential issue | 🟠 Major

work rerun prepares state but never reruns.

Line 3442 only calls prepare_rerun() and then prints metadata; it does not restore workspace file state nor start a new execution run, so the command behavior doesn’t match rerun semantics.

Suggested direction
-        rerun_info = prepare_rerun(workspace, run_id, from_step)
-        # print only
+        rerun_info = prepare_rerun(workspace, run_id, from_step)
+        # restore rerun_info["file_state"] to workspace
+        # create/start a new run for rerun_info["task_id"]
+        # invoke runtime.execute_agent(...) and report status
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/app.py` around lines 3411 - 3463, work_rerun currently only
calls prepare_rerun and prints info but never restores files or starts
execution; update work_rerun to accept an explicit flag (e.g., --apply /
--execute) and when set: 1) obtain file_state from prepare_rerun and restore
those files into the workspace (use get_workspace(path) and the workspace API to
write/overwrite each path in rerun_info["file_state"]), and 2) invoke the replay
execution routine (call a function such as
codeframe.core.replay.execute_rerun(workspace, run_id, from_step) or the
existing runner API to start/resume the run) and print the run outcome; keep
prepare_rerun, get_workspace and rerun_info keys ("file_state",
"remaining_steps", "task_id") as the reference points for locating and applying
the changes.
codeframe/core/react_agent.py (1)

465-488: ⚠️ Potential issue | 🟠 Major

Trace recording still skips the verification/fix execution path.

Line 465 instruments only _react_loop(), but _run_final_verification() also performs LLM calls and tool executions. Runs needing verification retries will export/replay incomplete traces and reconstruct the wrong checkpoint state.

Suggested direction
+# in _run_final_verification(), for each correction turn:
+# 1) record_iteration(...)
+# 2) record_llm_call(...)
+# 3) record_file_operation(...) for successful create/edit/delete tools
+# Reuse the same recording helper used by _react_loop() to keep behavior consistent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/react_agent.py` around lines 465 - 488, The trace recording
currently only wraps `_react_loop()` LLM/tool interactions, so calls made inside
`_run_final_verification()` are not recorded and lead to incomplete traces;
update the logic where `_rec_step_id` and subsequent
`execution_recorder.record_iteration(...)` / `record_llm_call(...)` are invoked
to also run for LLM/tool responses originating from `_run_final_verification()`
(or factor the recording into a helper used by both `_react_loop` and
`_run_final_verification`), ensuring you pass the same fields
(step_number/iterations, tool_names from `response.tool_calls`,
llm_response_summary, prompt_summary, model, tokens_used using
`response.input_tokens + response.output_tokens`, and purpose="execution") so
verification retries are included in exported/replayed traces.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@codeframe/core/react_agent.py`:
- Around line 569-597: The execution recording currently uses the create_file
input content and omits the pre-change state for edits, so update the recording
logic in the block handling execution_recorder/_rec_step_id for tc.name in
("edit_file","create_file") to: for create_file, read the actual on-disk file
content after the tool runs (using self.workspace.repo_path / path) and use that
as _op_after instead of tc.input.get("content"); for edit_file, read and supply
both the pre-change content (before) by reading the file prior to the edit (if
present) and the post-change content (after) by reading the file after the edit
into _op_after; then call
execution_recorder.record_file_operation(step_id=_rec_step_id, op_type=_op_type,
path=_op_path, before=<pre-content-or-None>, after=_op_after) so edits have a
non-None before value and creates reflect actual disk state.

---

Duplicate comments:
In `@codeframe/cli/app.py`:
- Around line 3378-3405: The exporters used by the export-trace command
(export_trace_json and export_trace_markdown in codeframe.core.replay) currently
omit trace.llm_interactions and file before/after contents, so update those
functions to include the full trace shape: add trace.llm_interactions
(preserving interaction metadata and tokens) to the JSON/Markdown output and
include file before_contents and after_contents (or full file diffs) for each
file change entry; ensure load_execution_trace continues to return these fields
and that export_trace_markdown formats/embeds the file contents or diffs (not
just filenames) so offline consumers can reconstruct the run.
- Around line 3138-3246: The command currently prints all steps when no --step
is given instead of an interactive replay; update work_replay so that when step
is None it enters an interactive loop letting the user navigate
next/prev/jump/quit (commands like n/p/j <num>/q), updates a current index over
trace.steps, and renders only the current step. To implement: extract the
per-step rendering logic (the block that prints status, files and LLM output
using ops_by_step, llm_by_step, show_files and show_llm) into a helper (e.g.,
render_step) and call it from the interactive loop; accept user input via
console.input (or typer.prompt), adjust the index on n/p/j commands, validate
bounds and show helpful prompts, and exit the loop on q. Keep existing
single-step behavior (when --step is provided) unchanged.
- Around line 3411-3463: work_rerun currently only calls prepare_rerun and
prints info but never restores files or starts execution; update work_rerun to
accept an explicit flag (e.g., --apply / --execute) and when set: 1) obtain
file_state from prepare_rerun and restore those files into the workspace (use
get_workspace(path) and the workspace API to write/overwrite each path in
rerun_info["file_state"]), and 2) invoke the replay execution routine (call a
function such as codeframe.core.replay.execute_rerun(workspace, run_id,
from_step) or the existing runner API to start/resume the run) and print the run
outcome; keep prepare_rerun, get_workspace and rerun_info keys ("file_state",
"remaining_steps", "task_id") as the reference points for locating and applying
the changes.

In `@codeframe/core/react_agent.py`:
- Around line 465-488: The trace recording currently only wraps `_react_loop()`
LLM/tool interactions, so calls made inside `_run_final_verification()` are not
recorded and lead to incomplete traces; update the logic where `_rec_step_id`
and subsequent `execution_recorder.record_iteration(...)` /
`record_llm_call(...)` are invoked to also run for LLM/tool responses
originating from `_run_final_verification()` (or factor the recording into a
helper used by both `_react_loop` and `_run_final_verification`), ensuring you
pass the same fields (step_number/iterations, tool_names from
`response.tool_calls`, llm_response_summary, prompt_summary, model, tokens_used
using `response.input_tokens + response.output_tokens`, and purpose="execution")
so verification retries are included in exported/replayed traces.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3cd06411-ca01-41bf-943c-74ee7739abe5

📥 Commits

Reviewing files that changed from the base of the PR and between dc2ad33 and c46c439.

📒 Files selected for processing (3)
  • codeframe/cli/app.py
  • codeframe/core/react_agent.py
  • codeframe/core/replay.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • codeframe/core/replay.py

Comment on lines +569 to +597
# --- Execution recording: file operations ---
if (
self.execution_recorder is not None
and _rec_step_id is not None
and tc.name in ("edit_file", "create_file")
and not result.is_error
):
_op_type = "create" if tc.name == "create_file" else "edit"
_op_path = tc.input.get("path", "")
if tc.name == "create_file":
# create_file input has the full content
_op_after = tc.input.get("content", "")
else:
# edit_file uses search/replace snippets — read the
# actual file content after the edit for accurate state.
_op_after = None
try:
_full_path = self.workspace.repo_path / _op_path
if _full_path.is_file():
_op_after = _full_path.read_text(errors="replace")
except OSError:
pass
self.execution_recorder.record_file_operation(
step_id=_rec_step_id,
op_type=_op_type,
path=_op_path,
before=None,
after=_op_after,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

File-operation snapshots are still not fully faithful to on-disk state.

At Line 580, create_file uses input content instead of reading the post-tool file, so autofix/lint mutations are missed. At Line 595, edits are recorded with before=None, which drops pre-change state expected by ExecutionRecorder.record_file_operation() for edit operations.

Suggested fix
-                    if tc.name == "create_file":
-                        # create_file input has the full content
-                        _op_after = tc.input.get("content", "")
-                    else:
-                        # edit_file uses search/replace snippets — read the
-                        # actual file content after the edit for accurate state.
-                        _op_after = None
-                        try:
-                            _full_path = self.workspace.repo_path / _op_path
-                            if _full_path.is_file():
-                                _op_after = _full_path.read_text(errors="replace")
-                        except OSError:
-                            pass
+                    _op_after = None
+                    try:
+                        _full_path = (self.workspace.repo_path / _op_path).resolve()
+                        if _full_path.is_file():
+                            _op_after = _full_path.read_text(errors="replace")
+                    except OSError:
+                        pass
                     self.execution_recorder.record_file_operation(
                         step_id=_rec_step_id,
                         op_type=_op_type,
                         path=_op_path,
-                        before=None,
+                        before=_op_before,
                         after=_op_after,
                     )
-                result = self._execute_tool_with_lint(tc)
+                _op_before = None
+                if tc.name == "edit_file":
+                    try:
+                        _before_path = (self.workspace.repo_path / tc.input.get("path", "")).resolve()
+                        if _before_path.is_file():
+                            _op_before = _before_path.read_text(errors="replace")
+                    except OSError:
+                        pass
+                result = self._execute_tool_with_lint(tc)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/react_agent.py` around lines 569 - 597, The execution
recording currently uses the create_file input content and omits the pre-change
state for edits, so update the recording logic in the block handling
execution_recorder/_rec_step_id for tc.name in ("edit_file","create_file") to:
for create_file, read the actual on-disk file content after the tool runs (using
self.workspace.repo_path / path) and use that as _op_after instead of
tc.input.get("content"); for edit_file, read and supply both the pre-change
content (before) by reading the file prior to the edit (if present) and the
post-change content (after) by reading the file after the edit into _op_after;
then call execution_recorder.record_file_operation(step_id=_rec_step_id,
op_type=_op_type, path=_op_path, before=<pre-content-or-None>, after=_op_after)
so edits have a non-None before value and creates reflect actual disk state.

@frankbria frankbria merged commit cd3e985 into main Mar 17, 2026
13 checks passed
@frankbria frankbria deleted the feature/issue-315-debug-replay-mode branch March 24, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Phase 5] Debug and replay mode

1 participant