feat(metrics): engine performance tracking and comparison (#419)#443
Conversation
…419) Add run_engine_log and engine_stats tables for tracking per-run engine metrics and computing aggregate performance statistics. Implements record_run, get_engine_stats, and get_run_log functions with 11 tests.
…ands (#419 steps 3-4) - Record engine performance metrics (duration, tokens, status) after each execute_agent() call via engine_stats.record_run(), with graceful failure handling so stats errors never crash execution - Add `cf engines stats` command with text/JSON output and engine filter - Add `cf engines compare` command showing all engines sorted by success rate - 12 new tests: 4 for runtime integration, 8 for CLI commands
WalkthroughAdds engine performance tracking: new DB tables for per-run and aggregate metrics, a headless engine_stats module to record/query stats, runtime instrumentation to record runs, CLI commands Changes
Sequence DiagramsequenceDiagram
participant Agent as Agent Adapter
participant Runtime as runtime.py
participant Stats as engine_stats.py
participant DB as SQLite
participant CLI as CLI User
CLI->>Runtime: trigger execution (engine, task)
Runtime->>Runtime: record monotonic start_ms
Agent->>Runtime: return AgentResult (status, token_usage)
Runtime->>Runtime: compute duration_ms
Runtime->>Stats: record_run(workspace, run_id, engine, task_id, status, duration_ms, tokens_used, ...)
Stats->>DB: INSERT INTO run_engine_log
Stats->>DB: SELECT aggregates FROM run_engine_log
Stats->>DB: UPSERT INTO engine_stats
CLI->>Stats: get_engine_stats(workspace, engine?)
Stats->>DB: SELECT FROM engine_stats
Stats-->>CLI: return aggregated stats
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Code Review: feat(metrics): engine performance tracking and comparison (#419)Overall this is a well-structured, focused addition. The architecture respects the core-first, headless pattern from CLAUDE.md, the fail-safe wiring in Critical
# current (two connections)
conn = get_db_connection(workspace)
try:
conn.execute("INSERT INTO run_engine_log ...")
conn.commit()
finally:
conn.close()
_update_aggregate_stats(workspace, engine) # opens its own connection
# better (one connection)
conn = get_db_connection(workspace)
try:
conn.execute("INSERT INTO run_engine_log ...")
_update_aggregate_stats_conn(conn, workspace.id, engine)
conn.commit()
finally:
conn.close()ImportantDeferred import of try:
from codeframe.core import engine_stats
...
except Exception:
logger.debug("Engine stats recording failed", exc_info=True)The deferred import is unnecessary —
The timer starts after workspace validation, env config loading, tech stack probing, hook setup — all of which can be slow. This means
status=agent_status.value.upper(),
avg_tokens_per_task = (
completed_tokens / completed_count if completed_count > 0 else 0.0
)The denominator is Nice-to-haveSchema duplication between Both functions now contain identical
format: str = typer.Option("text", "--format", "-f", help="Output format: text or json")Using
Both
Test helper imports inside test methods (tests/core/test_engine_stats.py) def test_record_run_inserts_row(self, workspace, run_id, task_id):
from codeframe.core.engine_stats import record_run
...Importing inside every test method is unusual and makes the test file harder to skim. Moving imports to the top of each class or to the module level is cleaner and is the pattern used in the rest of the test suite. Compliance with CLAUDE.md
The deferred import in runtime and the double-connection pattern in |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/core/test_engine_stats.py (1)
35-53: Consider documenting the_recordhelper return value.The helper returns the
run_idwhich is used in some tests but not others. A brief docstring addition clarifying this would improve readability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/test_engine_stats.py` around lines 35 - 53, Update the _record helper's docstring to mention that it returns the generated run_id string; specifically, in the _record function (which calls record_run) add a concise sentence describing that the function creates a fresh run_id and returns that run_id (string) so tests can reference it.codeframe/core/engine_stats.py (1)
82-97: Minor redundancy in SQL query.Lines 85 and 93 both compute
COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END), resulting inrow[1]androw[9]containing the same value. This doesn't affect correctness but adds minor overhead.Simplified query
row = cur.execute( "SELECT " " COUNT(*), " " COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END), " " COUNT(CASE WHEN status = 'FAILED' THEN 1 END), " " COUNT(CASE WHEN gates_passed = 1 THEN 1 END), " " COUNT(CASE WHEN gates_passed IS NOT NULL THEN 1 END), " " COUNT(CASE WHEN self_corrections > 0 THEN 1 END), " " AVG(CASE WHEN duration_ms IS NOT NULL THEN duration_ms END), " - " SUM(tokens_used), " - " SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END), " - " COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) " + " SUM(tokens_used), " + " SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END) " "FROM run_engine_log " "WHERE engine = ? AND workspace_id = ?", (engine, ws_id), ).fetchone() total = row[0] completed = row[1] failed = row[2] gates_pass_count = row[3] gates_total = row[4] self_corr_count = row[5] avg_duration = row[6] total_tokens = row[7] or 0 completed_tokens = row[8] or 0 - completed_count = row[9] + completed_count = completed # Reuse row[1]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/engine_stats.py` around lines 82 - 97, The SQL SELECT duplicates COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) (producing row[1] and row[9]) in the cur.execute(...) against run_engine_log; remove the redundant expression from the query (or consolidate by using an alias once) and update any code that referenced the removed index (e.g., uses of row[9]) to use the single completed-count column (row[1] or the chosen alias) so the tuple indices remain correct.codeframe/core/runtime.py (1)
675-677: Consider moving the time import to the module level.The
timemodule import inside the function adds minor overhead on each call. Moving it to the module-level imports would be cleaner.Suggested change
Add to the module-level imports at the top of the file:
import timeThen update lines 675-677:
- import time as _time_mod - _perf_start_ms = int(_time_mod.monotonic() * 1000) + _perf_start_ms = int(time.monotonic() * 1000)And update line 811:
- _perf_duration_ms = int(_time_mod.monotonic() * 1000) - _perf_start_ms + _perf_duration_ms = int(time.monotonic() * 1000) - _perf_start_ms🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/runtime.py` around lines 675 - 677, Move the time import out of the function and into module-level imports: add "import time" at the top of the file, remove the inline "import time as _time_mod", and update all places referencing _time_mod.monotonic() (including the code that sets _perf_start_ms and the later use at the end of the runtime where _time_mod is referenced) to call time.monotonic() instead so the module-level import is used consistently.codeframe/cli/engines_commands.py (1)
170-188: Minor:formatparameter shadows built-in.The parameter
formatshadows Python's built-informatfunction. While this works fine, renaming tooutput_formatorfmtwould be slightly cleaner.Suggested change
`@engines_app.command`("stats") def stats( engine: Optional[str] = typer.Option(None, "--engine", "-e", help="Filter by engine name"), - format: str = typer.Option("text", "--format", "-f", help="Output format: text or json"), + output_format: str = typer.Option("text", "--format", "-f", help="Output format: text or json"), ) -> None: """Show engine performance statistics.""" workspace = _get_current_workspace() data = engine_stats.get_engine_stats(workspace, engine=engine) if not data: console.print("[yellow]No engine stats recorded yet.[/yellow]") return - if format == "json": + if output_format == "json": console.print(_json.dumps(data, indent=2)) return table = _build_stats_table(data, title="Engine Performance Stats") console.print(table)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/cli/engines_commands.py` around lines 170 - 188, The CLI command function stats currently uses a parameter named format which shadows the built-in format; rename the parameter to output_format (or fmt) in the stats function signature and update all internal references (the typer.Option declaration, the comparison if format == "json", and the console.print branches) to use output_format so behavior remains identical but avoids shadowing; keep the CLI flag name "--format" in the typer.Option so the external CLI stays the same and update the docstring/variable names accordingly (function: stats, local vars: output_format).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/core/test_runtime_engine_stats.py`:
- Around line 36-44: The test creates a Run with started_at as a string but
Run.started_at is typed as a datetime; change the test to pass a datetime object
(e.g., use datetime.fromisoformat or datetime.strptime to parse
"2026-03-14T00:00:00Z") and ensure the test file imports datetime; update the
Run instantiation in tests/core/test_runtime_engine_stats.py (the Run(...) call)
to set started_at=<datetime object> instead of the string.
---
Nitpick comments:
In `@codeframe/cli/engines_commands.py`:
- Around line 170-188: The CLI command function stats currently uses a parameter
named format which shadows the built-in format; rename the parameter to
output_format (or fmt) in the stats function signature and update all internal
references (the typer.Option declaration, the comparison if format == "json",
and the console.print branches) to use output_format so behavior remains
identical but avoids shadowing; keep the CLI flag name "--format" in the
typer.Option so the external CLI stays the same and update the
docstring/variable names accordingly (function: stats, local vars:
output_format).
In `@codeframe/core/engine_stats.py`:
- Around line 82-97: The SQL SELECT duplicates COUNT(CASE WHEN status =
'COMPLETED' THEN 1 END) (producing row[1] and row[9]) in the cur.execute(...)
against run_engine_log; remove the redundant expression from the query (or
consolidate by using an alias once) and update any code that referenced the
removed index (e.g., uses of row[9]) to use the single completed-count column
(row[1] or the chosen alias) so the tuple indices remain correct.
In `@codeframe/core/runtime.py`:
- Around line 675-677: Move the time import out of the function and into
module-level imports: add "import time" at the top of the file, remove the
inline "import time as _time_mod", and update all places referencing
_time_mod.monotonic() (including the code that sets _perf_start_ms and the later
use at the end of the runtime where _time_mod is referenced) to call
time.monotonic() instead so the module-level import is used consistently.
In `@tests/core/test_engine_stats.py`:
- Around line 35-53: Update the _record helper's docstring to mention that it
returns the generated run_id string; specifically, in the _record function
(which calls record_run) add a concise sentence describing that the function
creates a fresh run_id and returns that run_id (string) so tests can reference
it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 786ff895-524f-48b9-bbb9-394e26805d64
📒 Files selected for processing (7)
codeframe/cli/engines_commands.pycodeframe/core/engine_stats.pycodeframe/core/runtime.pycodeframe/core/workspace.pytests/cli/test_engines_stats_commands.pytests/core/test_engine_stats.pytests/core/test_runtime_engine_stats.py
| run = Run( | ||
| id="run-001", | ||
| workspace_id=workspace.id, | ||
| task_id="task-001", | ||
| status=RunStatus.RUNNING, | ||
| started_at="2026-03-14T00:00:00Z", | ||
| completed_at=None, | ||
| ) | ||
| return workspace, run |
There was a problem hiding this comment.
Type mismatch: started_at should be a datetime object, not a string.
The Run dataclass in runtime.py (line 58) declares started_at: datetime, but this mock uses a string. While the current tests pass because complete_run, fail_run, and block_run are patched, this could cause issues if the test setup is reused elsewhere or if patches are modified.
Suggested fix
+from datetime import datetime, timezone
+
def _make_workspace_and_run(tmp_path):
"""Create a real workspace and a mock Run for testing."""
from codeframe.core.workspace import create_or_load_workspace
from codeframe.core.runtime import Run, RunStatus
repo = tmp_path / "test_repo"
repo.mkdir()
workspace = create_or_load_workspace(repo)
run = Run(
id="run-001",
workspace_id=workspace.id,
task_id="task-001",
status=RunStatus.RUNNING,
- started_at="2026-03-14T00:00:00Z",
+ started_at=datetime(2026, 3, 14, tzinfo=timezone.utc),
completed_at=None,
)
return workspace, run🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/core/test_runtime_engine_stats.py` around lines 36 - 44, The test
creates a Run with started_at as a string but Run.started_at is typed as a
datetime; change the test to pass a datetime object (e.g., use
datetime.fromisoformat or datetime.strptime to parse "2026-03-14T00:00:00Z") and
ensure the test file imports datetime; update the Run instantiation in
tests/core/test_runtime_engine_stats.py (the Run(...) call) to set
started_at=<datetime object> instead of the string.
- Use single DB connection in record_run (fixes TOCTOU/double connection) - Move engine_stats import to module level in runtime.py - Upgrade stats failure logging from debug to warning - Rename format→output_format to avoid shadowing builtin - Keep .upper() on status values for SQL convention consistency
There was a problem hiding this comment.
🧹 Nitpick comments (3)
codeframe/cli/engines_commands.py (1)
124-128: Duration formatting could handle edge cases more gracefully.The function handles values < 1000ms and >= 1000ms, but negative values (which shouldn't occur but could indicate a bug upstream) would display incorrectly.
🛡️ Optional: Add defensive check for negative durations
def _format_duration(ms: float) -> str: """Format duration in human-readable form.""" + if ms < 0: + return "N/A" if ms < 1000: return f"{ms:.0f}ms" return f"{ms / 1000:.1f}s"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/cli/engines_commands.py` around lines 124 - 128, The _format_duration function should defensively handle negative inputs; update _format_duration(ms: float) to clamp negative ms to 0 (e.g., if ms < 0: ms = 0) before formatting so upstream timing bugs don't produce misleading negative durations, then keep the existing branch logic for <1000ms and >=1000ms to return "0ms" or the seconds string as appropriate.codeframe/core/engine_stats.py (2)
92-107: Minor SQL redundancy:completedcount is computed twice.Lines 95 and 103 both count
status = 'COMPLETED'. While functionally correct, this is slightly redundant.♻️ Optional: Remove redundant column
row = cur.execute( "SELECT " " COUNT(*), " " COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END), " " COUNT(CASE WHEN status = 'FAILED' THEN 1 END), " " COUNT(CASE WHEN gates_passed = 1 THEN 1 END), " " COUNT(CASE WHEN gates_passed IS NOT NULL THEN 1 END), " " COUNT(CASE WHEN self_corrections > 0 THEN 1 END), " " AVG(CASE WHEN duration_ms IS NOT NULL THEN duration_ms END), " " SUM(tokens_used), " - " SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END), " - " COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) " + " SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END) " "FROM run_engine_log " "WHERE engine = ? AND workspace_id = ?", (engine, ws_id), ).fetchone() total = row[0] completed = row[1] failed = row[2] gates_pass_count = row[3] gates_total = row[4] self_corr_count = row[5] avg_duration = row[6] total_tokens = row[7] or 0 completed_tokens = row[8] or 0 - completed_count = row[9] + completed_count = completed # Reuse the count from row[1]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/engine_stats.py` around lines 92 - 107, The SQL in the cur.execute call in engine_stats.py selects the same COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) twice; remove the duplicate column from that SELECT (the second occurrence near the end of the list) and then adjust any code that reads the resulting row tuple (references to row indices/variables that expect the extra column) so the mapping of counts/averages/sums to variables remains correct. Use the existing cur.execute(...) invocation and the enclosing function that processes row to find and update all dependent index usages.
14-66: Consider emitting an event for observability on run recording.Per coding guidelines, core modules should "emit events for state transitions via
core/events.pyfor audit and observability." Recording an engine run is arguably a state transition that could benefit from event emission for audit trails.However, this may be intentionally omitted since:
- The recording itself is already a form of persistence/audit
- The runtime already handles fail-safe recording with warning logs
If events are desired, they could be added without blocking this PR.
The single-connection approach correctly addresses the TOCTOU concern mentioned in the commit message. The
try/finallyensures the connection is always closed.As per coding guidelines: "All core modules must emit events for state transitions via
core/events.pyfor audit and observability"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/engine_stats.py` around lines 14 - 66, The record_run function currently persists engine runs but does not emit an observability event; import the event emitter from core/events.py (e.g. emit_event) and, after the INSERT and after calling _update_aggregate_stats_conn but before conn.commit(), call the emitter to emit a concise "engine_run_recorded" (or similar) event containing run_id, engine, task_id, workspace_id (workspace.id), status, duration_ms, tokens_used, gates_passed, self_corrections and created_at (now) so consumers can audit the state transition without changing the DB transaction semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@codeframe/cli/engines_commands.py`:
- Around line 124-128: The _format_duration function should defensively handle
negative inputs; update _format_duration(ms: float) to clamp negative ms to 0
(e.g., if ms < 0: ms = 0) before formatting so upstream timing bugs don't
produce misleading negative durations, then keep the existing branch logic for
<1000ms and >=1000ms to return "0ms" or the seconds string as appropriate.
In `@codeframe/core/engine_stats.py`:
- Around line 92-107: The SQL in the cur.execute call in engine_stats.py selects
the same COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) twice; remove the
duplicate column from that SELECT (the second occurrence near the end of the
list) and then adjust any code that reads the resulting row tuple (references to
row indices/variables that expect the extra column) so the mapping of
counts/averages/sums to variables remains correct. Use the existing
cur.execute(...) invocation and the enclosing function that processes row to
find and update all dependent index usages.
- Around line 14-66: The record_run function currently persists engine runs but
does not emit an observability event; import the event emitter from
core/events.py (e.g. emit_event) and, after the INSERT and after calling
_update_aggregate_stats_conn but before conn.commit(), call the emitter to emit
a concise "engine_run_recorded" (or similar) event containing run_id, engine,
task_id, workspace_id (workspace.id), status, duration_ms, tokens_used,
gates_passed, self_corrections and created_at (now) so consumers can audit the
state transition without changing the DB transaction semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: be1b3109-0ab6-42a9-a4ef-05502231fbc2
📒 Files selected for processing (3)
codeframe/cli/engines_commands.pycodeframe/core/engine_stats.pycodeframe/core/runtime.py
🚧 Files skipped from review as they are similar to previous changes (1)
- codeframe/core/runtime.py
|
Follow-up Review Good progress since the initial review. Here is a status update on what was addressed and what remains. Resolved:
Still Open:
Both this review and CodeRabbit flagged this. The local alias _time_mod is unnecessary. Move to module-level:
row[1] (completed) and row[9] (completed_count) are computed by the same COUNT(CASE WHEN status = COMPLETED ...) expression. CodeRabbit also flagged this. Replace completed_count = row[9] with completed_count = completed and remove the redundant tenth SELECT column.
From my prior review - still unaddressed. The denominator is completed_count but completed_tokens is SUM(tokens_used WHERE status=COMPLETED), so token cost from failed runs is silently excluded. The metric name gives no hint of this. Either rename to avg_tokens_per_completed_task or compute across all runs so total cost (including failures) is visible.
CodeRabbit flagged this and it is still in the current diff. The field is typed datetime but the test passes a raw string: started_at="2026-03-14T00:00:00Z". Fix: use started_at=datetime(2026, 3, 14, tzinfo=timezone.utc) and add from datetime import datetime, timezone. Still Nice-to-Have (from prior review, not blocking):
The critical fix landed cleanly. The import time, duplicate SQL column, avg_tokens_per_task naming, and started_at type mismatch are the remaining items worth resolving before merge. |
Summary
Closes #419
run_engine_logandengine_statstables via_ensure_schema_upgrades()(idempotent migration)codeframe/core/engine_stats.pywithrecord_run(), aggregate computation,get_engine_stats(),get_run_log()execute_agent()call, record engine name, duration, tokens, status torun_engine_log(fail-safe, never crashes execution)cf engines stats [--engine NAME] [--format text|json]andcf engines compare(side-by-side comparison sorted by success rate)Files Changed
core/workspace.pycore/engine_stats.py(NEW)core/runtime.pycli/engines_commands.pystatsandcomparecommandsTest plan
cf engines statswith seeded datacf engines compareshows sorted comparisonSummary by CodeRabbit
New Features
statsCLI to view engine performance (success rate, gate pass rate, avg duration, token usage) with engine filtering and JSON output.compareCLI to rank and compare engines by success rate.Tests