feat(metrics): engine performance tracking and comparison (#419) by frankbria · Pull Request #443 · frankbria/codeframe

frankbria · 2026-03-14T19:50:05Z

Summary

Closes #419

Schema extension: Add run_engine_log and engine_stats tables via _ensure_schema_upgrades() (idempotent migration)
Core module: codeframe/core/engine_stats.py with record_run(), aggregate computation, get_engine_stats(), get_run_log()
Runtime wiring: After each execute_agent() call, record engine name, duration, tokens, status to run_engine_log (fail-safe, never crashes execution)
CLI commands: cf engines stats [--engine NAME] [--format text|json] and cf engines compare (side-by-side comparison sorted by success rate)

Files Changed

Area	Files	What
Schema	`core/workspace.py`	Two new tables + indexes
Core	`core/engine_stats.py` (NEW)	Record, aggregate, query engine metrics
Runtime	`core/runtime.py`	Hook recording after agent execution
CLI	`cli/engines_commands.py`	`stats` and `compare` commands
Tests	3 new test files	23 new tests

Test plan

23 new tests covering schema, aggregation, runtime wiring, CLI
All 2306 v2 tests pass (0 regressions)
Ruff lint clean
Demo cf engines stats with seeded data
Demo cf engines compare shows sorted comparison

Summary by CodeRabbit

New Features
- Added stats CLI to view engine performance (success rate, gate pass rate, avg duration, token usage) with engine filtering and JSON output.
- Added compare CLI to rank and compare engines by success rate.
- Automatic recording and aggregation of per-run engine performance for reporting.
Tests
- Added comprehensive tests for CLI commands and engine performance tracking.

…419) Add run_engine_log and engine_stats tables for tracking per-run engine metrics and computing aggregate performance statistics. Implements record_run, get_engine_stats, and get_run_log functions with 11 tests.

…ands (#419 steps 3-4) - Record engine performance metrics (duration, tokens, status) after each execute_agent() call via engine_stats.record_run(), with graceful failure handling so stats errors never crash execution - Add `cf engines stats` command with text/JSON output and engine filter - Add `cf engines compare` command showing all engines sorted by success rate - 12 new tests: 4 for runtime integration, 8 for CLI commands

coderabbitai · 2026-03-14T19:50:19Z

Walkthrough

Adds engine performance tracking: new DB tables for per-run and aggregate metrics, a headless engine_stats module to record/query stats, runtime instrumentation to record runs, CLI commands engines stats/engines compare, and tests covering core, runtime, and CLI paths.

Changes

Cohort / File(s)	Summary
Schema & Workspace `codeframe/core/workspace.py`	Adds `run_engine_log` and `engine_stats` tables plus indexes; integrates creation into setup/upgrade paths.
Engine Stats Core `codeframe/core/engine_stats.py`	New module providing `record_run`, `get_engine_stats`, `get_run_log`, and internal aggregate update/upsert logic computing rates/averages with safe handling.
Runtime Instrumentation `codeframe/core/runtime.py`	Records monotonic start time and calls `engine_stats.record_run` after agent execution with duration, tokens, status; wraps call in exception handling.
CLI Commands `codeframe/cli/engines_commands.py`	Adds `stats` (filterable by engine, text/json output) and `compare` commands; includes helpers to compute success rate, format durations, and build display tables.
Tests `tests/core/test_engine_stats.py`, `tests/core/test_runtime_engine_stats.py`, `tests/cli/test_engines_stats_commands.py`	New tests for DB logging, aggregate computation, run-log queries, runtime recording behavior (success/failure/token edge cases), and CLI output/formatting/filtering.

Sequence Diagram

sequenceDiagram
    participant Agent as Agent Adapter
    participant Runtime as runtime.py
    participant Stats as engine_stats.py
    participant DB as SQLite
    participant CLI as CLI User

    CLI->>Runtime: trigger execution (engine, task)
    Runtime->>Runtime: record monotonic start_ms
    Agent->>Runtime: return AgentResult (status, token_usage)
    Runtime->>Runtime: compute duration_ms
    Runtime->>Stats: record_run(workspace, run_id, engine, task_id, status, duration_ms, tokens_used, ...)
    Stats->>DB: INSERT INTO run_engine_log
    Stats->>DB: SELECT aggregates FROM run_engine_log
    Stats->>DB: UPSERT INTO engine_stats
    CLI->>Stats: get_engine_stats(workspace, engine?)
    Stats->>DB: SELECT FROM engine_stats
    Stats-->>CLI: return aggregated stats

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat(core): engine registry and runtime selection (#414) #433: Modifies codeframe/cli/engines_commands.py; touches the same CLI namespace and engine-related commands.

Poem

🐰 I hopped through logs both near and far,
I counted runs by engine and star,
Duration, tokens, success so bright,
SQLite stores them through the night.
Now stats and comparisons dance — hop, delight! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(metrics): engine performance tracking and comparison (`#419`)' accurately reflects the main change: implementing engine performance tracking and comparison capabilities, the primary focus of this PR.
Linked Issues check	✅ Passed	The PR fully addresses all coding requirements from issue `#419`: run outcomes recorded [runtime.py], stats/compare commands implemented [cli/engines_commands.py], token usage tracked [runtime.py], gate pass/self-correction data structure prepared [workspace.py/engine_stats.py], and metrics persisted [engine_stats.py].
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue `#419`: schema tables [workspace.py], stats recording and querying [engine_stats.py], runtime integration [runtime.py], CLI commands [cli/engines_commands.py], and comprehensive tests. No unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 95.45% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/issue-419-engine-perf-tracking

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-03-14T19:51:48Z

Code Review: feat(metrics): engine performance tracking and comparison (#419)

Overall this is a well-structured, focused addition. The architecture respects the core-first, headless pattern from CLAUDE.md, the fail-safe wiring in runtime.py is the right call, and the 23-test coverage across schema, aggregation, and CLI is solid. A few issues worth addressing before merge.

Critical

record_run opens two DB connections sequentially (engine_stats.py lines ~191–214 and ~225–289)

record_run calls get_db_connection, inserts and commits, closes, then immediately calls _update_aggregate_stats which opens a second connection to recompute all metrics. On a busy system this is a TOCTOU window — another thread can insert a row between the two connections, causing the aggregate to include data not counted in the just-closed transaction. More practically, it doubles connection overhead on every single task run. The fix is to pass the open connection into _update_aggregate_stats or restructure to do the INSERT and the aggregate recompute in a single connection scope.

# current (two connections)
conn = get_db_connection(workspace)
try:
    conn.execute("INSERT INTO run_engine_log ...")
    conn.commit()
finally:
    conn.close()
_update_aggregate_stats(workspace, engine)   # opens its own connection

# better (one connection)
conn = get_db_connection(workspace)
try:
    conn.execute("INSERT INTO run_engine_log ...")
    _update_aggregate_stats_conn(conn, workspace.id, engine)
    conn.commit()
finally:
    conn.close()

Important

Deferred import of engine_stats inside execute_agent (runtime.py line ~394)

try:
    from codeframe.core import engine_stats
    ...
except Exception:
    logger.debug("Engine stats recording failed", exc_info=True)

The deferred import is unnecessary — engine_stats.py has no circular dependency risk since it only imports from workspace.py. Moving the import to the module top saves a repeated import lookup on every task run and makes the dependency graph visible. The bare except Exception swallowing the error at logger.debug is fine for the fail-safe intent, but should at minimum log exc_info=True at warning level so ops teams can detect persistent DB issues without turning on debug logging. debug will be invisible in production.

_perf_start_ms is placed after the large pre-execution block (runtime.py line ~383)

The timer starts after workspace validation, env config loading, tech stack probing, hook setup — all of which can be slow. This means duration_ms only captures the adapter .run() call, not the full execute_agent cost. That may be intentional (adapter execution time only), but the field name duration_ms with no qualifier in the schema and docs will mislead users who expect wall-clock task duration. Either rename the column to adapter_duration_ms or move the timer to the very top of execute_agent.

agent_status.value.upper() is fragile (runtime.py line ~405)

status=agent_status.value.upper(),

AgentStatus values are already uppercase strings (COMPLETED, FAILED, BLOCKED) based on the codebase conventions, so .upper() is a no-op that only masks the case if the enum value changes. More importantly, agent_status here is assigned from the adapter result's .status field that was mapped earlier in the function. If the result comes back with an unexpected string (e.g. "complete" from a future adapter), the recorded status will be a non-canonical value that won't match any filter in _update_aggregate_stats (WHERE status = 'COMPLETED'). Better to validate against known values or map through the AgentStatus enum explicitly.

avg_tokens_per_task counts only completed-task tokens (engine_stats.py line ~264–266)

avg_tokens_per_task = (
    completed_tokens / completed_count if completed_count > 0 else 0.0
)

The denominator is completed_count but completed_tokens is SUM(tokens_used WHERE status='COMPLETED'). This means failed runs with token consumption are silently excluded from the average. The metric name gives no hint of this. Either rename it avg_tokens_per_completed_task or compute the average across all runs so token cost for failures is visible.

Nice-to-have

Schema duplication between _init_database and _ensure_schema_upgrades (workspace.py)

Both functions now contain identical CREATE TABLE IF NOT EXISTS run_engine_log and engine_stats DDL plus indexes. If the schema evolves (e.g., adding a column), the developer must update two places. Consider extracting the DDL into module-level constants and referencing them from both functions, which is already the cleanest pattern for this kind of idempotent migration.

format shadows the built-in (engines_commands.py line ~110)

format: str = typer.Option("text", "--format", "-f", help="Output format: text or json")

Using format as a parameter name shadows builtins.format. Prefer output_format or fmt to avoid the shadow and any lint warnings (A002 in ruff).

compare command is functionally identical to stats (engines_commands.py lines ~128–142)

Both stats and compare call get_engine_stats(workspace) and pass the result to the same _build_stats_table. The only difference is the table title string. Users running cf engines compare and cf engines stats will see the same data. Consider differentiating them — for example, compare could normalize metrics to a baseline engine, show delta columns, or render a side-by-side view. As-is the duplication will confuse users and the redundancy will accumulate technical debt.

get_run_log is public API with no CLI surface (engine_stats.py line ~331)

get_run_log is exported but there is no cf engines log or cf engines history command that uses it. Either wire it into a CLI command (good candidate: cf engines stats --raw) or mark it _get_run_log (private) until it is actually used, to avoid confusing the public API surface.

Test helper imports inside test methods (tests/core/test_engine_stats.py)

def test_record_run_inserts_row(self, workspace, run_id, task_id):
    from codeframe.core.engine_stats import record_run
    ...

Importing inside every test method is unusual and makes the test file harder to skim. Moving imports to the top of each class or to the module level is cleaner and is the pattern used in the rest of the test suite.

Compliance with CLAUDE.md

core/engine_stats.py is correctly headless (no FastAPI imports).
CLI layer calls core cleanly.
Schema migration uses _ensure_schema_upgrades for existing databases — idempotent.
The bare except Exception in runtime.py is the right pattern for fail-safe hooks per the existing convention in the codebase.

The deferred import in runtime and the double-connection pattern in record_run are the two things I'd flag before merge.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

tests/core/test_engine_stats.py (1)

35-53: Consider documenting the _record helper return value.

The helper returns the run_id which is used in some tests but not others. A brief docstring addition clarifying this would improve readability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_engine_stats.py` around lines 35 - 53, Update the _record
helper's docstring to mention that it returns the generated run_id string;
specifically, in the _record function (which calls record_run) add a concise
sentence describing that the function creates a fresh run_id and returns that
run_id (string) so tests can reference it.

codeframe/core/engine_stats.py (1)

82-97: Minor redundancy in SQL query.

Lines 85 and 93 both compute COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END), resulting in row[1] and row[9] containing the same value. This doesn't affect correctness but adds minor overhead.

Simplified query

         row = cur.execute(
             "SELECT "
             "  COUNT(*), "
             "  COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END), "
             "  COUNT(CASE WHEN status = 'FAILED' THEN 1 END), "
             "  COUNT(CASE WHEN gates_passed = 1 THEN 1 END), "
             "  COUNT(CASE WHEN gates_passed IS NOT NULL THEN 1 END), "
             "  COUNT(CASE WHEN self_corrections > 0 THEN 1 END), "
             "  AVG(CASE WHEN duration_ms IS NOT NULL THEN duration_ms END), "
-            "  SUM(tokens_used), "
-            "  SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END), "
-            "  COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) "
+            "  SUM(tokens_used), "
+            "  SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END) "
             "FROM run_engine_log "
             "WHERE engine = ? AND workspace_id = ?",
             (engine, ws_id),
         ).fetchone()

         total = row[0]
         completed = row[1]
         failed = row[2]
         gates_pass_count = row[3]
         gates_total = row[4]
         self_corr_count = row[5]
         avg_duration = row[6]
         total_tokens = row[7] or 0
         completed_tokens = row[8] or 0
-        completed_count = row[9]
+        completed_count = completed  # Reuse row[1]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/engine_stats.py` around lines 82 - 97, The SQL SELECT
duplicates COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) (producing row[1]
and row[9]) in the cur.execute(...) against run_engine_log; remove the redundant
expression from the query (or consolidate by using an alias once) and update any
code that referenced the removed index (e.g., uses of row[9]) to use the single
completed-count column (row[1] or the chosen alias) so the tuple indices remain
correct.

codeframe/core/runtime.py (1)

675-677: Consider moving the time import to the module level.

The time module import inside the function adds minor overhead on each call. Moving it to the module-level imports would be cleaner.

Suggested change

Add to the module-level imports at the top of the file:

import time

Then update lines 675-677:

-    import time as _time_mod
-    _perf_start_ms = int(_time_mod.monotonic() * 1000)
+    _perf_start_ms = int(time.monotonic() * 1000)

And update line 811:

-            _perf_duration_ms = int(_time_mod.monotonic() * 1000) - _perf_start_ms
+            _perf_duration_ms = int(time.monotonic() * 1000) - _perf_start_ms

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/runtime.py` around lines 675 - 677, Move the time import out
of the function and into module-level imports: add "import time" at the top of
the file, remove the inline "import time as _time_mod", and update all places
referencing _time_mod.monotonic() (including the code that sets _perf_start_ms
and the later use at the end of the runtime where _time_mod is referenced) to
call time.monotonic() instead so the module-level import is used consistently.

codeframe/cli/engines_commands.py (1)

170-188: Minor: format parameter shadows built-in.

The parameter format shadows Python's built-in format function. While this works fine, renaming to output_format or fmt would be slightly cleaner.

Suggested change

 `@engines_app.command`("stats")
 def stats(
     engine: Optional[str] = typer.Option(None, "--engine", "-e", help="Filter by engine name"),
-    format: str = typer.Option("text", "--format", "-f", help="Output format: text or json"),
+    output_format: str = typer.Option("text", "--format", "-f", help="Output format: text or json"),
 ) -> None:
     """Show engine performance statistics."""
     workspace = _get_current_workspace()
     data = engine_stats.get_engine_stats(workspace, engine=engine)

     if not data:
         console.print("[yellow]No engine stats recorded yet.[/yellow]")
         return

-    if format == "json":
+    if output_format == "json":
         console.print(_json.dumps(data, indent=2))
         return

     table = _build_stats_table(data, title="Engine Performance Stats")
     console.print(table)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/engines_commands.py` around lines 170 - 188, The CLI command
function stats currently uses a parameter named format which shadows the
built-in format; rename the parameter to output_format (or fmt) in the stats
function signature and update all internal references (the typer.Option
declaration, the comparison if format == "json", and the console.print branches)
to use output_format so behavior remains identical but avoids shadowing; keep
the CLI flag name "--format" in the typer.Option so the external CLI stays the
same and update the docstring/variable names accordingly (function: stats, local
vars: output_format).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/core/test_runtime_engine_stats.py`:
- Around line 36-44: The test creates a Run with started_at as a string but
Run.started_at is typed as a datetime; change the test to pass a datetime object
(e.g., use datetime.fromisoformat or datetime.strptime to parse
"2026-03-14T00:00:00Z") and ensure the test file imports datetime; update the
Run instantiation in tests/core/test_runtime_engine_stats.py (the Run(...) call)
to set started_at=<datetime object> instead of the string.

---

Nitpick comments:
In `@codeframe/cli/engines_commands.py`:
- Around line 170-188: The CLI command function stats currently uses a parameter
named format which shadows the built-in format; rename the parameter to
output_format (or fmt) in the stats function signature and update all internal
references (the typer.Option declaration, the comparison if format == "json",
and the console.print branches) to use output_format so behavior remains
identical but avoids shadowing; keep the CLI flag name "--format" in the
typer.Option so the external CLI stays the same and update the
docstring/variable names accordingly (function: stats, local vars:
output_format).

In `@codeframe/core/engine_stats.py`:
- Around line 82-97: The SQL SELECT duplicates COUNT(CASE WHEN status =
'COMPLETED' THEN 1 END) (producing row[1] and row[9]) in the cur.execute(...)
against run_engine_log; remove the redundant expression from the query (or
consolidate by using an alias once) and update any code that referenced the
removed index (e.g., uses of row[9]) to use the single completed-count column
(row[1] or the chosen alias) so the tuple indices remain correct.

In `@codeframe/core/runtime.py`:
- Around line 675-677: Move the time import out of the function and into
module-level imports: add "import time" at the top of the file, remove the
inline "import time as _time_mod", and update all places referencing
_time_mod.monotonic() (including the code that sets _perf_start_ms and the later
use at the end of the runtime where _time_mod is referenced) to call
time.monotonic() instead so the module-level import is used consistently.

In `@tests/core/test_engine_stats.py`:
- Around line 35-53: Update the _record helper's docstring to mention that it
returns the generated run_id string; specifically, in the _record function
(which calls record_run) add a concise sentence describing that the function
creates a fresh run_id and returns that run_id (string) so tests can reference
it.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 786ff895-524f-48b9-bbb9-394e26805d64

📥 Commits

Reviewing files that changed from the base of the PR and between d2f2660 and 16fe673.

📒 Files selected for processing (7)

codeframe/cli/engines_commands.py
codeframe/core/engine_stats.py
codeframe/core/runtime.py
codeframe/core/workspace.py
tests/cli/test_engines_stats_commands.py
tests/core/test_engine_stats.py
tests/core/test_runtime_engine_stats.py

coderabbitai · 2026-03-14T19:53:12Z

+    run = Run(
+        id="run-001",
+        workspace_id=workspace.id,
+        task_id="task-001",
+        status=RunStatus.RUNNING,
+        started_at="2026-03-14T00:00:00Z",
+        completed_at=None,
+    )
+    return workspace, run


⚠️ Potential issue | 🟡 Minor

Type mismatch: started_at should be a datetime object, not a string.

The Run dataclass in runtime.py (line 58) declares started_at: datetime, but this mock uses a string. While the current tests pass because complete_run, fail_run, and block_run are patched, this could cause issues if the test setup is reused elsewhere or if patches are modified.

Suggested fix

+from datetime import datetime, timezone + def _make_workspace_and_run(tmp_path): """Create a real workspace and a mock Run for testing.""" from codeframe.core.workspace import create_or_load_workspace from codeframe.core.runtime import Run, RunStatus repo = tmp_path / "test_repo" repo.mkdir() workspace = create_or_load_workspace(repo) run = Run( id="run-001", workspace_id=workspace.id, task_id="task-001", status=RunStatus.RUNNING, - started_at="2026-03-14T00:00:00Z", + started_at=datetime(2026, 3, 14, tzinfo=timezone.utc), completed_at=None, ) return workspace, run

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/core/test_runtime_engine_stats.py` around lines 36 - 44, The test creates a Run with started_at as a string but Run.started_at is typed as a datetime; change the test to pass a datetime object (e.g., use datetime.fromisoformat or datetime.strptime to parse "2026-03-14T00:00:00Z") and ensure the test file imports datetime; update the Run instantiation in tests/core/test_runtime_engine_stats.py (the Run(...) call) to set started_at=<datetime object> instead of the string.

- Use single DB connection in record_run (fixes TOCTOU/double connection) - Move engine_stats import to module level in runtime.py - Upgrade stats failure logging from debug to warning - Rename format→output_format to avoid shadowing builtin - Keep .upper() on status values for SQL convention consistency

coderabbitai

🧹 Nitpick comments (3)

codeframe/cli/engines_commands.py (1)

124-128: Duration formatting could handle edge cases more gracefully.

The function handles values < 1000ms and >= 1000ms, but negative values (which shouldn't occur but could indicate a bug upstream) would display incorrectly.

🛡️ Optional: Add defensive check for negative durations

 def _format_duration(ms: float) -> str:
     """Format duration in human-readable form."""
+    if ms < 0:
+        return "N/A"
     if ms < 1000:
         return f"{ms:.0f}ms"
     return f"{ms / 1000:.1f}s"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/engines_commands.py` around lines 124 - 128, The
_format_duration function should defensively handle negative inputs; update
_format_duration(ms: float) to clamp negative ms to 0 (e.g., if ms < 0: ms = 0)
before formatting so upstream timing bugs don't produce misleading negative
durations, then keep the existing branch logic for <1000ms and >=1000ms to
return "0ms" or the seconds string as appropriate.

codeframe/core/engine_stats.py (2)

92-107: Minor SQL redundancy: completed count is computed twice.

Lines 95 and 103 both count status = 'COMPLETED'. While functionally correct, this is slightly redundant.

♻️ Optional: Remove redundant column

     row = cur.execute(
         "SELECT "
         "  COUNT(*), "
         "  COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END), "
         "  COUNT(CASE WHEN status = 'FAILED' THEN 1 END), "
         "  COUNT(CASE WHEN gates_passed = 1 THEN 1 END), "
         "  COUNT(CASE WHEN gates_passed IS NOT NULL THEN 1 END), "
         "  COUNT(CASE WHEN self_corrections > 0 THEN 1 END), "
         "  AVG(CASE WHEN duration_ms IS NOT NULL THEN duration_ms END), "
         "  SUM(tokens_used), "
-        "  SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END), "
-        "  COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) "
+        "  SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END) "
         "FROM run_engine_log "
         "WHERE engine = ? AND workspace_id = ?",
         (engine, ws_id),
     ).fetchone()

     total = row[0]
     completed = row[1]
     failed = row[2]
     gates_pass_count = row[3]
     gates_total = row[4]
     self_corr_count = row[5]
     avg_duration = row[6]
     total_tokens = row[7] or 0
     completed_tokens = row[8] or 0
-    completed_count = row[9]
+    completed_count = completed  # Reuse the count from row[1]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/engine_stats.py` around lines 92 - 107, The SQL in the
cur.execute call in engine_stats.py selects the same COUNT(CASE WHEN status =
'COMPLETED' THEN 1 END) twice; remove the duplicate column from that SELECT (the
second occurrence near the end of the list) and then adjust any code that reads
the resulting row tuple (references to row indices/variables that expect the
extra column) so the mapping of counts/averages/sums to variables remains
correct. Use the existing cur.execute(...) invocation and the enclosing function
that processes row to find and update all dependent index usages.

14-66: Consider emitting an event for observability on run recording.

Per coding guidelines, core modules should "emit events for state transitions via core/events.py for audit and observability." Recording an engine run is arguably a state transition that could benefit from event emission for audit trails.

However, this may be intentionally omitted since:

The recording itself is already a form of persistence/audit
The runtime already handles fail-safe recording with warning logs

If events are desired, they could be added without blocking this PR.

The single-connection approach correctly addresses the TOCTOU concern mentioned in the commit message. The try/finally ensures the connection is always closed.

As per coding guidelines: "All core modules must emit events for state transitions via core/events.py for audit and observability"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/engine_stats.py` around lines 14 - 66, The record_run function
currently persists engine runs but does not emit an observability event; import
the event emitter from core/events.py (e.g. emit_event) and, after the INSERT
and after calling _update_aggregate_stats_conn but before conn.commit(), call
the emitter to emit a concise "engine_run_recorded" (or similar) event
containing run_id, engine, task_id, workspace_id (workspace.id), status,
duration_ms, tokens_used, gates_passed, self_corrections and created_at (now) so
consumers can audit the state transition without changing the DB transaction
semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@codeframe/cli/engines_commands.py`:
- Around line 124-128: The _format_duration function should defensively handle
negative inputs; update _format_duration(ms: float) to clamp negative ms to 0
(e.g., if ms < 0: ms = 0) before formatting so upstream timing bugs don't
produce misleading negative durations, then keep the existing branch logic for
<1000ms and >=1000ms to return "0ms" or the seconds string as appropriate.

In `@codeframe/core/engine_stats.py`:
- Around line 92-107: The SQL in the cur.execute call in engine_stats.py selects
the same COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) twice; remove the
duplicate column from that SELECT (the second occurrence near the end of the
list) and then adjust any code that reads the resulting row tuple (references to
row indices/variables that expect the extra column) so the mapping of
counts/averages/sums to variables remains correct. Use the existing
cur.execute(...) invocation and the enclosing function that processes row to
find and update all dependent index usages.
- Around line 14-66: The record_run function currently persists engine runs but
does not emit an observability event; import the event emitter from
core/events.py (e.g. emit_event) and, after the INSERT and after calling
_update_aggregate_stats_conn but before conn.commit(), call the emitter to emit
a concise "engine_run_recorded" (or similar) event containing run_id, engine,
task_id, workspace_id (workspace.id), status, duration_ms, tokens_used,
gates_passed, self_corrections and created_at (now) so consumers can audit the
state transition without changing the DB transaction semantics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: be1b3109-0ab6-42a9-a4ef-05502231fbc2

📥 Commits

Reviewing files that changed from the base of the PR and between 16fe673 and 7eb3e13.

📒 Files selected for processing (3)

codeframe/cli/engines_commands.py
codeframe/core/engine_stats.py
codeframe/core/runtime.py

🚧 Files skipped from review as they are similar to previous changes (1)

codeframe/core/runtime.py

claude · 2026-03-14T20:02:53Z

Follow-up Review

Good progress since the initial review. Here is a status update on what was addressed and what remains.

Resolved:

Double DB connection (Critical): record_run now calls _update_aggregate_stats_conn within the same open connection before committing. The TOCTOU window and double overhead are gone.
Deferred engine_stats import: Now correctly at the module level in runtime.py.
format shadows built-in: Renamed to output_format in stats.

Still Open:

import time as _time_mod inside the function body (runtime.py ~line 675)

Both this review and CodeRabbit flagged this. The local alias _time_mod is unnecessary. Move to module-level:

import time
_perf_start_ms = int(time.monotonic() * 1000)
_perf_duration_ms = int(time.monotonic() * 1000) - _perf_start_ms

Duplicate SQL COUNT expression in _update_aggregate_stats_conn (engine_stats.py ~line 82-97)

row[1] (completed) and row[9] (completed_count) are computed by the same COUNT(CASE WHEN status = COMPLETED ...) expression. CodeRabbit also flagged this. Replace completed_count = row[9] with completed_count = completed and remove the redundant tenth SELECT column.

avg_tokens_per_task excludes failed-run token consumption (engine_stats.py ~line 104-106)

From my prior review - still unaddressed. The denominator is completed_count but completed_tokens is SUM(tokens_used WHERE status=COMPLETED), so token cost from failed runs is silently excluded. The metric name gives no hint of this. Either rename to avg_tokens_per_completed_task or compute across all runs so total cost (including failures) is visible.

Run.started_at type mismatch in test (tests/core/test_runtime_engine_stats.py ~line 44)

CodeRabbit flagged this and it is still in the current diff. The field is typed datetime but the test passes a raw string: started_at="2026-03-14T00:00:00Z". Fix: use started_at=datetime(2026, 3, 14, tzinfo=timezone.utc) and add from datetime import datetime, timezone.

Still Nice-to-Have (from prior review, not blocking):

compare and stats render identical output - consider differentiating (delta columns, baseline normalization)
get_run_log is public API with no CLI surface - mark private or add cf engines log subcommand
Test helper imports inside each test method - move to module/class level per the rest of the test suite

The critical fix landed cleanly. The import time, duplicate SQL column, avg_tokens_per_task naming, and started_at type mismatch are the remaining items worth resolving before merge.

Test User added 2 commits March 14, 2026 12:23

coderabbitai Bot reviewed Mar 14, 2026

View reviewed changes

frankbria merged commit cf3fa89 into main Mar 14, 2026
11 checks passed

Repository owner deleted a comment from claude Bot Mar 15, 2026

frankbria deleted the feature/issue-419-engine-perf-tracking branch March 24, 2026 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): engine performance tracking and comparison (#419)#443

feat(metrics): engine performance tracking and comparison (#419)#443
frankbria merged 3 commits into
mainfrom
feature/issue-419-engine-perf-tracking

frankbria commented Mar 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 14, 2026 •

edited

Loading

Uh oh!

claude Bot commented Mar 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

claude Bot commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

frankbria commented Mar 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files Changed

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

claude Bot commented Mar 14, 2026

Code Review: feat(metrics): engine performance tracking and comparison (#419)

Critical

Important

Nice-to-have

Compliance with CLAUDE.md

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

frankbria commented Mar 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 14, 2026 •

edited

Loading