Skip to content

feat(metrics): engine performance tracking and comparison (#419)#443

Merged
frankbria merged 3 commits into
mainfrom
feature/issue-419-engine-perf-tracking
Mar 14, 2026
Merged

feat(metrics): engine performance tracking and comparison (#419)#443
frankbria merged 3 commits into
mainfrom
feature/issue-419-engine-perf-tracking

Conversation

@frankbria

@frankbria frankbria commented Mar 14, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #419

  • Schema extension: Add run_engine_log and engine_stats tables via _ensure_schema_upgrades() (idempotent migration)
  • Core module: codeframe/core/engine_stats.py with record_run(), aggregate computation, get_engine_stats(), get_run_log()
  • Runtime wiring: After each execute_agent() call, record engine name, duration, tokens, status to run_engine_log (fail-safe, never crashes execution)
  • CLI commands: cf engines stats [--engine NAME] [--format text|json] and cf engines compare (side-by-side comparison sorted by success rate)

Files Changed

Area Files What
Schema core/workspace.py Two new tables + indexes
Core core/engine_stats.py (NEW) Record, aggregate, query engine metrics
Runtime core/runtime.py Hook recording after agent execution
CLI cli/engines_commands.py stats and compare commands
Tests 3 new test files 23 new tests

Test plan

  • 23 new tests covering schema, aggregation, runtime wiring, CLI
  • All 2306 v2 tests pass (0 regressions)
  • Ruff lint clean
  • Demo cf engines stats with seeded data
  • Demo cf engines compare shows sorted comparison

Summary by CodeRabbit

  • New Features

    • Added stats CLI to view engine performance (success rate, gate pass rate, avg duration, token usage) with engine filtering and JSON output.
    • Added compare CLI to rank and compare engines by success rate.
    • Automatic recording and aggregation of per-run engine performance for reporting.
  • Tests

    • Added comprehensive tests for CLI commands and engine performance tracking.

Test User added 2 commits March 14, 2026 12:23
…419)

Add run_engine_log and engine_stats tables for tracking per-run engine
metrics and computing aggregate performance statistics. Implements
record_run, get_engine_stats, and get_run_log functions with 11 tests.
…ands (#419 steps 3-4)

- Record engine performance metrics (duration, tokens, status) after
  each execute_agent() call via engine_stats.record_run(), with graceful
  failure handling so stats errors never crash execution
- Add `cf engines stats` command with text/JSON output and engine filter
- Add `cf engines compare` command showing all engines sorted by success rate
- 12 new tests: 4 for runtime integration, 8 for CLI commands
@coderabbitai

coderabbitai Bot commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds engine performance tracking: new DB tables for per-run and aggregate metrics, a headless engine_stats module to record/query stats, runtime instrumentation to record runs, CLI commands engines stats/engines compare, and tests covering core, runtime, and CLI paths.

Changes

Cohort / File(s) Summary
Schema & Workspace
codeframe/core/workspace.py
Adds run_engine_log and engine_stats tables plus indexes; integrates creation into setup/upgrade paths.
Engine Stats Core
codeframe/core/engine_stats.py
New module providing record_run, get_engine_stats, get_run_log, and internal aggregate update/upsert logic computing rates/averages with safe handling.
Runtime Instrumentation
codeframe/core/runtime.py
Records monotonic start time and calls engine_stats.record_run after agent execution with duration, tokens, status; wraps call in exception handling.
CLI Commands
codeframe/cli/engines_commands.py
Adds stats (filterable by engine, text/json output) and compare commands; includes helpers to compute success rate, format durations, and build display tables.
Tests
tests/core/test_engine_stats.py, tests/core/test_runtime_engine_stats.py, tests/cli/test_engines_stats_commands.py
New tests for DB logging, aggregate computation, run-log queries, runtime recording behavior (success/failure/token edge cases), and CLI output/formatting/filtering.

Sequence Diagram

sequenceDiagram
    participant Agent as Agent Adapter
    participant Runtime as runtime.py
    participant Stats as engine_stats.py
    participant DB as SQLite
    participant CLI as CLI User

    CLI->>Runtime: trigger execution (engine, task)
    Runtime->>Runtime: record monotonic start_ms
    Agent->>Runtime: return AgentResult (status, token_usage)
    Runtime->>Runtime: compute duration_ms
    Runtime->>Stats: record_run(workspace, run_id, engine, task_id, status, duration_ms, tokens_used, ...)
    Stats->>DB: INSERT INTO run_engine_log
    Stats->>DB: SELECT aggregates FROM run_engine_log
    Stats->>DB: UPSERT INTO engine_stats
    CLI->>Stats: get_engine_stats(workspace, engine?)
    Stats->>DB: SELECT FROM engine_stats
    Stats-->>CLI: return aggregated stats
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I hopped through logs both near and far,
I counted runs by engine and star,
Duration, tokens, success so bright,
SQLite stores them through the night.
Now stats and comparisons dance — hop, delight! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(metrics): engine performance tracking and comparison (#419)' accurately reflects the main change: implementing engine performance tracking and comparison capabilities, the primary focus of this PR.
Linked Issues check ✅ Passed The PR fully addresses all coding requirements from issue #419: run outcomes recorded [runtime.py], stats/compare commands implemented [cli/engines_commands.py], token usage tracked [runtime.py], gate pass/self-correction data structure prepared [workspace.py/engine_stats.py], and metrics persisted [engine_stats.py].
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #419: schema tables [workspace.py], stats recording and querying [engine_stats.py], runtime integration [runtime.py], CLI commands [cli/engines_commands.py], and comprehensive tests. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 95.45% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/issue-419-engine-perf-tracking
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude

claude Bot commented Mar 14, 2026

Copy link
Copy Markdown

Code Review: feat(metrics): engine performance tracking and comparison (#419)

Overall this is a well-structured, focused addition. The architecture respects the core-first, headless pattern from CLAUDE.md, the fail-safe wiring in runtime.py is the right call, and the 23-test coverage across schema, aggregation, and CLI is solid. A few issues worth addressing before merge.


Critical

record_run opens two DB connections sequentially (engine_stats.py lines ~191–214 and ~225–289)

record_run calls get_db_connection, inserts and commits, closes, then immediately calls _update_aggregate_stats which opens a second connection to recompute all metrics. On a busy system this is a TOCTOU window — another thread can insert a row between the two connections, causing the aggregate to include data not counted in the just-closed transaction. More practically, it doubles connection overhead on every single task run. The fix is to pass the open connection into _update_aggregate_stats or restructure to do the INSERT and the aggregate recompute in a single connection scope.

# current (two connections)
conn = get_db_connection(workspace)
try:
    conn.execute("INSERT INTO run_engine_log ...")
    conn.commit()
finally:
    conn.close()
_update_aggregate_stats(workspace, engine)   # opens its own connection

# better (one connection)
conn = get_db_connection(workspace)
try:
    conn.execute("INSERT INTO run_engine_log ...")
    _update_aggregate_stats_conn(conn, workspace.id, engine)
    conn.commit()
finally:
    conn.close()

Important

Deferred import of engine_stats inside execute_agent (runtime.py line ~394)

try:
    from codeframe.core import engine_stats
    ...
except Exception:
    logger.debug("Engine stats recording failed", exc_info=True)

The deferred import is unnecessary — engine_stats.py has no circular dependency risk since it only imports from workspace.py. Moving the import to the module top saves a repeated import lookup on every task run and makes the dependency graph visible. The bare except Exception swallowing the error at logger.debug is fine for the fail-safe intent, but should at minimum log exc_info=True at warning level so ops teams can detect persistent DB issues without turning on debug logging. debug will be invisible in production.

_perf_start_ms is placed after the large pre-execution block (runtime.py line ~383)

The timer starts after workspace validation, env config loading, tech stack probing, hook setup — all of which can be slow. This means duration_ms only captures the adapter .run() call, not the full execute_agent cost. That may be intentional (adapter execution time only), but the field name duration_ms with no qualifier in the schema and docs will mislead users who expect wall-clock task duration. Either rename the column to adapter_duration_ms or move the timer to the very top of execute_agent.

agent_status.value.upper() is fragile (runtime.py line ~405)

status=agent_status.value.upper(),

AgentStatus values are already uppercase strings (COMPLETED, FAILED, BLOCKED) based on the codebase conventions, so .upper() is a no-op that only masks the case if the enum value changes. More importantly, agent_status here is assigned from the adapter result's .status field that was mapped earlier in the function. If the result comes back with an unexpected string (e.g. "complete" from a future adapter), the recorded status will be a non-canonical value that won't match any filter in _update_aggregate_stats (WHERE status = 'COMPLETED'). Better to validate against known values or map through the AgentStatus enum explicitly.

avg_tokens_per_task counts only completed-task tokens (engine_stats.py line ~264–266)

avg_tokens_per_task = (
    completed_tokens / completed_count if completed_count > 0 else 0.0
)

The denominator is completed_count but completed_tokens is SUM(tokens_used WHERE status='COMPLETED'). This means failed runs with token consumption are silently excluded from the average. The metric name gives no hint of this. Either rename it avg_tokens_per_completed_task or compute the average across all runs so token cost for failures is visible.


Nice-to-have

Schema duplication between _init_database and _ensure_schema_upgrades (workspace.py)

Both functions now contain identical CREATE TABLE IF NOT EXISTS run_engine_log and engine_stats DDL plus indexes. If the schema evolves (e.g., adding a column), the developer must update two places. Consider extracting the DDL into module-level constants and referencing them from both functions, which is already the cleanest pattern for this kind of idempotent migration.

format shadows the built-in (engines_commands.py line ~110)

format: str = typer.Option("text", "--format", "-f", help="Output format: text or json")

Using format as a parameter name shadows builtins.format. Prefer output_format or fmt to avoid the shadow and any lint warnings (A002 in ruff).

compare command is functionally identical to stats (engines_commands.py lines ~128–142)

Both stats and compare call get_engine_stats(workspace) and pass the result to the same _build_stats_table. The only difference is the table title string. Users running cf engines compare and cf engines stats will see the same data. Consider differentiating them — for example, compare could normalize metrics to a baseline engine, show delta columns, or render a side-by-side view. As-is the duplication will confuse users and the redundancy will accumulate technical debt.

get_run_log is public API with no CLI surface (engine_stats.py line ~331)

get_run_log is exported but there is no cf engines log or cf engines history command that uses it. Either wire it into a CLI command (good candidate: cf engines stats --raw) or mark it _get_run_log (private) until it is actually used, to avoid confusing the public API surface.

Test helper imports inside test methods (tests/core/test_engine_stats.py)

def test_record_run_inserts_row(self, workspace, run_id, task_id):
    from codeframe.core.engine_stats import record_run
    ...

Importing inside every test method is unusual and makes the test file harder to skim. Moving imports to the top of each class or to the module level is cleaner and is the pattern used in the rest of the test suite.


Compliance with CLAUDE.md

  • core/engine_stats.py is correctly headless (no FastAPI imports).
  • CLI layer calls core cleanly.
  • Schema migration uses _ensure_schema_upgrades for existing databases — idempotent.
  • The bare except Exception in runtime.py is the right pattern for fail-safe hooks per the existing convention in the codebase.

The deferred import in runtime and the double-connection pattern in record_run are the two things I'd flag before merge.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/core/test_engine_stats.py (1)

35-53: Consider documenting the _record helper return value.

The helper returns the run_id which is used in some tests but not others. A brief docstring addition clarifying this would improve readability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_engine_stats.py` around lines 35 - 53, Update the _record
helper's docstring to mention that it returns the generated run_id string;
specifically, in the _record function (which calls record_run) add a concise
sentence describing that the function creates a fresh run_id and returns that
run_id (string) so tests can reference it.
codeframe/core/engine_stats.py (1)

82-97: Minor redundancy in SQL query.

Lines 85 and 93 both compute COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END), resulting in row[1] and row[9] containing the same value. This doesn't affect correctness but adds minor overhead.

Simplified query
         row = cur.execute(
             "SELECT "
             "  COUNT(*), "
             "  COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END), "
             "  COUNT(CASE WHEN status = 'FAILED' THEN 1 END), "
             "  COUNT(CASE WHEN gates_passed = 1 THEN 1 END), "
             "  COUNT(CASE WHEN gates_passed IS NOT NULL THEN 1 END), "
             "  COUNT(CASE WHEN self_corrections > 0 THEN 1 END), "
             "  AVG(CASE WHEN duration_ms IS NOT NULL THEN duration_ms END), "
-            "  SUM(tokens_used), "
-            "  SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END), "
-            "  COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) "
+            "  SUM(tokens_used), "
+            "  SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END) "
             "FROM run_engine_log "
             "WHERE engine = ? AND workspace_id = ?",
             (engine, ws_id),
         ).fetchone()

         total = row[0]
         completed = row[1]
         failed = row[2]
         gates_pass_count = row[3]
         gates_total = row[4]
         self_corr_count = row[5]
         avg_duration = row[6]
         total_tokens = row[7] or 0
         completed_tokens = row[8] or 0
-        completed_count = row[9]
+        completed_count = completed  # Reuse row[1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/engine_stats.py` around lines 82 - 97, The SQL SELECT
duplicates COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) (producing row[1]
and row[9]) in the cur.execute(...) against run_engine_log; remove the redundant
expression from the query (or consolidate by using an alias once) and update any
code that referenced the removed index (e.g., uses of row[9]) to use the single
completed-count column (row[1] or the chosen alias) so the tuple indices remain
correct.
codeframe/core/runtime.py (1)

675-677: Consider moving the time import to the module level.

The time module import inside the function adds minor overhead on each call. Moving it to the module-level imports would be cleaner.

Suggested change

Add to the module-level imports at the top of the file:

import time

Then update lines 675-677:

-    import time as _time_mod
-    _perf_start_ms = int(_time_mod.monotonic() * 1000)
+    _perf_start_ms = int(time.monotonic() * 1000)

And update line 811:

-            _perf_duration_ms = int(_time_mod.monotonic() * 1000) - _perf_start_ms
+            _perf_duration_ms = int(time.monotonic() * 1000) - _perf_start_ms
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/runtime.py` around lines 675 - 677, Move the time import out
of the function and into module-level imports: add "import time" at the top of
the file, remove the inline "import time as _time_mod", and update all places
referencing _time_mod.monotonic() (including the code that sets _perf_start_ms
and the later use at the end of the runtime where _time_mod is referenced) to
call time.monotonic() instead so the module-level import is used consistently.
codeframe/cli/engines_commands.py (1)

170-188: Minor: format parameter shadows built-in.

The parameter format shadows Python's built-in format function. While this works fine, renaming to output_format or fmt would be slightly cleaner.

Suggested change
 `@engines_app.command`("stats")
 def stats(
     engine: Optional[str] = typer.Option(None, "--engine", "-e", help="Filter by engine name"),
-    format: str = typer.Option("text", "--format", "-f", help="Output format: text or json"),
+    output_format: str = typer.Option("text", "--format", "-f", help="Output format: text or json"),
 ) -> None:
     """Show engine performance statistics."""
     workspace = _get_current_workspace()
     data = engine_stats.get_engine_stats(workspace, engine=engine)

     if not data:
         console.print("[yellow]No engine stats recorded yet.[/yellow]")
         return

-    if format == "json":
+    if output_format == "json":
         console.print(_json.dumps(data, indent=2))
         return

     table = _build_stats_table(data, title="Engine Performance Stats")
     console.print(table)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/engines_commands.py` around lines 170 - 188, The CLI command
function stats currently uses a parameter named format which shadows the
built-in format; rename the parameter to output_format (or fmt) in the stats
function signature and update all internal references (the typer.Option
declaration, the comparison if format == "json", and the console.print branches)
to use output_format so behavior remains identical but avoids shadowing; keep
the CLI flag name "--format" in the typer.Option so the external CLI stays the
same and update the docstring/variable names accordingly (function: stats, local
vars: output_format).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/core/test_runtime_engine_stats.py`:
- Around line 36-44: The test creates a Run with started_at as a string but
Run.started_at is typed as a datetime; change the test to pass a datetime object
(e.g., use datetime.fromisoformat or datetime.strptime to parse
"2026-03-14T00:00:00Z") and ensure the test file imports datetime; update the
Run instantiation in tests/core/test_runtime_engine_stats.py (the Run(...) call)
to set started_at=<datetime object> instead of the string.

---

Nitpick comments:
In `@codeframe/cli/engines_commands.py`:
- Around line 170-188: The CLI command function stats currently uses a parameter
named format which shadows the built-in format; rename the parameter to
output_format (or fmt) in the stats function signature and update all internal
references (the typer.Option declaration, the comparison if format == "json",
and the console.print branches) to use output_format so behavior remains
identical but avoids shadowing; keep the CLI flag name "--format" in the
typer.Option so the external CLI stays the same and update the
docstring/variable names accordingly (function: stats, local vars:
output_format).

In `@codeframe/core/engine_stats.py`:
- Around line 82-97: The SQL SELECT duplicates COUNT(CASE WHEN status =
'COMPLETED' THEN 1 END) (producing row[1] and row[9]) in the cur.execute(...)
against run_engine_log; remove the redundant expression from the query (or
consolidate by using an alias once) and update any code that referenced the
removed index (e.g., uses of row[9]) to use the single completed-count column
(row[1] or the chosen alias) so the tuple indices remain correct.

In `@codeframe/core/runtime.py`:
- Around line 675-677: Move the time import out of the function and into
module-level imports: add "import time" at the top of the file, remove the
inline "import time as _time_mod", and update all places referencing
_time_mod.monotonic() (including the code that sets _perf_start_ms and the later
use at the end of the runtime where _time_mod is referenced) to call
time.monotonic() instead so the module-level import is used consistently.

In `@tests/core/test_engine_stats.py`:
- Around line 35-53: Update the _record helper's docstring to mention that it
returns the generated run_id string; specifically, in the _record function
(which calls record_run) add a concise sentence describing that the function
creates a fresh run_id and returns that run_id (string) so tests can reference
it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 786ff895-524f-48b9-bbb9-394e26805d64

📥 Commits

Reviewing files that changed from the base of the PR and between d2f2660 and 16fe673.

📒 Files selected for processing (7)
  • codeframe/cli/engines_commands.py
  • codeframe/core/engine_stats.py
  • codeframe/core/runtime.py
  • codeframe/core/workspace.py
  • tests/cli/test_engines_stats_commands.py
  • tests/core/test_engine_stats.py
  • tests/core/test_runtime_engine_stats.py

Comment on lines +36 to +44
run = Run(
id="run-001",
workspace_id=workspace.id,
task_id="task-001",
status=RunStatus.RUNNING,
started_at="2026-03-14T00:00:00Z",
completed_at=None,
)
return workspace, run

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Type mismatch: started_at should be a datetime object, not a string.

The Run dataclass in runtime.py (line 58) declares started_at: datetime, but this mock uses a string. While the current tests pass because complete_run, fail_run, and block_run are patched, this could cause issues if the test setup is reused elsewhere or if patches are modified.

Suggested fix
+from datetime import datetime, timezone
+
 def _make_workspace_and_run(tmp_path):
     """Create a real workspace and a mock Run for testing."""
     from codeframe.core.workspace import create_or_load_workspace
     from codeframe.core.runtime import Run, RunStatus

     repo = tmp_path / "test_repo"
     repo.mkdir()
     workspace = create_or_load_workspace(repo)

     run = Run(
         id="run-001",
         workspace_id=workspace.id,
         task_id="task-001",
         status=RunStatus.RUNNING,
-        started_at="2026-03-14T00:00:00Z",
+        started_at=datetime(2026, 3, 14, tzinfo=timezone.utc),
         completed_at=None,
     )
     return workspace, run
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_runtime_engine_stats.py` around lines 36 - 44, The test
creates a Run with started_at as a string but Run.started_at is typed as a
datetime; change the test to pass a datetime object (e.g., use
datetime.fromisoformat or datetime.strptime to parse "2026-03-14T00:00:00Z") and
ensure the test file imports datetime; update the Run instantiation in
tests/core/test_runtime_engine_stats.py (the Run(...) call) to set
started_at=<datetime object> instead of the string.

- Use single DB connection in record_run (fixes TOCTOU/double connection)
- Move engine_stats import to module level in runtime.py
- Upgrade stats failure logging from debug to warning
- Rename format→output_format to avoid shadowing builtin
- Keep .upper() on status values for SQL convention consistency

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
codeframe/cli/engines_commands.py (1)

124-128: Duration formatting could handle edge cases more gracefully.

The function handles values < 1000ms and >= 1000ms, but negative values (which shouldn't occur but could indicate a bug upstream) would display incorrectly.

🛡️ Optional: Add defensive check for negative durations
 def _format_duration(ms: float) -> str:
     """Format duration in human-readable form."""
+    if ms < 0:
+        return "N/A"
     if ms < 1000:
         return f"{ms:.0f}ms"
     return f"{ms / 1000:.1f}s"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/cli/engines_commands.py` around lines 124 - 128, The
_format_duration function should defensively handle negative inputs; update
_format_duration(ms: float) to clamp negative ms to 0 (e.g., if ms < 0: ms = 0)
before formatting so upstream timing bugs don't produce misleading negative
durations, then keep the existing branch logic for <1000ms and >=1000ms to
return "0ms" or the seconds string as appropriate.
codeframe/core/engine_stats.py (2)

92-107: Minor SQL redundancy: completed count is computed twice.

Lines 95 and 103 both count status = 'COMPLETED'. While functionally correct, this is slightly redundant.

♻️ Optional: Remove redundant column
     row = cur.execute(
         "SELECT "
         "  COUNT(*), "
         "  COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END), "
         "  COUNT(CASE WHEN status = 'FAILED' THEN 1 END), "
         "  COUNT(CASE WHEN gates_passed = 1 THEN 1 END), "
         "  COUNT(CASE WHEN gates_passed IS NOT NULL THEN 1 END), "
         "  COUNT(CASE WHEN self_corrections > 0 THEN 1 END), "
         "  AVG(CASE WHEN duration_ms IS NOT NULL THEN duration_ms END), "
         "  SUM(tokens_used), "
-        "  SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END), "
-        "  COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) "
+        "  SUM(CASE WHEN status = 'COMPLETED' THEN tokens_used ELSE 0 END) "
         "FROM run_engine_log "
         "WHERE engine = ? AND workspace_id = ?",
         (engine, ws_id),
     ).fetchone()

     total = row[0]
     completed = row[1]
     failed = row[2]
     gates_pass_count = row[3]
     gates_total = row[4]
     self_corr_count = row[5]
     avg_duration = row[6]
     total_tokens = row[7] or 0
     completed_tokens = row[8] or 0
-    completed_count = row[9]
+    completed_count = completed  # Reuse the count from row[1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/engine_stats.py` around lines 92 - 107, The SQL in the
cur.execute call in engine_stats.py selects the same COUNT(CASE WHEN status =
'COMPLETED' THEN 1 END) twice; remove the duplicate column from that SELECT (the
second occurrence near the end of the list) and then adjust any code that reads
the resulting row tuple (references to row indices/variables that expect the
extra column) so the mapping of counts/averages/sums to variables remains
correct. Use the existing cur.execute(...) invocation and the enclosing function
that processes row to find and update all dependent index usages.

14-66: Consider emitting an event for observability on run recording.

Per coding guidelines, core modules should "emit events for state transitions via core/events.py for audit and observability." Recording an engine run is arguably a state transition that could benefit from event emission for audit trails.

However, this may be intentionally omitted since:

  1. The recording itself is already a form of persistence/audit
  2. The runtime already handles fail-safe recording with warning logs

If events are desired, they could be added without blocking this PR.

The single-connection approach correctly addresses the TOCTOU concern mentioned in the commit message. The try/finally ensures the connection is always closed.

As per coding guidelines: "All core modules must emit events for state transitions via core/events.py for audit and observability"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/engine_stats.py` around lines 14 - 66, The record_run function
currently persists engine runs but does not emit an observability event; import
the event emitter from core/events.py (e.g. emit_event) and, after the INSERT
and after calling _update_aggregate_stats_conn but before conn.commit(), call
the emitter to emit a concise "engine_run_recorded" (or similar) event
containing run_id, engine, task_id, workspace_id (workspace.id), status,
duration_ms, tokens_used, gates_passed, self_corrections and created_at (now) so
consumers can audit the state transition without changing the DB transaction
semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@codeframe/cli/engines_commands.py`:
- Around line 124-128: The _format_duration function should defensively handle
negative inputs; update _format_duration(ms: float) to clamp negative ms to 0
(e.g., if ms < 0: ms = 0) before formatting so upstream timing bugs don't
produce misleading negative durations, then keep the existing branch logic for
<1000ms and >=1000ms to return "0ms" or the seconds string as appropriate.

In `@codeframe/core/engine_stats.py`:
- Around line 92-107: The SQL in the cur.execute call in engine_stats.py selects
the same COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) twice; remove the
duplicate column from that SELECT (the second occurrence near the end of the
list) and then adjust any code that reads the resulting row tuple (references to
row indices/variables that expect the extra column) so the mapping of
counts/averages/sums to variables remains correct. Use the existing
cur.execute(...) invocation and the enclosing function that processes row to
find and update all dependent index usages.
- Around line 14-66: The record_run function currently persists engine runs but
does not emit an observability event; import the event emitter from
core/events.py (e.g. emit_event) and, after the INSERT and after calling
_update_aggregate_stats_conn but before conn.commit(), call the emitter to emit
a concise "engine_run_recorded" (or similar) event containing run_id, engine,
task_id, workspace_id (workspace.id), status, duration_ms, tokens_used,
gates_passed, self_corrections and created_at (now) so consumers can audit the
state transition without changing the DB transaction semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: be1b3109-0ab6-42a9-a4ef-05502231fbc2

📥 Commits

Reviewing files that changed from the base of the PR and between 16fe673 and 7eb3e13.

📒 Files selected for processing (3)
  • codeframe/cli/engines_commands.py
  • codeframe/core/engine_stats.py
  • codeframe/core/runtime.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • codeframe/core/runtime.py

@claude

claude Bot commented Mar 14, 2026

Copy link
Copy Markdown

Follow-up Review

Good progress since the initial review. Here is a status update on what was addressed and what remains.

Resolved:

  • Double DB connection (Critical): record_run now calls _update_aggregate_stats_conn within the same open connection before committing. The TOCTOU window and double overhead are gone.
  • Deferred engine_stats import: Now correctly at the module level in runtime.py.
  • format shadows built-in: Renamed to output_format in stats.

Still Open:

  1. import time as _time_mod inside the function body (runtime.py ~line 675)

Both this review and CodeRabbit flagged this. The local alias _time_mod is unnecessary. Move to module-level:

import time
_perf_start_ms = int(time.monotonic() * 1000)
_perf_duration_ms = int(time.monotonic() * 1000) - _perf_start_ms
  1. Duplicate SQL COUNT expression in _update_aggregate_stats_conn (engine_stats.py ~line 82-97)

row[1] (completed) and row[9] (completed_count) are computed by the same COUNT(CASE WHEN status = COMPLETED ...) expression. CodeRabbit also flagged this. Replace completed_count = row[9] with completed_count = completed and remove the redundant tenth SELECT column.

  1. avg_tokens_per_task excludes failed-run token consumption (engine_stats.py ~line 104-106)

From my prior review - still unaddressed. The denominator is completed_count but completed_tokens is SUM(tokens_used WHERE status=COMPLETED), so token cost from failed runs is silently excluded. The metric name gives no hint of this. Either rename to avg_tokens_per_completed_task or compute across all runs so total cost (including failures) is visible.

  1. Run.started_at type mismatch in test (tests/core/test_runtime_engine_stats.py ~line 44)

CodeRabbit flagged this and it is still in the current diff. The field is typed datetime but the test passes a raw string: started_at="2026-03-14T00:00:00Z". Fix: use started_at=datetime(2026, 3, 14, tzinfo=timezone.utc) and add from datetime import datetime, timezone.


Still Nice-to-Have (from prior review, not blocking):

  • compare and stats render identical output - consider differentiating (delta columns, baseline normalization)
  • get_run_log is public API with no CLI surface - mark private or add cf engines log subcommand
  • Test helper imports inside each test method - move to module/class level per the rest of the test suite

The critical fix landed cleanly. The import time, duplicate SQL column, avg_tokens_per_task naming, and started_at type mismatch are the remaining items worth resolving before merge.

@frankbria frankbria merged commit cf3fa89 into main Mar 14, 2026
11 checks passed
Repository owner deleted a comment from claude Bot Mar 15, 2026
Repository owner deleted a comment from claude Bot Mar 15, 2026
@frankbria frankbria deleted the feature/issue-419-engine-perf-tracking branch March 24, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Phase 5] Engine Performance Tracking and Comparison

1 participant