Skip to content

Add latency to trace #1340

Merged
chiang-daniel merged 11 commits intomainfrom
KIL-542/perf
Apr 30, 2026
Merged

Add latency to trace #1340
chiang-daniel merged 11 commits intomainfrom
KIL-542/perf

Conversation

@chiang-daniel
Copy link
Copy Markdown
Contributor

@chiang-daniel chiang-daniel commented Apr 23, 2026

What does this PR do?

Add per llm call latency tracking to Trace.

  • Add total_llm_latency_ms to usage model to track overall llm latency like cost and token.
  • Add mean_total_llm_latency_ms to eval API for comparison
  • Update run, compare table, radar chart, and metric correlation

Checklists

  • Tests have been run locally and passed
  • New tests have been added to any work in /lib

chiang-daniel and others added 3 commits April 23, 2026 14:08
Add per-call latency timing (latency_ms) to trace messages and
cumulative total_llm_latency_ms to the Usage model. Instrument both
LiteLlmAdapter and AdapterStream to measure wall-clock time of each
LLM API call using time.monotonic, accumulate across multi-turn tool
call loops, and propagate latency into trace messages. Update
Usage.__add__ to support the new field, regenerate the OpenAPI schema,
and add comprehensive tests for field validation, addition semantics,
backwards compatibility, and adapter-level latency tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add mean_total_llm_latency_ms to MeanUsage model and wire up
accumulators in eval_api.py to compute average latency across eval
runs, respecting the existing 50% data-availability threshold.
Regenerate OpenAPI schema and add tests for both above- and
below-threshold latency scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add latency tracking to all frontend display surfaces:
- Run page: "Total Latency" and "Subtasks Latency" in usage properties
- Compare table: "Latency" row in the usage section
- Radar chart: "Speed" axis with lower-is-better scoring
- Scatter plot: latency available as selectable axis

Rename compare section from "Average Usage & Cost" to
"Average Usage, Cost & Latency". Refactor radar chart to support
multiple lower-is-better metrics with independent normalization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds end-to-end LLM latency tracking: adapters measure per-call/per-message latency and accumulate into Usage.total_llm_latency_ms; eval aggregation exposes MeanUsage.mean_total_llm_latency_ms (computed only when ≥50% of runs have data); UI/schema/charting is updated to surface and format latency.

Changes

Cohort / File(s) Summary
Core Data Model & Types
libs/core/kiln_ai/datamodel/task_run.py, libs/core/kiln_ai/utils/open_ai_types.py, libs/core/kiln_ai/datamodel/test_example_models.py, libs/core/kiln_ai/utils/test_open_ai_types.py
Add optional Usage.total_llm_latency_ms (non-negative) and ChatCompletionAssistantMessageParamWrapper.latency_ms; update aggregation and add validation/tests.
Model Adapter Instrumentation & Tests
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py, libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py, libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
Time streaming/completion iterations with time.monotonic(), accumulate per-call latencies into usage.total_llm_latency_ms, record per-message latencies, and thread latency maps into traces; add tests for timing and accumulation.
Eval API & Tests
app/desktop/studio_server/eval_api.py, app/desktop/studio_server/test_eval_api.py
Extend MeanUsage with mean_total_llm_latency_ms; aggregate per-run usage.total_llm_latency_ms across runs and compute mean only when sample coverage meets existing ≥50% threshold; add tests for pass/fail cases.
API Schema (TypeScript)
app/web_ui/src/lib/api_schema.d.ts
Expose new optional fields: ChatCompletionAssistantMessageParamWrapper.*.latency_ms, Usage.total_llm_latency_ms, and MeanUsage.mean_total_llm_latency_ms.
UI Formatting Utility
app/web_ui/src/lib/utils/formatters.ts
Add exported formatLatency(ms: number): string to render ms or seconds.
Compare Charts & Radar
app/web_ui/src/lib/components/compare_chart.svelte, app/web_ui/src/lib/components/compare_radar_chart.svelte
Add latency-specific formatting to chart tooltips/axis; generalize radar scoring (costToScoremetricToScore) to treat latency as lower-is-better and show scored values.
Run Details & Compare Page UI
app/web_ui/src/routes/(app)/run/run.svelte, app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
Accumulate/propagate subtask_latency_ms, add guarded UI state for latency requests, compute/display "Total Latency" and "Subtasks Latency" using formatLatency, and surface latency metric in compare page.
Client Type Definitions (repeat)
app/web_ui/src/lib/api_schema.d.ts
Ensure generated client types include new latency fields used by UI.

Sequence Diagram

sequenceDiagram
    participant UI as Client (UI)
    participant API as Eval API Server
    participant DB as Database
    participant Adapter as Model Adapter
    participant LLM as LLM Provider

    UI->>API: GET eval scores for run config
    API->>DB: Query eval run records (including usage)
    DB-->>API: Return eval run records

    loop per eval run with usage
        Note over Adapter,LLM: original runs recorded latencies
        Adapter-->>API: Stored usage includes total_llm_latency_ms and per-message latency map
    end

    API->>API: Aggregate usage across runs (sum, count samples_with_data)
    API->>API: If samples_with_data >= 50% → compute mean_total_llm_latency_ms else → null
    API-->>UI: Return MeanUsage including mean_total_llm_latency_ms
    UI->>UI: formatLatency(mean_total_llm_latency_ms) and render charts/tables
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • scosman
  • tawnymanticore
  • leonardmq

"i am a rabbit, timing hops in my paws,
counting milliseconds between each applause.
adapters tick, traces glow, charts hum with glee,
fifty-percent nods — the mean appears for all to see.
🐇⏱️✨"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add latency to trace' clearly and concisely describes the main objective of the changeset—adding latency tracking to the trace system.
Description check ✅ Passed The PR description covers what the PR does, mentions related checkboxes for tests and new test additions, and follows the required template structure with only the optional 'Related Issues' section omitted.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch KIL-542/perf

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces LLM latency tracking across the application, including data model updates, backend measurement in model adapters, and frontend visualizations in charts and run summaries. A significant issue was identified in the streaming adapter where latency measurements would be inflated by the consumer's processing time; a suggestion was provided to measure only the time spent waiting for chunks. Additionally, it is recommended to rename the costToScore function to reflect its broader use for all lower-is-better metrics and to centralize the duplicated latency formatting logic into a shared utility.

Comment thread libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
Comment thread app/web_ui/src/lib/components/compare_radar_chart.svelte Outdated
Comment thread app/web_ui/src/routes/(app)/run/run.svelte Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

📊 Coverage Report

Overall Coverage: 45%

Diff: origin/main...HEAD

  • libs/core/kiln_ai/datamodel/task_run.py (100%)
  • libs/core/kiln_ai/utils/open_ai_types.py (100%)

Summary

  • Total: 3 lines
  • Missing: 0 lines
  • Coverage: 100%

Fix streaming path to only measure time waiting on LLM chunks, not
time the consumer spends processing yielded chunks. Also fix misleading
test that claimed to test multi-call accumulation but only exercised a
single LLM call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Function now handles both cost and latency metrics, so the name should
reflect its broader purpose.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@chiang-daniel chiang-daniel marked this pull request as ready for review April 24, 2026 00:04
@chiang-daniel
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (3)
app/desktop/studio_server/test_eval_api.py (1)

2202-2359: Add an exact-50% threshold test to lock boundary semantics.

You currently test >50% and <50%, but not exactly 50%, which is where regressions usually slip in.

🧪 Suggested boundary test
+@pytest.mark.asyncio
+async def test_get_run_config_eval_scores_latency_at_threshold(
+    client, mock_task_from_id, mock_task, mock_eval, mock_eval_config, mock_run_config
+):
+    """At exactly 50% availability, latency inclusion should follow intended threshold semantics."""
+    mock_task_from_id.return_value = mock_task
+
+    # 2 runs total, exactly 1 has latency => 50%
+    task_run_1 = TaskRun(
+        input="test input 1",
+        input_source=DataSource(
+            type=DataSourceType.synthetic,
+            properties={
+                "model_name": "gpt-4",
+                "model_provider": "openai",
+                "adapter_name": "langchain_adapter",
+            },
+        ),
+        output=TaskOutput(output="test output 1"),
+        usage=Usage(input_tokens=100, output_tokens=50, total_tokens=150, cost=0.005, total_llm_latency_ms=500),
+        parent=mock_task,
+    )
+    task_run_1.save_to_file()
+
+    task_run_2 = TaskRun(
+        input="test input 2",
+        input_source=DataSource(
+            type=DataSourceType.synthetic,
+            properties={
+                "model_name": "gpt-4",
+                "model_provider": "openai",
+                "adapter_name": "langchain_adapter",
+            },
+        ),
+        output=TaskOutput(output="test output 2"),
+        usage=Usage(input_tokens=200, output_tokens=100, total_tokens=300, cost=0.010),
+        parent=mock_task,
+    )
+    task_run_2.save_to_file()
+
+    eval_run_1 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.0, "overall_rating": 4.0}, input="i1", output="o1", dataset_id=task_run_1.id, task_run_usage=task_run_1.usage, parent=mock_eval_config)
+    eval_run_1.save_to_file()
+    eval_run_2 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.5, "overall_rating": 4.5}, input="i2", output="o2", dataset_id=task_run_2.id, task_run_usage=task_run_2.usage, parent=mock_eval_config)
+    eval_run_2.save_to_file()
+
+    # ...same mock/patch pattern as neighboring tests...
+    # Assert explicit expected behavior at exactly 50%:
+    # assert mean_usage["mean_total_llm_latency_ms"] is <expected per threshold contract>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/desktop/studio_server/test_eval_api.py` around lines 2202 - 2359, Add a
boundary test to cover the exact 50% case: create a new test (e.g.
test_get_run_config_eval_scores_latency_at_50_percent) that mirrors
test_get_run_config_eval_scores_latency_below_threshold but creates 4
TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having
total_llm_latency_ms set (2/4 = 50%), patch
task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the
same way, call the same endpoint, and assert the behaviour of
mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected
assertion to match the current implementation's semantics—i.e., assert not None
if the code treats >=50% as sufficient, or assert None if it requires >50%).
app/web_ui/src/lib/components/compare_radar_chart.svelte (2)

165-189: Consolidate the COST_KEY / LATENCY_KEY tooltip branches.

The two branches are structurally identical aside from the label/format of the raw value. A small per-metric formatter map keeps future lower-is-better additions one-line changes and prevents subtle drift between the two branches.

♻️ Illustrative refactor
-      } else if (key === COST_KEY) {
-        const displayValue = metricToScore(
-          rawValue,
-          lowerIsBetterValues[key] || [],
-        )
-        html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Cost: $${rawValue.toFixed(6)})</span></div>`
-      } else if (key === LATENCY_KEY) {
-        const displayValue = metricToScore(
-          rawValue,
-          lowerIsBetterValues[key] || [],
-        )
-        const formatted =
-          rawValue < 1000
-            ? `${Math.round(rawValue)}ms`
-            : `${(rawValue / 1000).toFixed(1)}s`
-        html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Latency: ${formatted})</span></div>`
+      } else if (isLowerIsBetterMetric(key)) {
+        const displayValue = metricToScore(
+          rawValue,
+          lowerIsBetterValues[key] || [],
+        )
+        const rawLabel = key === COST_KEY ? "Mean Cost" : "Mean Latency"
+        const rawFormatted =
+          key === COST_KEY ? `$${rawValue.toFixed(6)}` : formatLatency(rawValue)
+        html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(${rawLabel}: ${rawFormatted})</span></div>`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 165 -
189, The two branches handling COST_KEY and LATENCY_KEY inside dataKeys.forEach
are duplicated; extract a per-metric raw-value formatter map and use a single
branch that calls metricToScore and then formats the raw value via the map
before composing html. Specifically, update the logic in dataKeys.forEach to
call getKeyLabel and getModelValueRaw as before, compute displayValue via
metricToScore using lowerIsBetterValues[key], and replace the
COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g.,
formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw
string (like currency with 6 decimals or ms/s formatting) and fall back to
rawValue.toFixed(3); then build the html with the unified template that inserts
label, displayValue.toFixed(1), and the formatted raw string.

46-78: Rename parameters to match the generalized helper.

After renaming costToScoremetricToScore, the parameters are still cost / costs, which reads misleadingly when the function is invoked for latency. Tightens readability without behavior change.

♻️ Proposed refactor
-  export function metricToScore(
-    cost: number,
-    costs: number[],
+  export function metricToScore(
+    value: number,
+    values: number[],
     {
       padding = 10, // keep endpoints away from 0/100
       relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1)
     }: {
       padding?: number
       relFull?: number
     } = {},
   ): number {
-    const lo = Math.min(...costs)
-    const hi = Math.max(...costs)
+    const lo = Math.min(...values)
+    const hi = Math.max(...values)

     const range = hi - lo
     if (range <= 0) return 50

     // 1) range-based normalized position
-    const t = (cost - lo) / range
+    const t = (value - lo) / range

-    // 2) raw padded linear score (lower cost = higher score)
+    // 2) raw padded linear score (lower value = higher score)
     const raw = padding + (1 - t) * (100 - 2 * padding)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 -
78, The parameter names cost and costs in the exported function metricToScore
are misleading after renaming; rename them to generic names (e.g., value and
values or metric and metrics) across the function signature and all internal
references inside metricToScore to improve readability without changing
behavior, keeping the same default behavior and return logic in metricToScore.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 181-185: Replace the inline latency formatting in
compare_radar_chart.svelte with the shared helper: import and call
formatLatency(ms: number) from app/web_ui/src/lib/utils/formatters.ts instead of
computing `formatted` manually (the block that sets `formatted` and uses it in
the HTML string); update the code that builds `html` to use
formatLatency(rawValue) where the previous `${formatted}` was inserted so all
latency display follows the single source-of-truth.

In `@app/web_ui/src/lib/utils/formatters.ts`:
- Around line 330-333: The formatLatency function can output "1000ms" for values
like 999.6 due to rounding before unit selection; change the logic so unit
selection uses the rounded ms value: compute roundedMs = Math.round(ms) and if
roundedMs < 1000 return `${roundedMs}ms`, otherwise return seconds using the
original ms divided by 1000 (e.g., `${(ms/1000).toFixed(1)}s`) so values that
round to 1000ms become "1.0s" instead of "1000ms".

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/compare/+page.svelte:
- Around line 496-501: The latency formatting is leaking into the diff
calculation: when category === "cost" and scoreKey like
"mean_total_llm_latency_ms" you should keep using formatLatency(value) only for
display, and use getModelValueRaw() (the raw millisecond value) when computing
the percent-difference/badge so comparisons remain numeric; update the
percent-difference calculation to call getModelValueRaw() for both sides when
scoreKey corresponds to latency (e.g., "mean_total_llm_latency_ms") and leave
formatLatency() exclusively in the cell rendering path.

In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`:
- Around line 171-180: The stream latency currently sums time only for
iterations that yield a chunk and misses the final await after the last chunk;
change the loop in adapter_stream.py to measure each await of the stream instead
of only time inside the async for body: replace the "async for chunk in stream"
with an explicit loop that records time before awaiting stream.__anext__(), adds
the elapsed time to call_latency_seconds after the await (so the final tail wait
is included), yields the chunk, and on StopAsyncIteration records the final
await duration before breaking; keep existing variables call_latency_seconds and
chunk_wait_start semantics so consumer processing time remains excluded.

In `@specs/projects/performance-tracking/architecture.md`:
- Line 222: The documentation references a renamed helper; replace uses of
metricToScore() with the canonical costToScore() so the docs match
functional_spec.md and the implementation. Update the "Speed" radar axis note
(and any other occurrences in this section) to call costToScore() and ensure the
wording still notes "lower is better". Verify the symbol name is exactly
costToScore() wherever metricToScore() appears.

In `@specs/projects/performance-tracking/functional_spec.md`:
- Around line 154-156: The spec still references the old helper name
costToScore(); update the documentation to use the new helper name
metricToScore() and, while here, confirm the spec notes that isCostMetric() now
also matches latency (both are lower-is-better) and that the radar label is
"Speed" (higher = better on chart) to match compare_radar_chart.svelte's current
behavior.

In `@specs/projects/performance-tracking/phase_plans/phase_3.md`:
- Line 13: The spec text incorrectly implies formatLatency() was created in this
phase; update the Phase 3 description for run.svelte to say you used or reused
the existing formatLatency() helper (instead of "added formatLatency()") and
confirm that calculate_subtask_usage(), the new subtask_latency_ms state
variable, and the additions to get_usage_properties() are the actual new changes
in this phase.
- Line 17: The spec incorrectly references costToScore(); update the
documentation to use the actual function name metricToScore() and ensure
references to symbols in compare_radar_chart.svelte (e.g., LATENCY_KEY,
isLowerIsBetterMetric(), lowerIsBetterValues) remain accurate — replace any
mention of costToScore() with metricToScore() and confirm wording indicates
per-key normalization via metricToScore() for cost and latency.

In `@specs/projects/performance-tracking/project_overview.md`:
- Line 30: The doc incorrectly states latency is stored in seconds; update the
spec to say latency is stored in milliseconds (integer/float as applicable) and
adjust examples/units accordingly; reference the observed implementation using
the _ms fields (latency_ms, total_llm_latency_ms, mean_total_llm_latency_ms) and
the formatLatency(ms: number) helper which renders "Xms" for values < 1000 to
ensure the documentation aligns with the code representation and displayed
units.

---

Nitpick comments:
In `@app/desktop/studio_server/test_eval_api.py`:
- Around line 2202-2359: Add a boundary test to cover the exact 50% case: create
a new test (e.g. test_get_run_config_eval_scores_latency_at_50_percent) that
mirrors test_get_run_config_eval_scores_latency_below_threshold but creates 4
TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having
total_llm_latency_ms set (2/4 = 50%), patch
task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the
same way, call the same endpoint, and assert the behaviour of
mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected
assertion to match the current implementation's semantics—i.e., assert not None
if the code treats >=50% as sufficient, or assert None if it requires >50%).

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 165-189: The two branches handling COST_KEY and LATENCY_KEY inside
dataKeys.forEach are duplicated; extract a per-metric raw-value formatter map
and use a single branch that calls metricToScore and then formats the raw value
via the map before composing html. Specifically, update the logic in
dataKeys.forEach to call getKeyLabel and getModelValueRaw as before, compute
displayValue via metricToScore using lowerIsBetterValues[key], and replace the
COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g.,
formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw
string (like currency with 6 decimals or ms/s formatting) and fall back to
rawValue.toFixed(3); then build the html with the unified template that inserts
label, displayValue.toFixed(1), and the formatted raw string.
- Around line 46-78: The parameter names cost and costs in the exported function
metricToScore are misleading after renaming; rename them to generic names (e.g.,
value and values or metric and metrics) across the function signature and all
internal references inside metricToScore to improve readability without changing
behavior, keeping the same default behavior and return logic in metricToScore.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 470f0af0-7876-452e-b65a-aecd47423ea4

📥 Commits

Reviewing files that changed from the base of the PR and between 7f69c4c and 1a6582a.

📒 Files selected for processing (22)
  • app/desktop/studio_server/eval_api.py
  • app/desktop/studio_server/test_eval_api.py
  • app/web_ui/src/lib/api_schema.d.ts
  • app/web_ui/src/lib/components/compare_chart.svelte
  • app/web_ui/src/lib/components/compare_radar_chart.svelte
  • app/web_ui/src/lib/utils/formatters.ts
  • app/web_ui/src/routes/(app)/run/run.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
  • libs/core/kiln_ai/datamodel/task_run.py
  • libs/core/kiln_ai/datamodel/test_example_models.py
  • libs/core/kiln_ai/utils/open_ai_types.py
  • libs/core/kiln_ai/utils/test_open_ai_types.py
  • specs/projects/performance-tracking/architecture.md
  • specs/projects/performance-tracking/functional_spec.md
  • specs/projects/performance-tracking/implementation_plan.md
  • specs/projects/performance-tracking/phase_plans/phase_1.md
  • specs/projects/performance-tracking/phase_plans/phase_2.md
  • specs/projects/performance-tracking/phase_plans/phase_3.md
  • specs/projects/performance-tracking/project_overview.md

Comment thread app/web_ui/src/lib/components/compare_radar_chart.svelte
Comment thread app/web_ui/src/lib/utils/formatters.ts
Comment thread libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py Outdated
Comment thread specs/projects/performance-tracking/architecture.md Outdated
Comment thread specs/projects/performance-tracking/functional_spec.md Outdated
Comment thread specs/projects/performance-tracking/phase_plans/phase_3.md Outdated
Comment thread specs/projects/performance-tracking/phase_plans/phase_3.md Outdated
Comment thread specs/projects/performance-tracking/project_overview.md Outdated
Extract duplicated latency formatting logic into shared formatLatency()
in formatters.ts. Rename costToScore to metricToScore since it now
handles both cost and latency metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
app/web_ui/src/lib/components/compare_radar_chart.svelte (1)

46-78: Rename cost/costs parameters now that the helper is generalized.

Since metricToScore now scores both cost and latency (and any future lower-is-better metric), the cost-specific parameter names are misleading. Consider value/values (or metric/metrics) for readability — purely cosmetic, no behavior change.

♻️ Proposed rename
-  export function metricToScore(
-    cost: number,
-    costs: number[],
+  export function metricToScore(
+    value: number,
+    values: number[],
     {
       padding = 10, // keep endpoints away from 0/100
       relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1)
     }: {
       padding?: number
       relFull?: number
     } = {},
   ): number {
-    const lo = Math.min(...costs)
-    const hi = Math.max(...costs)
+    const lo = Math.min(...values)
+    const hi = Math.max(...values)

     const range = hi - lo
     if (range <= 0) return 50

     // 1) range-based normalized position
-    const t = (cost - lo) / range
+    const t = (value - lo) / range

-    // 2) raw padded linear score (lower cost = higher score)
+    // 2) raw padded linear score (lower value = higher score)
     const raw = padding + (1 - t) * (100 - 2 * padding)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 -
78, Rename the cost-specific parameters in metricToScore to neutral names to
reflect general-purpose use: change the parameter names cost and costs to value
and values (or metric and metrics) in the metricToScore function signature and
update all internal references (lo, hi, t, raw, etc. that use those params)
accordingly; also update any call sites that pass arguments to metricToScore to
use the new parameter names if they reference them by name (or simply ensure
positional calls still work), and run the test/build to verify no unresolved
identifiers remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 46-78: Rename the cost-specific parameters in metricToScore to
neutral names to reflect general-purpose use: change the parameter names cost
and costs to value and values (or metric and metrics) in the metricToScore
function signature and update all internal references (lo, hi, t, raw, etc. that
use those params) accordingly; also update any call sites that pass arguments to
metricToScore to use the new parameter names if they reference them by name (or
simply ensure positional calls still work), and run the test/build to verify no
unresolved identifiers remain.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa3a98bb-f410-442e-8d7d-d4fc9c6b199f

📥 Commits

Reviewing files that changed from the base of the PR and between 1a6582a and 286c0a1.

📒 Files selected for processing (6)
  • app/web_ui/src/lib/components/compare_chart.svelte
  • app/web_ui/src/lib/components/compare_radar_chart.svelte
  • app/web_ui/src/lib/utils/formatters.ts
  • app/web_ui/src/routes/(app)/run/run.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
✅ Files skipped from review due to trivial changes (1)
  • app/web_ui/src/lib/utils/formatters.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • app/web_ui/src/lib/components/compare_chart.svelte
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
  • app/web_ui/src/routes/(app)/run/run.svelte

Comment thread libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py Outdated
Comment thread libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py Outdated
Comment thread libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py Outdated
Comment thread libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py (1)

174-177: ⚠️ Potential issue | 🟠 Major

Latency timing window includes downstream consumer processing time

Lines 174–177 measure the entire async loop duration, which includes time spent after each yield while the downstream caller processes chunks. This overstates call_latency_ms if the intent is to measure only provider latency.

Suggested fix
             stream = StreamingCompletion(**completion_kwargs)
-            start = time.monotonic()
-            async for chunk in stream:
-                yield chunk
-            call_latency_ms = int((time.monotonic() - start) * 1000)
+            call_latency_seconds = 0.0
+            chunk_wait_start = time.monotonic()
+            async for chunk in stream:
+                # Time spent waiting for the next chunk from the provider
+                call_latency_seconds += time.monotonic() - chunk_wait_start
+                yield chunk
+                # Reset after resuming; excludes downstream consumer processing time
+                chunk_wait_start = time.monotonic()
+            # Include terminal wait for final __anext__() / StopAsyncIteration
+            call_latency_seconds += time.monotonic() - chunk_wait_start
+            call_latency_ms = int(call_latency_seconds * 1000)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py` around lines 174
- 177, The current timing (start / call_latency_ms) wraps the whole async for
and thus includes downstream consumer time; change it to measure only provider
latency by timing each await that fetches the next chunk from the async iterator
and accumulating those durations into call_latency_ms (i.e., for symbol stream
in adapter_stream.py, record a timestamp immediately before awaiting/advancing
the iterator, stop the timer immediately after the chunk is received, add that
delta to the latency accumulator, then yield the chunk), ensuring you don't
include time spent after yield in the measured window.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`:
- Around line 174-177: The current timing (start / call_latency_ms) wraps the
whole async for and thus includes downstream consumer time; change it to measure
only provider latency by timing each await that fetches the next chunk from the
async iterator and accumulating those durations into call_latency_ms (i.e., for
symbol stream in adapter_stream.py, record a timestamp immediately before
awaiting/advancing the iterator, stop the timer immediately after the chunk is
received, add that delta to the latency accumulator, then yield the chunk),
ensuring you don't include time spent after yield in the measured window.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6bd5b33f-0037-49a4-8f98-bd831c1a1d21

📥 Commits

Reviewing files that changed from the base of the PR and between 0c4ea5e and eb6bcf3.

📒 Files selected for processing (1)
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py (1)

174-177: ⚠️ Potential issue | 🟠 Major

Streaming latency currently includes consumer processing time.

At Line 174-Line 177, the timer wraps the entire async for loop, so elapsed time includes time while the generator is paused after yield chunk. That inflates LLM latency whenever consumers process chunks slowly.

Suggested fix
-            stream = StreamingCompletion(**completion_kwargs)
-            start = time.monotonic()
-            async for chunk in stream:
-                yield chunk
-            call_latency_ms = int((time.monotonic() - start) * 1000)
+            stream = StreamingCompletion(**completion_kwargs)
+            call_latency_seconds = 0.0
+            chunk_wait_start = time.monotonic()
+            async for chunk in stream:
+                call_latency_seconds += time.monotonic() - chunk_wait_start
+                yield chunk
+                # reset after consumer resumes us; excludes consumer processing time
+                chunk_wait_start = time.monotonic()
+            # include terminal wait for the final __anext__ that ends the stream
+            call_latency_seconds += time.monotonic() - chunk_wait_start
+            call_latency_ms = int(call_latency_seconds * 1000)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py` around lines 174
- 177, The timer currently wraps the entire async for (start = time.monotonic()
before the loop) so consumer processing after each yield inflates latency;
instead, start timing when awaiting the stream and compute call_latency_ms
immediately after receiving each chunk and before yielding it (i.e., set start
when you begin waiting, then inside async for compute call_latency_ms =
int((time.monotonic() - start) * 1000) right after getting chunk and before
yield chunk, and reset start before awaiting the next chunk if you need
per-chunk measurements) so consumer pause time is excluded; update the logic
around the async for over stream in adapter_stream.py accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`:
- Around line 174-177: The timer currently wraps the entire async for (start =
time.monotonic() before the loop) so consumer processing after each yield
inflates latency; instead, start timing when awaiting the stream and compute
call_latency_ms immediately after receiving each chunk and before yielding it
(i.e., set start when you begin waiting, then inside async for compute
call_latency_ms = int((time.monotonic() - start) * 1000) right after getting
chunk and before yield chunk, and reset start before awaiting the next chunk if
you need per-chunk measurements) so consumer pause time is excluded; update the
logic around the async for over stream in adapter_stream.py accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6f5d05c9-9ca0-4507-aeab-2730f8d593af

📥 Commits

Reviewing files that changed from the base of the PR and between eb6bcf3 and 6bd53b6.

📒 Files selected for processing (3)
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py

@chiang-daniel chiang-daniel merged commit 9a1c0ee into main Apr 30, 2026
14 checks passed
@chiang-daniel chiang-daniel deleted the KIL-542/perf branch April 30, 2026 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants