Add latency to trace #1340
Conversation
Add per-call latency timing (latency_ms) to trace messages and cumulative total_llm_latency_ms to the Usage model. Instrument both LiteLlmAdapter and AdapterStream to measure wall-clock time of each LLM API call using time.monotonic, accumulate across multi-turn tool call loops, and propagate latency into trace messages. Update Usage.__add__ to support the new field, regenerate the OpenAPI schema, and add comprehensive tests for field validation, addition semantics, backwards compatibility, and adapter-level latency tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add mean_total_llm_latency_ms to MeanUsage model and wire up accumulators in eval_api.py to compute average latency across eval runs, respecting the existing 50% data-availability threshold. Regenerate OpenAPI schema and add tests for both above- and below-threshold latency scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add latency tracking to all frontend display surfaces: - Run page: "Total Latency" and "Subtasks Latency" in usage properties - Compare table: "Latency" row in the usage section - Radar chart: "Speed" axis with lower-is-better scoring - Scatter plot: latency available as selectable axis Rename compare section from "Average Usage & Cost" to "Average Usage, Cost & Latency". Refactor radar chart to support multiple lower-is-better metrics with independent normalization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds end-to-end LLM latency tracking: adapters measure per-call/per-message latency and accumulate into Changes
Sequence DiagramsequenceDiagram
participant UI as Client (UI)
participant API as Eval API Server
participant DB as Database
participant Adapter as Model Adapter
participant LLM as LLM Provider
UI->>API: GET eval scores for run config
API->>DB: Query eval run records (including usage)
DB-->>API: Return eval run records
loop per eval run with usage
Note over Adapter,LLM: original runs recorded latencies
Adapter-->>API: Stored usage includes total_llm_latency_ms and per-message latency map
end
API->>API: Aggregate usage across runs (sum, count samples_with_data)
API->>API: If samples_with_data >= 50% → compute mean_total_llm_latency_ms else → null
API-->>UI: Return MeanUsage including mean_total_llm_latency_ms
UI->>UI: formatLatency(mean_total_llm_latency_ms) and render charts/tables
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces LLM latency tracking across the application, including data model updates, backend measurement in model adapters, and frontend visualizations in charts and run summaries. A significant issue was identified in the streaming adapter where latency measurements would be inflated by the consumer's processing time; a suggestion was provided to measure only the time spent waiting for chunks. Additionally, it is recommended to rename the costToScore function to reflect its broader use for all lower-is-better metrics and to centralize the duplicated latency formatting logic into a shared utility.
📊 Coverage ReportOverall Coverage: 45% Diff: origin/main...HEAD
Summary
|
Fix streaming path to only measure time waiting on LLM chunks, not time the consumer spends processing yielded chunks. Also fix misleading test that claimed to test multi-call accumulation but only exercised a single LLM call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1d2a084 to
f6391f0
Compare
Function now handles both cost and latency metrics, so the name should reflect its broader purpose. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (3)
app/desktop/studio_server/test_eval_api.py (1)
2202-2359: Add an exact-50% threshold test to lock boundary semantics.You currently test
>50%and<50%, but not exactly50%, which is where regressions usually slip in.🧪 Suggested boundary test
+@pytest.mark.asyncio +async def test_get_run_config_eval_scores_latency_at_threshold( + client, mock_task_from_id, mock_task, mock_eval, mock_eval_config, mock_run_config +): + """At exactly 50% availability, latency inclusion should follow intended threshold semantics.""" + mock_task_from_id.return_value = mock_task + + # 2 runs total, exactly 1 has latency => 50% + task_run_1 = TaskRun( + input="test input 1", + input_source=DataSource( + type=DataSourceType.synthetic, + properties={ + "model_name": "gpt-4", + "model_provider": "openai", + "adapter_name": "langchain_adapter", + }, + ), + output=TaskOutput(output="test output 1"), + usage=Usage(input_tokens=100, output_tokens=50, total_tokens=150, cost=0.005, total_llm_latency_ms=500), + parent=mock_task, + ) + task_run_1.save_to_file() + + task_run_2 = TaskRun( + input="test input 2", + input_source=DataSource( + type=DataSourceType.synthetic, + properties={ + "model_name": "gpt-4", + "model_provider": "openai", + "adapter_name": "langchain_adapter", + }, + ), + output=TaskOutput(output="test output 2"), + usage=Usage(input_tokens=200, output_tokens=100, total_tokens=300, cost=0.010), + parent=mock_task, + ) + task_run_2.save_to_file() + + eval_run_1 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.0, "overall_rating": 4.0}, input="i1", output="o1", dataset_id=task_run_1.id, task_run_usage=task_run_1.usage, parent=mock_eval_config) + eval_run_1.save_to_file() + eval_run_2 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.5, "overall_rating": 4.5}, input="i2", output="o2", dataset_id=task_run_2.id, task_run_usage=task_run_2.usage, parent=mock_eval_config) + eval_run_2.save_to_file() + + # ...same mock/patch pattern as neighboring tests... + # Assert explicit expected behavior at exactly 50%: + # assert mean_usage["mean_total_llm_latency_ms"] is <expected per threshold contract>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/desktop/studio_server/test_eval_api.py` around lines 2202 - 2359, Add a boundary test to cover the exact 50% case: create a new test (e.g. test_get_run_config_eval_scores_latency_at_50_percent) that mirrors test_get_run_config_eval_scores_latency_below_threshold but creates 4 TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having total_llm_latency_ms set (2/4 = 50%), patch task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the same way, call the same endpoint, and assert the behaviour of mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected assertion to match the current implementation's semantics—i.e., assert not None if the code treats >=50% as sufficient, or assert None if it requires >50%).app/web_ui/src/lib/components/compare_radar_chart.svelte (2)
165-189: Consolidate theCOST_KEY/LATENCY_KEYtooltip branches.The two branches are structurally identical aside from the label/format of the raw value. A small per-metric formatter map keeps future lower-is-better additions one-line changes and prevents subtle drift between the two branches.
♻️ Illustrative refactor
- } else if (key === COST_KEY) { - const displayValue = metricToScore( - rawValue, - lowerIsBetterValues[key] || [], - ) - html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Cost: $${rawValue.toFixed(6)})</span></div>` - } else if (key === LATENCY_KEY) { - const displayValue = metricToScore( - rawValue, - lowerIsBetterValues[key] || [], - ) - const formatted = - rawValue < 1000 - ? `${Math.round(rawValue)}ms` - : `${(rawValue / 1000).toFixed(1)}s` - html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Latency: ${formatted})</span></div>` + } else if (isLowerIsBetterMetric(key)) { + const displayValue = metricToScore( + rawValue, + lowerIsBetterValues[key] || [], + ) + const rawLabel = key === COST_KEY ? "Mean Cost" : "Mean Latency" + const rawFormatted = + key === COST_KEY ? `$${rawValue.toFixed(6)}` : formatLatency(rawValue) + html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(${rawLabel}: ${rawFormatted})</span></div>`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 165 - 189, The two branches handling COST_KEY and LATENCY_KEY inside dataKeys.forEach are duplicated; extract a per-metric raw-value formatter map and use a single branch that calls metricToScore and then formats the raw value via the map before composing html. Specifically, update the logic in dataKeys.forEach to call getKeyLabel and getModelValueRaw as before, compute displayValue via metricToScore using lowerIsBetterValues[key], and replace the COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g., formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw string (like currency with 6 decimals or ms/s formatting) and fall back to rawValue.toFixed(3); then build the html with the unified template that inserts label, displayValue.toFixed(1), and the formatted raw string.
46-78: Rename parameters to match the generalized helper.After renaming
costToScore→metricToScore, the parameters are stillcost/costs, which reads misleadingly when the function is invoked for latency. Tightens readability without behavior change.♻️ Proposed refactor
- export function metricToScore( - cost: number, - costs: number[], + export function metricToScore( + value: number, + values: number[], { padding = 10, // keep endpoints away from 0/100 relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1) }: { padding?: number relFull?: number } = {}, ): number { - const lo = Math.min(...costs) - const hi = Math.max(...costs) + const lo = Math.min(...values) + const hi = Math.max(...values) const range = hi - lo if (range <= 0) return 50 // 1) range-based normalized position - const t = (cost - lo) / range + const t = (value - lo) / range - // 2) raw padded linear score (lower cost = higher score) + // 2) raw padded linear score (lower value = higher score) const raw = padding + (1 - t) * (100 - 2 * padding)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 - 78, The parameter names cost and costs in the exported function metricToScore are misleading after renaming; rename them to generic names (e.g., value and values or metric and metrics) across the function signature and all internal references inside metricToScore to improve readability without changing behavior, keeping the same default behavior and return logic in metricToScore.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 181-185: Replace the inline latency formatting in
compare_radar_chart.svelte with the shared helper: import and call
formatLatency(ms: number) from app/web_ui/src/lib/utils/formatters.ts instead of
computing `formatted` manually (the block that sets `formatted` and uses it in
the HTML string); update the code that builds `html` to use
formatLatency(rawValue) where the previous `${formatted}` was inserted so all
latency display follows the single source-of-truth.
In `@app/web_ui/src/lib/utils/formatters.ts`:
- Around line 330-333: The formatLatency function can output "1000ms" for values
like 999.6 due to rounding before unit selection; change the logic so unit
selection uses the rounded ms value: compute roundedMs = Math.round(ms) and if
roundedMs < 1000 return `${roundedMs}ms`, otherwise return seconds using the
original ms divided by 1000 (e.g., `${(ms/1000).toFixed(1)}s`) so values that
round to 1000ms become "1.0s" instead of "1000ms".
In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/compare/+page.svelte:
- Around line 496-501: The latency formatting is leaking into the diff
calculation: when category === "cost" and scoreKey like
"mean_total_llm_latency_ms" you should keep using formatLatency(value) only for
display, and use getModelValueRaw() (the raw millisecond value) when computing
the percent-difference/badge so comparisons remain numeric; update the
percent-difference calculation to call getModelValueRaw() for both sides when
scoreKey corresponds to latency (e.g., "mean_total_llm_latency_ms") and leave
formatLatency() exclusively in the cell rendering path.
In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`:
- Around line 171-180: The stream latency currently sums time only for
iterations that yield a chunk and misses the final await after the last chunk;
change the loop in adapter_stream.py to measure each await of the stream instead
of only time inside the async for body: replace the "async for chunk in stream"
with an explicit loop that records time before awaiting stream.__anext__(), adds
the elapsed time to call_latency_seconds after the await (so the final tail wait
is included), yields the chunk, and on StopAsyncIteration records the final
await duration before breaking; keep existing variables call_latency_seconds and
chunk_wait_start semantics so consumer processing time remains excluded.
In `@specs/projects/performance-tracking/architecture.md`:
- Line 222: The documentation references a renamed helper; replace uses of
metricToScore() with the canonical costToScore() so the docs match
functional_spec.md and the implementation. Update the "Speed" radar axis note
(and any other occurrences in this section) to call costToScore() and ensure the
wording still notes "lower is better". Verify the symbol name is exactly
costToScore() wherever metricToScore() appears.
In `@specs/projects/performance-tracking/functional_spec.md`:
- Around line 154-156: The spec still references the old helper name
costToScore(); update the documentation to use the new helper name
metricToScore() and, while here, confirm the spec notes that isCostMetric() now
also matches latency (both are lower-is-better) and that the radar label is
"Speed" (higher = better on chart) to match compare_radar_chart.svelte's current
behavior.
In `@specs/projects/performance-tracking/phase_plans/phase_3.md`:
- Line 13: The spec text incorrectly implies formatLatency() was created in this
phase; update the Phase 3 description for run.svelte to say you used or reused
the existing formatLatency() helper (instead of "added formatLatency()") and
confirm that calculate_subtask_usage(), the new subtask_latency_ms state
variable, and the additions to get_usage_properties() are the actual new changes
in this phase.
- Line 17: The spec incorrectly references costToScore(); update the
documentation to use the actual function name metricToScore() and ensure
references to symbols in compare_radar_chart.svelte (e.g., LATENCY_KEY,
isLowerIsBetterMetric(), lowerIsBetterValues) remain accurate — replace any
mention of costToScore() with metricToScore() and confirm wording indicates
per-key normalization via metricToScore() for cost and latency.
In `@specs/projects/performance-tracking/project_overview.md`:
- Line 30: The doc incorrectly states latency is stored in seconds; update the
spec to say latency is stored in milliseconds (integer/float as applicable) and
adjust examples/units accordingly; reference the observed implementation using
the _ms fields (latency_ms, total_llm_latency_ms, mean_total_llm_latency_ms) and
the formatLatency(ms: number) helper which renders "Xms" for values < 1000 to
ensure the documentation aligns with the code representation and displayed
units.
---
Nitpick comments:
In `@app/desktop/studio_server/test_eval_api.py`:
- Around line 2202-2359: Add a boundary test to cover the exact 50% case: create
a new test (e.g. test_get_run_config_eval_scores_latency_at_50_percent) that
mirrors test_get_run_config_eval_scores_latency_below_threshold but creates 4
TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having
total_llm_latency_ms set (2/4 = 50%), patch
task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the
same way, call the same endpoint, and assert the behaviour of
mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected
assertion to match the current implementation's semantics—i.e., assert not None
if the code treats >=50% as sufficient, or assert None if it requires >50%).
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 165-189: The two branches handling COST_KEY and LATENCY_KEY inside
dataKeys.forEach are duplicated; extract a per-metric raw-value formatter map
and use a single branch that calls metricToScore and then formats the raw value
via the map before composing html. Specifically, update the logic in
dataKeys.forEach to call getKeyLabel and getModelValueRaw as before, compute
displayValue via metricToScore using lowerIsBetterValues[key], and replace the
COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g.,
formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw
string (like currency with 6 decimals or ms/s formatting) and fall back to
rawValue.toFixed(3); then build the html with the unified template that inserts
label, displayValue.toFixed(1), and the formatted raw string.
- Around line 46-78: The parameter names cost and costs in the exported function
metricToScore are misleading after renaming; rename them to generic names (e.g.,
value and values or metric and metrics) across the function signature and all
internal references inside metricToScore to improve readability without changing
behavior, keeping the same default behavior and return logic in metricToScore.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 470f0af0-7876-452e-b65a-aecd47423ea4
📒 Files selected for processing (22)
app/desktop/studio_server/eval_api.pyapp/desktop/studio_server/test_eval_api.pyapp/web_ui/src/lib/api_schema.d.tsapp/web_ui/src/lib/components/compare_chart.svelteapp/web_ui/src/lib/components/compare_radar_chart.svelteapp/web_ui/src/lib/utils/formatters.tsapp/web_ui/src/routes/(app)/run/run.svelteapp/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.sveltelibs/core/kiln_ai/adapters/model_adapters/adapter_stream.pylibs/core/kiln_ai/adapters/model_adapters/litellm_adapter.pylibs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.pylibs/core/kiln_ai/datamodel/task_run.pylibs/core/kiln_ai/datamodel/test_example_models.pylibs/core/kiln_ai/utils/open_ai_types.pylibs/core/kiln_ai/utils/test_open_ai_types.pyspecs/projects/performance-tracking/architecture.mdspecs/projects/performance-tracking/functional_spec.mdspecs/projects/performance-tracking/implementation_plan.mdspecs/projects/performance-tracking/phase_plans/phase_1.mdspecs/projects/performance-tracking/phase_plans/phase_2.mdspecs/projects/performance-tracking/phase_plans/phase_3.mdspecs/projects/performance-tracking/project_overview.md
Extract duplicated latency formatting logic into shared formatLatency() in formatters.ts. Rename costToScore to metricToScore since it now handles both cost and latency metrics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1a6582a to
286c0a1
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
app/web_ui/src/lib/components/compare_radar_chart.svelte (1)
46-78: Renamecost/costsparameters now that the helper is generalized.Since
metricToScorenow scores both cost and latency (and any future lower-is-better metric), the cost-specific parameter names are misleading. Considervalue/values(ormetric/metrics) for readability — purely cosmetic, no behavior change.♻️ Proposed rename
- export function metricToScore( - cost: number, - costs: number[], + export function metricToScore( + value: number, + values: number[], { padding = 10, // keep endpoints away from 0/100 relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1) }: { padding?: number relFull?: number } = {}, ): number { - const lo = Math.min(...costs) - const hi = Math.max(...costs) + const lo = Math.min(...values) + const hi = Math.max(...values) const range = hi - lo if (range <= 0) return 50 // 1) range-based normalized position - const t = (cost - lo) / range + const t = (value - lo) / range - // 2) raw padded linear score (lower cost = higher score) + // 2) raw padded linear score (lower value = higher score) const raw = padding + (1 - t) * (100 - 2 * padding)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 - 78, Rename the cost-specific parameters in metricToScore to neutral names to reflect general-purpose use: change the parameter names cost and costs to value and values (or metric and metrics) in the metricToScore function signature and update all internal references (lo, hi, t, raw, etc. that use those params) accordingly; also update any call sites that pass arguments to metricToScore to use the new parameter names if they reference them by name (or simply ensure positional calls still work), and run the test/build to verify no unresolved identifiers remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 46-78: Rename the cost-specific parameters in metricToScore to
neutral names to reflect general-purpose use: change the parameter names cost
and costs to value and values (or metric and metrics) in the metricToScore
function signature and update all internal references (lo, hi, t, raw, etc. that
use those params) accordingly; also update any call sites that pass arguments to
metricToScore to use the new parameter names if they reference them by name (or
simply ensure positional calls still work), and run the test/build to verify no
unresolved identifiers remain.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aa3a98bb-f410-442e-8d7d-d4fc9c6b199f
📒 Files selected for processing (6)
app/web_ui/src/lib/components/compare_chart.svelteapp/web_ui/src/lib/components/compare_radar_chart.svelteapp/web_ui/src/lib/utils/formatters.tsapp/web_ui/src/routes/(app)/run/run.svelteapp/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.sveltelibs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
✅ Files skipped from review due to trivial changes (1)
- app/web_ui/src/lib/utils/formatters.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- app/web_ui/src/lib/components/compare_chart.svelte
- libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
- app/web_ui/src/routes/(app)/run/run.svelte
There was a problem hiding this comment.
♻️ Duplicate comments (1)
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py (1)
174-177:⚠️ Potential issue | 🟠 MajorLatency timing window includes downstream consumer processing time
Lines 174–177 measure the entire async loop duration, which includes time spent after each
yieldwhile the downstream caller processes chunks. This overstatescall_latency_msif the intent is to measure only provider latency.Suggested fix
stream = StreamingCompletion(**completion_kwargs) - start = time.monotonic() - async for chunk in stream: - yield chunk - call_latency_ms = int((time.monotonic() - start) * 1000) + call_latency_seconds = 0.0 + chunk_wait_start = time.monotonic() + async for chunk in stream: + # Time spent waiting for the next chunk from the provider + call_latency_seconds += time.monotonic() - chunk_wait_start + yield chunk + # Reset after resuming; excludes downstream consumer processing time + chunk_wait_start = time.monotonic() + # Include terminal wait for final __anext__() / StopAsyncIteration + call_latency_seconds += time.monotonic() - chunk_wait_start + call_latency_ms = int(call_latency_seconds * 1000)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py` around lines 174 - 177, The current timing (start / call_latency_ms) wraps the whole async for and thus includes downstream consumer time; change it to measure only provider latency by timing each await that fetches the next chunk from the async iterator and accumulating those durations into call_latency_ms (i.e., for symbol stream in adapter_stream.py, record a timestamp immediately before awaiting/advancing the iterator, stop the timer immediately after the chunk is received, add that delta to the latency accumulator, then yield the chunk), ensuring you don't include time spent after yield in the measured window.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`:
- Around line 174-177: The current timing (start / call_latency_ms) wraps the
whole async for and thus includes downstream consumer time; change it to measure
only provider latency by timing each await that fetches the next chunk from the
async iterator and accumulating those durations into call_latency_ms (i.e., for
symbol stream in adapter_stream.py, record a timestamp immediately before
awaiting/advancing the iterator, stop the timer immediately after the chunk is
received, add that delta to the latency accumulator, then yield the chunk),
ensuring you don't include time spent after yield in the measured window.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 6bd5b33f-0037-49a4-8f98-bd831c1a1d21
📒 Files selected for processing (1)
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py (1)
174-177:⚠️ Potential issue | 🟠 MajorStreaming latency currently includes consumer processing time.
At Line 174-Line 177, the timer wraps the entire
async forloop, so elapsed time includes time while the generator is paused afteryield chunk. That inflates LLM latency whenever consumers process chunks slowly.Suggested fix
- stream = StreamingCompletion(**completion_kwargs) - start = time.monotonic() - async for chunk in stream: - yield chunk - call_latency_ms = int((time.monotonic() - start) * 1000) + stream = StreamingCompletion(**completion_kwargs) + call_latency_seconds = 0.0 + chunk_wait_start = time.monotonic() + async for chunk in stream: + call_latency_seconds += time.monotonic() - chunk_wait_start + yield chunk + # reset after consumer resumes us; excludes consumer processing time + chunk_wait_start = time.monotonic() + # include terminal wait for the final __anext__ that ends the stream + call_latency_seconds += time.monotonic() - chunk_wait_start + call_latency_ms = int(call_latency_seconds * 1000)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py` around lines 174 - 177, The timer currently wraps the entire async for (start = time.monotonic() before the loop) so consumer processing after each yield inflates latency; instead, start timing when awaiting the stream and compute call_latency_ms immediately after receiving each chunk and before yielding it (i.e., set start when you begin waiting, then inside async for compute call_latency_ms = int((time.monotonic() - start) * 1000) right after getting chunk and before yield chunk, and reset start before awaiting the next chunk if you need per-chunk measurements) so consumer pause time is excluded; update the logic around the async for over stream in adapter_stream.py accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`:
- Around line 174-177: The timer currently wraps the entire async for (start =
time.monotonic() before the loop) so consumer processing after each yield
inflates latency; instead, start timing when awaiting the stream and compute
call_latency_ms immediately after receiving each chunk and before yielding it
(i.e., set start when you begin waiting, then inside async for compute
call_latency_ms = int((time.monotonic() - start) * 1000) right after getting
chunk and before yield chunk, and reset start before awaiting the next chunk if
you need per-chunk measurements) so consumer pause time is excluded; update the
logic around the async for over stream in adapter_stream.py accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 6f5d05c9-9ca0-4507-aeab-2730f8d593af
📒 Files selected for processing (3)
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.pylibs/core/kiln_ai/adapters/model_adapters/litellm_adapter.pylibs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
What does this PR do?
Add per llm call latency tracking to Trace.
total_llm_latency_msto usage model to track overall llm latency like cost and token.mean_total_llm_latency_msto eval API for comparisonChecklists