Turn and Conversation Metrics Plan

Here is an AI generated proposal for new turn and conversation metrics to add to GuideLLM:

# Conversation-Level and Turn-by-Turn Metrics

## Current State

Today, all metrics in `GenerativeMetrics.compile()` treat requests as a flat bag -- no grouping by `conversation_id` or awareness of `turn_index`. The raw data for grouping already exists (`RequestInfo.conversation_id`, `RequestInfo.turn_index`, `RequestTimings.*`), but nothing aggregates it.

---

## Turn-by-Turn Metrics

These are per-request metrics that become meaningful when **sliced by turn position**. The key insight: turn position affects performance (history grows, KV cache state changes, scheduling dependencies emerge).

### Timing metrics sliced by turn position

- **TTFT for first turns only** (`turn_index == 0`): Cold-start latency with no conversation history. This is the "first impression" latency.
- **TTFT for last turns only** (`turn_index == max for that conversation`): Latency with maximum history context. Shows worst-case TTFT as context grows.
- **TTFT by turn index** (distribution per index 0, 1, 2, ...): Shows how TTFT degrades (or stays flat with good prefix caching) as conversation progresses.
- **Request latency by turn position**: Same slicing as TTFT -- first, last, per-index. Captures total turn time, not just time-to-first-token.
- **ITL (inter-token latency) by turn position**: Does decoding speed degrade as history grows?

### Throughput metrics sliced by turn position

- **Output tokens/sec by turn position**: Does generation throughput drop with longer contexts?
- **Prompt tokens by turn position**: Quantifies context growth. First turn has minimal prompt; last turn has full history. Useful for understanding the "cost of history."

### Turn scheduling metrics

- **Turn queue/wait time by position**: Later turns in a conversation depend on earlier turns completing. How long do they wait?
- **Turn gap time**: Time between one turn's `request_end` and the next turn's `request_start` within the same conversation. Captures scheduling overhead for dependent turns.

---

## Conversation-Level Metrics

These aggregate across all turns of a single conversation, then produce distributions across all conversations.

### Time metrics

- **End-to-end conversation time**: Wall-clock from first turn's `request_start` (or `targeted_start`) to last turn's `request_end` (or `resolve_end`). The total time a "user" would experience.
- **Total server processing time**: Sum of `request_latency` across all turns in the conversation. Pure compute time.
- **Total idle/gap time**: `e2e_time - total_server_time`. Time spent not generating -- scheduling overhead, queue waits between turns, etc.
- **Percent time on server**: `total_server_time / e2e_time`. High values mean the server is the bottleneck; low values mean scheduling/queueing overhead dominates.
- **Percent time idle**: `1 - percent_on_server`. Useful for identifying whether multi-turn overhead is significant.
- **Average inter-turn gap**: Mean of gap times between consecutive turns. Captures the "think time" imposed by the system (not the user).

### Token metrics (per conversation)

- **Total prompt tokens**: Sum across turns. Shows the true cost of a conversation (history re-processing).
- **Total output tokens**: Sum across turns. Total useful generation.
- **Total tokens**: Combined.
- **Prompt token growth rate**: How much does prompt size increase per turn? (Linear = naive history append; sublinear = summarization/truncation.)
- **Effective conversation throughput**: `total_output_tokens / e2e_time`. Unlike per-request throughput, this accounts for inter-turn gaps and gives a realistic "useful output rate."

### Conversation success/completion metrics

- **Conversation completion rate**: Fraction of conversations where all turns completed successfully (no errors, no cancellations).
- **Turns completed per conversation**: Distribution. For errored conversations, shows how far they got.
- **First error turn index**: For conversations with errors, which turn failed?

### Conversation count/shape metrics

- **Turns per conversation**: Distribution of conversation lengths in the benchmark run.
- **Conversation count**: Total conversations in the measurement window.

---

### New schema objects needed

- **`ConversationMetrics`** (new `StandardBaseDict`): holds conversation-level distribution summaries (e2e time, server time, idle time, completion rate, etc.)
- **`TurnPositionMetrics`** (new `StandardBaseDict`): holds turn-position-sliced distributions (TTFT by position, latency by position, etc.)
- Both would be fields on `GenerativeMetrics`



### CSV output changes

New columns would be added to [`src/guidellm/benchmark/outputs/csv.py`](src/guidellm/benchmark/outputs/csv.py) in a new "Conversation Metrics" group, following the existing pattern of `_add_stats_for_metric`.

---

## Open Questions

- **Single-turn conversations**: When every conversation has exactly 1 turn, conversation metrics collapse to per-request metrics. Should we skip conversation metrics entirely for single-turn runs, or show them as a degenerate case?
- **Incomplete conversations**: If turn 2 of 5 errors and turns 3-5 are cancelled, how do we handle conversation e2e time? Use last attempted turn? Last completed turn?
- **Turn gap attribution**: The gap between turns includes scheduler queue time + multi-turn dependency wait. Should we try to separate these, or report the combined gap?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn and Conversation Metrics Plan #722

Conversation-Level and Turn-by-Turn Metrics

Current State

Turn-by-Turn Metrics

Timing metrics sliced by turn position

Throughput metrics sliced by turn position

Turn scheduling metrics

Conversation-Level Metrics

Time metrics

Token metrics (per conversation)

Conversation success/completion metrics

Conversation count/shape metrics

New schema objects needed

CSV output changes

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Turn and Conversation Metrics Plan #722

Description

Conversation-Level and Turn-by-Turn Metrics

Current State

Turn-by-Turn Metrics

Timing metrics sliced by turn position

Throughput metrics sliced by turn position

Turn scheduling metrics

Conversation-Level Metrics

Time metrics

Token metrics (per conversation)

Conversation success/completion metrics

Conversation count/shape metrics

New schema objects needed

CSV output changes

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions