Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
1327302
feat: add multi-turn dataset manager with flat JSONL support
tianmu-li Apr 23, 2026
ba1cce8
feat: add ConversationManager and MultiTurnStrategy
tianmu-li Apr 23, 2026
057600b
test: add multi-turn unit and integration tests
tianmu-li Apr 24, 2026
109434d
feat: wire multi-turn into benchmark execution pipeline
tianmu-li Apr 24, 2026
d481c3c
docs: add multi-turn quickstart, examples, and conversion scripts
tianmu-li Apr 25, 2026
aca5431
fix: replace hardcoded /model/ path in validate_jsonl_schema.py docst…
tianmu-li Apr 25, 2026
0a7ad37
chore: move multi_turn_dataset_schema.json into scripts/ and update d…
tianmu-li Apr 25, 2026
1140361
fix: address PR #285 review comments for multi-turn implementation
tianmu-li Apr 28, 2026
3b9dd1e
fix: improve multi-turn PromptData text and add concurrent stress test
tianmu-li Apr 28, 2026
8ab45a1
refactor: replace semaphore with worker-pool concurrency in MultiTurn…
tianmu-li May 4, 2026
0d66900
fix: address remaining PR #285 review comments for multi-turn impleme…
tianmu-li May 4, 2026
0621eb8
fix: address remaining PR #285 review comments
tianmu-li May 4, 2026
9c7dcda
refactor: replace worker-pool with event-driven model in MultiTurnStr…
tianmu-li May 4, 2026
5ce558b
fix: address PR #285 review comments for multi-turn implementation
tianmu-li May 6, 2026
a5d4a87
Import fix
tianmu-li May 6, 2026
fbe7d33
fix: revert out-of-scope live-history tool_call_id rewriting
tianmu-li May 6, 2026
1b32543
Fix issue with tool call accumulation and reasoning content
tianmu-li May 6, 2026
b8de93b
feat: account for tool-call tokens in OSL / TPOT / TPS metrics
tianmu-li May 7, 2026
65bff7f
fix: correct chat-template tokenization for tool-call messages
tianmu-li May 7, 2026
8025c45
docs: fix stale references and tool-row format in multi-turn docs
tianmu-li May 8, 2026
80a88bf
feat: pre-compute ISL token counts for multi-turn dataset-history mode
tianmu-li May 8, 2026
07658ba
fix: unwrap BatchEncoding from apply_chat_template for Qwen3 tokenizer
tianmu-li May 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
287 changes: 287 additions & 0 deletions docs/MULTI_TURN_QUICKSTART.md
Comment thread
tianmu-li marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
# Multi-Turn Conversation Benchmarking - Quick Start Guide

## Quick Start in 5 Minutes

### 1. Prepare Your Dataset

Create a JSONL file with your conversations. All rows for a given `conversation_id` must appear
**consecutively** in the file (no interleaving with other conversations):

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Hello!", "system": "You are a helpful assistant"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Hi! How can I help?"}
{"conversation_id": "c1", "turn": 3, "role": "user", "content": "What's 2+2?"}
{"conversation_id": "c1", "turn": 4, "role": "assistant", "content": "2+2 equals 4."}
```

**Rules**:

- Alternate between "user" and "assistant" roles
- Start with "user" role
- Sequential turn numbers (1, 2, 3, ...)
- Same `conversation_id` for all turns in a conversation
- All rows for the same `conversation_id` must be grouped together

### 2. Create Configuration File

Save as `multi_turn_config.yaml`:

```yaml
name: "my-multi-turn-benchmark"
version: "1.0"
type: "online"

model_params:
name: "your-model-name"
temperature: 0.7
max_new_tokens: 256

datasets:
- name: my_conversations
type: performance
path: path/to/your/conversations.jsonl
multi_turn: # ← Presence of this block enables multi-turn mode
turn_timeout_s: 300 # ← Max wait for prev turn

settings:
load_pattern:
type: multi_turn # ← Use multi-turn scheduler
target_concurrency: 32 # ← Required: max simultaneous conversations

client:
workers: 4

endpoint_config:
endpoints:
- "http://your-endpoint:8000"
api_type: openai

report_dir: logs/my_multi_turn_benchmark
```

Results are written to `report_dir` (here: `logs/my_multi_turn_benchmark/`).

### 3. Run Benchmark

```bash
inference-endpoint benchmark from-config --config multi_turn_config.yaml
```

That's it! Your benchmark will now:

- ✅ Enforce turn ordering (turn N+1 waits for turn N)
- ✅ Include conversation history in each request
- ✅ Track per-turn and per-conversation metrics
- ✅ Log all turns with conversation metadata

---

## Understanding Results

After the benchmark completes, check the directory configured via `report_dir`:

### Events Log

The `events.jsonl` file contains one JSON record per line:

- Standard fields: `sample_uuid`, `event_type`, `timestamp_ns`
- **New fields**: `conversation_id`, `turn_number`

Query examples:

```bash
# All events for a specific conversation
grep '"conversation_id": "c1"' logs/my_multi_turn_benchmark/events.jsonl

# With jq for structured output
jq 'select(.conversation_id == "c1") | {conversation_id, turn_number, event_type, timestamp_ns}' \
logs/my_multi_turn_benchmark/events.jsonl
```

### Metrics

Currently available:

- **Per-turn metrics**: Latency, TTFT, TPOT for each turn
- **Conversation tracking**: All events tagged with conversation_id

_Note: Per-conversation aggregation (e.g., "conversations/sec") is coming in a future update._

---

## Concurrency Control

`target_concurrency` is **required** for the `multi_turn` load pattern. It controls how many
conversations are active simultaneously. Each active conversation has exactly one in-flight turn
at a time — a worker issues turn N, waits for the response, then issues turn N+1. A new
conversation starts only after a worker finishes all turns of its current one.

```yaml
settings:
load_pattern:
type: multi_turn
target_concurrency: 32 # ← 32 conversations active simultaneously
```

---

## Troubleshooting

### Validate Your Dataset Before Running

Use the bundled validation script to check your JSONL file for schema errors before benchmarking:

```bash
python scripts/validate_jsonl_schema.py path/to/your/conversations.jsonl
```

This catches per-row schema errors (missing required fields, wrong types,
malformed `tool_results`). Cross-row invariants (consecutive turn numbers,
valid role sequences, grouped conversations) are enforced by
`MultiTurnDataset` at load time and will surface at benchmark startup.

### "Conversation has invalid role sequence"

**Problem**: Your dataset doesn't follow a valid role sequence.

**Fix**: Check your JSONL. Valid sequences:
Comment on lines +145 to +147
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a utility that parses the dataset to make sure it is compliant so devs can use it instead of running the benchmark for testing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation to use scripts/validate_jsonl_schema.py for validation


- Plain chat: `user → assistant → user → assistant → ...`
- Agentic (tool-use): `user → assistant → tool → assistant → tool → ... → user`

Conversations may also end with a `tool` row (the model's response to the final tool call is the benchmark target).

### "Rows for conversation X are not consecutive"

**Problem**: Rows for the same `conversation_id` are interleaved with rows from other conversations.

**Fix**: Sort your JSONL so all rows for each conversation appear together.

### "Turn timed out waiting for prev turn"

**Problem**: Previous turn took longer than `turn_timeout_s`.

**Fixes**:

1. Increase `turn_timeout_s` in config
2. Check if your endpoint is slow or unresponsive
3. Look for errors in the endpoint logs

### Dataset not loading

**Problem**: MultiTurnDataset not recognized.

**Fix**: Ensure `multi_turn:` block is present in the dataset config. The file format
is auto-detected from the `.jsonl` extension — no `format` field is needed:

```yaml
datasets:
- path: your_file.jsonl
multi_turn: {}
```

---

## Example Datasets

### Simple 2-Turn Conversation

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Hi"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Hello!"}
```

### With System Prompt

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Who won?", "system": "You are a sports expert"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "The Lakers won."}
```

### Multiple Conversations

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Hi"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Hello!"}
{"conversation_id": "c2", "turn": 1, "role": "user", "content": "Hey"}
{"conversation_id": "c2", "turn": 2, "role": "assistant", "content": "Hi there!"}
```

### With Model Override

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Summarize this", "model": "gpt-4"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Here's the summary..."}
```

---

## Testing Your Setup

### 1. Use the Example Dataset

```bash
cd examples/09_MultiTurn
inference-endpoint benchmark from-config --config multi_turn_benchmark.yaml
```

### 2. Check the Logs

```bash
cat logs/multi_turn_test/benchmark.log
# Look for: "Turn X of conversation_id issued"
```

### 3. Verify Event Recording

```bash
# List all unique conversation IDs in the events log
jq -r '.conversation_id' logs/multi_turn_test/events.jsonl | sort -u
# Should show your conversation IDs
```

---

## Tips & Best Practices

### Dataset Design

- **Keep conversations realistic**: 2-10 turns typical
- **Test edge cases**: 1-turn conversations, very long conversations
- **Include system prompts**: Helps model understand context

### Performance

- **Workers**: `client.workers` controls HTTP worker processes, independent of `target_concurrency`. The default (`-1`) auto-tunes based on NUMA topology.
- **Timeout**: Set `turn_timeout_s` = 2x your longest expected turn latency
- **Memory**: ~1KB per turn, plan accordingly for large datasets

### Debugging

- **Start small**: Test with 1-2 conversations first
- **Single conversation**: Use `target_concurrency: 1`
- **Check events.jsonl**: Verify turn ordering with `jq`

---

## More Information

- **Full Documentation**: See `examples/09_MultiTurn/README.md`
- **Architecture**: See `AGENTS.md` (Multi-Turn section)

---

## Checklist

Before running your first multi-turn benchmark:

- [ ] Dataset follows format (user/assistant alternation, or agentic user→assistant→tool sequences)
- [ ] All rows for each conversation_id are grouped together
- [ ] Config has `multi_turn:` block in the dataset section
- [ ] Config has `load_pattern.type: multi_turn`
- [ ] Endpoint is running and reachable
- [ ] File uses `.jsonl` extension (format is auto-detected)
- [ ] Conversation IDs are unique per conversation
- [ ] Turn numbers are sequential (1, 2, 3, ...)

Happy benchmarking!
Loading
Loading