Skip to content

Commit 95b2939

Browse files
authored
cxs: Set benchmark name and correct eval func (#25)
Signed-off-by: Sohei Koyama <skoyama@ddn.com>
1 parent 7c1097e commit 95b2939

2 files changed

Lines changed: 38 additions & 18 deletions

File tree

real-multi-round-qa/README.md

Lines changed: 31 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,37 @@
1-
# Real Multi-Round QA Benchmark
1+
# CxS: Real Multi-Round QA Benchmark
22

33
## Overview
44

5-
This benchmark is designed to identify **the maximum number of user sessions ($C\times S$) that can be kept active while maintaining a steady-state TTFT ≤ 2 s (95-th percentile)**. By sweeping the concurrency (C) and sequential (S) independently, it isolates whether compute capacity or KV-cache pressure is the first limiting factor.
5+
This benchmark is designed to identify **the maximum harmonic mean of user sessions $(C,S)$ that can be kept active while maintaining a steady-state TTFT ≤ 2 s (95-th percentile)**. By sweeping the concurrency ($C$) and sequential ($S$) independently, it isolates whether compute capacity or KV-cache pressure is the first limiting factor.
6+
67

78
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
89

910
This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.
1011

12+
The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Sequential users.
13+
14+
### Definition
15+
16+
Let us define the set of candidate pairs:
17+
18+
$$
19+
\mathcal{D} = \{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}
20+
$$
21+
22+
### Objective
23+
24+
More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:
25+
26+
27+
$$
28+
\underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
29+
$$
30+
31+
We use the harmonic mean to compare scores.
32+
As a business metric, we report the product, CxS.
33+
For example, we say "Our system can keep up to {C×S} user sessions active!"
34+
1135
## Two simple knobs
1236

1337
| Option | What it means |
@@ -88,8 +112,8 @@ $ python plot.py ./bench_dir_vllm vllm.png
88112
13 3 2 0.393902
89113
14 3 1 0.364927
90114
15 1 1 0.379049
91-
Max (C x S) where TTFT_95 <= 2s: 12
92-
=> C=4.0, S=3.0
115+
Max harmonic mean (C,S) where TTFT_95 <= 2s: 3.43
116+
=> C=4.0, S=3.0, CxS=12.0
93117
$ python plot.py ./bench_dir_lmcache lmcache.png
94118
num_users_concurrent num_users_sequential ttft_95
95119
0 1 1 0.524989
@@ -108,11 +132,11 @@ $ python plot.py ./bench_dir_lmcache lmcache.png
108132
13 4 2 0.586223
109133
14 1 2 0.477946
110134
15 2 2 0.457463
111-
Max (C x S) where TTFT_95 <= 2s: 16
112-
=> C=4.0, S=4.0
135+
Max harmonic mean (C,S) where TTFT_95 <= 2s: 4.00
136+
=> C=4.0, S=4.0, CxS=16.0
113137
```
114138

115-
LMCache allows 1.3x increase in the number of user sessions kept active at least.
139+
LMCache allows 1.17x increase in the number of user sessions kept active at least.
116140

117141
Note: LMCache has not yet reached its limit in this case,
118142
so we can aim to further improve the score by changing C and S.

real-multi-round-qa/plot.py

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -72,20 +72,16 @@ def main():
7272
ax.invert_xaxis()
7373
plt.savefig(args.output)
7474

75-
# Max product under 2s TTFT
75+
# Max harmonic mean under 2s TTFT
7676
summary_under_2s = summary_df[summary_df["ttft_95"] <= 2].copy()
77-
summary_under_2s["product"] = (
78-
summary_under_2s["num_users_concurrent"] * summary_under_2s["num_users_sequential"]
79-
)
8077
if not summary_under_2s.empty:
81-
summary_under_2s["product"] = (
82-
summary_under_2s["num_users_concurrent"] * summary_under_2s["num_users_sequential"]
78+
summary_under_2s["harmonic_mean"] = 2 * summary_under_2s["num_users_concurrent"] * summary_under_2s["num_users_sequential"] / (
79+
summary_under_2s["num_users_concurrent"] + summary_under_2s["num_users_sequential"]
8380
)
84-
max_product = summary_under_2s["product"].max()
85-
candidates = summary_under_2s[summary_under_2s["product"] == max_product]
86-
best_row = candidates.sort_values("num_users_concurrent", ascending=False).iloc[0]
87-
print(f"Max (C x S) where TTFT_95 <= 2s: {max_product}")
88-
print(f" => C={best_row['num_users_concurrent']}, S={best_row['num_users_sequential']}")
81+
best_row = summary_under_2s.sort_values("harmonic_mean", ascending=False).iloc[0]
82+
product = best_row["num_users_concurrent"] * best_row["num_users_sequential"]
83+
print(f"Max harmonic mean (C,S) where TTFT_95 <= 2s: {best_row['harmonic_mean']:.2f}")
84+
print(f" => C={best_row['num_users_concurrent']}, S={best_row['num_users_sequential']}, CxS={product}")
8985
else:
9086
print("No data points with TTFT_95 <= 2s.")
9187

0 commit comments

Comments
 (0)