Skip to content

Commit 8710077

Browse files
authored
refactor: unify async openai compatible naming and update v0.6 docs (#1083)
* add adaptive/static/baseline for throughput check * add adaptive/static/baseline for throughput check * Fix formatting and add molmo throughput compare script * fix(vllm): pass chat_template directly for lint stability * remove compare scripts * Fix async_openai lint and adaptive concurrency path * fix: address P0 concurrency control and lint issues * fix: harden P1 concurrency control and scheduling * feat: add prefix-aware queueing for openai and async models * refactor: unify async openai compatible module naming * docs: add adaptive control feature notes in v0.6 * docs: expand intro and performance details with 7x api speedup * docs: clarify adaptive v2 speedup attribution * docs: note canonical naming change to openai / async_openai in v0.6
1 parent 12c4a91 commit 8710077

7 files changed

Lines changed: 240 additions & 31 deletions

File tree

docs/lmms-eval-0.6.md

Lines changed: 118 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,11 @@ Building on lmms-eval's existing framework, v0.6 transforms evaluation from a on
1111
- **During training**: Evaluation runs as a standalone service, decoupled from the training loop. Submit checkpoints for async evaluation without blocking GPU training.
1212
- **Post-training**: Rapid, comprehensive evaluation across all modalities with statistical guarantees on the results.
1313

14+
From the engineering side, v0.6 also ships a substantial API-throughput upgrade. With the latest API control-path updates (adaptive concurrency, refill scheduling, prefix-aware queueing, and retry/backoff decoupling), we observe about **7.5x throughput improvement** on a fixed `LIMIT=100` benchmark (`0.3278 -> 2.4584 req/s`), while preserving metric outputs for the same task/model setup.
15+
1416
| Area | Key Features |
1517
|------|--------------|
16-
| **Performance** | Fully async and decoupled inference; data layer optimization for high-throughput multimodal access |
18+
| **Performance** | Fully async and decoupled inference; adaptive API concurrency control; prefix-aware queueing; measured ~7x+ throughput gain on API benchmark path |
1719
| **Evaluation as a Service** | Async job submission without blocking GPU training; separately hosted eval service on dedicated GPUs |
1820
| **Statistical Rigor** | Confidence intervals, clustered standard errors, baseline-anchored paired comparison |
1921
| **Frontier Evaluation** | Long video, spatial intelligence, and agentic scenarios |
@@ -51,11 +53,120 @@ Supported backends:
5153
- SGLang
5254
- API Models (OpenAI, Anthropic, Groq, etc.)
5355

56+
**Unified API model interfaces**
57+
58+
v0.6 unifies API-backed model evaluation under two interfaces:
59+
60+
| Interface | `--model` name | Backend | Recommended |
61+
|-----------|---------------|---------|-------------|
62+
| Async | `async_openai` | `asyncio` + `AsyncOpenAI` | **Yes** |
63+
| Sync | `openai` | `ThreadPoolExecutor` + `OpenAI` | Fallback |
64+
65+
We recommend `async_openai` for all API-backed evaluation — it uses native async I/O and achieves significantly higher throughput.
66+
67+
Both resolve to **chat mode by default** via Model Registry V2. The simple mode (`doc_to_visual` + `doc_to_text`) is deprecated and will be removed in a future release. See [Model Registry V2](#model-registry-v2) below for details.
68+
69+
> **Naming change in v0.6**: the canonical model names have been shortened from `openai_compatible` / `async_openai_compatible` to `openai` / `async_openai`. These are the names used in filenames, registry keys, and `@register_model` decorators. The old names (`openai_compatible`, `openai_compatible_chat`, `async_openai_compatible`, `async_openai_compatible_chat`) continue to work as aliases via `MODEL_ALIASES` in `__init__.py`, so existing scripts are not affected.
70+
71+
#### Adaptive Concurrency Control
72+
73+
v0.6 adds adaptive concurrency for API-backed evaluation (`async_openai`, `openai`).
74+
75+
The controller continuously adjusts in-flight request count using three online signals:
76+
- request failure rate
77+
- rate-limit hit rate (e.g., 429 / throttling)
78+
- p95 latency against a target latency budget
79+
80+
Execution also uses refill-style scheduling (no full-window barrier), so completed requests immediately release slots for new work.
81+
82+
For API models with repeated prompt prefixes, v0.6 also supports prefix-aware queueing to improve prefill-cache hit opportunities by dispatching same-prefix requests close together.
83+
84+
| Control | Meaning |
85+
|---------|---------|
86+
| `adaptive_concurrency` | Enable/disable adaptive mode |
87+
| `adaptive_min_concurrency` | Lower bound for concurrency |
88+
| `adaptive_max_concurrency` | Upper bound for concurrency |
89+
| `adaptive_target_latency_s` | Target p95 latency budget |
90+
| `adaptive_increase_step` | Additive growth step when healthy |
91+
| `adaptive_decrease_factor` | Multiplicative decrease on pressure |
92+
| `adaptive_failure_threshold` | Failure-rate threshold for backoff |
93+
| `retry_backoff_s` | Retry sleep interval (separate from request timeout) |
94+
| `prefix_aware_queue` | Group dispatch by prefix hash |
95+
| `prefix_hash_chars` | Prefix length used for hashing |
96+
97+
Example (sync API backend):
98+
```bash
99+
python -m lmms_eval \
100+
--model openai \
101+
--model_args model_version=<model>,num_concurrent=16,adaptive_concurrency=true,adaptive_min_concurrency=1,adaptive_max_concurrency=64,adaptive_target_latency_s=15.0,adaptive_increase_step=0.15,adaptive_decrease_factor=0.75,adaptive_failure_threshold=0.05,retry_backoff_s=1.0,prefix_aware_queue=true,prefix_hash_chars=256
102+
```
103+
104+
Example (async API backend, recommended):
105+
```bash
106+
python -m lmms_eval \
107+
--model async_openai \
108+
--model_args model_version=<model>,num_cpus=16,adaptive_concurrency=true,adaptive_min_concurrency=1,adaptive_max_concurrency=64,adaptive_target_latency_s=15.0,adaptive_increase_step=0.15,adaptive_decrease_factor=0.75,adaptive_failure_threshold=0.05,retry_backoff_s=1.0,prefix_aware_queue=true,prefix_hash_chars=256
109+
```
110+
111+
#### Performance Snapshot: Latest API Path
112+
113+
To make the performance claim auditable, we keep a concrete benchmark trail in this repo:
114+
- Historical comparison file: `logs/openrouter_molmo_throughput/throughput_comparison.csv`
115+
- Latest-vs-previous comparison file: `logs/openrouter_molmo_throughput/throughput_comparison_latest_vs_prev.csv`
116+
117+
Benchmark setup used for this snapshot:
118+
- Task: `mme`
119+
- Limit: `100`
120+
- Model backend: `openai` / `async_openai` API path
121+
- API endpoint family: OpenRouter-compatible
122+
- Model: `bytedance-seed/seed-1.6-flash`
123+
- Baseline control: static single concurrency (`num_concurrent=1`)
124+
- Latest control: adaptive + refill scheduling + prefix-aware queueing + explicit retry backoff
125+
126+
Result summary (`requests_per_sec`):
127+
128+
| Run Type | Concurrency | RPS | Wall Time (s) | Relative to Baseline |
129+
|----------|-------------|-----|---------------|----------------------|
130+
| baseline | 1 | 0.327836 | 305.030740 | 1.00x |
131+
| static | 24 | 1.926987 | 51.894473 | 5.88x |
132+
| adaptive (v1) | 16 | 2.404706 | 41.585121 | 7.33x |
133+
| adaptive (v2) | 16 | 2.458435 | 40.676279 | 7.50x |
134+
135+
Interpretation:
136+
- The latest API control path reaches about **7.5x throughput** over baseline on the same `LIMIT=100` setup.
137+
- Compared to the previous adaptive run (`v1`), the latest adaptive run (`v2`) still improves (`2.4047 -> 2.4584 req/s`, `+2.23%`). This is a small but measurable delta in a noisy environment (shared network + provider-side scheduling), so the right takeaway is not "a new ceiling", but "less overhead and better utilization under the same constraints."
138+
- The core point: this speedup is not from changing benchmark difficulty. We keep the same task (`mme`), model (`bytedance-seed/seed-1.6-flash`), limit (`100`), and evaluation prompts/settings. The gain comes from changes in the API request scheduling/control path.
139+
- What `adaptive (v2)` means in practice:
140+
- Refill scheduling (no window barrier): maintain a steady pool of in-flight requests and immediately dispatch new work as soon as a request completes. This reduces idle gaps and prevents the slowest request in a window from gating progress.
141+
- Rolling controller updates: adjust concurrency based on a rolling batch of completions (failure rate, rate-limit hits, and p95 latency vs target) rather than only after fixed windows. This makes the controller more responsive and less sensitive to outliers.
142+
- Hysteresis for stability: use separate "reduce" vs "increase" conditions (and minimum sample thresholds) to avoid oscillating on a single transient 429 or a brief latency spike.
143+
- Retry/backoff decoupling: `retry_backoff_s` is explicitly separate from request timeout, so retries don't sleep for long timeouts and tie up worker slots.
144+
- Prefix-aware queueing (when enabled): reorder dispatch by prefix hash so same-prefix requests are sent close together, improving prefill-cache hit opportunities on providers that support prefix caching. (Some routing layers may dilute this benefit; the mechanism is still safe.)
145+
54146
```bash
55147
python -m lmms_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-VL-7B
56148
python -m lmms_eval --model sglang --model_args pretrained=Qwen/Qwen2.5-VL-7B
57149
```
58150

151+
#### Model Registry V2
152+
153+
v0.6 introduces `ModelRegistryV2` — a unified model registry that replaces the previous ad-hoc import system. All model names (`--model X`) resolve through a single path.
154+
155+
**How it works**
156+
157+
Two dicts in `lmms_eval/models/__init__.py` declare available models:
158+
159+
- `AVAILABLE_SIMPLE_MODELS`: maps `model_id` -> `ClassName` for simple (legacy) models in `models/simple/`
160+
- `AVAILABLE_CHAT_TEMPLATE_MODELS`: maps `model_id` -> `ClassName` for chat models in `models/chat/`
161+
162+
At startup, the registry merges both dicts into `ModelManifest` objects. Each manifest holds a `model_id` and up to two class paths (simple + chat). Class paths are auto-constructed: `lmms_eval.models.{type}.{model_id}.{ClassName}`, so the dict key **must match the filename**.
163+
164+
**Resolution**: chat is always preferred over simple (unless `force_simple=True`). This means `--model openai` transparently resolves to the chat implementation.
165+
166+
**Aliasing**: backward-compatible names are supported via `MODEL_ALIASES` in `__init__.py` and via `ModelManifest.aliases`. Old names like `openai_compatible`, `openai_compatible_chat`, `async_openai_compatible`, and `async_openai_compatible_chat` continue to work.
167+
168+
**Simple mode deprecation**: the simple model interface (`doc_to_visual` + `doc_to_text`) for API models is deprecated. New integrations should always use chat (`doc_to_messages` + `ChatMessages`). The simple implementations in `models/simple/openai.py` will be removed in a future release.
169+
59170
### 1.2 Data Layer
60171

61172
#### Storage Format
@@ -317,20 +428,20 @@ Same accuracy, but Model A is 3× more stable.
317428

318429
---
319430

320-
## 3. Frontier Multimodal Evaluation (TODO)
431+
## 3. Evaluating Multimodal Models in 2026 (TODO)
321432

322-
> **Note**: This section outlines planned frontier evaluation features. Implementation is in progress.
433+
> **Note**: This section outlines planned evaluation features. Implementation is in progress.
323434
324-
### 3.1 Why Frontier Scenarios Matter
435+
### 3.1 More features are expected
325436

326-
Static image QA benchmarks are saturating. Building frontier multimodal systems requires setting more challenging tasks and evaluating them in more realistic scenarios:
437+
Static image QA benchmarks are saturating. Building multimodal systems requires setting more challenging tasks and evaluating them in more realistic scenarios:
327438

328439
| Capability | Challenge | Current Gap |
329440
|------------|-----------|-------------|
330441
| Long video understanding | 10min+ videos, 1000+ frames | Most benchmarks use <128 frames |
331-
| High temporal resolution | Event detection at 30fps | Sparse sampling loses fine-grained actions |
442+
| High motion | Objects movement at 30fps | Sparse sampling loses fine-grained actions |
332443
| Spatial reasoning | 3D world understanding | 2D perception ≠ physical grounding |
333-
| Agentic interaction (streaming input) | Multi-step task execution | Static QA can't measure planning/tool use |
444+
| Agentic interaction | Multi-step task execution and feedback | Static QA can't measure planning/tool use well |
334445

335446
**Key insight**: These capabilities require **in-environment evaluation**, the model must interact with simulators, receive feedback, and adapt. Static input-output pairs cannot capture this.
336447

examples/models/async_openai_compatible.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
#!/bin/bash
22

33
# ============================================================================
4-
# Async OpenAI Compatible Model Example
4+
# Async OpenAI Model Example
55
# ============================================================================
6-
# This script demonstrates how to use the async_openai_compatible model with:
6+
# This script demonstrates how to use the async_openai model with:
77
# 1. Basic video/image evaluation (without MCP tools)
88
# 2. Tool-enabled evaluation (with MCP client)
99
#
@@ -61,7 +61,7 @@ accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
6161
# 7. The result is sent back to the model for continuation
6262
# 8. Steps 3-7 repeat in a loop until the model produces final text output
6363
#
64-
# Tool Calling Loop in Code (from async_openai.py):
64+
# Tool Calling Loop in Code (from async_openai_compatible.py):
6565
# ──────────────────────────────────────────────────
6666
# while response.choices[0].finish_reason == "tool_calls":
6767
# for tool_call in response.choices[0].message.tool_calls:
@@ -103,4 +103,4 @@ accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
103103
# nframes : Fixed number of frames for video (default: 64)
104104
# is_qwen3_vl : Enable Qwen3-VL specific formatting (default: False)
105105
# mcp_server_path : Path to MCP server script for tool calling (optional)
106-
# work_dir : Working directory for MCP tools (default: /tmp/...)
106+
# work_dir : Working directory for MCP tools (default: /tmp/...)

lmms_eval/models/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@
119119

120120
MODEL_ALIASES: dict[str, tuple[str, ...]] = {
121121
"openai": ("openai_compatible", "openai_compatible_chat"),
122-
"async_openai": ("async_openai_compatible_chat",),
122+
"async_openai": ("async_openai_compatible_chat", "async_openai_compatible"),
123123
}
124124

125125

lmms_eval/models/chat/async_openai.py

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@
2020
from lmms_eval.models.model_utils.concurrency_control import (
2121
AdaptiveConcurrencyConfig,
2222
decide_next_concurrency,
23+
extract_text_prefix_from_chat_messages,
2324
is_rate_limit_error,
25+
make_prefix_hash,
2426
parse_bool,
2527
)
2628
from lmms_eval.protocol import ChatMessages
@@ -60,6 +62,8 @@ def __init__(
6062
adaptive_increase_step: float = 0.1,
6163
adaptive_decrease_factor: float = 0.7,
6264
adaptive_failure_threshold: float = 0.05,
65+
prefix_aware_queue: bool = True,
66+
prefix_hash_chars: int = 256,
6367
**kwargs,
6468
) -> None:
6569
super().__init__()
@@ -91,6 +95,8 @@ def __init__(
9195
decrease_factor=adaptive_decrease_factor,
9296
failure_threshold=adaptive_failure_threshold,
9397
)
98+
self.prefix_aware_queue = parse_bool(prefix_aware_queue)
99+
self.prefix_hash_chars = max(32, int(prefix_hash_chars))
94100
if mcp_server_path is not None:
95101
from lmms_eval.mcp import MCPClient
96102

@@ -272,6 +278,18 @@ async def run():
272278
if self.adaptive_concurrency
273279
else max(1, self.num_cpus)
274280
)
281+
dispatch_order = list(range(len(requests)))
282+
if self.prefix_aware_queue:
283+
prefix_hashes = {}
284+
for idx in dispatch_order:
285+
request_obj = requests[idx]
286+
prefix_text = request_obj.args[0] if isinstance(request_obj.args[0], str) else ""
287+
if not prefix_text:
288+
_, doc_to_messages, _, doc_id, task, split = request_obj.args
289+
chat_messages_raw = doc_to_messages(self.task_dict[task][split][doc_id])
290+
prefix_text = extract_text_prefix_from_chat_messages(chat_messages_raw, self.prefix_hash_chars)
291+
prefix_hashes[idx] = make_prefix_hash(prefix_text, self.prefix_hash_chars)
292+
dispatch_order.sort(key=lambda idx: (prefix_hashes[idx], idx))
275293
cursor = 0
276294

277295
async def _process(req, idx):
@@ -343,10 +361,11 @@ def maybe_update_concurrency(force: bool = False) -> None:
343361
request_latencies = []
344362
completed_since_adapt = 0
345363

346-
while cursor < len(requests) or in_flight:
347-
while cursor < len(requests) and len(in_flight) < max(1, current_concurrency):
348-
task = asyncio.create_task(_process(requests[cursor], cursor))
349-
in_flight[task] = cursor
364+
while cursor < len(dispatch_order) or in_flight:
365+
while cursor < len(dispatch_order) and len(in_flight) < max(1, current_concurrency):
366+
request_index = dispatch_order[cursor]
367+
task = asyncio.create_task(_process(requests[request_index], request_index))
368+
in_flight[task] = request_index
350369
cursor += 1
351370

352371
if not in_flight:

lmms_eval/models/chat/openai.py

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@
1111
from lmms_eval.imports import optional_import
1212
from lmms_eval.models.model_utils.concurrency_control import (
1313
decide_next_concurrency,
14+
extract_text_prefix_from_chat_messages,
1415
is_rate_limit_error,
16+
make_prefix_hash,
1517
)
1618
from lmms_eval.models.model_utils.gen_metrics import log_metrics
1719
from lmms_eval.models.simple.openai import OpenAICompatible as OpenAICompatibleSimple
@@ -45,6 +47,18 @@ def generate_until(self, requests) -> List[str]:
4547
self.num_concurrent,
4648
self.adaptive_config.max_concurrency,
4749
)
50+
dispatch_order = list(range(len(reordered_requests)))
51+
if self.prefix_aware_queue:
52+
prefix_hashes = {}
53+
for idx in dispatch_order:
54+
req = reordered_requests[idx]
55+
prefix_text = req.args[0] if isinstance(req.args[0], str) else ""
56+
if not prefix_text:
57+
_, doc_to_messages, _, doc_id, task, split = req.args
58+
chat_messages_raw = doc_to_messages(self.task_dict[task][split][doc_id])
59+
prefix_text = extract_text_prefix_from_chat_messages(chat_messages_raw, self.prefix_hash_chars)
60+
prefix_hashes[idx] = make_prefix_hash(prefix_text, self.prefix_hash_chars)
61+
dispatch_order.sort(key=lambda idx: (prefix_hashes[idx], idx))
4862
cursor = 0
4963
failed_requests = 0
5064
rate_limited_requests = 0
@@ -165,18 +179,19 @@ def build_payload_for_index(global_index: int):
165179
return doc_uuid, None, payload
166180

167181
with ThreadPoolExecutor(max_workers=max_workers) as executor:
168-
while cursor < len(reordered_requests) or in_flight:
169-
while cursor < len(reordered_requests) and len(in_flight) < max(1, current_concurrency):
170-
doc_uuid, cached_response, payload = build_payload_for_index(cursor)
171-
doc_uuids[cursor] = doc_uuid
182+
while cursor < len(dispatch_order) or in_flight:
183+
while cursor < len(dispatch_order) and len(in_flight) < max(1, current_concurrency):
184+
request_index = dispatch_order[cursor]
185+
doc_uuid, cached_response, payload = build_payload_for_index(request_index)
186+
doc_uuids[request_index] = doc_uuid
172187
if cached_response is not None:
173-
responses[cursor] = cached_response
188+
responses[request_index] = cached_response
174189
pbar.update(1)
175190
cursor += 1
176191
continue
177192

178-
future = executor.submit(process_single_request, cursor, payload)
179-
in_flight[future] = cursor
193+
future = executor.submit(process_single_request, request_index, payload)
194+
in_flight[future] = request_index
180195
cursor += 1
181196

182197
if not in_flight:

0 commit comments

Comments
 (0)