refactor: unify async openai compatible naming and update v0.6 docs (#1083)

Luodian · web-flow · commit 871007727620 · 2026-02-16T12:37:09.000+08:00
* add adaptive/static/baseline for throughput check

* add adaptive/static/baseline for throughput check

* Fix formatting and add molmo throughput compare script

* fix(vllm): pass chat_template directly for lint stability

* remove compare scripts

* Fix async_openai lint and adaptive concurrency path

* fix: address P0 concurrency control and lint issues

* fix: harden P1 concurrency control and scheduling

* feat: add prefix-aware queueing for openai and async models

* refactor: unify async openai compatible module naming

* docs: add adaptive control feature notes in v0.6

* docs: expand intro and performance details with 7x api speedup

* docs: clarify adaptive v2 speedup attribution

* docs: note canonical naming change to openai / async_openai in v0.6
diff --git a/docs/lmms-eval-0.6.md b/docs/lmms-eval-0.6.md
@@ -11,9 +11,11 @@ Building on lmms-eval's existing framework, v0.6 transforms evaluation from a on
 - **During training**: Evaluation runs as a standalone service, decoupled from the training loop. Submit checkpoints for async evaluation without blocking GPU training.
 - **Post-training**: Rapid, comprehensive evaluation across all modalities with statistical guarantees on the results.
 
+From the engineering side, v0.6 also ships a substantial API-throughput upgrade. With the latest API control-path updates (adaptive concurrency, refill scheduling, prefix-aware queueing, and retry/backoff decoupling), we observe about **7.5x throughput improvement** on a fixed `LIMIT=100` benchmark (`0.3278 -> 2.4584 req/s`), while preserving metric outputs for the same task/model setup.
+
 | Area | Key Features |
 |------|--------------|
-| **Performance** | Fully async and decoupled inference; data layer optimization for high-throughput multimodal access |
+| **Performance** | Fully async and decoupled inference; adaptive API concurrency control; prefix-aware queueing; measured ~7x+ throughput gain on API benchmark path |
 | **Evaluation as a Service** | Async job submission without blocking GPU training; separately hosted eval service on dedicated GPUs |
 | **Statistical Rigor** | Confidence intervals, clustered standard errors, baseline-anchored paired comparison |
 | **Frontier Evaluation** | Long video, spatial intelligence, and agentic scenarios |
@@ -51,11 +53,120 @@ Supported backends:
 - SGLang
 - API Models (OpenAI, Anthropic, Groq, etc.)
 
+**Unified API model interfaces**
+
+v0.6 unifies API-backed model evaluation under two interfaces:
+
+| Interface | `--model` name | Backend | Recommended |
+|-----------|---------------|---------|-------------|
+| Async | `async_openai` | `asyncio` + `AsyncOpenAI` | **Yes** |
+| Sync | `openai` | `ThreadPoolExecutor` + `OpenAI` | Fallback |
+
+We recommend `async_openai` for all API-backed evaluation — it uses native async I/O and achieves significantly higher throughput.
+
+Both resolve to **chat mode by default** via Model Registry V2. The simple mode (`doc_to_visual` + `doc_to_text`) is deprecated and will be removed in a future release. See [Model Registry V2](#model-registry-v2) below for details.
+
+> **Naming change in v0.6**: the canonical model names have been shortened from `openai_compatible` / `async_openai_compatible` to `openai` / `async_openai`. These are the names used in filenames, registry keys, and `@register_model` decorators. The old names (`openai_compatible`, `openai_compatible_chat`, `async_openai_compatible`, `async_openai_compatible_chat`) continue to work as aliases via `MODEL_ALIASES` in `__init__.py`, so existing scripts are not affected.
+
+#### Adaptive Concurrency Control
+
+v0.6 adds adaptive concurrency for API-backed evaluation (`async_openai`, `openai`).
+
+The controller continuously adjusts in-flight request count using three online signals:
+- request failure rate
+- rate-limit hit rate (e.g., 429 / throttling)
+- p95 latency against a target latency budget
+
+Execution also uses refill-style scheduling (no full-window barrier), so completed requests immediately release slots for new work.
+
+For API models with repeated prompt prefixes, v0.6 also supports prefix-aware queueing to improve prefill-cache hit opportunities by dispatching same-prefix requests close together.
+
+| Control | Meaning |
+|---------|---------|
+| `adaptive_concurrency` | Enable/disable adaptive mode |
+| `adaptive_min_concurrency` | Lower bound for concurrency |
+| `adaptive_max_concurrency` | Upper bound for concurrency |
+| `adaptive_target_latency_s` | Target p95 latency budget |
+| `adaptive_increase_step` | Additive growth step when healthy |
+| `adaptive_decrease_factor` | Multiplicative decrease on pressure |
+| `adaptive_failure_threshold` | Failure-rate threshold for backoff |
+| `retry_backoff_s` | Retry sleep interval (separate from request timeout) |
+| `prefix_aware_queue` | Group dispatch by prefix hash |
+| `prefix_hash_chars` | Prefix length used for hashing |
+
+Example (sync API backend):
+```bash
+python -m lmms_eval \
+  --model openai \
+  --model_args model_version=<model>,num_concurrent=16,adaptive_concurrency=true,adaptive_min_concurrency=1,adaptive_max_concurrency=64,adaptive_target_latency_s=15.0,adaptive_increase_step=0.15,adaptive_decrease_factor=0.75,adaptive_failure_threshold=0.05,retry_backoff_s=1.0,prefix_aware_queue=true,prefix_hash_chars=256
+```
+
+Example (async API backend, recommended):
+```bash
+python -m lmms_eval \
+  --model async_openai \
+  --model_args model_version=<model>,num_cpus=16,adaptive_concurrency=true,adaptive_min_concurrency=1,adaptive_max_concurrency=64,adaptive_target_latency_s=15.0,adaptive_increase_step=0.15,adaptive_decrease_factor=0.75,adaptive_failure_threshold=0.05,retry_backoff_s=1.0,prefix_aware_queue=true,prefix_hash_chars=256
+```
+
+#### Performance Snapshot: Latest API Path
+
+To make the performance claim auditable, we keep a concrete benchmark trail in this repo:
+- Historical comparison file: `logs/openrouter_molmo_throughput/throughput_comparison.csv`
+- Latest-vs-previous comparison file: `logs/openrouter_molmo_throughput/throughput_comparison_latest_vs_prev.csv`
+
+Benchmark setup used for this snapshot:
+- Task: `mme`
+- Limit: `100`
+- Model backend: `openai` / `async_openai` API path
+- API endpoint family: OpenRouter-compatible
+- Model: `bytedance-seed/seed-1.6-flash`
+- Baseline control: static single concurrency (`num_concurrent=1`)
+- Latest control: adaptive + refill scheduling + prefix-aware queueing + explicit retry backoff
+
+Result summary (`requests_per_sec`):
+
+| Run Type | Concurrency | RPS | Wall Time (s) | Relative to Baseline |
+|----------|-------------|-----|---------------|----------------------|
+| baseline | 1 | 0.327836 | 305.030740 | 1.00x |
+| static | 24 | 1.926987 | 51.894473 | 5.88x |
+| adaptive (v1) | 16 | 2.404706 | 41.585121 | 7.33x |
+| adaptive (v2) | 16 | 2.458435 | 40.676279 | 7.50x |
+
+Interpretation:
+- The latest API control path reaches about **7.5x throughput** over baseline on the same `LIMIT=100` setup.
+- Compared to the previous adaptive run (`v1`), the latest adaptive run (`v2`) still improves (`2.4047 -> 2.4584 req/s`, `+2.23%`). This is a small but measurable delta in a noisy environment (shared network + provider-side scheduling), so the right takeaway is not "a new ceiling", but "less overhead and better utilization under the same constraints."
+- The core point: this speedup is not from changing benchmark difficulty. We keep the same task (`mme`), model (`bytedance-seed/seed-1.6-flash`), limit (`100`), and evaluation prompts/settings. The gain comes from changes in the API request scheduling/control path.
+- What `adaptive (v2)` means in practice:
+- Refill scheduling (no window barrier): maintain a steady pool of in-flight requests and immediately dispatch new work as soon as a request completes. This reduces idle gaps and prevents the slowest request in a window from gating progress.
+- Rolling controller updates: adjust concurrency based on a rolling batch of completions (failure rate, rate-limit hits, and p95 latency vs target) rather than only after fixed windows. This makes the controller more responsive and less sensitive to outliers.
+- Hysteresis for stability: use separate "reduce" vs "increase" conditions (and minimum sample thresholds) to avoid oscillating on a single transient 429 or a brief latency spike.
+- Retry/backoff decoupling: `retry_backoff_s` is explicitly separate from request timeout, so retries don't sleep for long timeouts and tie up worker slots.
+- Prefix-aware queueing (when enabled): reorder dispatch by prefix hash so same-prefix requests are sent close together, improving prefill-cache hit opportunities on providers that support prefix caching. (Some routing layers may dilute this benefit; the mechanism is still safe.)
+
 ```bash
 python -m lmms_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-VL-7B
 python -m lmms_eval --model sglang --model_args pretrained=Qwen/Qwen2.5-VL-7B
 ```
 
+#### Model Registry V2
+
+v0.6 introduces `ModelRegistryV2` — a unified model registry that replaces the previous ad-hoc import system. All model names (`--model X`) resolve through a single path.
+
+**How it works**
+
+Two dicts in `lmms_eval/models/__init__.py` declare available models:
+
+- `AVAILABLE_SIMPLE_MODELS`: maps `model_id` -> `ClassName` for simple (legacy) models in `models/simple/`
+- `AVAILABLE_CHAT_TEMPLATE_MODELS`: maps `model_id` -> `ClassName` for chat models in `models/chat/`
+
+At startup, the registry merges both dicts into `ModelManifest` objects. Each manifest holds a `model_id` and up to two class paths (simple + chat). Class paths are auto-constructed: `lmms_eval.models.{type}.{model_id}.{ClassName}`, so the dict key **must match the filename**.
+
+**Resolution**: chat is always preferred over simple (unless `force_simple=True`). This means `--model openai` transparently resolves to the chat implementation.
+
+**Aliasing**: backward-compatible names are supported via `MODEL_ALIASES` in `__init__.py` and via `ModelManifest.aliases`. Old names like `openai_compatible`, `openai_compatible_chat`, `async_openai_compatible`, and `async_openai_compatible_chat` continue to work.
+
+**Simple mode deprecation**: the simple model interface (`doc_to_visual` + `doc_to_text`) for API models is deprecated. New integrations should always use chat (`doc_to_messages` + `ChatMessages`). The simple implementations in `models/simple/openai.py` will be removed in a future release.
+
 ### 1.2 Data Layer
 
 #### Storage Format
@@ -317,20 +428,20 @@ Same accuracy, but Model A is 3× more stable.
 
 ---
 
-## 3. Frontier Multimodal Evaluation (TODO)
+## 3. Evaluating Multimodal Models in 2026 (TODO)
 
-> **Note**: This section outlines planned frontier evaluation features. Implementation is in progress.
+> **Note**: This section outlines planned evaluation features. Implementation is in progress.
 
-### 3.1 Why Frontier Scenarios Matter
+### 3.1 More features are expected
 
-Static image QA benchmarks are saturating. Building frontier multimodal systems requires setting more challenging tasks and evaluating them in more realistic scenarios:
+Static image QA benchmarks are saturating. Building multimodal systems requires setting more challenging tasks and evaluating them in more realistic scenarios:
 
 | Capability | Challenge | Current Gap |
 |------------|-----------|-------------|
 | Long video understanding | 10min+ videos, 1000+ frames | Most benchmarks use <128 frames |
-| High temporal resolution | Event detection at 30fps | Sparse sampling loses fine-grained actions |
+| High motion | Objects movement at 30fps | Sparse sampling loses fine-grained actions |
 | Spatial reasoning | 3D world understanding | 2D perception ≠ physical grounding |
-| Agentic interaction (streaming input) | Multi-step task execution | Static QA can't measure planning/tool use |
+| Agentic interaction | Multi-step task execution and feedback | Static QA can't measure planning/tool use well |
 
 **Key insight**: These capabilities require **in-environment evaluation**, the model must interact with simulators, receive feedback, and adapt. Static input-output pairs cannot capture this.
 
diff --git a/examples/models/async_openai_compatible.sh b/examples/models/async_openai_compatible.sh
@@ -1,9 +1,9 @@
 #!/bin/bash
 
 # ============================================================================
-# Async OpenAI Compatible Model Example
+# Async OpenAI Model Example
 # ============================================================================
-# This script demonstrates how to use the async_openai_compatible model with:
+# This script demonstrates how to use the async_openai model with:
 # 1. Basic video/image evaluation (without MCP tools)
 # 2. Tool-enabled evaluation (with MCP client)
 #
@@ -61,7 +61,7 @@ accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
 # 7. The result is sent back to the model for continuation
 # 8. Steps 3-7 repeat in a loop until the model produces final text output
 #
-# Tool Calling Loop in Code (from async_openai.py):
+# Tool Calling Loop in Code (from async_openai_compatible.py):
 # ──────────────────────────────────────────────────
 # while response.choices[0].finish_reason == "tool_calls":
 #     for tool_call in response.choices[0].message.tool_calls:
@@ -103,4 +103,4 @@ accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
 # nframes                : Fixed number of frames for video (default: 64)
 # is_qwen3_vl            : Enable Qwen3-VL specific formatting (default: False)
 # mcp_server_path        : Path to MCP server script for tool calling (optional)
-# work_dir               : Working directory for MCP tools (default: /tmp/...)
+# work_dir               : Working directory for MCP tools (default: /tmp/...)
diff --git a/lmms_eval/models/__init__.py b/lmms_eval/models/__init__.py
@@ -119,7 +119,7 @@
 
 MODEL_ALIASES: dict[str, tuple[str, ...]] = {
     "openai": ("openai_compatible", "openai_compatible_chat"),
-    "async_openai": ("async_openai_compatible_chat",),
+    "async_openai": ("async_openai_compatible_chat", "async_openai_compatible"),
 }
 
 
diff --git a/lmms_eval/models/chat/async_openai.py b/lmms_eval/models/chat/async_openai.py
@@ -20,7 +20,9 @@
 from lmms_eval.models.model_utils.concurrency_control import (
     AdaptiveConcurrencyConfig,
     decide_next_concurrency,
+    extract_text_prefix_from_chat_messages,
     is_rate_limit_error,
+    make_prefix_hash,
     parse_bool,
 )
 from lmms_eval.protocol import ChatMessages
@@ -60,6 +62,8 @@ def __init__(
         adaptive_increase_step: float = 0.1,
         adaptive_decrease_factor: float = 0.7,
         adaptive_failure_threshold: float = 0.05,
+        prefix_aware_queue: bool = True,
+        prefix_hash_chars: int = 256,
         **kwargs,
     ) -> None:
         super().__init__()
@@ -91,6 +95,8 @@ def __init__(
             decrease_factor=adaptive_decrease_factor,
             failure_threshold=adaptive_failure_threshold,
         )
+        self.prefix_aware_queue = parse_bool(prefix_aware_queue)
+        self.prefix_hash_chars = max(32, int(prefix_hash_chars))
         if mcp_server_path is not None:
             from lmms_eval.mcp import MCPClient
 
@@ -272,6 +278,18 @@ async def run():
                 if self.adaptive_concurrency
                 else max(1, self.num_cpus)
             )
+            dispatch_order = list(range(len(requests)))
+            if self.prefix_aware_queue:
+                prefix_hashes = {}
+                for idx in dispatch_order:
+                    request_obj = requests[idx]
+                    prefix_text = request_obj.args[0] if isinstance(request_obj.args[0], str) else ""
+                    if not prefix_text:
+                        _, doc_to_messages, _, doc_id, task, split = request_obj.args
+                        chat_messages_raw = doc_to_messages(self.task_dict[task][split][doc_id])
+                        prefix_text = extract_text_prefix_from_chat_messages(chat_messages_raw, self.prefix_hash_chars)
+                    prefix_hashes[idx] = make_prefix_hash(prefix_text, self.prefix_hash_chars)
+                dispatch_order.sort(key=lambda idx: (prefix_hashes[idx], idx))
             cursor = 0
 
             async def _process(req, idx):
@@ -343,10 +361,11 @@ def maybe_update_concurrency(force: bool = False) -> None:
                 request_latencies = []
                 completed_since_adapt = 0
 
-            while cursor < len(requests) or in_flight:
-                while cursor < len(requests) and len(in_flight) < max(1, current_concurrency):
-                    task = asyncio.create_task(_process(requests[cursor], cursor))
-                    in_flight[task] = cursor
+            while cursor < len(dispatch_order) or in_flight:
+                while cursor < len(dispatch_order) and len(in_flight) < max(1, current_concurrency):
+                    request_index = dispatch_order[cursor]
+                    task = asyncio.create_task(_process(requests[request_index], request_index))
+                    in_flight[task] = request_index
                     cursor += 1
 
                 if not in_flight:
diff --git a/lmms_eval/models/chat/openai.py b/lmms_eval/models/chat/openai.py
@@ -11,7 +11,9 @@
 from lmms_eval.imports import optional_import
 from lmms_eval.models.model_utils.concurrency_control import (
     decide_next_concurrency,
+    extract_text_prefix_from_chat_messages,
     is_rate_limit_error,
+    make_prefix_hash,
 )
 from lmms_eval.models.model_utils.gen_metrics import log_metrics
 from lmms_eval.models.simple.openai import OpenAICompatible as OpenAICompatibleSimple
@@ -45,6 +47,18 @@ def generate_until(self, requests) -> List[str]:
             self.num_concurrent,
             self.adaptive_config.max_concurrency,
         )
+        dispatch_order = list(range(len(reordered_requests)))
+        if self.prefix_aware_queue:
+            prefix_hashes = {}
+            for idx in dispatch_order:
+                req = reordered_requests[idx]
+                prefix_text = req.args[0] if isinstance(req.args[0], str) else ""
+                if not prefix_text:
+                    _, doc_to_messages, _, doc_id, task, split = req.args
+                    chat_messages_raw = doc_to_messages(self.task_dict[task][split][doc_id])
+                    prefix_text = extract_text_prefix_from_chat_messages(chat_messages_raw, self.prefix_hash_chars)
+                prefix_hashes[idx] = make_prefix_hash(prefix_text, self.prefix_hash_chars)
+            dispatch_order.sort(key=lambda idx: (prefix_hashes[idx], idx))
         cursor = 0
         failed_requests = 0
         rate_limited_requests = 0
@@ -165,18 +179,19 @@ def build_payload_for_index(global_index: int):
             return doc_uuid, None, payload
 
         with ThreadPoolExecutor(max_workers=max_workers) as executor:
-            while cursor < len(reordered_requests) or in_flight:
-                while cursor < len(reordered_requests) and len(in_flight) < max(1, current_concurrency):
-                    doc_uuid, cached_response, payload = build_payload_for_index(cursor)
-                    doc_uuids[cursor] = doc_uuid
+            while cursor < len(dispatch_order) or in_flight:
+                while cursor < len(dispatch_order) and len(in_flight) < max(1, current_concurrency):
+                    request_index = dispatch_order[cursor]
+                    doc_uuid, cached_response, payload = build_payload_for_index(request_index)
+                    doc_uuids[request_index] = doc_uuid
                     if cached_response is not None:
-                        responses[cursor] = cached_response
+                        responses[request_index] = cached_response
                         pbar.update(1)
                         cursor += 1
                         continue
 
-                    future = executor.submit(process_single_request, cursor, payload)
-                    in_flight[future] = cursor
+                    future = executor.submit(process_single_request, request_index, payload)
+                    in_flight[future] = request_index
                     cursor += 1
 
                 if not in_flight:
diff --git a/lmms_eval/models/model_utils/concurrency_control.py b/lmms_eval/models/model_utils/concurrency_control.py
diff --git a/lmms_eval/models/simple/openai.py b/lmms_eval/models/simple/openai.py

Original file line number	Diff line number	Diff line change
`@@ -119,7 +119,7 @@`
`119`	`119`
`120`	`120`	`MODEL_ALIASES: dict[str, tuple[str, ...]] = {`
`121`	`121`	`"openai": ("openai_compatible", "openai_compatible_chat"),`
`122`		`- "async_openai": ("async_openai_compatible_chat",),`
	`122`	`+ "async_openai": ("async_openai_compatible_chat", "async_openai_compatible"),`
`123`	`123`	`}`
`124`	`124`
`125`	`125`