Skip to content

Commit 4c785b2

Browse files
nv-alichengclaudegemini-code-assist[bot]
authored
Disable warmup temporarily, bugfixes. (#288)
* chore: address PR review comments — fix tmpfs cleanup, use metric enums, fix templates - Fix premature tmpfs deletion in _write_scoring_artifacts (cleanup now owned solely by run_benchmark's finally block) - Replace hardcoded metric key strings in _setup_kv_reader with MetricCounterKey/MetricSeriesKey enum iteration - Fix config templates: replace placeholder URLs with http://localhost:8000, remove nonexistent record_worker_events field, fix YAML indentation - Replace internal hostname in gptoss_test.yaml with localhost Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Disable warmup by default until rules are defined * Remove stray config file * Add TPOT_NS to _STREAMING_ONLY metrics set Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * chore: address second round of PR #282 review comments - docs(report_design): add "reports reproducible from event log" principle - refactor(metrics_table): rename subtract_field -> delta_start_fieldname - docs(metrics_table): reword ISL token_ids/text comments so SGLang/OpenAI are examples, not defining conditions - test(kv_store): pin empty SeriesStats min/max sentinels; add snapshot isolation regression test - test(aggregator): add explanatory messages to tracking-window asserts - test(report_builder): pin max/std_dev on empty compute_summary - test(sample_order): parametrize over [3, 100, 10_000] dataset sizes - test(zmq_pool_transport): collapse two pool transport classes into one parametrized over (num_workers, create_publisher) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify _check_tokenizer_exists is a probe, consumers are examples Addresses PR #282 thread T33: reframe the docstring so MetricsAggregator and Harmony read as examples rather than a closed list of consumers, and state explicitly that this function never loads or downloads the tokenizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent 3c07790 commit 4c785b2

17 files changed

Lines changed: 191 additions & 202 deletions

File tree

configs/gptoss_test.yaml

Lines changed: 0 additions & 66 deletions
This file was deleted.

docs/metrics/report_design.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,19 @@ this complexity is needed when the input is a `list[float]` from the KVStore.
3838
The entire rollup is a single function: `compute_summary(values) → dict`.
3939
It calls numpy for percentiles and histograms. No classes, no state.
4040

41+
**Reports are reproducible from the event log.**
42+
43+
The KVStore is lossy aggregation — it stores per-metric series, not per-sample
44+
provenance. The authoritative record of what happened during a run is the event
45+
log written by the `EventLoggerService`. Every number in a `Report` can be
46+
recomputed by replaying the event log through the same aggregator logic: if a
47+
production report shows a TTFT spike, the event log is the ground truth a user
48+
can mine to attribute the spike to specific samples or time windows.
49+
50+
New metrics must preserve this property: the aggregator may only derive values
51+
from event fields, never from out-of-band state. If a metric cannot be rebuilt
52+
from the event log alone, it does not belong in the KVStore.
53+
4154
## Components
4255

4356
### `compute_summary(values, percentiles, n_histogram_buckets) → dict`

src/inference_endpoint/async_utils/services/metrics_aggregator/metrics_table.py

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -150,18 +150,20 @@ def fire(
150150

151151

152152
class TimeDeltaTrigger(EmitTrigger):
153-
"""Sync trigger: emits ev_rec.timestamp_ns - pre_change[required_field].
153+
"""Sync trigger: emits ev_rec.timestamp_ns - pre_change[delta_start_fieldname].
154154
155-
Subclass only needs to set metric_name and the required field name.
156-
Skips silently if the required field is None (event hasn't occurred yet).
155+
The emitted metric is a time delta: the firing event marks the end of the
156+
delta, and ``delta_start_fieldname`` names the SampleField whose timestamp
157+
marks the start. Skips silently if the start field is None (the delta has
158+
not yet opened for this sample).
157159
"""
158160

159-
def __init__(self, metric_name: str, kv_store: KVStore, subtract_field: str):
160-
super().__init__(metric_name, kv_store, requires=(subtract_field,))
161-
self._subtract_field = subtract_field
161+
def __init__(self, metric_name: str, kv_store: KVStore, delta_start_fieldname: str):
162+
super().__init__(metric_name, kv_store, requires=(delta_start_fieldname,))
163+
self._delta_start_fieldname = delta_start_fieldname
162164

163165
def fire(self, ev_rec, row, pre_change):
164-
baseline = pre_change.get(self._subtract_field)
166+
baseline = pre_change.get(self._delta_start_fieldname)
165167
if baseline is not None:
166168
self.kv_store.update(self.metric_name, ev_rec.timestamp_ns - baseline)
167169
return None
@@ -235,7 +237,9 @@ class TtftTrigger(TimeDeltaTrigger):
235237

236238
def __init__(self, kv_store: KVStore):
237239
super().__init__(
238-
MetricSeriesKey.TTFT_NS, kv_store, subtract_field=SampleField.ISSUED_NS
240+
MetricSeriesKey.TTFT_NS,
241+
kv_store,
242+
delta_start_fieldname=SampleField.ISSUED_NS,
239243
)
240244

241245

@@ -249,7 +253,7 @@ def __init__(self, kv_store: KVStore):
249253
super().__init__(
250254
MetricSeriesKey.CHUNK_DELTA_NS,
251255
kv_store,
252-
subtract_field=SampleField.LAST_RECV_NS,
256+
delta_start_fieldname=SampleField.LAST_RECV_NS,
253257
)
254258

255259

@@ -260,7 +264,7 @@ def __init__(self, kv_store: KVStore):
260264
super().__init__(
261265
MetricSeriesKey.SAMPLE_LATENCY_NS,
262266
kv_store,
263-
subtract_field=SampleField.ISSUED_NS,
267+
delta_start_fieldname=SampleField.ISSUED_NS,
264268
)
265269

266270

@@ -281,11 +285,12 @@ def __init__(
281285
super().__init__(MetricSeriesKey.ISL, kv_store, tokenize_pool, loop)
282286

283287
def fire(self, ev_rec, row, pre_change):
284-
# Sync fast path: pre-tokenized IDs (SGLang)
288+
# Sync fast path: any backend that pre-populates token_ids (e.g. SGLang).
285289
if isinstance(ev_rec.data, PromptData) and ev_rec.data.token_ids is not None:
286290
self.kv_store.update(self.metric_name, len(ev_rec.data.token_ids))
287291
return None
288-
# Async path: tokenize raw text (OpenAI) — handled by base class
292+
# Async path: tokenize raw text — used when token_ids are unavailable
293+
# (e.g. OpenAI-compatible endpoints). Handled by the base class.
289294
return super().fire(ev_rec, row, pre_change)
290295

291296
def _extract_text(self, ev_rec, row, pre_change):

src/inference_endpoint/commands/benchmark/execute.py

Lines changed: 26 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,15 @@
4848
ServiceConfig,
4949
ServiceLauncher,
5050
)
51+
from inference_endpoint.async_utils.services.metrics_aggregator.aggregator import (
52+
MetricCounterKey,
53+
)
5154
from inference_endpoint.async_utils.services.metrics_aggregator.kv_store import (
5255
BasicKVStoreReader,
5356
)
57+
from inference_endpoint.async_utils.services.metrics_aggregator.metrics_table import (
58+
MetricSeriesKey,
59+
)
5460
from inference_endpoint.async_utils.transport.zmq.context import ManagedZMQContext
5561
from inference_endpoint.config.runtime_settings import RuntimeSettings
5662
from inference_endpoint.config.schema import (
@@ -176,8 +182,11 @@ def _check_tokenizer_exists(model_name: str) -> bool:
176182
"""Check if a HuggingFace tokenizer exists for the model (API only, no download).
177183
178184
Returns True if the model repo exists and has tokenizer files, False otherwise.
179-
The actual tokenizer is loaded later by the MetricsAggregator subprocess and
180-
by Harmony transforms (each loads their own instance as needed).
185+
This function is a probe — it never loads or downloads the tokenizer itself.
186+
Downstream consumers that need tokenization (e.g. the MetricsAggregator
187+
subprocess for ISL/OSL/TPOT, Harmony transforms for prompt preprocessing,
188+
and any future plugin with its own tokenization need) each load their own
189+
instance as required.
181190
"""
182191
try:
183192
info = model_info(model_name)
@@ -374,24 +383,19 @@ def _setup_kv_reader(
374383
) -> BasicKVStoreReader:
375384
"""Create a KVStoreReader pre-registered with all metric keys."""
376385
reader = BasicKVStoreReader(metrics_dir)
377-
# Counter keys (from MetricCounterKey enum)
378-
for key in [
379-
"total_samples_issued",
380-
"total_samples_completed",
381-
"total_samples_failed",
382-
"tracked_samples_issued",
383-
"tracked_samples_completed",
384-
"tracked_duration_ns",
385-
"total_duration_ns",
386-
]:
387-
reader.register_key(key, "counter")
388-
# Series keys (from MetricSeriesKey enum)
389-
for key in ["isl", "osl", "sample_latency_ns"]:
390-
reader.register_key(key, "series")
391-
reader.register_key("tpot_ns", "series", dtype=float)
392-
if streaming:
393-
for key in ["ttft_ns", "chunk_delta_ns"]:
394-
reader.register_key(key, "series")
386+
for counter_key in MetricCounterKey:
387+
reader.register_key(counter_key.value, "counter")
388+
_STREAMING_ONLY = {
389+
MetricSeriesKey.TTFT_NS,
390+
MetricSeriesKey.CHUNK_DELTA_NS,
391+
MetricSeriesKey.TPOT_NS,
392+
}
393+
_FLOAT_SERIES = {MetricSeriesKey.TPOT_NS}
394+
for series_key in MetricSeriesKey:
395+
if series_key in _STREAMING_ONLY and not streaming:
396+
continue
397+
dtype = float if series_key in _FLOAT_SERIES else int
398+
reader.register_key(series_key.value, "series", dtype=dtype)
395399
return reader
396400

397401

@@ -591,12 +595,10 @@ def _write_scoring_artifacts(
591595
f.write(msgspec.json.format(msgspec.json.encode(sample_idx_map), indent=2))
592596
logger.debug(f"Wrote {map_path}")
593597

594-
# Copy events.jsonl from tmpfs to report_dir
598+
# Copy events.jsonl from tmpfs to report_dir.
599+
# Tmpfs cleanup is handled by run_benchmark()'s finally block.
595600
_salvage_tmpfs(ctx.report_dir, tmpfs_dir)
596601

597-
# Clean up tmpfs
598-
shutil.rmtree(tmpfs_dir, ignore_errors=True)
599-
600602

601603
def _salvage_tmpfs(report_dir: Path, tmpfs_dir: Path) -> None:
602604
"""Copy all salvageable artifacts from tmpfs to report_dir.

src/inference_endpoint/config/templates/concurrency_template.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ settings:
1818
target_concurrency: 32 # Concurrent requests
1919
endpoint_config:
2020
endpoints: # Endpoint URL(s)
21-
- '<ENDPOINT_URL eg: http://localhost:8000>'
21+
- 'http://localhost:8000'

src/inference_endpoint/config/templates/concurrency_template_full.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,6 @@ settings:
4949
target_concurrency: 32 # Concurrent requests
5050
client:
5151
num_workers: -1 # Worker processes (-1=auto)
52-
record_worker_events: false # Record per-worker events
5352
log_level: INFO # Worker log level
5453
warmup_connections: -1 # Pre-establish TCP connections (-1=auto, 0=disabled)
5554
max_connections: -1 # Max TCP connections (-1=unlimited)
@@ -77,7 +76,7 @@ metrics:
7776
- tpot
7877
endpoint_config:
7978
endpoints: # Endpoint URL(s)
80-
- '<ENDPOINT_URL eg: http://localhost:8000>'
79+
- 'http://localhost:8000'
8180
api_key: null # API key
8281
api_type: openai # API type: openai or sglang | options: openai, sglang
8382
report_dir: null # Report output directory

src/inference_endpoint/config/templates/offline_template.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@ settings:
1515
n_samples_to_issue: null # Sample count override
1616
endpoint_config:
1717
endpoints: # Endpoint URL(s)
18-
- '<ENDPOINT_URL eg: http://localhost:8000>'
18+
- 'http://localhost:8000'

src/inference_endpoint/config/templates/offline_template_full.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,6 @@ settings:
4949
target_concurrency: null # Concurrent requests
5050
client:
5151
num_workers: -1 # Worker processes (-1=auto)
52-
record_worker_events: false # Record per-worker events
5352
log_level: INFO # Worker log level
5453
warmup_connections: -1 # Pre-establish TCP connections (-1=auto, 0=disabled)
5554
max_connections: -1 # Max TCP connections (-1=unlimited)
@@ -77,7 +76,7 @@ metrics:
7776
- tpot
7877
endpoint_config:
7978
endpoints: # Endpoint URL(s)
80-
- '<ENDPOINT_URL eg: http://localhost:8000>'
79+
- 'http://localhost:8000'
8180
api_key: null # API key
8281
api_type: openai # API type: openai or sglang | options: openai, sglang
8382
report_dir: null # Report output directory

src/inference_endpoint/config/templates/online_template.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ settings:
1818
target_qps: 10.0 # Target QPS
1919
endpoint_config:
2020
endpoints: # Endpoint URL(s)
21-
- '<ENDPOINT_URL eg: http://localhost:8000>'
21+
- 'http://localhost:8000'

src/inference_endpoint/config/templates/online_template_full.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,6 @@ settings:
4949
target_concurrency: null # Concurrent requests
5050
client:
5151
num_workers: -1 # Worker processes (-1=auto)
52-
record_worker_events: false # Record per-worker events
5352
log_level: INFO # Worker log level
5453
warmup_connections: -1 # Pre-establish TCP connections (-1=auto, 0=disabled)
5554
max_connections: -1 # Max TCP connections (-1=unlimited)
@@ -77,7 +76,7 @@ metrics:
7776
- tpot
7877
endpoint_config:
7978
endpoints: # Endpoint URL(s)
80-
- '<ENDPOINT_URL eg: http://localhost:8000>'
79+
- 'http://localhost:8000'
8180
api_key: null # API key
8281
api_type: openai # API type: openai or sglang | options: openai, sglang
8382
report_dir: null # Report output directory

0 commit comments

Comments
 (0)