🚀 The feature, motivation and pitch
trtllm_cache_config_info (added in #12564) is only populated after the first inference request. The iteration stats pipeline _maybe_initialize_iteration_results() is only called on request submission, so get_stats_async() returns empty until then.
External scrapers like the Kubernetes Inference Gateway EPP need cache_config_info (with block_size and num_gpu_blocks labels) at startup to make routing decisions. Without it, pods that haven't received traffic get lower scores, so they never get routed to.
Proposed fix: emit a one-time stats snapshot from PyExecutor.__init__ (after the KV cache manager is initialized) containing max_num_blocks and tokens_per_block, and wake the stats collector loop on server startup so it processes the initial stats immediately. The data is already available via kv_cache_manager.get_kv_cache_stats() at init time, it just isn't surfaced through the stats pipeline until the first request.
Alternatives
The alternative is to have external systems (e.g. the Inference Gateway EPP) send dummy warmup requests to each pod on discovery. This works but is fragile as it doesn't survive pod restarts, adds negligible but unnecessary inference load, and is somewhat hacky. Emitting the stats at init is cleaner since the data is already available.
Additional context
A comparison of trtllm-serve vs other model server metrics and the full gap analysis can be found in kubernetes-sigs/gateway-api-inference-extension#2596.
Before submitting a new issue...
🚀 The feature, motivation and pitch
trtllm_cache_config_info(added in #12564) is only populated after the first inference request. The iteration stats pipeline_maybe_initialize_iteration_results()is only called on request submission, soget_stats_async()returns empty until then.External scrapers like the Kubernetes Inference Gateway EPP need
cache_config_info(withblock_sizeandnum_gpu_blockslabels) at startup to make routing decisions. Without it, pods that haven't received traffic get lower scores, so they never get routed to.Proposed fix: emit a one-time stats snapshot from
PyExecutor.__init__(after the KV cache manager is initialized) containingmax_num_blocksandtokens_per_block, and wake the stats collector loop on server startup so it processes the initial stats immediately. The data is already available viakv_cache_manager.get_kv_cache_stats()at init time, it just isn't surfaced through the stats pipeline until the first request.Alternatives
The alternative is to have external systems (e.g. the Inference Gateway EPP) send dummy warmup requests to each pod on discovery. This works but is fragile as it doesn't survive pod restarts, adds negligible but unnecessary inference load, and is somewhat hacky. Emitting the stats at init is cleaner since the data is already available.
Additional context
A comparison of trtllm-serve vs other model server metrics and the full gap analysis can be found in kubernetes-sigs/gateway-api-inference-extension#2596.
Before submitting a new issue...