Type of Change
Refactoring
Problem Statement
Feature request
Add programmatic guardrails so user-supplied engine arguments (YAML, --extra-engine-args, --override-engine-args) cannot be silently clobbered by Dynamo defaults when building the TRT-LLM arg_map in components/src/dynamo/trtllm/workers/llm_worker.py.
Describe the problem you're encountering
The Dynamo→TRT-LLM config bridge in init_llm_worker is leaky. User YAML is parsed into arg_map, then arg_map is mutated in several places (if config.publish_events_and_metrics: block, dict↔KvCacheConfig conversion, override_engine_args deep-update) before being handed to LLM(...). Nothing verifies that user-supplied values actually survive.
This same six-line block has regressed three times in the past few releases:
84c7d1e234 (Sep 2025) — dict→KvCacheConfig refactor dropped the if not event_buffer_max_size guard, causing user-supplied buffer sizes to be silently overwritten with the 1024 default.
- PR #5198 (Jan 2026) — partially fixed: preserved most
KvCacheConfig settings via model_dump, but did not restore the conditional on event_buffer_max_size itself.
- PR #9284 — fixes
event_buffer_max_size properly, but in the same change drops arg_map["return_perf_metrics"] = config.publish_events_and_metrics. After that PR, vanilla --publish-events-and-metrics deployments (without OTEL launch scripts that explicitly inject return_perf_metrics: true) silently lose TRT-LLM's PerfMetricsManager instrumentation — GPU forward/sample timing, step_metrics, ctx_chunk_metrics,
time_breakdown analysis, OTEL forward/sample spans. None of this is caught by existing tests.
The recurring pattern: Dynamo's internal massaging of arg_map clobbers or fails to forward user intent, and the failure mode is silent (no warning, no exception, no log), so it ships and is only discovered when someone notices missing data in production.
Describe alternatives you've tried
A systemic guardrail in the worker init pipeline would prevent the next instance of this bug class.
Proposed Solution
Concretely, two complementary mechanisms:
- Audit log of user-vs-default provenance. Either snapshot the post-YAML
arg_map and diff against the final arg_map before LLM(**arg_map), or track per-key provenance ({"event_buffer_max_size": "user_yaml", "backend": "dynamo_default", ...}). Emit a warning whenever a Dynamo-default code path overwrites a user-supplied key. The existing _warn_override_collisions helper at llm_worker.py:98 already implements this pattern in one direction (warning when
override_engine_args clobbers a value) — apply it across the whole pipeline.
- Regression test pinning. Feed YAML configs with non-default values for the historically-fragile fields (
kv_cache_config.event_buffer_max_size, kv_cache_config.free_gpu_memory_fraction, kv_cache_config.cache_transceiver_config, return_perf_metrics, backend, enable_iter_perf_stats) and assert they survive end-to-end into the final arg_map produced by the worker init.
Estimated PR Size
M (51-200 lines)
Files/Components Affected
TRTLLM
Type of Change
Refactoring
Problem Statement
Feature request
Add programmatic guardrails so user-supplied engine arguments (YAML,
--extra-engine-args,--override-engine-args) cannot be silently clobbered by Dynamo defaults when building the TRT-LLMarg_mapincomponents/src/dynamo/trtllm/workers/llm_worker.py.Describe the problem you're encountering
The Dynamo→TRT-LLM config bridge in
init_llm_workeris leaky. User YAML is parsed intoarg_map, thenarg_mapis mutated in several places (if config.publish_events_and_metrics:block, dict↔KvCacheConfigconversion,override_engine_argsdeep-update) before being handed toLLM(...). Nothing verifies that user-supplied values actually survive.This same six-line block has regressed three times in the past few releases:
84c7d1e234(Sep 2025) — dict→KvCacheConfigrefactor dropped theif not event_buffer_max_sizeguard, causing user-supplied buffer sizes to be silently overwritten with the 1024 default.KvCacheConfigsettings viamodel_dump, but did not restore the conditional onevent_buffer_max_sizeitself.event_buffer_max_sizeproperly, but in the same change dropsarg_map["return_perf_metrics"] = config.publish_events_and_metrics. After that PR, vanilla--publish-events-and-metricsdeployments (without OTEL launch scripts that explicitly injectreturn_perf_metrics: true) silently lose TRT-LLM'sPerfMetricsManagerinstrumentation — GPU forward/sample timing,step_metrics,ctx_chunk_metrics,time_breakdownanalysis, OTEL forward/sample spans. None of this is caught by existing tests.The recurring pattern: Dynamo's internal massaging of
arg_mapclobbers or fails to forward user intent, and the failure mode is silent (no warning, no exception, no log), so it ships and is only discovered when someone notices missing data in production.Describe alternatives you've tried
docs/backends/trtllm/trtllm-observability.md:212documents thereturn_perf_metricscontract) — does not enforce.A systemic guardrail in the worker init pipeline would prevent the next instance of this bug class.
Proposed Solution
Concretely, two complementary mechanisms:
arg_mapand diff against the finalarg_mapbeforeLLM(**arg_map), or track per-key provenance ({"event_buffer_max_size": "user_yaml", "backend": "dynamo_default", ...}). Emit a warning whenever a Dynamo-default code path overwrites a user-supplied key. The existing_warn_override_collisionshelper atllm_worker.py:98already implements this pattern in one direction (warning whenoverride_engine_argsclobbers a value) — apply it across the whole pipeline.kv_cache_config.event_buffer_max_size,kv_cache_config.free_gpu_memory_fraction,kv_cache_config.cache_transceiver_config,return_perf_metrics,backend,enable_iter_perf_stats) and assert they survive end-to-end into the finalarg_mapproduced by the worker init.Estimated PR Size
M (51-200 lines)
Files/Components Affected
TRTLLM