Skip to content

[CONTRIBUTION]: [FEATURE]: Enforce user-config preservation in TRT-LLM worker arg_map #9288

@indrajit96

Description

@indrajit96

Type of Change

Refactoring

Problem Statement

Feature request

Add programmatic guardrails so user-supplied engine arguments (YAML, --extra-engine-args, --override-engine-args) cannot be silently clobbered by Dynamo defaults when building the TRT-LLM arg_map in components/src/dynamo/trtllm/workers/llm_worker.py.

Describe the problem you're encountering

The Dynamo→TRT-LLM config bridge in init_llm_worker is leaky. User YAML is parsed into arg_map, then arg_map is mutated in several places (if config.publish_events_and_metrics: block, dict↔KvCacheConfig conversion, override_engine_args deep-update) before being handed to LLM(...). Nothing verifies that user-supplied values actually survive.

This same six-line block has regressed three times in the past few releases:

  1. 84c7d1e234 (Sep 2025) — dict→KvCacheConfig refactor dropped the if not event_buffer_max_size guard, causing user-supplied buffer sizes to be silently overwritten with the 1024 default.
  2. PR #5198 (Jan 2026) — partially fixed: preserved most KvCacheConfig settings via model_dump, but did not restore the conditional on event_buffer_max_size itself.
  3. PR #9284 — fixes event_buffer_max_size properly, but in the same change drops arg_map["return_perf_metrics"] = config.publish_events_and_metrics. After that PR, vanilla --publish-events-and-metrics deployments (without OTEL launch scripts that explicitly inject return_perf_metrics: true) silently lose TRT-LLM's PerfMetricsManager instrumentation — GPU forward/sample timing, step_metrics, ctx_chunk_metrics,
    time_breakdown analysis, OTEL forward/sample spans. None of this is caught by existing tests.

The recurring pattern: Dynamo's internal massaging of arg_map clobbers or fails to forward user intent, and the failure mode is silent (no warning, no exception, no log), so it ships and is only discovered when someone notices missing data in production.

Describe alternatives you've tried

A systemic guardrail in the worker init pipeline would prevent the next instance of this bug class.

Proposed Solution

Concretely, two complementary mechanisms:

  1. Audit log of user-vs-default provenance. Either snapshot the post-YAML arg_map and diff against the final arg_map before LLM(**arg_map), or track per-key provenance ({"event_buffer_max_size": "user_yaml", "backend": "dynamo_default", ...}). Emit a warning whenever a Dynamo-default code path overwrites a user-supplied key. The existing _warn_override_collisions helper at llm_worker.py:98 already implements this pattern in one direction (warning when
    override_engine_args clobbers a value) — apply it across the whole pipeline.
  2. Regression test pinning. Feed YAML configs with non-default values for the historically-fragile fields (kv_cache_config.event_buffer_max_size, kv_cache_config.free_gpu_memory_fraction, kv_cache_config.cache_transceiver_config, return_perf_metrics, backend, enable_iter_perf_stats) and assert they survive end-to-end into the final arg_map produced by the worker init.

Estimated PR Size

M (51-200 lines)

Files/Components Affected

TRTLLM

Metadata

Metadata

Assignees

No one assigned

    Labels

    approved-for-prIssue approved by Dynamo team - ready for PR submissionbackend::trtllmRelates to the trtllm backendcontribution-requestExternal contributor proposing to implement a changeenhancementNew feature or requestgood first issueGood for newcomersobservabilityRelated to metrics, tracing, logging

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions