[CONTRIBUTION]: [FEATURE]: Enforce user-config preservation in TRT-LLM worker arg_map

### Type of Change

Refactoring

### Problem Statement

  ## Feature request

  Add programmatic guardrails so user-supplied engine arguments (YAML, `--extra-engine-args`, `--override-engine-args`) cannot be silently clobbered by Dynamo defaults when building the TRT-LLM `arg_map` in `components/src/dynamo/trtllm/workers/llm_worker.py`.                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  ## Describe the problem you're encountering
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  The Dynamo→TRT-LLM config bridge in `init_llm_worker` is leaky. User YAML is parsed into `arg_map`, then `arg_map` is mutated in several places (`if config.publish_events_and_metrics:` block, dict↔`KvCacheConfig` conversion, `override_engine_args` deep-update) before being handed to `LLM(...)`. Nothing verifies that user-supplied values actually survive.                                                                                                                                
   
  This same six-line block has regressed three times in the past few releases:                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                  
  1. `84c7d1e234` (Sep 2025) — dict→`KvCacheConfig` refactor dropped the `if not event_buffer_max_size` guard, causing user-supplied buffer sizes to be silently overwritten with the 1024 default.                                                                                                                                                                                                                                                                                                   
  2. PR [#5198](https://github.com/ai-dynamo/dynamo/pull/5198) (Jan 2026) — partially fixed: preserved most `KvCacheConfig` settings via `model_dump`, but did not restore the conditional on `event_buffer_max_size` itself.
  3. PR [#9284](https://github.com/ai-dynamo/dynamo/pull/9284) — fixes `event_buffer_max_size` properly, but in the same change drops `arg_map["return_perf_metrics"] = config.publish_events_and_metrics`. After that PR, vanilla `--publish-events-and-metrics` deployments (without OTEL launch scripts that explicitly inject `return_perf_metrics: true`) silently lose TRT-LLM's `PerfMetricsManager` instrumentation — GPU forward/sample timing, `step_metrics`, `ctx_chunk_metrics`,         
  `time_breakdown` analysis, OTEL forward/sample spans. None of this is caught by existing tests.                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  The recurring pattern: Dynamo's internal massaging of `arg_map` clobbers or fails to forward user intent, and the failure mode is silent (no warning, no exception, no log), so it ships and is only discovered when someone notices missing data in production.                                                                                                                                                                                                                                    
                                                                  
  ## Describe alternatives you've tried                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                  
  - Adding individual conditional guards per field (the approach used in #5198 and #9284) — fixes one field at a time; doesn't prevent the next regression on a different field.                                                                                                                                                                                                                                                                                                                      
  - Documentation (`docs/backends/trtllm/trtllm-observability.md:212` documents the `return_perf_metrics` contract) — does not enforce.
  - Code review — has not caught the recurring class of bug.                                                                                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  A systemic guardrail in the worker init pipeline would prevent the next instance of this bug class.                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                            

### Proposed Solution

Concretely, two complementary mechanisms:                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                  
  1. **Audit log of user-vs-default provenance.** Either snapshot the post-YAML `arg_map` and diff against the final `arg_map` before `LLM(**arg_map)`, or track per-key provenance (`{"event_buffer_max_size": "user_yaml", "backend": "dynamo_default", ...}`). Emit a warning whenever a Dynamo-default code path overwrites a user-supplied key. The existing `_warn_override_collisions` helper at `llm_worker.py:98` already implements this pattern in one direction (warning when             
  `override_engine_args` clobbers a value) — apply it across the whole pipeline.
  2. **Regression test pinning.** Feed YAML configs with non-default values for the historically-fragile fields (`kv_cache_config.event_buffer_max_size`, `kv_cache_config.free_gpu_memory_fraction`, `kv_cache_config.cache_transceiver_config`, `return_perf_metrics`, `backend`, `enable_iter_perf_stats`) and assert they survive end-to-end into the final `arg_map` produced by the worker init.                      

### Estimated PR Size

M (51-200 lines)

### Files/Components Affected

TRTLLM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CONTRIBUTION]: [FEATURE]: Enforce user-config preservation in TRT-LLM worker arg_map #9288

Type of Change

Problem Statement

Feature request

Describe the problem you're encountering

Describe alternatives you've tried

Proposed Solution

Estimated PR Size

Files/Components Affected

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[CONTRIBUTION]: [FEATURE]: Enforce user-config preservation in TRT-LLM worker arg_map #9288

Description

Type of Change

Problem Statement

Feature request

Describe the problem you're encountering

Describe alternatives you've tried

Proposed Solution

Estimated PR Size

Files/Components Affected

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions