NVIDIA
diff --git a/‎.claude/agents/ad-conf-check-update.md‎
Lines changed: 95 additions & 0 deletions b/‎.claude/agents/ad-conf-check-update.md‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎.claude/agents/ad-debug-agent.md‎
Lines changed: 1 addition & 1 deletion b/‎.claude/agents/ad-debug-agent.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.claude/agents/ad-onboard-reviewer.md‎
Lines changed: 1 addition & 1 deletion b/‎.claude/agents/ad-onboard-reviewer.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.claude/agents/ad-run-agent.md‎
Lines changed: 1 addition & 1 deletion b/‎.claude/agents/ad-run-agent.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.claude/skills/ad-add-fusion-transformation/SKILL.md‎
Lines changed: 220 additions & 0 deletions b/‎.claude/skills/ad-add-fusion-transformation/SKILL.md‎
Lines changed: 220 additions & 0 deletions
@@ -0,0 +1,95 @@
+---
+name: ad-conf-check-update
+description: >
+  Updates the ad-conf-check skill's references/config_log_patterns.md
+  by comparing it against the latest TensorRT-LLM AutoDeploy source code.
+  Checks for new/removed/renamed configs in default.yaml and verifies that
+  log patterns still match the actual source code. Edits the reference doc
+  in-place if anything changed.
+tools: ["Read", "Write", "Edit", "Bash", "Grep", "Glob"]
+model: sonnet
+license: Apache-2.0
+metadata:
+  author: NVIDIA Corporation
+---
+
+You are a reference-doc updater for the `ad-conf-check` skill. Your job is to ensure that `references/config_log_patterns.md` is up-to-date with the latest TensorRT-LLM AutoDeploy source code.
+
+You will receive two paths:
+- `<trtllm_src>` — the TensorRT-LLM repo root
+- `<skill_dir>` — the ad-conf-check skill directory (contains `references/config_log_patterns.md`)
+
+## Procedure
+
+### Phase 1: Detect drift
+
+#### 1a. Config drift — compare `default.yaml` against the reference doc
+
+1. Read `<trtllm_src>/tensorrt_llm/_torch/auto_deploy/config/default.yaml` to get the authoritative list of all config keys and their defaults.
+2. Read `<skill_dir>/references/config_log_patterns.md` to get the currently documented config keys.
+3. Compare the two:
+   - **New configs**: keys in `default.yaml` not documented in `config_log_patterns.md`
+   - **Removed configs**: keys documented in `config_log_patterns.md` but no longer in `default.yaml`
+   - **Renamed configs**: keys that appear removed but have an obvious successor (e.g., `cuda_graph_batch_sizes` → `cuda_graph_config.batch_sizes`)
+   - **Changed defaults**: keys whose default value changed
+
+#### 1b. Log-pattern drift — verify patterns against source code
+
+1. For each config documented in `config_log_patterns.md`, grep the TRT-LLM source for the quoted log strings (success/failure indicators).
+   - Focus on: `<trtllm_src>/tensorrt_llm/_torch/auto_deploy/` (transforms, compilers, config loaders)
+   - Key directories to search:
+     - `<trtllm_src>/tensorrt_llm/_torch/auto_deploy/transforms/`
+     - `<trtllm_src>/tensorrt_llm/_torch/auto_deploy/compile/`
+     - `<trtllm_src>/tensorrt_llm/_torch/auto_deploy/sharding/`
+     - `<trtllm_src>/tensorrt_llm/_torch/auto_deploy/`
+2. Flag patterns that no longer appear in the source code (stale patterns).
+3. Find new log messages in the source that are not yet documented (new patterns).
+   - Search for `logger.info`, `logger.warning`, `logger.error`, and `print` calls in the auto_deploy directory.
+   - Focus on messages related to config application, transform results, and failure/fallback.
+
+### Phase 2: Update the reference doc
+
+If Phase 1 found any drift, edit `<skill_dir>/references/config_log_patterns.md` in-place:
+
+1. **Add new config sections** for configs found in `default.yaml` but missing from the doc.
+   - Place them in the appropriate section (Top-Level, kv_cache_config, or Transform Parameters).
+   - Include the verification source tags (`[log]`, `[graph]`, `[nsys]`) based on what log patterns exist.
+   - Document the log patterns found in the source code.
+
+2. **Remove or mark deprecated** configs that no longer exist in `default.yaml`.
+   - If a config was renamed, update the section header and add a deprecation note.
+   - If a config was fully removed, delete its section.
+
+3. **Update stale log patterns** where the source code has changed.
+   - Replace old quoted strings with the current ones from the source.
+   - Add newly discovered log patterns.
+
+4. **Preserve the existing structure** and formatting conventions:
+   - Section hierarchy: Top-Level → kv_cache_config → Transform Parameters → General Failure Patterns
+   - Each config has: header with verification tags, values/transform key, success/failure indicators
+   - Use the same markdown style as existing entries.
+
+### Phase 3: Report
+
+After updating (or confirming no changes needed), output a summary:
+
+```
+## Reference Doc Update Summary
+
+**Status**: UPDATED / NO CHANGES NEEDED
+**TRT-LLM source**: <trtllm_src path>
+**Reference doc**: <skill_dir>/references/config_log_patterns.md
+
+### Changes made:
+- Added configs: <list or "none">
+- Removed configs: <list or "none">
+- Updated patterns: <list or "none">
+- Renamed configs: <list or "none">
+```
+
+## Important rules
+
+- **Do NOT fabricate log patterns.** Every quoted string must come from the actual source code. If you cannot find a log pattern for a config, document it with "No explicit log" as existing entries do.
+- **Do NOT change the overall document structure** (section order, heading levels) unless adding/removing sections.
+- **Be conservative**: if you're unsure whether a pattern is still valid, keep it and add a note rather than removing it.
+- **Preserve existing verification source tags** (`[log]`, `[graph]`, `[nsys]`) and only modify them if evidence supports the change.
@@ -1,7 +1,7 @@
 ---
 name: ad-debug-agent
 description: Debug the AutoDeploy model onboarding process
-tools: Read, Grep, Glob, Bash, Edit, Write
+tools: ["Read", "Grep", "Glob", "Bash", "Edit", "Write"]
 model: sonnet
 license: Apache-2.0
 metadata:
 
@@ -1,7 +1,7 @@
 ---
 name: ad-onboard-reviewer
 description: Independent reviewer for AutoDeploy model onboarding. Validates created model and test files against all onboarding requirements. Use after completing model onboarding work.
-tools: Read, Grep, Glob
+tools: ["Read", "Grep", "Glob"]
 model: sonnet
 license: Apache-2.0
 metadata:
 
@@ -1,7 +1,7 @@
 ---
 name: ad-run-agent
 description: Run AutoDeploy build and run command for a given model
-tools: Read, Grep, Glob, Bash, Write, Edit
+tools: ["Read", "Grep", "Glob", "Bash", "Write", "Edit"]
 model: sonnet
 license: Apache-2.0
 metadata:
 
@@ -0,0 +1,220 @@
+---
+name: ad-add-fusion-transformation
+description: >
+  Claude Code skill (trtllm-agent-toolkit): implement or extend TensorRT-LLM AutoDeploy fusion
+  transforms under transform/library/ in a TensorRT-LLM checkout. Prefer existing kernels and custom
+  ops; use Triton only when no viable existing-kernel path exists. Use ad-graph-dump for
+  AD_DUMP_GRAPHS_DIR workflows. Covers TRT-LLM paths, registry, default.yaml registration, graph
+  validation, tests, and a review checklist — without prescribing profiling tools or throughput
+  targets.
+license: Apache-2.0
+tags:
+  - tensorrt-llm
+  - autodeploy
+  - fusion
+  - graph-transform
+  - optimization
+metadata:
+  author: NVIDIA Corporation
+---
+
+# Autodeploy: Add Fusion Transformation Pass
+
+## Where this skill applies
+
+This file lives in the **trtllm-agent-toolkit** plugin. Paths such as `tensorrt_llm/...`, `examples/auto_deploy/...`, and `tests/...` are relative to a **TensorRT-LLM source checkout** on the user’s machine — not the plugin tree.
+
+After installing the plugin (see the toolkit `README.md`), skills use the `trtllm-agent-toolkit:` prefix (for example `trtllm-agent-toolkit:ad-add-fusion-transformation`).
+
+## Related skills in this plugin
+
+| Skill | Use it for |
+|-------|------------|
+| **ad-graph-dump** | Enabling `AD_DUMP_GRAPHS_DIR`, dump file layout, and how to read SSA graph output. |
+| **trtllm-codebase-exploration** | Mapping existing transforms, custom ops, and search patterns before writing a pass. |
+| **trtllm-code-contribution** | TensorRT-LLM pre-commit, tests, DCO sign-off, and PR expectations. |
+| **triton-kernel-writing** | Implementing a **Triton** op only after existing-kernel lookup fails. |
+| **triton-tileir-optimization** | Tuning **existing** Triton kernels for the TileIR backend when that path applies. |
+| **cuda-kernel-writing** | Raw CUDA extension work if the viable path is a PyTorch C++ extension (not Triton). |
+| **cute-kernel-writing** / **cudeepy-kernel-writing** | CuTe DSL/LIR or CuDeepy-generated kernels when that is the chosen integration path. |
+
+Use this skill when you already know **which subgraph or pattern** you are targeting (from graph dumps, logs, or code reading). For dump capture and file semantics, follow **ad-graph-dump** first.
+
+## When to use this skill
+
+- Adding, extending, or reviewing a fusion under AutoDeploy transforms in a TensorRT-LLM tree.
+
+### Workflow (concise)
+
+1. Confirm the pattern in **current** graph dumps (see **ad-graph-dump**).
+2. Search for an existing kernel or custom-op path before new Triton or CUDA.
+3. Implement the smallest change that proves correctness and matching; add tests.
+4. Re-run dumps and tests; if outputs drift, separate matching issues from metadata loss from numeric differences.
+
+## Finding fusion candidates (lightweight)
+
+Do this before writing a new pass so you work on real graph structure.
+
+### Inputs
+
+- Graph dump directory from a run with `AD_DUMP_GRAPHS_DIR` set (see **ad-graph-dump**).
+- Model id and active AutoDeploy config (registry YAML, `default.yaml` overlays).
+- TensorRT-LLM source tree for kernel and transform lookup.
+
+### Outputs
+
+- Ordered list of candidates with: graph evidence, existing-kernel lookup (`found` / `not_found`), recommendation (`use_existing_kernel`, `needs_triton_fallback`, `defer`), and trade-offs (complexity, correctness risk).
+
+### Discovery workflow
+
+1. Parse dumps for repeated unfused patterns (element-wise chains, norm chains, epilogues, attention-adjacent ops).
+2. Search the tree for equivalent transforms or custom ops; record file/symbol evidence.
+3. If nothing fits, mark Triton or other kernel work as a deliberate fallback.
+4. Prefer candidates with clear recurrence, existing support, and lower numerical risk.
+
+### Per-candidate template
+
+```text
+Candidate: <short-name>
+Affected graph pattern: <pattern>
+Existing kernel lookup: <found|not_found>
+Evidence: <path/symbol>
+Recommendation: <use_existing_kernel|needs_triton_fallback|defer>
+Strengths / weaknesses / risks:
+- ...
+```
+
+### Guardrails
+
+- Do not skip existing-kernel lookup.
+- Do not default to Triton when a viable existing op already exists.
+- If uncertain, `defer` and narrow the question with one more dump or test.
+
+---
+
+## Inputs (implementation)
+
+- Chosen candidate or concrete subgraph.
+- Active model and config files.
+- Fresh graph dumps when available.
+- Current baseline: match counts from logs, unit test status, any accuracy notes you already maintain.
+
+## Outputs (implementation)
+
+- Pass design or patch: registered transform, `default.yaml` entry, optional model-registry YAML.
+- Path decision: `existing_kernel_path` vs `triton_fallback_path` (or other kernel stack).
+- Validation notes: graph evidence, `[SUMMARY] matches=...` before/after from AutoDeploy logs, test results.
+
+## Implementation workflow
+
+1. Align the pass with **observed** graph structure from dumps — not assumed op names from docs alone.
+2. Search `transform/library/`, `custom_ops/`, `torch.ops.auto_deploy.*`, and related tests for reuse.
+3. Integrate an existing op when possible; otherwise delegate kernel work to the appropriate skill (**triton-kernel-writing**, **cuda-kernel-writing**, etc.).
+4. Keep one logical change per patch; extend tests in the same change.
+5. Re-read dumps after the change; if match counts collapse, suspect pattern availability or metadata propagation.
+
+## Where fusion passes live
+
+- Transforms: `tensorrt_llm/_torch/auto_deploy/transform/library/`
+- Registry / base behavior: `tensorrt_llm/_torch/auto_deploy/transform/interface.py`
+- Default transform list: `tensorrt_llm/_torch/auto_deploy/config/default.yaml`
+- Dump helper: `tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py`
+- Graph utilities: `tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`, `tensorrt_llm/_torch/auto_deploy/utils/_graph.py`
+- Custom ops: `tensorrt_llm/_torch/auto_deploy/custom_ops/`
+
+Tests (typical):
+
+- `tests/unittest/auto_deploy/singlegpu/transformations/library/`
+- `tests/integration/defs/accuracy/test_llm_api_autodeploy.py` (when behavior or numerics may change)
+
+## How to add a transform
+
+### Implement the pass
+
+Create or update a module under `transform/library/` and register the class:
+
+```python
+@TransformRegistry.register("my_transform_key")
+class MyTransform(BaseTransform):
+    @classmethod
+    def get_config_class(cls):
+        return MyTransformConfig
+```
+
+Use a dedicated config class only when the pass needs parameters beyond the base transform config.
+
+### Register in `default.yaml`
+
+Add a key under `transforms:` in `tensorrt_llm/_torch/auto_deploy/config/default.yaml`. **Copy the field set from the closest existing transform** in the same section of the file (required keys depend on the transform config class and on how peers are declared). New experimental passes should stay **`enabled: false`** until covered by tests and dumps.
+
+### Enable for a specific model
+
+For targeted rollout, adjust registry YAMLs under `examples/auto_deploy/model_registry/configs/` rather than turning on unproven passes globally.
+
+## Implementation rules
+
+- Prefer existing AutoDeploy / TRT-LLM ops and `torch.ops.auto_deploy` entries.
+- Prefer stable, backend-neutral graph contracts; avoid hiding real dataflow in `node.meta` when an edge should carry it.
+- Use metadata for observable tensor facts (shape, dtype) and preserve it across rewrites when replacements should remain traceable.
+- **One hypothesis per patch** — do not mix unrelated fusions.
+
+## Existing kernel first, Triton second
+
+Before Triton:
+
+1. Search `transform/library/` and `custom_ops/`.
+2. Search `torch.ops.auto_deploy.*` and TRT-LLM custom op definitions.
+3. Read tests for similar integrations.
+
+Use **triton-kernel-writing** only when no suitable op exists and you accept owning kernel + integration work.
+
+## Validation order
+
+1. Graph dumps — pattern present, rewrite visible (see **ad-graph-dump**).
+2. Unit tests for the transform.
+3. Integration or accuracy checks when numerics or end-to-end behavior may change.
+
+## Match counts
+
+AutoDeploy logs `[SUMMARY] matches=<n>` (or `skipped` / `disabled`) per transform. Compare before and after your change; a large drop usually indicates pattern or metadata issues, not “slow runs.”
+
+## Testing expectations
+
+Follow **trtllm-code-contribution** for repo conventions. Cover:
+
+- Happy-path micrograph or exported-graph rewrites.
+- Failure modes that must **not** fuse (multiple consumers, mixed consumers).
+- Metadata preservation when an upstream pass feeds your pattern.
+
+Primary unittest location for library transforms:
+
+- `tests/unittest/auto_deploy/singlegpu/transformations/library/`
+
+## Review checklist
+
+- Target structure appears in current dumps.
+- Transform registered and listed in `default.yaml` consistently with peer entries.
+- Model-registry toggles are intentional.
+- Non-zero `matches` where expected, or `skipped` is explained.
+- Before/after dump snippets or diffs saved for the review thread.
+- Tests cover both success and intentional non-match cases.
+- If outputs change, classify match loss vs metadata loss vs acceptable numeric drift.
+
+## Guardrails
+
+- Do not bundle unrelated passes in one change.
+- If dumps contradict expectations, document what you observed before chasing unrelated hypotheses.
+
+## Iteration note (template)
+
+```text
+Candidate: <name>
+Path: <existing_kernel_path|triton_fallback_path|other>
+Rationale:
+- ...
+Graph validation: <pass|fail — what files / ops>
+Summary logs: <matches before / after>
+Tests: <what ran>
+Open risks:
+- ...
+```