|
| 1 | +--- |
| 2 | +name: ad-add-fusion-transformation |
| 3 | +description: > |
| 4 | + Claude Code skill (trtllm-agent-toolkit): implement or extend TensorRT-LLM AutoDeploy fusion |
| 5 | + transforms under transform/library/ in a TensorRT-LLM checkout. Prefer existing kernels and custom |
| 6 | + ops; use Triton only when no viable existing-kernel path exists. Use ad-graph-dump for |
| 7 | + AD_DUMP_GRAPHS_DIR workflows. Covers TRT-LLM paths, registry, default.yaml registration, graph |
| 8 | + validation, tests, and a review checklist — without prescribing profiling tools or throughput |
| 9 | + targets. |
| 10 | +license: Apache-2.0 |
| 11 | +tags: |
| 12 | + - tensorrt-llm |
| 13 | + - autodeploy |
| 14 | + - fusion |
| 15 | + - graph-transform |
| 16 | + - optimization |
| 17 | +metadata: |
| 18 | + author: NVIDIA Corporation |
| 19 | +--- |
| 20 | + |
| 21 | +# Autodeploy: Add Fusion Transformation Pass |
| 22 | + |
| 23 | +## Where this skill applies |
| 24 | + |
| 25 | +This file lives in the **trtllm-agent-toolkit** plugin. Paths such as `tensorrt_llm/...`, `examples/auto_deploy/...`, and `tests/...` are relative to a **TensorRT-LLM source checkout** on the user’s machine — not the plugin tree. |
| 26 | + |
| 27 | +After installing the plugin (see the toolkit `README.md`), skills use the `trtllm-agent-toolkit:` prefix (for example `trtllm-agent-toolkit:ad-add-fusion-transformation`). |
| 28 | + |
| 29 | +## Related skills in this plugin |
| 30 | + |
| 31 | +| Skill | Use it for | |
| 32 | +|-------|------------| |
| 33 | +| **ad-graph-dump** | Enabling `AD_DUMP_GRAPHS_DIR`, dump file layout, and how to read SSA graph output. | |
| 34 | +| **trtllm-codebase-exploration** | Mapping existing transforms, custom ops, and search patterns before writing a pass. | |
| 35 | +| **trtllm-code-contribution** | TensorRT-LLM pre-commit, tests, DCO sign-off, and PR expectations. | |
| 36 | +| **triton-kernel-writing** | Implementing a **Triton** op only after existing-kernel lookup fails. | |
| 37 | +| **triton-tileir-optimization** | Tuning **existing** Triton kernels for the TileIR backend when that path applies. | |
| 38 | +| **cuda-kernel-writing** | Raw CUDA extension work if the viable path is a PyTorch C++ extension (not Triton). | |
| 39 | +| **cute-kernel-writing** / **cudeepy-kernel-writing** | CuTe DSL/LIR or CuDeepy-generated kernels when that is the chosen integration path. | |
| 40 | + |
| 41 | +Use this skill when you already know **which subgraph or pattern** you are targeting (from graph dumps, logs, or code reading). For dump capture and file semantics, follow **ad-graph-dump** first. |
| 42 | + |
| 43 | +## When to use this skill |
| 44 | + |
| 45 | +- Adding, extending, or reviewing a fusion under AutoDeploy transforms in a TensorRT-LLM tree. |
| 46 | + |
| 47 | +### Workflow (concise) |
| 48 | + |
| 49 | +1. Confirm the pattern in **current** graph dumps (see **ad-graph-dump**). |
| 50 | +2. Search for an existing kernel or custom-op path before new Triton or CUDA. |
| 51 | +3. Implement the smallest change that proves correctness and matching; add tests. |
| 52 | +4. Re-run dumps and tests; if outputs drift, separate matching issues from metadata loss from numeric differences. |
| 53 | + |
| 54 | +## Finding fusion candidates (lightweight) |
| 55 | + |
| 56 | +Do this before writing a new pass so you work on real graph structure. |
| 57 | + |
| 58 | +### Inputs |
| 59 | + |
| 60 | +- Graph dump directory from a run with `AD_DUMP_GRAPHS_DIR` set (see **ad-graph-dump**). |
| 61 | +- Model id and active AutoDeploy config (registry YAML, `default.yaml` overlays). |
| 62 | +- TensorRT-LLM source tree for kernel and transform lookup. |
| 63 | + |
| 64 | +### Outputs |
| 65 | + |
| 66 | +- Ordered list of candidates with: graph evidence, existing-kernel lookup (`found` / `not_found`), recommendation (`use_existing_kernel`, `needs_triton_fallback`, `defer`), and trade-offs (complexity, correctness risk). |
| 67 | + |
| 68 | +### Discovery workflow |
| 69 | + |
| 70 | +1. Parse dumps for repeated unfused patterns (element-wise chains, norm chains, epilogues, attention-adjacent ops). |
| 71 | +2. Search the tree for equivalent transforms or custom ops; record file/symbol evidence. |
| 72 | +3. If nothing fits, mark Triton or other kernel work as a deliberate fallback. |
| 73 | +4. Prefer candidates with clear recurrence, existing support, and lower numerical risk. |
| 74 | + |
| 75 | +### Per-candidate template |
| 76 | + |
| 77 | +```text |
| 78 | +Candidate: <short-name> |
| 79 | +Affected graph pattern: <pattern> |
| 80 | +Existing kernel lookup: <found|not_found> |
| 81 | +Evidence: <path/symbol> |
| 82 | +Recommendation: <use_existing_kernel|needs_triton_fallback|defer> |
| 83 | +Strengths / weaknesses / risks: |
| 84 | +- ... |
| 85 | +``` |
| 86 | + |
| 87 | +### Guardrails |
| 88 | + |
| 89 | +- Do not skip existing-kernel lookup. |
| 90 | +- Do not default to Triton when a viable existing op already exists. |
| 91 | +- If uncertain, `defer` and narrow the question with one more dump or test. |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +## Inputs (implementation) |
| 96 | + |
| 97 | +- Chosen candidate or concrete subgraph. |
| 98 | +- Active model and config files. |
| 99 | +- Fresh graph dumps when available. |
| 100 | +- Current baseline: match counts from logs, unit test status, any accuracy notes you already maintain. |
| 101 | + |
| 102 | +## Outputs (implementation) |
| 103 | + |
| 104 | +- Pass design or patch: registered transform, `default.yaml` entry, optional model-registry YAML. |
| 105 | +- Path decision: `existing_kernel_path` vs `triton_fallback_path` (or other kernel stack). |
| 106 | +- Validation notes: graph evidence, `[SUMMARY] matches=...` before/after from AutoDeploy logs, test results. |
| 107 | + |
| 108 | +## Implementation workflow |
| 109 | + |
| 110 | +1. Align the pass with **observed** graph structure from dumps — not assumed op names from docs alone. |
| 111 | +2. Search `transform/library/`, `custom_ops/`, `torch.ops.auto_deploy.*`, and related tests for reuse. |
| 112 | +3. Integrate an existing op when possible; otherwise delegate kernel work to the appropriate skill (**triton-kernel-writing**, **cuda-kernel-writing**, etc.). |
| 113 | +4. Keep one logical change per patch; extend tests in the same change. |
| 114 | +5. Re-read dumps after the change; if match counts collapse, suspect pattern availability or metadata propagation. |
| 115 | + |
| 116 | +## Where fusion passes live |
| 117 | + |
| 118 | +- Transforms: `tensorrt_llm/_torch/auto_deploy/transform/library/` |
| 119 | +- Registry / base behavior: `tensorrt_llm/_torch/auto_deploy/transform/interface.py` |
| 120 | +- Default transform list: `tensorrt_llm/_torch/auto_deploy/config/default.yaml` |
| 121 | +- Dump helper: `tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py` |
| 122 | +- Graph utilities: `tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`, `tensorrt_llm/_torch/auto_deploy/utils/_graph.py` |
| 123 | +- Custom ops: `tensorrt_llm/_torch/auto_deploy/custom_ops/` |
| 124 | + |
| 125 | +Tests (typical): |
| 126 | + |
| 127 | +- `tests/unittest/auto_deploy/singlegpu/transformations/library/` |
| 128 | +- `tests/integration/defs/accuracy/test_llm_api_autodeploy.py` (when behavior or numerics may change) |
| 129 | + |
| 130 | +## How to add a transform |
| 131 | + |
| 132 | +### Implement the pass |
| 133 | + |
| 134 | +Create or update a module under `transform/library/` and register the class: |
| 135 | + |
| 136 | +```python |
| 137 | +@TransformRegistry.register("my_transform_key") |
| 138 | +class MyTransform(BaseTransform): |
| 139 | + @classmethod |
| 140 | + def get_config_class(cls): |
| 141 | + return MyTransformConfig |
| 142 | +``` |
| 143 | + |
| 144 | +Use a dedicated config class only when the pass needs parameters beyond the base transform config. |
| 145 | + |
| 146 | +### Register in `default.yaml` |
| 147 | + |
| 148 | +Add a key under `transforms:` in `tensorrt_llm/_torch/auto_deploy/config/default.yaml`. **Copy the field set from the closest existing transform** in the same section of the file (required keys depend on the transform config class and on how peers are declared). New experimental passes should stay **`enabled: false`** until covered by tests and dumps. |
| 149 | + |
| 150 | +### Enable for a specific model |
| 151 | + |
| 152 | +For targeted rollout, adjust registry YAMLs under `examples/auto_deploy/model_registry/configs/` rather than turning on unproven passes globally. |
| 153 | + |
| 154 | +## Implementation rules |
| 155 | + |
| 156 | +- Prefer existing AutoDeploy / TRT-LLM ops and `torch.ops.auto_deploy` entries. |
| 157 | +- Prefer stable, backend-neutral graph contracts; avoid hiding real dataflow in `node.meta` when an edge should carry it. |
| 158 | +- Use metadata for observable tensor facts (shape, dtype) and preserve it across rewrites when replacements should remain traceable. |
| 159 | +- **One hypothesis per patch** — do not mix unrelated fusions. |
| 160 | + |
| 161 | +## Existing kernel first, Triton second |
| 162 | + |
| 163 | +Before Triton: |
| 164 | + |
| 165 | +1. Search `transform/library/` and `custom_ops/`. |
| 166 | +2. Search `torch.ops.auto_deploy.*` and TRT-LLM custom op definitions. |
| 167 | +3. Read tests for similar integrations. |
| 168 | + |
| 169 | +Use **triton-kernel-writing** only when no suitable op exists and you accept owning kernel + integration work. |
| 170 | + |
| 171 | +## Validation order |
| 172 | + |
| 173 | +1. Graph dumps — pattern present, rewrite visible (see **ad-graph-dump**). |
| 174 | +2. Unit tests for the transform. |
| 175 | +3. Integration or accuracy checks when numerics or end-to-end behavior may change. |
| 176 | + |
| 177 | +## Match counts |
| 178 | + |
| 179 | +AutoDeploy logs `[SUMMARY] matches=<n>` (or `skipped` / `disabled`) per transform. Compare before and after your change; a large drop usually indicates pattern or metadata issues, not “slow runs.” |
| 180 | + |
| 181 | +## Testing expectations |
| 182 | + |
| 183 | +Follow **trtllm-code-contribution** for repo conventions. Cover: |
| 184 | + |
| 185 | +- Happy-path micrograph or exported-graph rewrites. |
| 186 | +- Failure modes that must **not** fuse (multiple consumers, mixed consumers). |
| 187 | +- Metadata preservation when an upstream pass feeds your pattern. |
| 188 | + |
| 189 | +Primary unittest location for library transforms: |
| 190 | + |
| 191 | +- `tests/unittest/auto_deploy/singlegpu/transformations/library/` |
| 192 | + |
| 193 | +## Review checklist |
| 194 | + |
| 195 | +- Target structure appears in current dumps. |
| 196 | +- Transform registered and listed in `default.yaml` consistently with peer entries. |
| 197 | +- Model-registry toggles are intentional. |
| 198 | +- Non-zero `matches` where expected, or `skipped` is explained. |
| 199 | +- Before/after dump snippets or diffs saved for the review thread. |
| 200 | +- Tests cover both success and intentional non-match cases. |
| 201 | +- If outputs change, classify match loss vs metadata loss vs acceptable numeric drift. |
| 202 | + |
| 203 | +## Guardrails |
| 204 | + |
| 205 | +- Do not bundle unrelated passes in one change. |
| 206 | +- If dumps contradict expectations, document what you observed before chasing unrelated hypotheses. |
| 207 | + |
| 208 | +## Iteration note (template) |
| 209 | + |
| 210 | +```text |
| 211 | +Candidate: <name> |
| 212 | +Path: <existing_kernel_path|triton_fallback_path|other> |
| 213 | +Rationale: |
| 214 | +- ... |
| 215 | +Graph validation: <pass|fail — what files / ops> |
| 216 | +Summary logs: <matches before / after> |
| 217 | +Tests: <what ran> |
| 218 | +Open risks: |
| 219 | +- ... |
| 220 | +``` |
0 commit comments