Commit 26aaeeb
Hoonkyung Cho
feat: Add TensorRT Edge-LLM AttentionPlugin backend support
Add plugin backend as an alternative to the default SDPA lowering for
LLM inference, providing ~1.5x-1.8x speedup over SDPA and ~8x-11x
over PyTorch eager execution.
Supported Models: Llama 3.x (3.1/3.2), Qwen 2.5, Qwen 3
Changes:
- examples/dynamo/attention_plugin_example.py: Standalone plugin demo
with correctness validation against PyTorch SDPA
- examples/dynamo/end_to_end_llm_generation_example.py: End-to-end LLM
generation example with plugin integration and benchmarks
- tools/llm/plugin_utils.py: Model-agnostic plugin utilities including
op registration (tensorrt_edge_llm::xqa_attn), TensorRT converter,
PluginAttention module, LLMPluginWrapper, compilation and generation
- tools/llm/run_llm.py: Add --backend plugin/sdpa selection with plugin
workflow integration
- tools/llm/README.md: Plugin backend documentation with build guide,
usage examples, and performance summary
Plugin library built from TensorRT-Edge-LLM 0.4.0:
https://github.com/chohk88/TensorRT-Edge-LLM/tree/feature/torch-tensorrt-python-runtime1 parent 95497ab commit 26aaeeb
File tree
5 files changed
+2428
-59
lines changed- examples/dynamo
- tools/llm
5 files changed
+2428
-59
lines changed
0 commit comments