Skip to content

Commit 26aaeeb

Browse files
author
Hoonkyung Cho
committed
feat: Add TensorRT Edge-LLM AttentionPlugin backend support
Add plugin backend as an alternative to the default SDPA lowering for LLM inference, providing ~1.5x-1.8x speedup over SDPA and ~8x-11x over PyTorch eager execution. Supported Models: Llama 3.x (3.1/3.2), Qwen 2.5, Qwen 3 Changes: - examples/dynamo/attention_plugin_example.py: Standalone plugin demo with correctness validation against PyTorch SDPA - examples/dynamo/end_to_end_llm_generation_example.py: End-to-end LLM generation example with plugin integration and benchmarks - tools/llm/plugin_utils.py: Model-agnostic plugin utilities including op registration (tensorrt_edge_llm::xqa_attn), TensorRT converter, PluginAttention module, LLMPluginWrapper, compilation and generation - tools/llm/run_llm.py: Add --backend plugin/sdpa selection with plugin workflow integration - tools/llm/README.md: Plugin backend documentation with build guide, usage examples, and performance summary Plugin library built from TensorRT-Edge-LLM 0.4.0: https://github.com/chohk88/TensorRT-Edge-LLM/tree/feature/torch-tensorrt-python-runtime
1 parent 95497ab commit 26aaeeb

File tree

5 files changed

+2428
-59
lines changed

5 files changed

+2428
-59
lines changed

0 commit comments

Comments
 (0)