You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(amd): EP intra-node normal and low-latency kernels with mori shmem (#164)
* feat(amd): EP intra-node normal and low-latency kernels with mori shmem
- Implement EP intra-node dispatch/combine kernels using mori shmem P2P (putmem_signal_warp) on AMD MI325X
- Add Low Latency EP v1 (raw all-to-all) and v2 (online FP8 quant + combine with topk weighted reduce)
- Fix shfl_up/shfl_down_sync implementation and golden reference calculation in test_language_extra.py
- Fix mixed-bitwidth ld/st implementation and add kernel test coverage
- Update mori submodule to main with JIT bitcode compilation, replacing manual hipcc/llvm-link build
- Simplify `build_mori_shmem.sh` to use mori JIT (`mori.ir.bitcode.find_bitcode()`)
- Add AlgoBW and BusBW metrics to EP A2A benchmark output
- Add CI tests for EP A2A (correctness + perf), LL v2 (correctness + perf M=64/128)
---------
Co-authored-by: Wu, Yutong <yutong.wu@amd.com>
* fix(ci): use non-recursive submodule checkout to avoid pulling mori's nested submodules
---------
Co-authored-by: Wu, Yutong <yutong.wu@amd.com>
0 commit comments