deeepep normal support shmem with asymmetric tensor#328
deeepep normal support shmem with asymmetric tensor#328zuje123 wants to merge 23 commits intosgl-project:shmem_verfrom
Conversation
Summary of ChangesHello @zuje123, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly overhauls the shared memory (shmem) infrastructure within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for shared memory with asymmetric tensors in deeepep. The changes involve a significant refactoring of how shared memory is managed, moving the allocation and initialization outside of the Buffer class and passing the memory address directly. The communication and synchronization logic within the kernels has also been substantially updated.
My review focuses on improving code maintainability and removing leftover debugging artifacts. The key points are:
- Code Duplication: There is a considerable amount of duplicated code for synchronization and address calculation across different kernel headers. I've suggested refactoring this into a shared utility.
- Debugging Output: A debug print statement was found and should be removed.
- Code Comments: Some comments are in Chinese and should be translated to English for better readability and consistency.
Overall, the changes look solid and the addition of tests is a great step towards ensuring correctness.
| __aicore__ inline GM_ADDR GetMetaAddrByRankId(const int32_t rankId, const int metaType) | ||
| { | ||
| auto ptr = shmem_ptr(gva_gm, rankId); | ||
|
|
||
| return (GM_ADDR)(ptr) + WIN_MAGIC_OFFSET + NOTIFY_WIN_STATE_OFFSET + dataState * HALF_WIN_STATE_OFFSET; | ||
| switch (metaType) { | ||
| case STATE: // 存放通信结束的state | ||
| return (GM_ADDR)(ptr); | ||
| case ADDR: // 存放交换的共享地址 | ||
| return (GM_ADDR)(ptr) + DISPATCH_STATUS_OFFSET; | ||
| case FLAG: // 存放第一次清理state空间后的同步flag | ||
| return (GM_ADDR)(ptr) + META_FLAG_OFFSET; | ||
| default: | ||
| return (GM_ADDR)(ptr); | ||
| } | ||
| } |
There was a problem hiding this comment.
There is significant code duplication across several kernel header files. Functions like GetMetaAddrByRankId, SplitCoreCal, ResetMetaState, PutShareAddr, GetShareAddr, SetSyncFlag, and WaitSyncFlag are nearly identical in shmem_moe_dispatch_normal.h, shmem_moe_combine_normal.h, and shmem_notify_dispatch.h.
To improve maintainability and reduce redundancy, please consider refactoring this common logic into a shared utility header file.
Additionally, several comments within these duplicated functions are in Chinese. It would be beneficial to translate them to English for consistency.
| EP_HOST_ASSERT(rank == internode::init(rank, num_ranks, local_mem_size, "tcp://127.0.0.1:11222")); | ||
| shmem_ptr = internode::alloc(num_of_int32, ele_size); | ||
| shmem_ptr = meta_addr; | ||
| std::cout << "[deepep] rank: " << rank << ", shmem_ptr: " << shmem_ptr << std::endl; |
| constexpr uint64_t NOTIFY_META_SIZE = 60 * 1024UL; // notify(60KB) | ||
| constexpr uint64_t DISPATCH_META_SIZE = 60 * 1024UL; // dispatch(60KB) | ||
|
|
||
| constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // 预设meta空间200MB,从180MB开始用来写清理后的同步flag |
There was a problem hiding this comment.
The comment on this line is in Chinese. For better maintainability and to make the code accessible to a wider audience, please consider translating it to English.
| constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // 预设meta空间200MB,从180MB开始用来写清理后的同步flag | |
| constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // Preset meta space is 200MB, the sync flag after cleanup is written starting from 180MB |
| constexpr static int32_t UB_HEAD_OFFSET = 96; | ||
| constexpr static int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN; |
There was a problem hiding this comment.
The static keyword is redundant for constexpr variables at namespace scope. You can remove static for cleaner code.
| constexpr static int32_t UB_HEAD_OFFSET = 96; | |
| constexpr static int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN; | |
| constexpr int32_t UB_HEAD_OFFSET = 96; | |
| constexpr int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN; |
| int32_t sumVal = 0; // 所有rank中接收token最大的 | ||
| for (uint32_t srcRankId = 0; srcRankId < epWorldSize_; srcRankId++) { | ||
| ReorderRecvDataOutput(srcRankId, recvTokenLt, false); // localExpNum * ranks | ||
| // 只需要计算当前rank接收的token数 |
deeepep normal support shmem with asymmetric tensor

Testing Performance: