Skip to content

deeepep normal support shmem with asymmetric tensor#328

Open
zuje123 wants to merge 23 commits intosgl-project:shmem_verfrom
zuje123:g_shmem_dp
Open

deeepep normal support shmem with asymmetric tensor#328
zuje123 wants to merge 23 commits intosgl-project:shmem_verfrom
zuje123:g_shmem_dp

Conversation

@zuje123
Copy link
Copy Markdown
Collaborator

@zuje123 zuje123 commented Jan 19, 2026

deeepep normal support shmem with asymmetric tensor
Testing Performance:
image

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zuje123, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the shared memory (shmem) infrastructure within the deepep module to enhance support for asymmetric tensor operations. The core change involves externalizing the management of shared memory, allowing the Buffer class to operate with a pre-allocated shared memory region identified by a base address. This refactoring centralizes shared memory communication parameters and introduces a more robust synchronization mechanism, paving the way for improved performance and flexibility in distributed tensor processing.

Highlights

  • Shared Memory Management Refactor: The Buffer class no longer directly allocates or deallocates shared memory; instead, it receives a meta_addr (uint64_t) as a base address for shared memory access, externalizing its management.
  • Centralized Shared Memory Arguments: A new header file, shmem_comm_args.h, has been introduced to centralize common constants, enums, and metadata structures for shared memory communication across various kernels.
  • Asymmetric Tensor Support & Kernel Refactoring: The pull request enables robust support for asymmetric tensors in shared memory operations, with significant refactoring of shmem_moe_combine_normal.h, shmem_moe_dispatch_normal.h, and shmem_notify_dispatch.h kernels to utilize the new shared memory addressing scheme.
  • Enhanced Synchronization Mechanism: A more structured synchronization protocol has been implemented, replacing older shmem_barrier_all and magic-based methods with new ResetMetaState, PutShareAddr, GetShareAddr, SetSyncFlag, and WaitSyncFlag functions.
  • Python Integration and Testing: Python bindings for the Buffer class have been updated to accept the new meta_addr parameter, complemented by new Python utility and test scripts (shmem_utils.py, test_shmem_intranode_api.py) to validate the shared memory functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for shared memory with asymmetric tensors in deeepep. The changes involve a significant refactoring of how shared memory is managed, moving the allocation and initialization outside of the Buffer class and passing the memory address directly. The communication and synchronization logic within the kernels has also been substantially updated.

My review focuses on improving code maintainability and removing leftover debugging artifacts. The key points are:

  • Code Duplication: There is a considerable amount of duplicated code for synchronization and address calculation across different kernel headers. I've suggested refactoring this into a shared utility.
  • Debugging Output: A debug print statement was found and should be removed.
  • Code Comments: Some comments are in Chinese and should be translated to English for better readability and consistency.

Overall, the changes look solid and the addition of tests is a great step towards ensuring correctness.

Comment on lines +55 to 69
__aicore__ inline GM_ADDR GetMetaAddrByRankId(const int32_t rankId, const int metaType)
{
auto ptr = shmem_ptr(gva_gm, rankId);

return (GM_ADDR)(ptr) + WIN_MAGIC_OFFSET + NOTIFY_WIN_STATE_OFFSET + dataState * HALF_WIN_STATE_OFFSET;
switch (metaType) {
case STATE: // 存放通信结束的state
return (GM_ADDR)(ptr);
case ADDR: // 存放交换的共享地址
return (GM_ADDR)(ptr) + DISPATCH_STATUS_OFFSET;
case FLAG: // 存放第一次清理state空间后的同步flag
return (GM_ADDR)(ptr) + META_FLAG_OFFSET;
default:
return (GM_ADDR)(ptr);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is significant code duplication across several kernel header files. Functions like GetMetaAddrByRankId, SplitCoreCal, ResetMetaState, PutShareAddr, GetShareAddr, SetSyncFlag, and WaitSyncFlag are nearly identical in shmem_moe_dispatch_normal.h, shmem_moe_combine_normal.h, and shmem_notify_dispatch.h.

To improve maintainability and reduce redundancy, please consider refactoring this common logic into a shared utility header file.

Additionally, several comments within these duplicated functions are in Chinese. It would be beneficial to translate them to English for consistency.

Comment thread csrc/deepep/deep_ep.cpp
EP_HOST_ASSERT(rank == internode::init(rank, num_ranks, local_mem_size, "tcp://127.0.0.1:11222"));
shmem_ptr = internode::alloc(num_of_int32, ele_size);
shmem_ptr = meta_addr;
std::cout << "[deepep] rank: " << rank << ", shmem_ptr: " << shmem_ptr << std::endl;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This std::cout statement appears to be for debugging purposes. It's recommended to remove it before merging to keep the logs clean in a production environment.

constexpr uint64_t NOTIFY_META_SIZE = 60 * 1024UL; // notify(60KB)
constexpr uint64_t DISPATCH_META_SIZE = 60 * 1024UL; // dispatch(60KB)

constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // 预设meta空间200MB,从180MB开始用来写清理后的同步flag
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on this line is in Chinese. For better maintainability and to make the code accessible to a wider audience, please consider translating it to English.

Suggested change
constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // 预设meta空间200MB,从180MB开始用来写清理后的同步flag
constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // Preset meta space is 200MB, the sync flag after cleanup is written starting from 180MB

Comment on lines +32 to +33
constexpr static int32_t UB_HEAD_OFFSET = 96;
constexpr static int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The static keyword is redundant for constexpr variables at namespace scope. You can remove static for cleaner code.

Suggested change
constexpr static int32_t UB_HEAD_OFFSET = 96;
constexpr static int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN;
constexpr int32_t UB_HEAD_OFFSET = 96;
constexpr int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN;

int32_t sumVal = 0; // 所有rank中接收token最大的
for (uint32_t srcRankId = 0; srcRankId < epWorldSize_; srcRankId++) {
ReorderRecvDataOutput(srcRankId, recvTokenLt, false); // localExpNum * ranks
// 只需要计算当前rank接收的token数
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is in Chinese. Please translate it to English to ensure consistency and improve code readability for all contributors.

// Only need to calculate the number of tokens received by the current rank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants