deeepep normal support shmem with asymmetric tensor by zuje123 · Pull Request #328 · sgl-project/sgl-kernel-npu

zuje123 · 2026-01-19T12:41:58Z

deeepep normal support shmem with asymmetric tensor
Testing Performance:

gemini-code-assist · 2026-01-19T12:42:48Z

Summary of Changes

Hello @zuje123, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the shared memory (shmem) infrastructure within the deepep module to enhance support for asymmetric tensor operations. The core change involves externalizing the management of shared memory, allowing the Buffer class to operate with a pre-allocated shared memory region identified by a base address. This refactoring centralizes shared memory communication parameters and introduces a more robust synchronization mechanism, paving the way for improved performance and flexibility in distributed tensor processing.

Highlights

Shared Memory Management Refactor: The Buffer class no longer directly allocates or deallocates shared memory; instead, it receives a meta_addr (uint64_t) as a base address for shared memory access, externalizing its management.
Centralized Shared Memory Arguments: A new header file, shmem_comm_args.h, has been introduced to centralize common constants, enums, and metadata structures for shared memory communication across various kernels.
Asymmetric Tensor Support & Kernel Refactoring: The pull request enables robust support for asymmetric tensors in shared memory operations, with significant refactoring of shmem_moe_combine_normal.h, shmem_moe_dispatch_normal.h, and shmem_notify_dispatch.h kernels to utilize the new shared memory addressing scheme.
Enhanced Synchronization Mechanism: A more structured synchronization protocol has been implemented, replacing older shmem_barrier_all and magic-based methods with new ResetMetaState, PutShareAddr, GetShareAddr, SetSyncFlag, and WaitSyncFlag functions.
Python Integration and Testing: Python bindings for the Buffer class have been updated to accept the new meta_addr parameter, complemented by new Python utility and test scripts (shmem_utils.py, test_shmem_intranode_api.py) to validate the shared memory functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for shared memory with asymmetric tensors in deeepep. The changes involve a significant refactoring of how shared memory is managed, moving the allocation and initialization outside of the Buffer class and passing the memory address directly. The communication and synchronization logic within the kernels has also been substantially updated.

My review focuses on improving code maintainability and removing leftover debugging artifacts. The key points are:

Code Duplication: There is a considerable amount of duplicated code for synchronization and address calculation across different kernel headers. I've suggested refactoring this into a shared utility.
Debugging Output: A debug print statement was found and should be removed.
Code Comments: Some comments are in Chinese and should be translated to English for better readability and consistency.

Overall, the changes look solid and the addition of tests is a great step towards ensuring correctness.

gemini-code-assist · 2026-01-19T12:46:40Z

+    __aicore__ inline GM_ADDR GetMetaAddrByRankId(const int32_t rankId, const int metaType)
    {
        auto ptr = shmem_ptr(gva_gm, rankId);

-        return (GM_ADDR)(ptr) + WIN_MAGIC_OFFSET + NOTIFY_WIN_STATE_OFFSET + dataState * HALF_WIN_STATE_OFFSET;
+        switch (metaType) {
+            case STATE:  // 存放通信结束的state
+                return (GM_ADDR)(ptr);
+            case ADDR:  // 存放交换的共享地址
+                return (GM_ADDR)(ptr) + DISPATCH_STATUS_OFFSET;
+            case FLAG:  // 存放第一次清理state空间后的同步flag
+                return (GM_ADDR)(ptr) + META_FLAG_OFFSET;
+            default:
+                return (GM_ADDR)(ptr);
+        }
    }


There is significant code duplication across several kernel header files. Functions like GetMetaAddrByRankId, SplitCoreCal, ResetMetaState, PutShareAddr, GetShareAddr, SetSyncFlag, and WaitSyncFlag are nearly identical in shmem_moe_dispatch_normal.h, shmem_moe_combine_normal.h, and shmem_notify_dispatch.h.

To improve maintainability and reduce redundancy, please consider refactoring this common logic into a shared utility header file.

Additionally, several comments within these duplicated functions are in Chinese. It would be beneficial to translate them to English for consistency.

gemini-code-assist · 2026-01-19T12:46:40Z

-        EP_HOST_ASSERT(rank == internode::init(rank, num_ranks, local_mem_size, "tcp://127.0.0.1:11222"));
-        shmem_ptr = internode::alloc(num_of_int32, ele_size);
+        shmem_ptr = meta_addr;
+        std::cout << "[deepep] rank: " << rank << ", shmem_ptr: " << shmem_ptr << std::endl;


This std::cout statement appears to be for debugging purposes. It's recommended to remove it before merging to keep the logs clean in a production environment.

gemini-code-assist · 2026-01-19T12:46:40Z

+constexpr uint64_t NOTIFY_META_SIZE = 60 * 1024UL;        // notify(60KB)
+constexpr uint64_t DISPATCH_META_SIZE = 60 * 1024UL;      // dispatch(60KB)
+
+constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL;  // 预设meta空间200MB，从180MB开始用来写清理后的同步flag


The comment on this line is in Chinese. For better maintainability and to make the code accessible to a wider audience, please consider translating it to English.

Suggested change

constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // 预设meta空间200MB，从180MB开始用来写清理后的同步flag

constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // Preset meta space is 200MB, the sync flag after cleanup is written starting from 180MB

gemini-code-assist · 2026-01-19T12:46:41Z

+constexpr static int32_t UB_HEAD_OFFSET = 96;
+constexpr static int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN;


The static keyword is redundant for constexpr variables at namespace scope. You can remove static for cleaner code.

Suggested change

constexpr static int32_t UB_HEAD_OFFSET = 96;

constexpr static int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN;

constexpr int32_t UB_HEAD_OFFSET = 96;

constexpr int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN;

gemini-code-assist · 2026-01-19T12:46:41Z

-        int32_t sumVal = 0;  // 所有rank中接收token最大的
-        for (uint32_t srcRankId = 0; srcRankId < epWorldSize_; srcRankId++) {
-            ReorderRecvDataOutput(srcRankId, recvTokenLt, false);  // localExpNum * ranks
+        // 只需要计算当前rank接收的token数


This comment is in Chinese. Please translate it to English to ensure consistency and improve code readability for all contributors.

// Only need to calculate the number of tokens received by the current rank

zhengfaan added 16 commits January 15, 2026 14:15

notify exchange shareAddr

3531822

fix notify exchangeAddr

41e5d42

dispatch exchange shareAddr

d0a20f4

combine exchange shareAddr

9acc276

fix gva_gm offset

801f08c

reset status once

7cda4ac

remove one barrier

c0542fe

support topk=-1

5b94078

replace shmem_barrier_all with flag sync

49fcb91

init buffer with meta_add

6566252

remove notify unuse code

af1b8d9

remove dispatch unuse code

e3be0f1

remove combine unuse code

61c293a

intranode test with shmem_allocator

2a88fdc

remove adapter unuse code

bbd8aa2

fix codelint

a2985ec

gemini-code-assist Bot reviewed Jan 19, 2026

View reviewed changes

zhengfaan added 7 commits January 21, 2026 21:50

switch to gitcode version of shmem

741da70

fix dispatch copyout

1c0c498

notify add balance_matrix

00c91f4

notify add balance factor

4792d5c

combine add balance_matrix

7cc2aa4

dispatch check hiddensize

7f93ac9

fix codelint

4054137

zuje123 force-pushed the g_shmem_dp branch from 2199f31 to 4054137 Compare January 28, 2026 03:24

Yael-X approved these changes Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deeepep normal support shmem with asymmetric tensor#328

deeepep normal support shmem with asymmetric tensor#328
zuje123 wants to merge 23 commits intosgl-project:shmem_verfrom
zuje123:g_shmem_dp

zuje123 commented Jan 19, 2026

Uh oh!

gemini-code-assist Bot commented Jan 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // 预设meta空间200MB，从180MB开始用来写清理后的同步flag
	constexpr uint64_t META_FLAG_OFFSET = 180 * 1024UL * 1024UL; // Preset meta space is 200MB, the sync flag after cleanup is written starting from 180MB

		constexpr static int32_t UB_HEAD_OFFSET = 96;
		constexpr static int32_t UB_MID_OFFSET = UB_HEAD_OFFSET + UB_SINGLE_PING_PONG_ADD_SIZE_MAX + UB_ALIGN;

Conversation

zuje123 commented Jan 19, 2026

Uh oh!

gemini-code-assist Bot commented Jan 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants