Issue/900 cuda及类cuda embedding by wooway777 · Pull Request #902 · InfiniTensor/InfiniCore

wooway777 · 2026-01-09T02:29:17Z

resolves #900
resolves #846
includes #859

天数

沐曦

摩尔

Copilot

Pull request overview

This pull request adds device-side embedding support for CUDA and CUDA-like platforms (NVIDIA, Metax, Moore) to enable CUDA Graph recording. The key improvement is removing synchronous CPU transfers that previously prevented graph recording, replacing them with fully asynchronous device-side kernel implementations.

Key changes:

Implements device-side embedding kernels for NVIDIA, Metax, and Moore platforms with optimized vectorized memory access
Removes synchronous to(cpu_device) operations from test and production code
Adds comprehensive test suite for validating CUDA Graph recording support

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
test/infinicore/ops/embedding.py	Removed CPU conversion logic, now supports device-side input directly
test/infinicore/nn/embedding.py	Removed CPU conversion logic for testing
test/infinicore/nn/test_embedding_graph_recording.py	New comprehensive test suite for CUDA Graph recording validation
test/infinicore/nn/HOW_TO_USE_GRAPH_RECORDING_TEST.md	Documentation on how to use and interpret the graph recording tests
test/infinicore/nn/EMBEDDING_GRAPH_RECORDING_COMPARISON.md	Technical comparison of before/after implementation details
src/infiniop/ops/embedding/operator.cc	Multi-platform dispatcher for embedding operations
src/infiniop/ops/embedding/embedding.h	Common descriptor macro definition
src/infiniop/ops/embedding/cpu/embedding_cpu.h	CPU implementation header
src/infiniop/ops/embedding/cpu/embedding_cpu.cc	CPU implementation with memcpy-based embedding lookup
src/infiniop/ops/embedding/cuda/embedding_kernel.cuh	Shared CUDA kernel helper functions with vectorized memory access
src/infiniop/ops/embedding/nvidia/embedding_nvidia.cuh	NVIDIA platform header
src/infiniop/ops/embedding/nvidia/embedding_nvidia.cu	NVIDIA CUDA kernel implementation with float4/half2/bfloat162 optimization
src/infiniop/ops/embedding/metax/embedding_metax.cuh	Metax platform header
src/infiniop/ops/embedding/metax/embedding_metax.maca	Metax MACA kernel implementation
src/infiniop/ops/embedding/moore/embedding_moore.h	Moore platform header
src/infiniop/ops/embedding/moore/embedding_moore_kernel.h	Moore-specific kernel helpers
src/infiniop/ops/embedding/moore/embedding_moore.mu	Moore MUSA kernel implementation
src/infinicore/ops/embedding/embedding_infiniop.cc	InfiniOP wrapper with descriptor caching
src/infinicore/ops/embedding/embedding.cc	Op dispatcher and device synchronization logic
src/infinicore/nn/embedding.cc	Simplified forward method using new device-side ops
python/infinicore/nn/functional/embedding.py	Removed CPU-only assertion to support device inputs
include/infiniop/ops/embedding.h	Public API for embedding operations
include/infiniop.h	Added embedding.h include
include/infinicore/ops/embedding.hpp	Added Embedding class with dispatcher pattern
include/infinicore/ops.hpp	Added embedding.hpp include

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-09T02:34:54Z

+            } else if constexpr (std::is_same_v<T, cuda_bfloat16>) {
+                // Use bfloat162 for vectorized access
+                if (embedding_dim >= 2 && embedding_dim % 2 == 0) {
+                    copyVectorizedBFloat162<IndexType>(dst, src, embedding_dim);
+                } else {
+                    copyScalar<T, IndexType>(dst, src, embedding_dim);
+                }


The embeddingKernel template checks for cuda_bfloat16 type on line 48, but this is Metax platform code which should use __hpcc_bfloat16 instead. This type mismatch means the vectorized bfloat16 path (lines 48-54) will never execute on Metax platform. The kernel should check for the Metax-specific type __hpcc_bfloat16 to enable proper vectorization.

Copilot · 2026-01-09T02:34:54Z

+    } else if (_embedding_dim >= 1024) {
+        block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure
+    } else if (_embedding_dim <= 256) {
+        block_size = 384; // Medium embedding_dim: balanced configuration


The block size adjustment logic has an inconsistency. When _embedding_dim <= 256, the code sets block_size = 384 (line 91), but this case only applies if _embedding_dim is greater than 64 (because line 87 handles <= 64). This means embedding_dim values between 65-256 will use block_size=384. However, this contradicts the default of 256 which was already set. The logic should be restructured to avoid overlapping conditions and ensure proper block size selection.

Suggested change

} else if (_embedding_dim >= 1024) {

block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure

} else if (_embedding_dim <= 256) {

block_size = 384; // Medium embedding_dim: balanced configuration

} else if (_embedding_dim <= 256) {

block_size = 384; // Medium embedding_dim: balanced configuration

} else if (_embedding_dim >= 1024) {

block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure

wooway777 · 2026-01-19T01:44:50Z

CPU：

…graph recording - Ensure embedding tensors are on the same device. Change format. - Optimize embedding kernel with vectorized memory access and __ldg - Add vectorized memory access using float4/float2, half2, and bfloat162 - Use __ldg instruction for read-only weight and indices access - Add memory alignment checks to enable vectorized paths - Add __restrict__ keywords for better compiler optimization - Implement dynamic block size selection based on embedding_dim

wooway777 requested review from a team and Copilot January 9, 2026 02:29

Copilot started reviewing on behalf of wooway777 January 9, 2026 02:29 View session

Copilot AI reviewed Jan 9, 2026

View reviewed changes

wooway777 force-pushed the issue/900-temp branch 3 times, most recently from 7997d89 to 2c02d73 Compare January 9, 2026 09:26

wooway777 requested review from PanZezhong1725, gongchensu and whjthu January 9, 2026 10:07

wooway777 force-pushed the issue/900-temp branch from 2c02d73 to 170328a Compare January 15, 2026 11:59

PanZezhong1725 requested changes Jan 16, 2026

View reviewed changes

Comment thread test/infinicore/nn/test_embedding_graph_recording.py Outdated

Comment thread test/infinicore/nn/EMBEDDING_GRAPH_RECORDING_COMPARISON.md Outdated

PanZezhong1725 requested changes Jan 16, 2026

View reviewed changes

Comment thread src/infinicore/ops/embedding/embedding_infiniop.cc Outdated

wooway777 force-pushed the issue/900-temp branch from 170328a to 4866ddf Compare January 16, 2026 03:08

wooway777 requested a review from PanZezhong1725 January 16, 2026 03:08

wooway777 force-pushed the issue/900-temp branch from 4866ddf to 1eac07c Compare January 16, 2026 10:53

PanZezhong1725 approved these changes Jan 16, 2026

View reviewed changes

PanZezhong1725 requested changes Jan 19, 2026

View reviewed changes

Comment thread src/infinicore/nn/embedding.cc Outdated

wooway777 requested a review from PanZezhong1725 January 19, 2026 01:45

wooway777 force-pushed the issue/900-temp branch 4 times, most recently from 8c49f1c to db1bdfc Compare January 22, 2026 10:29

gongchensu and others added 4 commits January 26, 2026 06:58

issue/900 - support embedding on iluvatar, metax, and moore

4615ecf

issue/900 - adapt to graph and adjust test script

d3bae33

issue/900 - maintains classic embedding for devices yet to be worked on

bf120b2

wooway777 force-pushed the issue/900-temp branch from db1bdfc to bf120b2 Compare January 26, 2026 06:58

wooway777 closed this Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/900 cuda及类cuda embedding#902

Issue/900 cuda及类cuda embedding#902
wooway777 wants to merge 4 commits into
mainfrom
issue/900-temp

wooway777 commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wooway777 commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wooway777 commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wooway777 commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants