Issue/900 cuda及类cuda embedding#902
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds device-side embedding support for CUDA and CUDA-like platforms (NVIDIA, Metax, Moore) to enable CUDA Graph recording. The key improvement is removing synchronous CPU transfers that previously prevented graph recording, replacing them with fully asynchronous device-side kernel implementations.
Key changes:
- Implements device-side embedding kernels for NVIDIA, Metax, and Moore platforms with optimized vectorized memory access
- Removes synchronous
to(cpu_device)operations from test and production code - Adds comprehensive test suite for validating CUDA Graph recording support
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| test/infinicore/ops/embedding.py | Removed CPU conversion logic, now supports device-side input directly |
| test/infinicore/nn/embedding.py | Removed CPU conversion logic for testing |
| test/infinicore/nn/test_embedding_graph_recording.py | New comprehensive test suite for CUDA Graph recording validation |
| test/infinicore/nn/HOW_TO_USE_GRAPH_RECORDING_TEST.md | Documentation on how to use and interpret the graph recording tests |
| test/infinicore/nn/EMBEDDING_GRAPH_RECORDING_COMPARISON.md | Technical comparison of before/after implementation details |
| src/infiniop/ops/embedding/operator.cc | Multi-platform dispatcher for embedding operations |
| src/infiniop/ops/embedding/embedding.h | Common descriptor macro definition |
| src/infiniop/ops/embedding/cpu/embedding_cpu.h | CPU implementation header |
| src/infiniop/ops/embedding/cpu/embedding_cpu.cc | CPU implementation with memcpy-based embedding lookup |
| src/infiniop/ops/embedding/cuda/embedding_kernel.cuh | Shared CUDA kernel helper functions with vectorized memory access |
| src/infiniop/ops/embedding/nvidia/embedding_nvidia.cuh | NVIDIA platform header |
| src/infiniop/ops/embedding/nvidia/embedding_nvidia.cu | NVIDIA CUDA kernel implementation with float4/half2/bfloat162 optimization |
| src/infiniop/ops/embedding/metax/embedding_metax.cuh | Metax platform header |
| src/infiniop/ops/embedding/metax/embedding_metax.maca | Metax MACA kernel implementation |
| src/infiniop/ops/embedding/moore/embedding_moore.h | Moore platform header |
| src/infiniop/ops/embedding/moore/embedding_moore_kernel.h | Moore-specific kernel helpers |
| src/infiniop/ops/embedding/moore/embedding_moore.mu | Moore MUSA kernel implementation |
| src/infinicore/ops/embedding/embedding_infiniop.cc | InfiniOP wrapper with descriptor caching |
| src/infinicore/ops/embedding/embedding.cc | Op dispatcher and device synchronization logic |
| src/infinicore/nn/embedding.cc | Simplified forward method using new device-side ops |
| python/infinicore/nn/functional/embedding.py | Removed CPU-only assertion to support device inputs |
| include/infiniop/ops/embedding.h | Public API for embedding operations |
| include/infiniop.h | Added embedding.h include |
| include/infinicore/ops/embedding.hpp | Added Embedding class with dispatcher pattern |
| include/infinicore/ops.hpp | Added embedding.hpp include |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else if constexpr (std::is_same_v<T, cuda_bfloat16>) { | ||
| // Use bfloat162 for vectorized access | ||
| if (embedding_dim >= 2 && embedding_dim % 2 == 0) { | ||
| copyVectorizedBFloat162<IndexType>(dst, src, embedding_dim); | ||
| } else { | ||
| copyScalar<T, IndexType>(dst, src, embedding_dim); | ||
| } |
There was a problem hiding this comment.
The embeddingKernel template checks for cuda_bfloat16 type on line 48, but this is Metax platform code which should use __hpcc_bfloat16 instead. This type mismatch means the vectorized bfloat16 path (lines 48-54) will never execute on Metax platform. The kernel should check for the Metax-specific type __hpcc_bfloat16 to enable proper vectorization.
| } else if (_embedding_dim >= 1024) { | ||
| block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure | ||
| } else if (_embedding_dim <= 256) { | ||
| block_size = 384; // Medium embedding_dim: balanced configuration |
There was a problem hiding this comment.
The block size adjustment logic has an inconsistency. When _embedding_dim <= 256, the code sets block_size = 384 (line 91), but this case only applies if _embedding_dim is greater than 64 (because line 87 handles <= 64). This means embedding_dim values between 65-256 will use block_size=384. However, this contradicts the default of 256 which was already set. The logic should be restructured to avoid overlapping conditions and ensure proper block size selection.
| } else if (_embedding_dim >= 1024) { | |
| block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure | |
| } else if (_embedding_dim <= 256) { | |
| block_size = 384; // Medium embedding_dim: balanced configuration | |
| } else if (_embedding_dim <= 256) { | |
| block_size = 384; // Medium embedding_dim: balanced configuration | |
| } else if (_embedding_dim >= 1024) { | |
| block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure |
7997d89 to
2c02d73
Compare
2c02d73 to
170328a
Compare
170328a to
4866ddf
Compare
4866ddf to
1eac07c
Compare
8c49f1c to
db1bdfc
Compare
…graph recording - Ensure embedding tensors are on the same device. Change format. - Optimize embedding kernel with vectorized memory access and __ldg - Add vectorized memory access using float4/float2, half2, and bfloat162 - Use __ldg instruction for read-only weight and indices access - Add memory alignment checks to enable vectorized paths - Add __restrict__ keywords for better compiler optimization - Implement dynamic block size selection based on embedding_dim
db1bdfc to
bf120b2
Compare

resolves #900
resolves #846
includes #859
天数


沐曦

摩尔
