Skip to content

dora manage pinned memory about 1000MB/S#1623

Closed
tang-canran wants to merge 15 commits into
dora-rs:mainfrom
tang-canran:main
Closed

dora manage pinned memory about 1000MB/S#1623
tang-canran wants to merge 15 commits into
dora-rs:mainfrom
tang-canran:main

Conversation

@tang-canran
Copy link
Copy Markdown

No description provided.

@phil-opp
Copy link
Copy Markdown
Collaborator

Please use English in commit messages, code comments, and docs. Please also add a PR description to tell us what this change is about. Thanks!

@phil-opp phil-opp added the waiting-for-author The pull request requires adjustments by the PR author. label Apr 23, 2026
@heyong4725
Copy link
Copy Markdown
Collaborator

Hi @tang-canran, thanks for the substantial engineering effort. After reading the actual code (not just the design doc), the technical idea has real value but this specific PR isn't the right vehicle. Closing this — please read carefully, the path forward is in the follow-up issue linked at the bottom.

What works architecturally

After reading binaries/daemon/src/memory_manager.rs, the new DaemonRequest variants in libraries/message/src/node_to_daemon.rs, and the control_channel.rs additions, the actual design is side-channel, not replacement:

  • 3 new opt-in protocol variants (RegisterPinnedMemory, ReadPinnedMemory, FreePinnedMemory) — additive, no modification to existing message types
  • New MemoryPoolManager in the daemon — metadata-only HashMap, isolated module
  • Bulk data lives in /dev/shm/dora_pool_* shared-memory files; daemon brokers metadata exchange
  • Existing send_output / recv_event paths are untouched. Zenoh SHM, Arrow IPC, mpsc routing, cudaIpc — all unchanged

That's a much friendlier shape than docs/design.md's "replace message-passing with shared-state" rhetoric suggests. The code is opt-in; the docs aren't.

Where the real value sits

There's a concrete gap in dora today that this addresses: CPU-pinned host buffer → cudaMemcpyHostToDevice at high throughput. Zenoh SHM (the ≥4 KiB transport) is page-aligned but NOT pinned, so feeding CUDA from a zenoh-SHM payload requires either (a) an extra cudaMemcpyHostToDevice from non-pinned memory (slower than DMA from pinned) or (b) manual pinning that the application has to manage.

For high-rate CPU→GPU workloads — camera frames, sensor batches, real-time perception — that overhead is real. The 3-5× speedup claim against the non-pinned baseline is plausible on that specific path.

cudaIpc handles solve GPU↔GPU sharing but don't help when the producer is on CPU. So pinned-host-buffer is a real gap, not a synthetic benchmark.

Why this specific PR can't land

Five issues, in order of severity:

  1. Implementation sprawl. The feature surface (RPC + memory manager + opt-in API) is reasonably ~300-500 lines. This PR adds 2,717. The 1,092-line Python binding additions in apis/python/node/src/lib.rs and the 348/-60 rewrite of apis/python/node/dora/cuda.py are 7× larger than the feature needs. A focused implementation would fit in a reviewable PR; this one cannot.
  2. Design doc framing. docs/design.md is structured as a EuroSys paper draft ("面向 EuroSys 投稿的系统设计与创新...3.7× 加速比") arguing dora should replace its data plane. That's not a feature proposal — it's a research artifact. A future PR's design doc has to be in English and framed as "opt-in additional transport for pinned-host-buffer CUDA workloads," not "replace upstream architecture."
  3. /dev/shm/dora_pool_* is Linux-only with no Windows/macOS story. dora supports all three. Cross-platform isn't optional — needs at least a graceful "this feature is unavailable on platform X" path.
  4. No comparison against existing alternatives. The design doc doesn't address: why not extend zenoh-shm to support optional pinning? Why not a CUDA-aware Arrow IPC variant? Why not improve cudaIpc reuse with handle caching? Those are the questions a reviewer needs answered before evaluating a new transport.
  5. Maintainer language feedback unresolved. @phil-opp asked on 2026-04-23 for English commits/comments/docs. Commits after that date are still in Chinese ("dora统一管理页锁内存", "完成自释放功能", etc.) and docs/design.md is entirely in Chinese. Project working language is English.

Path forward

I'm opening a follow-up issue: #1872 describes the CPU-pinned-host-buffer opportunity in English, doesn't commit to an architecture, and invites focused proposals. If you (or anyone else) want to engage with the opportunity at the right scope and in English, that's the place to start.

Specifically, a reviewable future PR would look like:

  • An issue or RFC first, in English, proposing the architecture and addressing the four "why not X" alternatives above
  • After maintainer feedback, a code PR roughly: ~150 lines of Rust (protocol + daemon + control_channel), ~100 lines of Python wrapper, ~50 lines of cross-platform gating, ~50 lines of focused test, and an English design note explaining what it is and what it isn't
  • Cross-platform fallback (clear "feature unavailable" on non-Linux, or graceful degradation)
  • Comparison numbers against zenoh-shm + manual pinning, not just against zenoh-shm without pinning

Thanks again for the engineering. The paper is yours to pursue on your fork regardless of what upstream decides. If you do engage with the follow-up issue, please write in English and start with the smaller architectural slice.

cc @phil-opp

@heyong4725 heyong4725 closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-for-author The pull request requires adjustments by the PR author.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants