prov/shm: add IOV fast path with cirque-based completion#12279
Open
yinliaws wants to merge 10 commits into
Open
prov/shm: add IOV fast path with cirque-based completion#12279yinliaws wants to merge 10 commits into
yinliaws wants to merge 10 commits into
Conversation
c17c48a to
5012aa3
Compare
Place lighter protocol fields in the same cache-line grab as the atomic-queue cmd_entry grab. This way we if we are using a cpu without prefetcing algorithms (to grab the adjacent cacheline for us) we are optimizing the access of the fields we need for the lightweight/fast protocols. The heavier/slower protocols which use the second cache line fields will be unaffected by this change on older cpus since they already need to access both cache-lines. Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com> Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Replace hdr.status with hdr.smr_flags to indicate any error. This error will use the flag SMR_OP_ERROR for the sender to process its errors on return cmd. Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com> Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Remove the parallel command-inject resources and revert to using a lock-unlock inject buffer pool. Update the inject protocol to use the old method. There is a performance regression when using the "new shm" command-inject parallel data structure. This is due to the sender not being able to complete its transmission until the receiver returns the sender's command to the sender's return queue. In the old lock-unlock method the sender would allocate receiver side resources, copy its data into the receiver inject buffer and then complete. The old method allows MPIs and applications to assume that their inject message transmissions will complete quickly and since the new method does not complete `as` quickly it is likely the reason for this regression. Remove the smr_format_inject function. We need to try to get a tx_buf and try again if we run out as soon as possible. Since we always need a tx_buf we are avoiding the ofi_buf_alloc call in the case where there are no more inject buffers. Pulling this code out of the format_inject function makes that function copy into the tx_buf and set the proto to inject which does not need to be its own function. Instead we can order the operations from that function in more optimal locations inside of do_inject function. This will also revert to the "old-shm" method of buffering all unexpected inject messages Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com> Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Command stack is less likely to be used in the inject protocol when resources are on the receiver side. If the inject pool is above it then we have to jump less, and do not have to jump over the command stack, when accessing it. Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
SAR should never be handling 0 byte copies anymore since the inject protocol can handle delivery complete. Instead we will check and WARN the user if we accidentally do a 0-byte copy in SAR. Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Remove smr_proto_inject as a selectable proto for ofi_op_read_req. This will revert the behavior to be the same as 2.4.x for this protocol. We want to choose iov here because we can avoid the bottleneck issue of when a receiver runs out of inject-buffers, and we can use the command to do an rma operation instead of copy-in copy-out operation of the inject-protocol. Try to grab an inject buffer before formatting anything else. This is one of the most expensive lookups we need to do and it is not guaranteed to always get a buffer. Since we always need one, try to get it first so that if it fails, the fail case is quicker at failing. Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Reuse cmd entry field to be use as the command pointer. The use of entry and the ptr are mutually exclusive and so it can be reused. This allows casting of the cmd to an an entry and dereferencing the cmd to get the ptr. It allows us to maiximize the header field caching as well as increases the maximum inline payload by 8 bytes. Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
The return queue was re-using the atomic queue implementation since it already existed. The AQ implementation is a circular queue with a read and write position. This can causes a lot of contention when all peers are trying to return to the sender in a one to all case. This switches to a more efficient dlist implementation to reduce peer contention. The command queue cannot use this implementation because the commands can come from any peer and in order to work, the dlist implementation requires all peers be mapped to each other, which we cannot guarantee in shm. Only the return queue can use this implementation because all of the commands in the return queue belong to the same process (the sender) which all peers are guaranteed to have mapped. This implementation is adapted from the sm2 fifo implementation which was in turn adapted from the OMPI implementation so it retains the OMPI header. Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
7b945cf to
8993c13
Compare
For RMA reads <= SMR_INJECT_SIZE, allow smr_rma_fast (sender-side CMA) regardless of FI_DELIVERY_COMPLETE. Delivery is inherently complete when process_vm_readv returns — data is in the local buffer. The target memory is always a registered MR with pinned pages, so the syscall cannot stall. This avoids the receiver-side CMA round-trip that regresses 20-25% on Graviton and AMD platforms at sizes 1-4096B. Reads > SMR_INJECT_SIZE already use smr_rma_fast via the existing total_len > SMR_INJECT_SIZE condition. Signed-off-by: Yin Li <yinliq@amazon.com>
2483259 to
c09c08d
Compare
Replace the fifo-based return queue for IOV operations with per-operation status slots and an atomic completion counter. This eliminates the expensive atomic_swap + memory barriers on the receiver side (~150ns on ARM) while supporting out-of-order completion (no head-of-line blocking). Design: - Sender allocates a resp slot (bitmap) and stores index in cmd - Receiver writes slot.status = 1 (simple store) + increments comp_count - Sender progress checks comp_count (one load), scans completed slots - Any slot can complete independently (out-of-order safe) - MPSC safe (multiple receivers write to different slots) The fifo is retained for SAR/IPC protocols. Only IOV uses resp slots. Performance: tagged_bw 8K recovers from 88% to 96% of v2.4.x on Graviton. writedata_bw 1M recovers to 114% of v2.4.x. Signed-off-by: Yin Li <yinliq@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
│ Stacked on #12109. First 6 commits are from #12109 unchanged; only the final commit is new. Will rebase once #12109 merges.
Summary
Adds a lightweight cirque-based completion path for IOV/CMA operations in the SHM provider, replacing the atomic return queue with a single store + pointer comparison for the high-concurrency case. Recovers v2.4.x performance for many-to-many RMA reads without regressing low-concurrency workloads.
Design
Sender (smr_do_iov_fast): uses &ce->cmd directly, reserves a resp slot via cirque, allocates a lightweight smr_iov_pend, encodes the resp slot offset in cmd->hdr.proto_data, and does not set SMR_RETURN_CMD. Receiver (smr_progress_iov): writes resp->status after the CMA copy when SMR_RETURN_CMD is clear. The status write happens-after the CMA copy in program order, satisfying FI_DELIVERY_COMPLETE.
Sender progress (smr_progress_resp): in-order head processing, breaks on first BUSY slot. Guarded by pending_resp_cnt. Dispatch heuristic (peer fan-out): the fast path engages only when recent ops are to different peers. A fan_out_score saturates on peer change and decays on repeats — all-to-all workloads use the fast path; pair-based workloads stay on the slow path.
Performance
Severe IOV regressions resolved:
│ Test │ Without fast path │ With fast path │
│ All_get_all_aggregate 1024 │ 78% │ 127% │
│ All_get_all_aggregate 4096 │ 82% │ 124% │
│ All_get_all_non_aggregate 4096 │ 81% │ 125% │
Low-concurrency workloads (Exchange_get pairs) match the v2.5.x baseline rather than degrading from cirque overhead.