Skip to content

prov/shm: add IOV fast path with cirque-based completion#12279

Open
yinliaws wants to merge 10 commits into
ofiwg:mainfrom
yinliaws:iov-fast-path-stacked
Open

prov/shm: add IOV fast path with cirque-based completion#12279
yinliaws wants to merge 10 commits into
ofiwg:mainfrom
yinliaws:iov-fast-path-stacked

Conversation

@yinliaws
Copy link
Copy Markdown
Contributor

│ Stacked on #12109. First 6 commits are from #12109 unchanged; only the final commit is new. Will rebase once #12109 merges.

Summary

Adds a lightweight cirque-based completion path for IOV/CMA operations in the SHM provider, replacing the atomic return queue with a single store + pointer comparison for the high-concurrency case. Recovers v2.4.x performance for many-to-many RMA reads without regressing low-concurrency workloads.

Design

Sender (smr_do_iov_fast): uses &ce->cmd directly, reserves a resp slot via cirque, allocates a lightweight smr_iov_pend, encodes the resp slot offset in cmd->hdr.proto_data, and does not set SMR_RETURN_CMD. Receiver (smr_progress_iov): writes resp->status after the CMA copy when SMR_RETURN_CMD is clear. The status write happens-after the CMA copy in program order, satisfying FI_DELIVERY_COMPLETE.

Sender progress (smr_progress_resp): in-order head processing, breaks on first BUSY slot. Guarded by pending_resp_cnt. Dispatch heuristic (peer fan-out): the fast path engages only when recent ops are to different peers. A fan_out_score saturates on peer change and decays on repeats — all-to-all workloads use the fast path; pair-based workloads stay on the slow path.

Performance
Severe IOV regressions resolved:

│ Test │ Without fast path │ With fast path │
│ All_get_all_aggregate 1024 │ 78% │ 127% │
│ All_get_all_aggregate 4096 │ 82% │ 124% │
│ All_get_all_non_aggregate 4096 │ 81% │ 125% │

Low-concurrency workloads (Exchange_get pairs) match the v2.5.x baseline rather than degrading from cirque overhead.

@yinliaws yinliaws force-pushed the iov-fast-path-stacked branch 30 times, most recently from c17c48a to 5012aa3 Compare May 28, 2026 20:04
zachdworkin and others added 8 commits May 28, 2026 13:09
Place lighter protocol fields in the same cache-line grab
as the atomic-queue cmd_entry grab. This way we if we are
using a cpu without prefetcing algorithms (to grab the
adjacent cacheline for us) we are optimizing the access
of the fields we need for the lightweight/fast protocols.
The heavier/slower protocols which use the second cache
line fields will be unaffected by this change on older
cpus since they already need to access both cache-lines.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Replace hdr.status with hdr.smr_flags to indicate any
error. This error will use the flag SMR_OP_ERROR for the
sender to process its errors on return cmd.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Remove the parallel command-inject resources and revert
to using a lock-unlock inject buffer pool.

Update the inject protocol to use the old method.

There is a performance regression when using the "new shm"
command-inject parallel data structure. This is due to the
sender not being able to complete its transmission until
the receiver returns the sender's command to the sender's
return queue. In the old lock-unlock method the sender
would allocate receiver side resources, copy its data into
the receiver inject buffer and then complete. The old
method allows MPIs and applications to assume that their
inject message transmissions will complete quickly and
since the new method does not complete `as` quickly it is
likely the reason for this regression.

Remove the smr_format_inject function. We need to try to
get a tx_buf and try again if we run out as soon as possible.
Since we always need a tx_buf we are avoiding the ofi_buf_alloc
call in the case where there are no more inject buffers.
Pulling this code out of the format_inject function makes that
function copy into the tx_buf and set the proto to inject which
does not need to be its own function. Instead we can order the
operations from that function in more optimal locations inside
of do_inject function.

This will also revert to the "old-shm" method of buffering
all unexpected inject messages

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Command stack is less likely to be used in the inject
protocol when resources are on the receiver side. If
the inject pool is above it then we have to jump less,
and do not have to jump over the command stack, when
accessing it.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
SAR should never be handling 0 byte copies anymore
since the inject protocol can handle delivery complete.
Instead we will check and WARN the user if we
accidentally do a 0-byte copy in SAR.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Remove smr_proto_inject as a selectable proto for ofi_op_read_req.
This will revert the behavior to be the same as 2.4.x for this
protocol. We want to choose iov here because we can avoid the
bottleneck issue of when a receiver runs out of inject-buffers,
and we can use the command to do an rma operation instead of
copy-in copy-out operation of the inject-protocol.

Try to grab an inject buffer before formatting anything else.
This is one of the most expensive lookups we need to do and
it is not guaranteed to always get a buffer. Since we always
need one, try to get it first so that if it fails, the fail
case is quicker at failing.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Reuse cmd entry field to be use as the command pointer. The use of entry
and the ptr are mutually exclusive and so it can be reused. This allows
casting of the cmd to an an entry and dereferencing the cmd to get the ptr.
It allows us to maiximize the header field caching as well as increases the
maximum inline payload by 8 bytes.

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
The return queue was re-using the atomic queue implementation since it
already existed. The AQ implementation is a circular queue with a read and
write position. This can causes a lot of contention when all peers are trying
to return to the sender in a one to all case. This switches to a more efficient
dlist implementation to reduce peer contention. The command queue cannot use
this implementation because the commands can come from any peer and in order to
work, the dlist implementation requires all peers be mapped to each other, which
we cannot guarantee in shm. Only the return queue can use this implementation
because all of the commands in the return queue belong to the same process (the
sender) which all peers are guaranteed to have mapped.

This implementation is adapted from the sm2 fifo implementation which was in turn
adapted from the OMPI implementation so it retains the OMPI header.

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
@yinliaws yinliaws force-pushed the iov-fast-path-stacked branch 7 times, most recently from 7b945cf to 8993c13 Compare May 28, 2026 21:48
For RMA reads <= SMR_INJECT_SIZE, allow smr_rma_fast (sender-side CMA)
regardless of FI_DELIVERY_COMPLETE. Delivery is inherently complete when
process_vm_readv returns — data is in the local buffer. The target memory
is always a registered MR with pinned pages, so the syscall cannot stall.

This avoids the receiver-side CMA round-trip that regresses 20-25% on
Graviton and AMD platforms at sizes 1-4096B. Reads > SMR_INJECT_SIZE
already use smr_rma_fast via the existing total_len > SMR_INJECT_SIZE
condition.

Signed-off-by: Yin Li <yinliq@amazon.com>
@yinliaws yinliaws force-pushed the iov-fast-path-stacked branch 2 times, most recently from 2483259 to c09c08d Compare May 29, 2026 17:53
Replace the fifo-based return queue for IOV operations with per-operation
status slots and an atomic completion counter. This eliminates the
expensive atomic_swap + memory barriers on the receiver side (~150ns on
ARM) while supporting out-of-order completion (no head-of-line blocking).

Design:
- Sender allocates a resp slot (bitmap) and stores index in cmd
- Receiver writes slot.status = 1 (simple store) + increments comp_count
- Sender progress checks comp_count (one load), scans completed slots
- Any slot can complete independently (out-of-order safe)
- MPSC safe (multiple receivers write to different slots)

The fifo is retained for SAR/IPC protocols. Only IOV uses resp slots.

Performance: tagged_bw 8K recovers from 88% to 96% of v2.4.x on Graviton.
writedata_bw 1M recovers to 114% of v2.4.x.

Signed-off-by: Yin Li <yinliq@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants