Skip to content

prov/rxm: add pluggable multi-QP routing with round-robin selector#12234

Open
imasuari wants to merge 4 commits into
ofiwg:mainfrom
imasuari:mqp01-rr-selector
Open

prov/rxm: add pluggable multi-QP routing with round-robin selector#12234
imasuari wants to merge 4 commits into
ofiwg:mainfrom
imasuari:mqp01-rr-selector

Conversation

@imasuari
Copy link
Copy Markdown
Contributor

Summary

  • Introduce a pluggable rxm_qp_selector interface that maps each TX operation to a msg_ep index, replacing the single scalar msg_ep with a msg_eps[] array per connection
  • Implement a round-robin selector that steers RMA/rendezvous-RMA traffic across msg_ep[1..N-1] while pinning control-plane ops to msg_ep[0], with per-msg_id SAR segment pinning for in-order delivery
  • Add FI_OFI_RXM_NUM_MSG_EPS env var (default 1, clamped to [1, 255]) for runtime configuration; currently all slots alias a single real msg_ep (mock multi-QP), restricted to verbs provider
  • Add debug trace logging for QP selection at FI_LOG_EP_DATA level

Design

The selector is a simple vtable (select + destroy callbacks) attached to each rxm_conn. Two implementations are provided:

  • single_qp — stateless, always returns index 0 (used when num_msg_eps == 1)
  • round_robin — heap-allocated, steers RMA/RNDV_RMA round-robin across data QPs with a SAR pin table to maintain segment ordering

Wire behaviour is unchanged in this series — all msg_eps[] slots point to the same underlying endpoint. This lays the groundwork for real multi-QP connections in a follow-up series.

Copy link
Copy Markdown
Contributor

@ooststep ooststep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the underlying transport is mostly transparent to rxm, which deals in endpoint types (msg_ep) rather than sockets/qps/etc. in that respect - this is more a prioritized multi-ep/multi-rail rather than specific to multi-qp, though I understand that to be the primary intent.


struct rxm_conn;

enum rxm_op_type {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have rxm_ctrl_* to mark each message in the pkt. can those not be used rather than adding another enum and function argument to track the same information?

imasuari added 4 commits May 28, 2026 06:10
Introduce a pluggable rxm_ep_selector interface that maps each TX
operation to a msg_ep index. The selector receives the op type
(EAGER, SAR_FIRST/MIDDLE/LAST, RNDV_CTRL/RMA, RMA, ATOMIC) and the SAR
msg_id, allowing future policies to steer traffic across multiple
underlying msg endpoints.

Wire the selector and a msg_eps[] array into rxm_conn, replacing the
single scalar msg_ep alias. All TX paths (eager/tagged sends, SAR,
RNDV control and payload, pass-through RMA, atomics) are updated to
call rxm_conn_msg_ep() instead of accessing msg_ep directly.

The initial implementation ships rxm_selector_single_ep, which always
returns index 0 and preserves current single-msg_ep behaviour.

Signed-off-by: Itai Masuari <imasuari@habana.ai>
Implement a round-robin msg_ep selector policy and SAR segment
pinning.

RR policy routes RMA and rendezvous-RMA ops round-robin across
msg_ep[1..N-1] while keeping control-plane traffic (eager, tagged,
RNDV_CTRL, atomics) pinned to msg_ep[0]. Falls back to msg_ep[0]
when num_msg_eps == 1.

SAR pinning ensures in-order delivery: SAR_FIRST stays on msg_ep[0];
the first SAR_MIDDLE picks a data msg_ep via RR and records the
choice in an index_map keyed by msg_id; subsequent MIDDLEs inherit
the pin; SAR_LAST clears the entry.

A destroy hook is added to the selector vtable so stateful selectors
(like the RR selector with its sar_pins map) can own their cleanup.

Signed-off-by: Itai Masuari <imasuari@habana.ai>
Add runtime configuration for multi-msg_ep routing via the
FI_OFI_RXM_NUM_MSG_EPS environment variable (default 1, clamped to
[1, 255]).

When num_msg_eps == 1, use the stateless single-ep selector to avoid
an unnecessary heap allocation and SAR-pin map. When num_msg_eps > 1,
allocate the round-robin selector.

Currently all msg_eps[] slots alias the single real msg endpoint so
wire behaviour is unchanged; the selector still picks among indices,
paving the way for real multi-msg_ep support (e.g. multi-QP over
verbs).

Multi-ep (num_msg_eps > 1) is restricted to the verbs provider, where
each msg_ep maps to a distinct QP; other providers clamp to 1 with an
info-level log.

Signed-off-by: Itai Masuari <imasuari@habana.ai>
Log conn, op type, msg_id, and selected msg_ep index at FI_LOG_EP_DATA
level to aid tracing multi-ep path selection.

Signed-off-by: Itai Masuari <imasuari@habana.ai>
@gad-arbel gad-arbel force-pushed the mqp01-rr-selector branch from de7b1e8 to 4146705 Compare May 28, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants