fabtests/efa: Add multi-QP support for efa_rma_bw#12298
Conversation
Add support for multiple endpoints (-q / --num-eps) with per-EP flow control and FI_MORE doorbell batching. Each EP independently tracks its posted/completed counts via an efa_rma_bw_ctx embedded in the operation context, with pool slots partitioned per-EP to prevent cross-EP reuse. The benchmark loop follows the perftest pattern: iterating all EPs each pass to avoid starvation, with per-EP window checks gating both posting and FI_MORE. The writedata rx path uses fi_recvmsg with per-EP tracking when FI_RX_CQ_DATA is set. Signed-off-by: Shi Jin <sjina@amazon.com>
| * Each posted operation carries an efa_rma_bw_ctx containing the EP index. | ||
| * On completion, the CQ entry's op_context is used (via container_of) to | ||
| * recover the efa_rma_bw_ctx and attribute the completion to the correct EP. | ||
| * The ep_idx field is placed before fi_context2 in the struct to prevent |
There was a problem hiding this comment.
The provider shouldn't clobber any memory beyond fi_context2 anyway
There was a problem hiding this comment.
Lame comments generated by AI, I will fix it
| * | ||
| * For the writedata receiver (unsolicited write-with-imm without posted | ||
| * receives, i.e. !FI_RX_CQ_DATA), completions have no user context, so | ||
| * per_ep_completed is passed as NULL and only global counting is performed. |
There was a problem hiding this comment.
Is the benchmark actually useful in this case? With only a global counter, some QPs can be idle because other QPs are waiting for completions. Should the test just fail when multiple EPs are used without FI_RX_CQ_DATA?
There was a problem hiding this comment.
That completion counter is only used as credits for each EP to repost rx buffer, when unsolicited write recv is used, such credit doesn't exist by nature
| * Options: | ||
| * --high-pps Enable FI_EFA_WR_HIGH_PPS flag on writes. | ||
| * -o write|writedata|read Select RMA operation (default: write). | ||
| * --post-list <n> Batch n posts per doorbell using FI_MORE (default: 1). |
There was a problem hiding this comment.
Should the default be min(window_size, tx_queue_depth)?
There was a problem hiding this comment.
good catch, will do
| per_ep_posted[ep_idx]++; | ||
|
|
||
| ret = fi_recvmsg(eps[ep_idx], &msg, rx_flags); | ||
| if (ret) { |
There was a problem hiding this comment.
will do. We have a large while loop until completed_cnt reaching the total cnt, so as long as we don't return early in the function for EAGAIN we finally retry
|
|
||
| # Testing the batch mode of fi_efa_rma_bw (--post-list) which batch multiple WQEs with FI_MORE | ||
| @pytest.mark.pr_ci | ||
| @pytest.mark.fabric(params=["efa", "efa-direct"]) |
There was a problem hiding this comment.
Since you don't retry on EAGAIN, this test won't work well for the EFA protocol path. You could either set the homogeneous peers option or fail the test for the protocol path.
There was a problem hiding this comment.
It should retry on EAGAIN, I just missed the handling on fi_recvmsg, will add that
There was a problem hiding this comment.
post_rma also does not retry on EAGAIN https://github.com/shijin-aws/libfabric/blob/d164f90ddea65ab012cea46ce5ab479faaa2f603/fabtests/prov/efa/src/efa_rma_bw.c#L134-L147
If you do want to retry on EAGAIN, then it's probably better to re-use the FT_POST macro
There was a problem hiding this comment.
ret = post_rma(eps[ep_idx],
remote_addrs[ep_idx],
buf, opts.transfer_size,
remote,
&tx_ctx_pool[slot].context,
flags);
if (ret == -FI_EAGAIN) {
per_ep_posted[ep_idx]--;
break;
}
This finally get retried in the outmost while
while (posted_cnt < total_posts ||
completed_cnt < total_posts) {
I intend to use such semantic to keep post and poll separated in the while loop while both finally get progressed
Add support for multiple endpoints (-q / --num-eps) with per-EP flow control and FI_MORE doorbell batching. Each EP independently tracks its posted/completed counts via an efa_rma_bw_ctx embedded in the operation context, with pool slots partitioned per-EP to prevent cross-EP reuse. The benchmark loop follows the perftest pattern: iterating all EPs each pass to avoid starvation, with per-EP window checks gating both posting and FI_MORE. The writedata rx path uses fi_recvmsg with per-EP tracking when FI_RX_CQ_DATA is set.