Skip to content

fabtests/efa: Add multi-QP support for efa_rma_bw#12298

Open
shijin-aws wants to merge 1 commit into
ofiwg:mainfrom
shijin-aws:pps-dev
Open

fabtests/efa: Add multi-QP support for efa_rma_bw#12298
shijin-aws wants to merge 1 commit into
ofiwg:mainfrom
shijin-aws:pps-dev

Conversation

@shijin-aws
Copy link
Copy Markdown
Contributor

Add support for multiple endpoints (-q / --num-eps) with per-EP flow control and FI_MORE doorbell batching. Each EP independently tracks its posted/completed counts via an efa_rma_bw_ctx embedded in the operation context, with pool slots partitioned per-EP to prevent cross-EP reuse. The benchmark loop follows the perftest pattern: iterating all EPs each pass to avoid starvation, with per-EP window checks gating both posting and FI_MORE. The writedata rx path uses fi_recvmsg with per-EP tracking when FI_RX_CQ_DATA is set.

@shijin-aws shijin-aws requested a review from a team May 28, 2026 00:51
Add support for multiple endpoints (-q / --num-eps) with per-EP flow
control and FI_MORE doorbell batching. Each EP independently tracks its
posted/completed counts via an efa_rma_bw_ctx embedded in the operation
context, with pool slots partitioned per-EP to prevent cross-EP reuse.
The benchmark loop follows the perftest pattern: iterating all EPs each
pass to avoid starvation, with per-EP window checks gating both posting
and FI_MORE. The writedata rx path uses fi_recvmsg with per-EP tracking
when FI_RX_CQ_DATA is set.

Signed-off-by: Shi Jin <sjina@amazon.com>
* Each posted operation carries an efa_rma_bw_ctx containing the EP index.
* On completion, the CQ entry's op_context is used (via container_of) to
* recover the efa_rma_bw_ctx and attribute the completion to the correct EP.
* The ep_idx field is placed before fi_context2 in the struct to prevent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider shouldn't clobber any memory beyond fi_context2 anyway

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lame comments generated by AI, I will fix it

*
* For the writedata receiver (unsolicited write-with-imm without posted
* receives, i.e. !FI_RX_CQ_DATA), completions have no user context, so
* per_ep_completed is passed as NULL and only global counting is performed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the benchmark actually useful in this case? With only a global counter, some QPs can be idle because other QPs are waiting for completions. Should the test just fail when multiple EPs are used without FI_RX_CQ_DATA?

Copy link
Copy Markdown
Contributor Author

@shijin-aws shijin-aws May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That completion counter is only used as credits for each EP to repost rx buffer, when unsolicited write recv is used, such credit doesn't exist by nature

* Options:
* --high-pps Enable FI_EFA_WR_HIGH_PPS flag on writes.
* -o write|writedata|read Select RMA operation (default: write).
* --post-list <n> Batch n posts per doorbell using FI_MORE (default: 1).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default be min(window_size, tx_queue_depth)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will do

per_ep_posted[ep_idx]++;

ret = fi_recvmsg(eps[ep_idx], &msg, rx_flags);
if (ret) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No retry on EAGAIN?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. We have a large while loop until completed_cnt reaching the total cnt, so as long as we don't return early in the function for EAGAIN we finally retry


# Testing the batch mode of fi_efa_rma_bw (--post-list) which batch multiple WQEs with FI_MORE
@pytest.mark.pr_ci
@pytest.mark.fabric(params=["efa", "efa-direct"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you don't retry on EAGAIN, this test won't work well for the EFA protocol path. You could either set the homogeneous peers option or fail the test for the protocol path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should retry on EAGAIN, I just missed the handling on fi_recvmsg, will add that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post_rma also does not retry on EAGAIN https://github.com/shijin-aws/libfabric/blob/d164f90ddea65ab012cea46ce5ab479faaa2f603/fabtests/prov/efa/src/efa_rma_bw.c#L134-L147

If you do want to retry on EAGAIN, then it's probably better to re-use the FT_POST macro

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ret = post_rma(eps[ep_idx],
							remote_addrs[ep_idx],
							buf, opts.transfer_size,
							remote,
							&tx_ctx_pool[slot].context,
							flags);
					if (ret == -FI_EAGAIN) {
						per_ep_posted[ep_idx]--;
						break;
					}

This finally get retried in the outmost while

while (posted_cnt < total_posts ||
		       completed_cnt < total_posts) {

I intend to use such semantic to keep post and poll separated in the while loop while both finally get progressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants