-
Notifications
You must be signed in to change notification settings - Fork 853
[RFC] Introduce Completion Counters verbs #1701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,246 @@ | ||
| --- | ||
| date: 2026-02-09 | ||
| footer: libibverbs | ||
| header: "Libibverbs Programmer's Manual" | ||
| layout: page | ||
| license: 'Licensed under the OpenIB.org BSD license (FreeBSD Variant) - See COPYING.md' | ||
| section: 3 | ||
| title: ibv_create_comp_cntr | ||
| tagline: Verbs | ||
| --- | ||
|
|
||
| # NAME | ||
|
|
||
| **ibv_create_comp_cntr**, **ibv_destroy_comp_cntr** - Create or destroy a | ||
| completion counter | ||
|
|
||
| **ibv_set_comp_cntr**, **ibv_set_err_comp_cntr** - Set the value of a | ||
| completion or error counter | ||
|
|
||
| **ibv_inc_comp_cntr**, **ibv_inc_err_comp_cntr** - Increment a completion or | ||
| error counter | ||
|
|
||
| **ibv_read_comp_cntr**, **ibv_read_err_comp_cntr** - Read the value of a | ||
| completion or error counter | ||
|
|
||
| # SYNOPSIS | ||
|
|
||
| ```c | ||
| #include <infiniband/verbs.h> | ||
|
|
||
| struct ibv_comp_cntr *ibv_create_comp_cntr(struct ibv_context *context, | ||
| struct ibv_comp_cntr_init_attr *cc_attr); | ||
|
|
||
| int ibv_destroy_comp_cntr(struct ibv_comp_cntr *comp_cntr); | ||
|
|
||
| int ibv_set_comp_cntr(struct ibv_comp_cntr *comp_cntr, uint64_t value); | ||
| int ibv_set_err_comp_cntr(struct ibv_comp_cntr *comp_cntr, uint64_t value); | ||
| int ibv_inc_comp_cntr(struct ibv_comp_cntr *comp_cntr, uint64_t amount); | ||
| int ibv_inc_err_comp_cntr(struct ibv_comp_cntr *comp_cntr, uint64_t amount); | ||
| int ibv_read_comp_cntr(struct ibv_comp_cntr *comp_cntr, uint64_t *value); | ||
| int ibv_read_err_comp_cntr(struct ibv_comp_cntr *comp_cntr, uint64_t *value); | ||
| ``` | ||
|
|
||
| # DESCRIPTION | ||
|
|
||
| Completion counters provide a lightweight completion mechanism as an | ||
| alternative or extension to completion queues (CQs). Rather than generating | ||
| individual completion queue entries, a completion counter tracks the aggregate | ||
| number of completed operations. This makes them well suited for applications | ||
| that need to know how many requests have completed without requiring | ||
| per-request details, such as credit based flow control or tracking responses | ||
| from remote peers. | ||
|
|
||
| Each completion counter maintains two distinct 64-bit values: a completion | ||
| count that is incremented on successful completions, and an error count that | ||
| is incremented when operations complete in error. | ||
|
|
||
| **ibv_create_comp_cntr**() allocates a new completion counter for the RDMA | ||
| device context *context*. The properties of the counter are defined by | ||
| *cc_attr*. The maximum number of completion counters a device supports is | ||
| reported by the *max_comp_cntr* field of **ibv_device_attr_ex**. | ||
|
|
||
| **ibv_destroy_comp_cntr**() releases all resources associated with the | ||
| completion counter *comp_cntr*. The counter must not be attached to any QP | ||
| when destroyed. | ||
|
|
||
| **ibv_set_comp_cntr**() sets the completion count of *comp_cntr* to *value*. | ||
|
|
||
| **ibv_set_err_comp_cntr**() sets the error count of *comp_cntr* to *value*. | ||
|
|
||
| **ibv_inc_comp_cntr**() increments the completion count of *comp_cntr* by | ||
| *amount*. | ||
|
|
||
| **ibv_inc_err_comp_cntr**() increments the error count of *comp_cntr* by | ||
| *amount*. | ||
|
|
||
| **ibv_read_comp_cntr**() reads the current completion count of *comp_cntr* | ||
| into *value*. | ||
|
|
||
| **ibv_read_err_comp_cntr**() reads the current error count of *comp_cntr* | ||
| into *value*. | ||
|
|
||
| ## External memory | ||
|
|
||
| By default, the memory backing the counter values is allocated internally. | ||
| When the **IBV_COMP_CNTR_INIT_WITH_EXTERNAL_MEM** flag is set in | ||
| *ibv_comp_cntr_init_attr.flags*, the application provides its own memory for | ||
| the completion and error counts via the *comp_cntr_ext_mem* and | ||
| *err_cntr_ext_mem* fields. The external memory is described by an | ||
| **ibv_memory_location** structure which supports two modes: a virtual address | ||
| (**IBV_MEMORY_LOCATION_VA**), where the application supplies a direct pointer, or | ||
| a DMA-BUF reference (**IBV_MEMORY_LOCATION_DMABUF**), where the application | ||
| supplies a file descriptor and offset into an exported DMA-BUF. When using | ||
| DMA-BUF, the *ptr* field may also be set to provide a process-accessible | ||
| mapping of the memory, which may enable more efficient counter reads. Using | ||
| external memory allows the counter values to | ||
| reside in application-managed buffers or in memory exported through DMA-BUF, | ||
| enabling zero-copy observation of completion progress by co-located processes | ||
| or devices. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still question trying to support external memory for counters as a standard API feature. This is similar to supporting external memory for a CQ. I don't see a benefit for supporting this for host memory. GPU/device memory could have a benefit, but, again, this is forcing a specific NIC implementation. libfabric took a different approach for NIC to GPU interactions in the 1.x release (removed from 2.x because of a lack of implementation). That involved a trigger API, where once a counter hit a certain threshold, memory on the GPU would be updated. That's not the same as keeping the actual counter value on the GPU. https://github.com/ofiwg/libfabric/blob/v1.22.x/include/rdma/fi_trigger.h The advantage that approach had is that the trigger could come from the NIC or CPU, based on the vendor implementation. And even the counters could exist totally in SW. I would remove external memory support and defer discussion for how to best support completions to a GPU. |
||
|
|
||
| # ARGUMENTS | ||
|
|
||
| ## ibv_comp_cntr | ||
|
|
||
| ```c | ||
| struct ibv_comp_cntr { | ||
| struct ibv_context *context; | ||
| uint32_t handle; | ||
| uint64_t comp_count_max_value; | ||
| uint64_t err_count_max_value; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would move these to device attributes, not set per counter. An app may need to know this prior to created the counters, rather than after the fact. |
||
| }; | ||
| ``` | ||
|
|
||
| *context* | ||
| : Device context associated with the completion counter. | ||
|
|
||
| *handle* | ||
| : Kernel object handle for the completion counter. | ||
|
|
||
| *comp_count_max_value* | ||
| : The maximum value the completion count can hold. A subsequent | ||
| increment that would exceed this value wraps the counter to zero. | ||
|
|
||
| *err_count_max_value* | ||
| : The maximum value the error count can hold. A subsequent increment | ||
| that would exceed this value wraps the counter to zero. | ||
|
|
||
| ## ibv_comp_cntr_init_attr | ||
|
|
||
| ```c | ||
| struct ibv_comp_cntr_init_attr { | ||
| uint32_t comp_mask; | ||
| uint32_t flags; | ||
| struct ibv_memory_location comp_cntr_ext_mem; | ||
| struct ibv_memory_location err_cntr_ext_mem; | ||
| }; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. libfabric attributes include an enum for events to count (transfers or bytes). UEC lists support for both as mandatory. Adding an enum here, even if only transfers are included to start with, would make this more extensible. libfabric also supports blocking counter read calls. It's possible this might align with a verbs ibv_comp_channel (or ibv_cntr_channel?) object. I don't see any statement in UEC regarding blocking counters (or CQs). However, since libfabric counters include blocking support, they may simply not have thought it important to call that support out separately. (I don't think MPI uses blocking counters.) CXI implements blocking reads by setting a triggered threshold. HW writes the counter value to memory once it hits the trigger. SW in this case spins until the counter is triggered. At least based on UEC 1.0, I would defer adding wait or trigger APIs until there's a larger discussion.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding events vs. bytes I think it's already easy to extend by adding another flag like For your comment about blocking counters, I would try keep
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would still remove external memory support, particularly in the initial API submission. I would also include the enum to select the counter type (bytes vs. completions). This isn't something that should be a flag, as only one is usable at a time. Even if the enum only has 1 option to start with, introducing it now makes compatibility much easier, both from the perspective of verbs, but also the user. |
||
| ``` | ||
|
|
||
| *comp_mask* | ||
| : Bitmask specifying what fields in the structure are valid. | ||
|
|
||
| *flags* | ||
| : Creation flags. The following flags are supported: | ||
|
|
||
| **IBV_COMP_CNTR_INIT_WITH_EXTERNAL_MEM** - Use application-provided | ||
| memory for the counter values, as specified by *comp_cntr_ext_mem* | ||
| and *err_cntr_ext_mem*. | ||
|
|
||
| *comp_cntr_ext_mem* | ||
| : Memory location for the completion count when using external memory. | ||
|
|
||
| *err_cntr_ext_mem* | ||
| : Memory location for the error count when using external memory. | ||
|
|
||
| ## ibv_memory_location | ||
|
|
||
| ```c | ||
| enum ibv_memory_location_type { | ||
| IBV_MEMORY_LOCATION_VA, | ||
| IBV_MEMORY_LOCATION_DMABUF, | ||
| }; | ||
|
|
||
| struct ibv_memory_location { | ||
| uint8_t *ptr; | ||
| struct { | ||
| uint64_t offset; | ||
| int32_t fd; | ||
| uint32_t reserved; | ||
| } dmabuf; | ||
| uint8_t type; | ||
| uint8_t reserved[7]; | ||
| }; | ||
| ``` | ||
|
|
||
| *type* | ||
| : The type of memory location. **IBV_MEMORY_LOCATION_VA** for a virtual | ||
| address, or **IBV_MEMORY_LOCATION_DMABUF** for a DMA-BUF reference. | ||
|
|
||
| *ptr* | ||
| : Virtual address pointer. Required when type is | ||
| **IBV_MEMORY_LOCATION_VA**. When type is | ||
| **IBV_MEMORY_LOCATION_DMABUF**, may optionally be set to provide a | ||
| process-accessible mapping of the DMA-BUF memory. Otherwise should be | ||
| NULL. | ||
|
|
||
| *dmabuf.fd* | ||
| : DMA-BUF file descriptor (used when type is | ||
| **IBV_MEMORY_LOCATION_DMABUF**). | ||
|
|
||
| *dmabuf.offset* | ||
| : Offset within the DMA-BUF. | ||
|
|
||
| # RETURN VALUE | ||
|
|
||
| **ibv_create_comp_cntr**() returns a pointer to the allocated ibv_comp_cntr | ||
| object, or NULL if the request fails (and sets errno to indicate the failure | ||
| reason). | ||
|
|
||
| **ibv_destroy_comp_cntr**(), **ibv_set_comp_cntr**(), | ||
| **ibv_set_err_comp_cntr**(), **ibv_inc_comp_cntr**(), | ||
| **ibv_inc_err_comp_cntr**(), **ibv_read_comp_cntr**(), and | ||
| **ibv_read_err_comp_cntr**() return 0 on success, or the value of errno on | ||
| failure (which indicates the failure reason). | ||
|
|
||
| # ERRORS | ||
|
|
||
| ENOTSUP | ||
| : Completion counters are not supported on this device, or the | ||
| requested operation is not supported for the given counter | ||
| configuration. | ||
|
|
||
| ENOMEM | ||
| : Not enough resources to create the completion counter. | ||
|
|
||
| EINVAL | ||
| : Invalid argument(s) passed. | ||
|
|
||
| EBUSY | ||
| : The completion counter is still attached to a QP | ||
| (**ibv_destroy_comp_cntr**() only). | ||
|
|
||
| # NOTES | ||
|
|
||
| Counter values must only be updated using **ibv_set_comp_cntr**(), | ||
| **ibv_set_err_comp_cntr**(), **ibv_inc_comp_cntr**(), or | ||
| **ibv_inc_err_comp_cntr**(). Counter memory supplied by the application | ||
| must not be modified directly. | ||
|
|
||
| Updates made to counter values (e.g. via **ibv_set_comp_cntr**() or | ||
| **ibv_inc_comp_cntr**()) may not be immediately visible when reading the | ||
| counter via **ibv_read_comp_cntr**() or **ibv_read_err_comp_cntr**(). A small | ||
| delay may occur between the update and the observed value. However, the final | ||
| updated value will eventually be reflected. | ||
|
|
||
| Applications should ensure that the counter value is stable before calling | ||
| **ibv_set_comp_cntr**() or **ibv_set_err_comp_cntr**(). Otherwise, concurrent | ||
| updates may be lost. | ||
|
|
||
| # SEE ALSO | ||
|
|
||
| **ibv_qp_attach_comp_cntr**(3), **ibv_create_cq**(3), | ||
| **ibv_create_cq_ex**(3), **ibv_create_qp**(3) | ||
|
|
||
| # AUTHORS | ||
|
|
||
| Michael Margolin <mrgolin@amazon.com> | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| --- | ||
| date: 2026-02-09 | ||
| footer: libibverbs | ||
| header: "Libibverbs Programmer's Manual" | ||
| layout: page | ||
| license: 'Licensed under the OpenIB.org BSD license (FreeBSD Variant) - See COPYING.md' | ||
| section: 3 | ||
| title: ibv_qp_attach_comp_cntr | ||
| tagline: Verbs | ||
| --- | ||
|
|
||
| # NAME | ||
|
|
||
| **ibv_qp_attach_comp_cntr** - Attach a completion counter to a QP | ||
|
|
||
| # SYNOPSIS | ||
|
|
||
| ```c | ||
| #include <infiniband/verbs.h> | ||
|
|
||
| int ibv_qp_attach_comp_cntr(struct ibv_qp *qp, | ||
| struct ibv_comp_cntr *comp_cntr, | ||
| struct ibv_comp_cntr_attach_attr *attr); | ||
| ``` | ||
|
|
||
| # DESCRIPTION | ||
|
|
||
| **ibv_qp_attach_comp_cntr**() attaches the completion counter *comp_cntr* to | ||
| the queue pair *qp*. The *attr* argument specifies which operation types | ||
| should update the counter. | ||
|
|
||
| The QP must be in **IBV_QPS_RESET** or **IBV_QPS_INIT** state when attaching | ||
| a completion counter. Attempting to attach a counter to a QP in any other | ||
| state will fail with EINVAL. | ||
|
|
||
| The completion counter starts collecting values for the specified QP once | ||
| attached. Attaching the same completion counter to multiple QPs will | ||
| accumulate values from all attached QPs into the same counter. | ||
|
|
||
| The *op_mask* field controls which operation completions are counted. Local | ||
| operations (**IBV_COMP_CNTR_ATTACH_OP_SEND**, **IBV_COMP_CNTR_ATTACH_OP_RECV**, | ||
| **IBV_COMP_CNTR_ATTACH_OP_RDMA_READ**, **IBV_COMP_CNTR_ATTACH_OP_RDMA_WRITE**) | ||
| count completions initiated by the local QP. Remote operations | ||
| (**IBV_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ**, | ||
| **IBV_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE**) count completions of incoming | ||
| RDMA operations initiated by the remote side. Supported *op_mask* values may | ||
| vary by device; unsupported values will result in an ENOTSUP error. | ||
|
|
||
| Multiple completion counters can be attached to the same QP, provided their | ||
| *op_mask* values do not overlap. Each QP and operation type pair can be | ||
| associated with at most one completion counter. Attempting to attach a | ||
| counter with an *op_mask* that conflicts with an already attached counter | ||
| will fail. | ||
|
|
||
| There is no explicit detach operation. A completion counter is implicitly | ||
| detached when the QP it is attached to is destroyed. A completion counter | ||
| cannot be destroyed while it is still attached to any QP; the QP must be | ||
| destroyed first. | ||
|
|
||
| # ARGUMENTS | ||
|
|
||
| *qp* | ||
| : The queue pair to attach the completion counter to. | ||
|
|
||
| *comp_cntr* | ||
| : The completion counter to attach, previously created with | ||
| **ibv_create_comp_cntr**(). | ||
|
|
||
| *attr* | ||
| : Attach attributes specifying which operation types update the counter. | ||
|
|
||
| ## ibv_comp_cntr_attach_attr | ||
|
|
||
| ```c | ||
| enum ibv_comp_cntr_attach_op { | ||
| IBV_COMP_CNTR_ATTACH_OP_SEND = 1 << 0, | ||
| IBV_COMP_CNTR_ATTACH_OP_RECV = 1 << 1, | ||
| IBV_COMP_CNTR_ATTACH_OP_RDMA_READ = 1 << 2, | ||
| IBV_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ = 1 << 3, | ||
| IBV_COMP_CNTR_ATTACH_OP_RDMA_WRITE = 1 << 4, | ||
| IBV_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE = 1 << 5, | ||
| }; | ||
|
|
||
| struct ibv_comp_cntr_attach_attr { | ||
| uint32_t comp_mask; | ||
| uint32_t op_mask; | ||
| }; | ||
| ``` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This behavior seems aligned with libfabric, so it should work with UEC. |
||
|
|
||
| *comp_mask* | ||
| : Bitmask specifying what fields in the structure are valid. | ||
|
|
||
| *op_mask* | ||
| : Bitmask of **ibv_comp_cntr_attach_op** values specifying which | ||
| operation types should update the counter. | ||
|
|
||
| # RETURN VALUE | ||
|
|
||
| **ibv_qp_attach_comp_cntr**() returns 0 on success, or the value of errno on | ||
| failure (which indicates the failure reason). | ||
|
|
||
| # ERRORS | ||
|
|
||
| EINVAL | ||
| : Invalid argument(s) passed. | ||
|
|
||
| ENOTSUP | ||
| : Requested operation is not supported on this device. | ||
|
|
||
| EBUSY | ||
| : The *op_mask* overlaps with a completion counter already attached | ||
| to this QP. | ||
|
|
||
| # SEE ALSO | ||
|
|
||
| **ibv_create_comp_cntr**(3), **ibv_create_qp**(3) | ||
|
|
||
| # AUTHORS | ||
|
|
||
| Michael Margolin <mrgolin@amazon.com> | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libfabric and UEC call out 2 separate types of counters. One counts completed data transfer operations. The other counts the number of application bytes carried by a data transfer. For example, the number of bytes written into a RDMA write target buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback, replied in another comment below.