Skip to content

infiniband: add EFA (AWS Elastic Fabric Adapter) support#3668

Open
jyizheng wants to merge 1 commit into
prometheus:masterfrom
basetenlabs:feat-efa-collector-support
Open

infiniband: add EFA (AWS Elastic Fabric Adapter) support#3668
jyizheng wants to merge 1 commit into
prometheus:masterfrom
basetenlabs:feat-efa-collector-support

Conversation

@jyizheng
Copy link
Copy Markdown

What

Add support for AWS Elastic Fabric Adapter (EFA) NICs in the infiniband collector.

Why

EFA devices show up under /sys/class/infiniband/ like IB HCAs but don't
follow the IB spec — they store bytes/packets in hw_counters/{tx,rx}_{bytes,pkts}
instead of counters/port_xmit_data etc. As a result, the collector emits
state_id and rate_bytes_per_second for rdmap* devices but
port_data_* / port_packets_* are silently absent on AWS p4d/p5/p6
instances.

How

  • Detect EFA by PCI vendor 0x1d0f.
  • For EFA devices, read hw_counters/ and emit under the same existing
    port_data_* / port_packets_* metric names so IB dashboards just work.
  • Add 8 EFA-only diagnostic counters under efa_* prefix
    (retrans / rx_drops / rdma_read / rdma_write / ...).
  • IB devices are untouched.

Tests

  • 22 unit tests, 100% coverage of new helpers, Update() 93%.
  • Verified on a real p6-b200.48xlarge node: exporter output matches
    cat /sys/class/infiniband/rdmap*/ports/1/hw_counters/tx_bytes
    byte-for-byte.

Docs: docs/INFINIBAND_EFA.md.

infiniband: add EFA collector unit tests

docs: document EFA support in the infiniband collector
Signed-off-by: Yizheng Jiao <jyizheng@gmail.com>
@jyizheng jyizheng force-pushed the feat-efa-collector-support branch from 66dd657 to fab5e77 Compare May 28, 2026 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant