Skip to content

UCT/IB: Propagate traffic class in address for symmetric QoS#11309

Open
ndg8743 wants to merge 2 commits into
openucx:masterfrom
ndg8743:fix/symmetric-traffic-class
Open

UCT/IB: Propagate traffic class in address for symmetric QoS#11309
ndg8743 wants to merge 2 commits into
openucx:masterfrom
ndg8743:fix/symmetric-traffic-class

Conversation

@ndg8743
Copy link
Copy Markdown

@ndg8743 ndg8743 commented Mar 30, 2026

Summary

  • Pack the configured traffic class into the IB address exchanged during connection setup
  • When the remote address carries a traffic class and the local interface has none configured, apply the remote value so both directions use the same TC
  • Update the DevX RC QP connect path to use ah_attr->grh.traffic_class instead of always reading from local iface config

Fixes #10325

Pack the traffic class into the IB address exchanged during connection
setup so the remote side can apply it when the local interface has no
traffic class configured. This fixes one-directional DSCP/TC when
UCX_IB_TRAFFIC_CLASS is set, ensuring both sides of an RC connection
use the same traffic class value.

Also update the DevX QP connect path to read traffic class from
ah_attr (which may carry the remote value) instead of always using
the local iface config.

Closes openucx#10325
@ndg8743 ndg8743 force-pushed the fix/symmetric-traffic-class branch from 20b1e67 to 0c982d8 Compare March 31, 2026 02:16
@ndg8743
Copy link
Copy Markdown
Author

ndg8743 commented Apr 2, 2026

CI failures are infrastructure flakes on RoCE worker 3 (mlx5_1 device degraded), unrelated to traffic class changes:

  • Tests roce worker 3: rc_mlx5/test_cqe_zipping.zcopy/1 — send_cnt (512) != recv_cnt (287), not all completions received. Then dcx/test_ucp_am_nbx_closed_ep.rx_long_am_on_closed_ep/3 timed out and aborted remaining tests.
  • ASAN roce worker 3: udx/test_ucp_loopback.envelope variants (put_bw, get_bw, amo_cswap, put_lat, amo_fadd) — all hit ibv_poll_cq(UMR CQ) timeout on mlx5_1. Then rc_verbs/uct_p2p_mix_test.mix_10000/2 timed out.

All failures trace back to mlx5_1 UMR CQ timeouts on the same worker node. Retriggering CI.

@ndg8743 ndg8743 force-pushed the fix/symmetric-traffic-class branch from d68a22e to 0c982d8 Compare April 3, 2026 04:44
The UCT_IB_ADDRESS_FLAG_TRAFFIC_CLASS at BIT(7) overlaps with the
RoCE version bits in the flags byte. The version extraction does
flags >> 5, which includes bit 7, corrupting the RoCE version when
traffic class is packed. Mask out the traffic class flag before
extracting the version.

Also initialize traffic_class in test pack_params to avoid reading
uninitialized memory when the traffic class flag is set.
@ndg8743 ndg8743 force-pushed the fix/symmetric-traffic-class branch from 0c982d8 to 8962d27 Compare April 4, 2026 16:44
@openucx openucx deleted a comment from svc-nixl May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Traffic class not fully applied via UCX_IB_TRAFFIC_CLASS in one direction

1 participant