Skip to content

Commit bcb98fa

Browse files
committed
Adding RDMA fields to Hash Object
Signed-off-by: Satheesh Kumar Karra <skarra@marvell.com>
1 parent f4695e9 commit bcb98fa

4 files changed

Lines changed: 216 additions & 0 deletions

File tree

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# [SAI] Hashing Enhancements for Efficient RoCE Traffic Distribution
2+
-------------------------------------------------------------------------------
3+
Title | Hashing Enhancements for Efficient RoCE Traffic Distribution
4+
-------------|-----------------------------------------------------------------
5+
Authors | Satheesh Kumar Karra, Ravindranath C K (Marvell)
6+
Status | In review
7+
Type | Standards track
8+
Created | 2025-02-27
9+
SAI-Version | 1.16
10+
-------------------------------------------------------------------------------
11+
12+
## 1.0 Introduction
13+
14+
15+
SAI (Switch Abstraction Interface) supports customization of hash field
16+
selection through the `saihash` object, allowing users to define hash fields
17+
based on network requirements. Configured `saihash` objects can be applied
18+
to different ECMP (Equal cost multi path) traffic flows using the following
19+
SAI switch attributes:
20+
21+
22+
1) SAI_SWITCH_ATTR_ECMP_HASH_IPV4 – Specifies the hash object for IPv4 packets in ECMP.
23+
2) SAI_SWITCH_ATTR_ECMP_HASH_IPV4_IN_IPV4 – Specifies the hash object for IPv4-in-IPv4 encapsulated packets in ECMP.
24+
3) SAI_SWITCH_ATTR_ECMP_HASH_IPV6 – Specifies the hash object for IPv6 packets in ECMP.
25+
26+
These attributes allow fine-tuned ECMP hashing, optimizing traffic
27+
distribution based on application needs. Network administrators can create
28+
custom hash lists using SAI native hash fields and bind them to above switch
29+
attributes. SAI provided similar configurations even for LAG (Link
30+
Aggregation Groups), which ensures balanced traffic distribution across
31+
member links, reducing congestion, and enhancing overall network
32+
efficiency.
33+
34+
In the current configuration, Remote Direct Memory Access over Converged
35+
Ethernet (RoCE) traffic utilizes the same ECMP and LAG hash objects as
36+
standard IP traffic. However, this can lead to traffic polarization,
37+
especially when multiple RoCE streams share the same IP endpoints.
38+
39+
40+
## 2.0 Motivation
41+
42+
The packet fields up to the L4 header for different RDMA streams between the
43+
same endpoints will be mostly identical leading to all these streams to hash
44+
to the same member. In order to improve the hash distribution for RDMA
45+
traffic, modern NPUs have native support for hashing on RDMA header fields.
46+
47+
This proposal adds SAI native hash field support for the below fields in the
48+
RDMA Base Transport Header:
49+
50+
- Queue Pair (QP) Number
51+
- RDMA opcode(Operation type)
52+
53+
54+
## 3.0 SAI Enhancements
55+
56+
1) New Hash fields to support RoCE :
57+
```c
58+
59+
/**
60+
* @brief Attribute data for SAI native hash fields
61+
*/
62+
typedef enum _sai_native_hash_field_t
63+
{
64+
65+
...
66+
/** Native hash field RDMA packet BTH(Base Transport Header) opcode */
67+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_OPCODE,
68+
69+
/** Native hash field RDMA packet BTH destination queue pair */
70+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_DEST_QP,
71+
72+
} sai_native_hash_field_t;
73+
```
74+
2) Switch Attributes to configure Hashing for RoCE Traffic:
75+
```c
76+
/**
77+
* @brief Attribute Id in sai_set_switch_attribute() and
78+
* sai_get_switch_attribute() calls.
79+
*/
80+
typedef enum _sai_switch_attr_t
81+
{
82+
...
83+
/**
84+
* @brief The hash object for IPv4 RDMA packets going through ECMP
85+
*
86+
* @type sai_object_id_t
87+
* @flags CREATE_AND_SET
88+
* @objects SAI_OBJECT_TYPE_HASH
89+
* @allownull true
90+
* @default SAI_NULL_OBJECT_ID
91+
*/
92+
SAI_SWITCH_ATTR_ECMP_HASH_IPV4_RDMA,
93+
94+
/**
95+
* @brief The hash object for IPv6 RDMA packets going through ECMP
96+
*
97+
* @type sai_object_id_t
98+
* @flags CREATE_AND_SET
99+
* @objects SAI_OBJECT_TYPE_HASH
100+
* @allownull true
101+
* @default SAI_NULL_OBJECT_ID
102+
*/
103+
SAI_SWITCH_ATTR_ECMP_HASH_IPV6_RDMA,
104+
105+
/**
106+
* @brief The hash object for IPv4 RDMA packets going through LAG
107+
*
108+
* @type sai_object_id_t
109+
* @flags CREATE_AND_SET
110+
* @objects SAI_OBJECT_TYPE_HASH
111+
* @allownull true
112+
* @default SAI_NULL_OBJECT_ID
113+
*/
114+
SAI_SWITCH_ATTR_LAG_HASH_IPV4_RDMA,
115+
116+
/**
117+
* @brief The hash object for IPv6 RDMA packets going through LAG
118+
*
119+
* @type sai_object_id_t
120+
* @flags CREATE_AND_SET
121+
* @objects SAI_OBJECT_TYPE_HASH
122+
* @allownull true
123+
* @default SAI_NULL_OBJECT_ID
124+
*/
125+
SAI_SWITCH_ATTR_LAG_HASH_IPV6_RDMA,
126+
...
127+
} sai_switch_attr_t;
128+
```
129+
130+
131+
## 4.0 API Example
132+
133+
### Create Hash Object
134+
135+
```c
136+
137+
hash_count = 0;
138+
sai_attr_list[0].id = SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST;
139+
...(Other hash fileds)
140+
sai_attr_list[0].value.s32list.list[hash_count++] =
141+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_OPCODE;
142+
sai_attr_list[0].value.s32list.list[hash_count++] =
143+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_DEST_QP;
144+
sai_attr_list[0].value.s32list.count = hash_count;
145+
attr_count =1
146+
147+
sai_create_hash_fn(
148+
&hash_rdma_v4_oid,
149+
switch_id,
150+
attr_count,
151+
sai_attr_list);
152+
```
153+
154+
### Configure RDMA Hash on Switch
155+
156+
```c
157+
attr_count = 0
158+
sai_attr_list[attr_count].id = SAI_SWITCH_ATTR_ECMP_HASH_IPV4_RDMA;
159+
sai_attr_list[attr_count].value.oid = hash_rdma_v4_oid;
160+
161+
sai_set_switch_attribute_fn(
162+
switch_id,
163+
sai_attr_list);
164+
```

inc/saihash.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,12 @@ typedef enum _sai_native_hash_field_t
169169
/** Native hash field IPv6 flow label */
170170
SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL = 0x00000018,
171171

172+
/** Native hash field RDMA packet BTH(Base Transport Header) opcode */
173+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_OPCODE = 0x00000022,
174+
175+
/** Native hash field RDMA packet BTH destination queue pair */
176+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_DEST_QP = 0x00000023,
177+
172178
/** No field - for compatibility, must be last */
173179
SAI_NATIVE_HASH_FIELD_NONE = 0x00000021,
174180

inc/saiswitch.h

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3215,6 +3215,50 @@ typedef enum _sai_switch_attr_t
32153215
*/
32163216
SAI_SWITCH_ATTR_PORT_PTP_MODE,
32173217

3218+
/**
3219+
* @brief The hash object for IPv4 RDMA packets going through ECMP
3220+
*
3221+
* @type sai_object_id_t
3222+
* @flags CREATE_AND_SET
3223+
* @objects SAI_OBJECT_TYPE_HASH
3224+
* @allownull true
3225+
* @default SAI_NULL_OBJECT_ID
3226+
*/
3227+
SAI_SWITCH_ATTR_ECMP_HASH_IPV4_RDMA,
3228+
3229+
/**
3230+
* @brief The hash object for IPv6 RDMA packets going through ECMP
3231+
*
3232+
* @type sai_object_id_t
3233+
* @flags CREATE_AND_SET
3234+
* @objects SAI_OBJECT_TYPE_HASH
3235+
* @allownull true
3236+
* @default SAI_NULL_OBJECT_ID
3237+
*/
3238+
SAI_SWITCH_ATTR_ECMP_HASH_IPV6_RDMA,
3239+
3240+
/**
3241+
* @brief The hash object for IPv4 RDMA packets going through LAG
3242+
*
3243+
* @type sai_object_id_t
3244+
* @flags CREATE_AND_SET
3245+
* @objects SAI_OBJECT_TYPE_HASH
3246+
* @allownull true
3247+
* @default SAI_NULL_OBJECT_ID
3248+
*/
3249+
SAI_SWITCH_ATTR_LAG_HASH_IPV4_RDMA,
3250+
3251+
/**
3252+
* @brief The hash object for IPv6 RDMA packets going through LAG
3253+
*
3254+
* @type sai_object_id_t
3255+
* @flags CREATE_AND_SET
3256+
* @objects SAI_OBJECT_TYPE_HASH
3257+
* @allownull true
3258+
* @default SAI_NULL_OBJECT_ID
3259+
*/
3260+
SAI_SWITCH_ATTR_LAG_HASH_IPV6_RDMA,
3261+
32183262
/**
32193263
* @brief End of attributes
32203264
*/

meta/acronyms.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ BFDV6 - Bidirectional Forwarding Detection for IPv6
1313
BGP - Border Gateway Protocol
1414
BMTOR - Behavioral Model Top-of-Rack
1515
BOS - Bottom Of Stack
16+
BTH - Base Transport Header
1617
BW - Bandwidth
1718
CAM - Content Addressable Memory
1819
CAUI - 100 Gigabit Attachment Unit Interface
@@ -125,6 +126,7 @@ PSP - Penultimate Segment Pop
125126
PTP - Precision time protocol
126127
QOS - Quality of Service
127128
RARP - Reverse Address Resolution Protocol
129+
RDMA - Remote Direct Memory Access
128130
RFC - Request For Comment
129131
RPC - Remote Procedure Call
130132
RPF - Reverse Path Forwarding

0 commit comments

Comments
 (0)