Skip to content

Commit 8a99a47

Browse files
committed
Adding RDMA fields to Hash Object
Signed-off-by: Satheesh Kumar Karra <skarra@marvell.com>
1 parent 750cafa commit 8a99a47

5 files changed

Lines changed: 218 additions & 0 deletions

File tree

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# [SAI] Hashing Enhancements for Efficient RoCE Traffic Distribution
2+
-------------------------------------------------------------------------------
3+
Title | Hashing Enhancements for Efficient RoCE Traffic Distribution
4+
-------------|-----------------------------------------------------------------
5+
Authors | Satheesh Kumar Karra, Ravindranath C K (Marvell)
6+
Status | In review
7+
Type | Standards track
8+
Created | 2025-02-27
9+
SAI-Version | 1.16
10+
-------------------------------------------------------------------------------
11+
12+
## 1.0 Introduction
13+
14+
15+
SAI (Switch Abstraction Interface) supports customization of hash field
16+
selection through the `saihash` object, allowing users to define hash fields
17+
based on network requirements. Configured `saihash` objects can be applied
18+
to different ECMP (Equal cost multi path) traffic flows using the following
19+
SAI switch attributes:
20+
21+
22+
1) SAI_SWITCH_ATTR_ECMP_HASH_IPV4 – Specifies the hash object for IPv4 packets in ECMP.
23+
2) SAI_SWITCH_ATTR_ECMP_HASH_IPV4_IN_IPV4 – Specifies the hash object for IPv4-in-IPv4 encapsulated packets in ECMP.
24+
3) SAI_SWITCH_ATTR_ECMP_HASH_IPV6 – Specifies the hash object for IPv6 packets in ECMP.
25+
26+
These attributes allow fine-tuned ECMP hashing, optimizing traffic
27+
distribution based on application needs. Network administrators can create
28+
custom hash lists using SAI native hash fields and bind them to above switch
29+
attributes. SAI provided similar configurations even for LAG (Link
30+
Aggregation Groups), which ensures balanced traffic distribution across
31+
member links, reducing congestion, and enhancing overall network
32+
efficiency.
33+
34+
In the current configuration, Remote Direct Memory Access over Converged
35+
Ethernet (RoCE) traffic utilizes the same ECMP and LAG hash objects as
36+
standard IP traffic. However, this can lead to traffic polarization,
37+
especially when multiple RoCE streams share the same IP endpoints.
38+
39+
40+
## 2.0 Motivation
41+
42+
The packet fields up to the L4 header for different RDMA streams between the
43+
same endpoints will be mostly identical leading to all these streams to hash
44+
to the same member. In order to improve the hash distribution for RDMA
45+
traffic, modern NPUs have native support for hashing on RDMA header fields.
46+
47+
This proposal adds SAI native hash field support for the below fields in the
48+
RDMA Base Transport Header:
49+
50+
- Queue Pair (QP) Number
51+
- RDMA opcode(Operation type)
52+
53+
54+
## 3.0 SAI Enhancements
55+
56+
1) New Hash fields to support RoCE :
57+
```c
58+
59+
/**
60+
* @brief Attribute data for SAI native hash fields
61+
*/
62+
typedef enum _sai_native_hash_field_t
63+
{
64+
65+
...
66+
/** Native hash field RDMA packet BTH(Base Transport Header) opcode */
67+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_OPCODE,
68+
69+
/** Native hash field RDMA packet BTH destination queue pair */
70+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_DEST_QP,
71+
72+
} sai_native_hash_field_t;
73+
```
74+
2) Switch Attributes to configure Hashing for RoCE Traffic:
75+
```c
76+
/**
77+
* @brief Attribute Id in sai_set_switch_attribute() and
78+
* sai_get_switch_attribute() calls.
79+
*/
80+
typedef enum _sai_switch_attr_t
81+
{
82+
...
83+
/**
84+
* @brief The hash object for IPv4 RDMA packets going through ECMP
85+
*
86+
* @type sai_object_id_t
87+
* @flags CREATE_AND_SET
88+
* @objects SAI_OBJECT_TYPE_HASH
89+
* @allownull true
90+
* @default SAI_NULL_OBJECT_ID
91+
*/
92+
SAI_SWITCH_ATTR_ECMP_HASH_IPV4_RDMA,
93+
94+
/**
95+
* @brief The hash object for IPv6 RDMA packets going through ECMP
96+
*
97+
* @type sai_object_id_t
98+
* @flags CREATE_AND_SET
99+
* @objects SAI_OBJECT_TYPE_HASH
100+
* @allownull true
101+
* @default SAI_NULL_OBJECT_ID
102+
*/
103+
SAI_SWITCH_ATTR_ECMP_HASH_IPV6_RDMA,
104+
105+
/**
106+
* @brief The hash object for IPv4 RDMA packets going through LAG
107+
*
108+
* @type sai_object_id_t
109+
* @flags CREATE_AND_SET
110+
* @objects SAI_OBJECT_TYPE_HASH
111+
* @allownull true
112+
* @default SAI_NULL_OBJECT_ID
113+
*/
114+
SAI_SWITCH_ATTR_LAG_HASH_IPV4_RDMA,
115+
116+
/**
117+
* @brief The hash object for IPv6 RDMA packets going through LAG
118+
*
119+
* @type sai_object_id_t
120+
* @flags CREATE_AND_SET
121+
* @objects SAI_OBJECT_TYPE_HASH
122+
* @allownull true
123+
* @default SAI_NULL_OBJECT_ID
124+
*/
125+
SAI_SWITCH_ATTR_LAG_HASH_IPV6_RDMA,
126+
...
127+
} sai_switch_attr_t;
128+
```
129+
130+
131+
## 4.0 API Example
132+
133+
### Create Hash Object
134+
135+
```c
136+
137+
hash_count = 0;
138+
sai_attr_list[0].id = SAI_HASH_ATTR_NATIVE_HASH_FIELD_LIST;
139+
...(Other hash fileds)
140+
sai_attr_list[0].value.s32list.list[hash_count++] =
141+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_OPCODE;
142+
sai_attr_list[0].value.s32list.list[hash_count++] =
143+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_DEST_QP;
144+
sai_attr_list[0].value.s32list.count = hash_count;
145+
attr_count =1
146+
147+
sai_create_hash_fn(
148+
&hash_rdma_v4_oid,
149+
switch_id,
150+
attr_count,
151+
sai_attr_list);
152+
```
153+
154+
### Configure RDMA Hash on Switch
155+
156+
```c
157+
attr_count = 0
158+
sai_attr_list[attr_count].id = SAI_SWITCH_ATTR_ECMP_HASH_IPV4_RDMA;
159+
sai_attr_list[attr_count].value.oid = hash_rdma_v4_oid;
160+
161+
sai_set_switch_attribute_fn(
162+
switch_id,
163+
sai_attr_list);
164+
```

inc/saihash.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,12 @@ typedef enum _sai_native_hash_field_t
169169
/** Native hash field IPv6 flow label */
170170
SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL = 0x00000018,
171171

172+
/** Native hash field RDMA packet BTH(Base Transport Header) opcode */
173+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_OPCODE = 0x00000022,
174+
175+
/** Native hash field RDMA packet BTH destination queue pair */
176+
SAI_NATIVE_HASH_FIELD_RDMA_BTH_DEST_QP = 0x00000023,
177+
172178
/** No field - for compatibility, must be last */
173179
SAI_NATIVE_HASH_FIELD_NONE = 0x00000021,
174180

inc/saiswitch.h

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3203,6 +3203,50 @@ typedef enum _sai_switch_attr_t
32033203
*/
32043204
SAI_SWITCH_ATTR_SHARED_BUFFER_CELL_SIZE,
32053205

3206+
/**
3207+
* @brief The hash object for IPv4 RDMA packets going through ECMP
3208+
*
3209+
* @type sai_object_id_t
3210+
* @flags CREATE_AND_SET
3211+
* @objects SAI_OBJECT_TYPE_HASH
3212+
* @allownull true
3213+
* @default SAI_NULL_OBJECT_ID
3214+
*/
3215+
SAI_SWITCH_ATTR_ECMP_HASH_IPV4_RDMA,
3216+
3217+
/**
3218+
* @brief The hash object for IPv6 RDMA packets going through ECMP
3219+
*
3220+
* @type sai_object_id_t
3221+
* @flags CREATE_AND_SET
3222+
* @objects SAI_OBJECT_TYPE_HASH
3223+
* @allownull true
3224+
* @default SAI_NULL_OBJECT_ID
3225+
*/
3226+
SAI_SWITCH_ATTR_ECMP_HASH_IPV6_RDMA,
3227+
3228+
/**
3229+
* @brief The hash object for IPv4 RDMA packets going through LAG
3230+
*
3231+
* @type sai_object_id_t
3232+
* @flags CREATE_AND_SET
3233+
* @objects SAI_OBJECT_TYPE_HASH
3234+
* @allownull true
3235+
* @default SAI_NULL_OBJECT_ID
3236+
*/
3237+
SAI_SWITCH_ATTR_LAG_HASH_IPV4_RDMA,
3238+
3239+
/**
3240+
* @brief The hash object for IPv6 RDMA packets going through LAG
3241+
*
3242+
* @type sai_object_id_t
3243+
* @flags CREATE_AND_SET
3244+
* @objects SAI_OBJECT_TYPE_HASH
3245+
* @allownull true
3246+
* @default SAI_NULL_OBJECT_ID
3247+
*/
3248+
SAI_SWITCH_ATTR_LAG_HASH_IPV6_RDMA,
3249+
32063250
/**
32073251
* @brief End of attributes
32083252
*/

meta/acronyms.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ BFDV6 - Bidirectional Forwarding Detection for IPv6
1313
BGP - Border Gateway Protocol
1414
BMTOR - Behavioral Model Top-of-Rack
1515
BOS - Bottom Of Stack
16+
BTH - Base Transport Header
1617
BW - Bandwidth
1718
CAM - Content Addressable Memory
1819
CAUI - 100 Gigabit Attachment Unit Interface
@@ -125,6 +126,7 @@ PSP - Penultimate Segment Pop
125126
PTP - Precision time protocol
126127
QOS - Quality of Service
127128
RARP - Reverse Address Resolution Protocol
129+
RDMA - Remote Direct Memory Access
128130
RFC - Request For Comment
129131
RPC - Remote Procedure Call
130132
RPF - Reverse Path Forwarding

meta/ancestry.pl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,7 @@ sub BuildCommitHistory
178178
next if $enumName eq "SAI_OBJECT_TYPE_MAX";
179179
next if $enumName eq "SAI_PORT_INTERFACE_TYPE_MAX";
180180
next if $enumName eq "SAI_PORT_BREAKOUT_MODE_TYPE_MAX";
181+
next if $enumName eq "SAI_NATIVE_HASH_FIELD_NONE";
181182

182183
LogError "wrong initializer on $enumName $enumValue" if not $enumValue =~ /^0x[0-9a-f]{8}$/;
183184

@@ -209,6 +210,7 @@ sub BuildCommitHistory
209210
#print "elsif (defined $enumName $IGNORED{$enumName} and $IGNORED{$enumName} eq $HISTORY{$enumTypeName}{$enumName}{name})";
210211

211212
next if $HISTORY{$enumTypeName}{$enumValue} eq "SAI_PORT_BREAKOUT_MODE_TYPE_MAX";
213+
next if $HISTORY{$enumTypeName}{$enumValue} eq "SAI_NATIVE_HASH_FIELD_NONE";
212214
LogWarning "Both enums have the same value $enumName and $HISTORY{$enumTypeName}{$enumValue} = $enumValue";
213215
}
214216
}

0 commit comments

Comments
 (0)