-
Notifications
You must be signed in to change notification settings - Fork 583
SAI API Performance Monitoring #2279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,243 @@ | ||
| # Performance Monitoring SAI Specification | ||
| ------------------------------------------------------------------------------- | ||
| Title | SAI support for Performance Monitoring | ||
| :-------------|:----------------------------------------------------------------- | ||
| Authors | Jai Kumar, Broadcom Inc | ||
| Status | In review | ||
| Type | Standards track | ||
| Created | 03/18/2026: Initial Draft | ||
| SAI-Version | 1.19 | ||
| ------------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| ## 1.0 Introduction | ||
| As network fabric scale increases and data centers require regional spine connectivity, the number of downlinks for cluster connectivity is growing. This leads to more LAGs, more prefixes, and larger ECMP. This is also true for large scale up and scale across fabrics for AI/ML. | ||
|
|
||
| This increasing scale mandates that SAI be scalable, reliable, and high-performance. This specification addresses the performance component of SAI by introducing a new set of metrics to accurately measure the performance of various components within the SAI layer and below, such as SDK and hardware updates. | ||
|
|
||
| Using these metrics, deployments can isolate components impacting performance and focus on their optimization. | ||
|
|
||
|
|
||
|
|
||
| ## 2.0 Terms and Acronyms | ||
|
|
||
| | Term| Description | | ||
| |:---|:---| | ||
| | perfmon | Performance Metrics | | ||
|
|
||
| ## 3.0 Overview | ||
| The SAI infrastructure exposes a set of APIs as a standard interface to the upper layer. | ||
|
|
||
| These APIs are synchronous and blocking, making the completion time of any given API a critical performance measure. Note that application-specific callbacks are not addressed by this specification. | ||
|
|
||
| ``` | ||
| /** | ||
| * @brief SAI common API type | ||
| */ | ||
| typedef enum _sai_common_api_t | ||
| { | ||
| SAI_COMMON_API_CREATE = 0, | ||
| SAI_COMMON_API_REMOVE = 1, | ||
| SAI_COMMON_API_SET = 2, | ||
| SAI_COMMON_API_GET = 3, | ||
| SAI_COMMON_API_BULK_CREATE = 4, | ||
| SAI_COMMON_API_BULK_REMOVE = 5, | ||
| SAI_COMMON_API_BULK_SET = 6, | ||
| SAI_COMMON_API_BULK_GET = 7, | ||
| SAI_COMMON_API_MAX = 8, | ||
| } sai_common_api_t; | ||
|
|
||
| ``` | ||
|
|
||
| This specification proposes API performance measures for the following metrics | ||
| 1. Average Latency | ||
| 2. Instantaneous Latency | ||
| 3. Maximum Latency | ||
|
|
||
| ### 3.1 Average, Instantaneous, and Maximum Latency | ||
| API completion time consists of the time spent in the SAI adapter and the SDK, including hardware update or query time. Time measured is irrespetcive of the status of the API call i.e. if the API call completes with error status, adapter will still account the measured latency during the time interval of the metrics computation. NOS tracks the return status of API calls and can account for errors as needed. Discounting latency for specific error statuses would result in inconsistent measurements, requiring metric subscribers to implement manual workarounds for those cases. | ||
|
|
||
| These metrics can be used to: | ||
| - Improve SAI adapter and SDK implementations | ||
| - Provide a baseline for comparing different hardware | ||
| - Instantaneous value: Provides [time, n], where n > 1 represents the number of objects in a bulk API, or n = 1 represents the last observed latency for a single object | ||
| - Maximum: The highest value observed across the last n invocations | ||
| - Average: The average value over the last n invocations. | ||
|
|
||
|
|
||
| ## 4.0 SAI Specification | ||
| New perfmon object is introduced. Each perfmon object specifies the object of interest, set of APIs and metrics to be measured for each API. | ||
|
|
||
|
|
||
| Each perfmon object created has a binding to the switch object. | ||
|
|
||
| ### 4.2 Perfmon Object | ||
| New perfmon object is introduced specifying API and metrics of interest. | ||
|
|
||
| #### 4.3.1 Metrics | ||
| Each API can be measure for a specific performance metrics as specified in sai_perfmon_metrics_t | ||
|
|
||
| ``` | ||
| /** | ||
| * @brief Performance Monitoring Metrics | ||
| */ | ||
| typedef enum _sai_perfmon_metrics_t | ||
| { | ||
| /** | ||
| * @brief None | ||
| */ | ||
| SAI_PERFMON_METRICS_NONE, | ||
|
|
||
| /** | ||
| * @brief Maximum latency observed | ||
| */ | ||
| SAI_PERFMON_METRICS_MAX_LATENCY, | ||
|
|
||
| /** | ||
| * @brief Average latency observed | ||
| */ | ||
| SAI_PERFMON_METRICS_AVERAGE_LATENCY, | ||
|
|
||
| /** | ||
| * @brief Instantaneous latency observed | ||
| */ | ||
| SAI_PERFMON_METRICS_INST_LATENCY, | ||
|
|
||
| } sai_perfmon_metrics_t; | ||
|
|
||
| ``` | ||
|
|
||
| #### 4.3.2 Perfmon Object Attributes | ||
| Type of API to be monitored for performance and its associated attributes are specified in the perfmon object attributes | ||
|
|
||
| ``` | ||
| /** | ||
| * @brief Performance Monitoring Attributes | ||
| */ | ||
| typedef enum _sai_perfmon_attr_t | ||
| { | ||
| /** | ||
| * @brief Start of Attributes | ||
| */ | ||
| SAI_PERFMON_ATTR_START, | ||
|
|
||
| /** | ||
| * @brief Object to be monitored | ||
| * | ||
| * @type sai_object_type_t | ||
| * @flags MANDATORY_ON_CREATE | CREATE_ONLY | ||
| */ | ||
| SAI_PERFMON_ATTR_OBJECT_TYPE = SAI_PERFMON_ATTR_START, | ||
|
|
||
| /** | ||
| * @brief API to be monitored | ||
| * | ||
| * @type sai_common_api_t | ||
| * @flags CREATE_AND_SET | ||
| */ | ||
| SAI_PERFMON_ATTR_COMMON_API, | ||
|
|
||
| /** | ||
| * @brief Performance metrics to be collected | ||
| * | ||
| * @type sai_perfmon_metrics_t | ||
| * @flags CREATE_AND_SET | ||
| * @default SAI_PERFMON_METRICS_NONE | ||
| */ | ||
| SAI_PERFMON_ATTR_PERFMON_METRICS, | ||
|
|
||
| /** | ||
| * @brief Performance data as collected. This is clear on read. | ||
| * Performance data is computed once enabled and is cleared once read. | ||
| * | ||
| * @type sai_uint64_t | ||
| * @flags READ_ONLY | ||
| */ | ||
| SAI_PERFMON_ATTR_PERFDATA, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please specify the units of this data. |
||
|
|
||
| /** | ||
| * @brief End of Performance Monitoring attributes | ||
| */ | ||
| SAI_PERFMON_ATTR_END, | ||
|
|
||
| /** Custom range base value */ | ||
| SAI_PERFMON_ATTR_CUSTOM_RANGE_START = 0x10000000, | ||
|
|
||
| /** End of custom range base */ | ||
| SAI_PERFMON_ATTR_CUSTOM_RANGE_END | ||
|
|
||
| } sai_perfmon_attr_t; | ||
|
|
||
| ``` | ||
|
|
||
| #### 4.3.3 Perfmon Object Switch Binding | ||
| List of perfmon objects can be bound to the switch object. This binding can be done as a SET operation when perfmon object is created. | ||
|
|
||
| ``` | ||
| /** | ||
| * @brief Performance Monitoring enabled on the switch | ||
| * | ||
| * @type sai_object_list_t | ||
| * @flags CREATE_AND_SET | ||
| * @objects SAI_OBJECT_TYPE_PERFMO$ | ||
| * @default empty | ||
| */ | ||
| SAI_SWITCH_ATTR_PERFMON_LIST, | ||
| ``` | ||
|
|
||
|
|
||
| ## 5.0 Sample Workflow | ||
|
|
||
| This section talks about enabling performance monitoring for a given API and a metrics. | ||
|
|
||
| ### 5.1 Create perfmon object | ||
| - Each perfmon object supports a single API and a single set of metrics. To monitor additional metrics for the same API or to monitor a different API, a new perfmon object must be created. | ||
| - Monitoring in the SAI adapter will only begin once the perfmon object is bound to the switch object. | ||
|
|
||
| ``` | ||
| /* | ||
| * Configure CSIG Compact Tag for ABW signal processing and time interval of 256 micro seconds | ||
| */ | ||
|
|
||
| // Specify the Object of intererst | ||
| sai_attr_list[0].id = SAI_PERFMON_ATTR_OBJECT_TYPE; | ||
| sai_attr_list[0].value.s32 = SAI_OBJECT_TYPE_ROUTE_ENTRY; | ||
|
|
||
| // Specify the API of interest | ||
| sai_attr_list[1].id = SAI_PERFMON_ATTR_COMMON_API; | ||
| sai_attr_list[1].value.s32 = SAI_COMMON_API_BULK_SET; | ||
|
|
||
| // Configure metrics to be measured | ||
| sai_attr_list[2].id = SAI_PERFMON_ATTR_PERFMON_METRICS; | ||
| sai_attr_list[2].value.s32 = SAI_PERFMON_METRICS_AVERAGE_LATENCY; | ||
|
|
||
| // Configure Time Interval in msec | ||
| sai_attr_list[3].id = SAI_PERFMON_ATTR_METRICS_TIME_INTERVAL; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SAI_PERFMON_ATTR_METRICS_TIME_INTERVAL attribute is missing, and functionality is NOT specified. The spec has defined that the interval is always between the two invocations for a given ObjType+API_type. |
||
| sai_attr_list[3].value.u32 = 2048; | ||
|
|
||
|
|
||
| // Create perfmon object | ||
| attr_count = 4; | ||
| create_perfmon( | ||
| &sai_perfmon_object, | ||
| switch_id, | ||
| attr_count, | ||
| sai_attr_list); | ||
| ``` | ||
|
|
||
| ### 5.2 Read perfmon Metrics | ||
|
|
||
| Read the perfmon attribute for getting the API related metrics. | ||
|
|
||
| ``` | ||
| // Specify the read attribute | ||
| sai_attr_list[1].id = SAI_PERFMON_ATTR_PERFDATA; | ||
|
|
||
| // Read perfmon metrics | ||
| attr_count = 1; | ||
| get_perfmon_attribute( | ||
| sai_perfmon_object, | ||
| attr_count, | ||
| sai_attr_list); | ||
| ... | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 'n' part is not kept any more. Can simplify the description to "last observed latency for API call".