[thermalctld] Remove SFP temperature polling#808
Conversation
SFP/transceiver temperature was being read from Redis (TRANSCEIVER_DOM_* tables populated by xcvrd) in thermalctld's update loop. On high-port-count platforms (e.g. 512 SFPs), this caused 3-5 Redis round-trips per SFP per cycle, which could delay fast-polling sensors like ASIC temperature. Remove SFP temperature polling from thermalctld entirely. SFP/transceiver temperature reporting is owned by xcvrd and exposed via the TRANSCEIVER_DOM_SENSOR table; thermalctld no longer needs to duplicate this path. Changes: - Drop the chassis-level and per-module SFP enumeration loops in TemperatureUpdater.update(). - Remove the now-unnecessary 'SFP not in parent_name' guard around PHYSICAL_ENTITY_INFO updates in _collect_thermals(). - Remove the SfpUtilHelper init, Redis-based SFP temperature/threshold readers, and _collect_sfp_thermals() helper. - Update unit tests: drop SFP-specific tests and adjust the module thermal count expectation (no longer counts the SFP thermal). Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@Junchao-Mellanox Please review this PR, with this change, |
|
@vvolam removing transceiver temperatures from |
@aditya-nexthop The approach was to remove the transceiver temperature polling from thermalctld and use data polled by xcvrd. Xcvrd does this in two ways already
We had this comment earlier : So does nexthop platforms show sfp module temperature in "show plat temp" ? adding @Junchao-Mellanox too. |
I think that getting the SFP temperature inline in thermalctld ensures that the values are fresh. For platforms that have moved to using [xcvrd] Add new thread to poll temperatures more aggressively #685, this should be fine, but there might be other platforms that don't. Empirically, transceiver DOM (including temperature) updates at a frequency of 1 minute or slower (#758) when polling via the DOM manager thread.
Yes our platforms currently do.
|
|
@aditya-nexthop just to bring context back again -- Based on the usecase and our internal lab testing where there could be a single device with like 500+ interfaces and sfp module temperature pooling loop in thermalctld affects its performance and application of cooling algorithm -- the approach is to move SFP module temperature polling out of thermalctld and let it be updated in xcvrd which anyways fetch this temperature info. |
|
Thanks @judyjoseph for the context.
But now the platform SFP objects will make the redis calls to get the SFP temperature from the TRANSCEIVER_DOM_* tables, right? Just having some trouble understanding how this results in a speedup. Also thanks @vvolam for raising the PR to retain the SFP temperature values in |
|
@copilot resolve the merge conflicts in this pull request |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Remove SFP/transceiver temperature polling from thermalctld. SFP temperature reporting is owned by xcvrd and exposed via the
TRANSCEIVER_DOM_SENSOR,TANSCEIVER_DOM_TEMPERATUREtables, so duplicating it in thermalctld is unnecessary.Changes:
TemperatureUpdater.update().'SFP' not in parent_nameguard aroundPHYSICAL_ENTITY_INFOupdates in_collect_thermals()._collect_sfp_thermals()helper.Companion PR:
tempershowreads SFP temperature directly from xcvrd-managedTRANSCEIVER_DOM_TEMPERATURE/TRANSCEIVER_DOM_THRESHOLD/TRANSCEIVER_DOM_FLAGtables instead ofTEMPERATURE_INFO, soshow platform temperaturecontinues to display SFP modules after this change.Motivation and Context
SFP/transceiver temperature was being read from Redis (
TRANSCEIVER_DOM_*tables populated by xcvrd) inside thermalctld's update loop. On high-port-count platforms (e.g. 512 SFPs), this caused 3-5 Redis round-trips per SFP per cycle, which could delay fast-polling sensors (such as ASIC temperature) and added duplicate ownership of transceiver telemetry between xcvrd and thermalctld.Other consumers of
TEMPERATURE_INFOwere reviewed and are unaffected:PhysicalSensorTableMIBUpdater/ThermalCacheUpdateralready filter out non-chassis thermals, and transceiver temperature is exposed via the dedicatedTRANSCEIVER_DOM_SENSORpath (update_xcvr_dom_data).hardware_checkeronly reads theASICentry.How Has This Been Tested?
pytest sonic-thermalctld/tests/test_thermalctld.py— all 89 tests pass.TEMPERATURE_INFOno longer contains SFP entries written by thermalctld, while xcvrd-ownedTRANSCEIVER_DOM_SENSORcontinues to report transceiver temperature unchanged, andshow platform temperature(via the updatedtempershowin [tempershow]: Read SFP temperature from xcvrd-managed tables sonic-utilities#4522) lists all SFP modules correctly.Additional Information (Optional)
N/A