Skip to content

[thermalctld] Remove SFP temperature polling#808

Merged
judyjoseph merged 2 commits into
sonic-net:masterfrom
vvolam:remove-sfp-polling-thermalctld
May 13, 2026
Merged

[thermalctld] Remove SFP temperature polling#808
judyjoseph merged 2 commits into
sonic-net:masterfrom
vvolam:remove-sfp-polling-thermalctld

Conversation

@vvolam
Copy link
Copy Markdown
Contributor

@vvolam vvolam commented Apr 30, 2026

Description

Remove SFP/transceiver temperature polling from thermalctld. SFP temperature reporting is owned by xcvrd and exposed via the TRANSCEIVER_DOM_SENSOR, TANSCEIVER_DOM_TEMPERATURE tables, so duplicating it in thermalctld is unnecessary.

Changes:

  • Drop the chassis-level and per-module SFP enumeration loops in TemperatureUpdater.update().
  • Remove the now-unnecessary 'SFP' not in parent_name guard around PHYSICAL_ENTITY_INFO updates in _collect_thermals().
  • Remove the SfpUtilHelper init, Redis-based SFP temperature/threshold readers, and _collect_sfp_thermals() helper.
  • Update unit tests: drop SFP-specific tests and adjust the module thermal count expectation (no longer counts the SFP thermal).

Companion PR:

Motivation and Context

SFP/transceiver temperature was being read from Redis (TRANSCEIVER_DOM_* tables populated by xcvrd) inside thermalctld's update loop. On high-port-count platforms (e.g. 512 SFPs), this caused 3-5 Redis round-trips per SFP per cycle, which could delay fast-polling sensors (such as ASIC temperature) and added duplicate ownership of transceiver telemetry between xcvrd and thermalctld.

Other consumers of TEMPERATURE_INFO were reviewed and are unaffected:

  • sonic-snmpagent PhysicalSensorTableMIBUpdater / ThermalCacheUpdater already filter out non-chassis thermals, and transceiver temperature is exposed via the dedicated TRANSCEIVER_DOM_SENSOR path (update_xcvr_dom_data).
  • system-health hardware_checker only reads the ASIC entry.

How Has This Been Tested?

  • pytest sonic-thermalctld/tests/test_thermalctld.py — all 89 tests pass.
  • Verified on a chassis with SFPs: TEMPERATURE_INFO no longer contains SFP entries written by thermalctld, while xcvrd-owned TRANSCEIVER_DOM_SENSOR continues to report transceiver temperature unchanged, and show platform temperature (via the updated tempershow in [tempershow]: Read SFP temperature from xcvrd-managed tables sonic-utilities#4522) lists all SFP modules correctly.

Additional Information (Optional)

N/A

SFP/transceiver temperature was being read from Redis (TRANSCEIVER_DOM_*
tables populated by xcvrd) in thermalctld's update loop. On high-port-count
platforms (e.g. 512 SFPs), this caused 3-5 Redis round-trips per SFP per
cycle, which could delay fast-polling sensors like ASIC temperature.

Remove SFP temperature polling from thermalctld entirely. SFP/transceiver
temperature reporting is owned by xcvrd and exposed via the
TRANSCEIVER_DOM_SENSOR table; thermalctld no longer needs to duplicate
this path. Changes:

- Drop the chassis-level and per-module SFP enumeration loops in
  TemperatureUpdater.update().
- Remove the now-unnecessary 'SFP not in parent_name' guard around
  PHYSICAL_ENTITY_INFO updates in _collect_thermals().
- Remove the SfpUtilHelper init, Redis-based SFP temperature/threshold
  readers, and _collect_sfp_thermals() helper.
- Update unit tests: drop SFP-specific tests and adjust the module
  thermal count expectation (no longer counts the SFP thermal).

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented Apr 30, 2026

@Junchao-Mellanox Please review this PR, with this change, show platform temperature will no longer display the SFPs

@aditya-nexthop
Copy link
Copy Markdown
Contributor

@vvolam removing transceiver temperatures from show platform temperature is a big change.
Is there an HLD for all the recent changes in this area and was there a community discussion?
If yes, please can you share the recording link or date? I might have missed it. Thanks!

@judyjoseph
Copy link
Copy Markdown
Contributor

judyjoseph commented May 5, 2026

@vvolam removing transceiver temperatures from show platform temperature is a big change. Is there an HLD for all the recent changes in this area and was there a community discussion? If yes, please can you share the recording link or date? I might have missed it. Thanks!

@aditya-nexthop The approach was to remove the transceiver temperature polling from thermalctld and use data polled by xcvrd. Xcvrd does this in two ways already

  1. dom manager thread, storing it in TRANSCEIVER_DOM_SENSOR_TABLE
  2. new thread (introduced by PR : [xcvrd] Add new thread to poll temperatures more aggressively #685), storing it in TRANSCEIVER_DOM_TEMPERATURE_TABLE

We had this comment earlier :

# TODO: This Redis-based approach for reading SFP temperature is temporary.

So does nexthop platforms show sfp module temperature in "show plat temp" ?

adding @Junchao-Mellanox too.

@aditya-nexthop
Copy link
Copy Markdown
Contributor

@vvolam removing transceiver temperatures from show platform temperature is a big change. Is there an HLD for all the recent changes in this area and was there a community discussion? If yes, please can you share the recording link or date? I might have missed it. Thanks!

@aditya-nexthop The approach was to remove the transceiver temperature polling from thermalctld and use data polled by xcvrd. Xcvrd does this in two ways already

  1. dom manager thread, storing it in TRANSCEIVER_DOM_SENSOR_TABLE
  2. new thread (introduced by PR : [xcvrd] Add new thread to poll temperatures more aggressively #685), storing it in TRANSCEIVER_DOM_TEMPERATURE_TABLE

We had this comment earlier :

# TODO: This Redis-based approach for reading SFP temperature is temporary.

I think that getting the SFP temperature inline in thermalctld ensures that the values are fresh.
Going through the database assumes that xcvrd is able to poll temperatures at a rate matching or faster than thermalctld consumes them.

For platforms that have moved to using [xcvrd] Add new thread to poll temperatures more aggressively #685, this should be fine, but there might be other platforms that don't. Empirically, transceiver DOM (including temperature) updates at a frequency of 1 minute or slower (#758) when polling via the DOM manager thread.
This might result in a change of behavior on platforms that poll transceiver temperature directly from thermalctld.

CC @prgeor @mihirpat1

So does nexthop platforms show sfp module temperature in "show plat temp" ?

Yes our platforms currently do.

adding @Junchao-Mellanox too.

@judyjoseph
Copy link
Copy Markdown
Contributor

judyjoseph commented May 6, 2026

@aditya-nexthop just to bring context back again -- Based on the usecase and our internal lab testing where there could be a single device with like 500+ interfaces and sfp module temperature pooling loop in thermalctld affects its performance and application of cooling algorithm -- the approach is to move SFP module temperature polling out of thermalctld and let it be updated in xcvrd which anyways fetch this temperature info.

@aditya-nexthop
Copy link
Copy Markdown
Contributor

Thanks @judyjoseph for the context.

SFP/transceiver temperature was being read from Redis (TRANSCEIVER_DOM_* tables populated by xcvrd) inside thermalctld's update loop. On high-port-count platforms (e.g. 512 SFPs), this caused 3-5 Redis round-trips per SFP per cycle, which could delay fast-polling sensors (such as ASIC temperature) and added duplicate ownership of transceiver telemetry between xcvrd and thermalctld.

But now the platform SFP objects will make the redis calls to get the SFP temperature from the TRANSCEIVER_DOM_* tables, right?

temperature = try_get(thermal.get_temperature)

Just having some trouble understanding how this results in a speedup.

Also thanks @vvolam for raising the PR to retain the SFP temperature values in show platform temperature

@judyjoseph
Copy link
Copy Markdown
Contributor

@copilot resolve the merge conflicts in this pull request

judyjoseph
judyjoseph previously approved these changes May 8, 2026
Copy link
Copy Markdown
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator

Junchao-Mellanox commented May 9, 2026

@judyjoseph judyjoseph merged commit 5e12d7d into sonic-net:master May 13, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants