Skip to content

[tempershow]: Read SFP temperature from xcvrd-managed tables#4522

Merged
judyjoseph merged 3 commits into
sonic-net:masterfrom
vvolam:tempershow-sfp-from-xcvrd
May 13, 2026
Merged

[tempershow]: Read SFP temperature from xcvrd-managed tables#4522
judyjoseph merged 3 commits into
sonic-net:masterfrom
vvolam:tempershow-sfp-from-xcvrd

Conversation

@vvolam
Copy link
Copy Markdown
Contributor

@vvolam vvolam commented May 6, 2026

What I did

Update tempershow (show platform temperature) to read SFP/transceiver temperatures from xcvrd-managed STATE_DB tables (TRANSCEIVER_DOM_TEMPERATURE for the value, TRANSCEIVER_DOM_THRESHOLD for thresholds, TRANSCEIVER_DOM_FLAG for the warning state), instead of relying on TEMPERATURE_INFO entries published by thermalctld.

This is the CLI-side companion to the thermalctld change that stops polling and publishing per-SFP temperature data into TEMPERATURE_INFO (sonic-platform-daemons). After that change, SFP rows would silently disappear from show platform temperature unless the CLI sources them directly from xcvrd's tables.

Field mapping from xcvrd tables to tempershow columns:

tempershow column xcvrd table xcvrd field
Temperature TRANSCEIVER_DOM_TEMPERATURE temperature
High TH TRANSCEIVER_DOM_THRESHOLD temphighwarning
Low TH TRANSCEIVER_DOM_THRESHOLD templowwarning
Crit High TH TRANSCEIVER_DOM_THRESHOLD temphighalarm
Crit Low TH TRANSCEIVER_DOM_THRESHOLD templowalarm
Warning TRANSCEIVER_DOM_FLAG tempHWarn / tempLWarn / tempHAlarm / tempLAlarm

Note: xcvrd's TRANSCEIVER_DOM_FLAG table uses camelCase field names (per the SFP API get_transceiver_dom_flags()), while TRANSCEIVER_DOM_THRESHOLD uses snake_case field names (per get_transceiver_threshold_info()). Both naming conventions are handled correctly.

Additional behavior:

  • Sensor name parity with the legacy thermalctld output: SFP rows are labelled xSFP module <N> Temp, mapped via the shared utilities_common.platform_sfputil_helper (used by sfpshow / sfputil), so multi-ASIC port-config handling stays consistent across CLIs. Falls back to the logical port name if the helper is unavailable.
  • Warning column is True when any of the four temperature flags (tempHWarn, tempLWarn, tempHAlarm, tempLAlarm) in TRANSCEIVER_DOM_FLAG is asserted; False when all four are present and de-asserted; N/A when the flag table has no temperature flags for the port. This matches the legacy thermalctld semantics.
  • Timestamp column uses the current wall-clock time formatted as YYYYMMDD HH:MM:SS, matching the legacy thermalctld behavior of stamping every poll with the current time.
  • Multi-ASIC support: Calls SonicDBConfig.load_sonic_global_db_config() to initialize global DB config before creating namespace-scoped connections. SFP rows are collected by iterating every front-end ASIC namespace's STATE_DB (via multi_asic.get_front_end_namespaces() and multi_asic.connect_to_all_dbs_for_ns()) and merging the results, so transceiver rows from per-namespace tables are not missed on multi-ASIC platforms.
  • Existing platform thermal sensors (chassis/PSU/fan/CPU/SODIMM) continue to be displayed from TEMPERATURE_INFO.
  • Missing fields default to N/A instead of raising KeyError.
  • Tightened the table-key glob to <table>|* so unrelated keys are not accidentally matched.

How I did it

  • Added _collect_platform_sensors() (existing TEMPERATURE_INFO data path) and a new _collect_sfp_sensors() that reads TRANSCEIVER_DOM_TEMPERATURE|* plus the matching TRANSCEIVER_DOM_THRESHOLD|<port> and TRANSCEIVER_DOM_FLAG|<port> rows.
  • Added _init_sfp_util_helper() / _sfp_display_name() that delegate to utilities_common.platform_sfputil_helper for logical -> physical port mapping, reusing the shared multi-ASIC-aware helper instead of re-implementing the porttab init logic.
  • Added _get_sfp_db_connections() to return one STATE_DB connection per front-end ASIC namespace on multi-ASIC platforms (and a single host connection on single-ASIC), so transceiver tables are read from every namespace.
  • Added _derive_sfp_warning() to translate the four temperature flags in TRANSCEIVER_DOM_FLAG into the True / False / N/A warning state.
  • Added SonicDBConfig.load_sonic_global_db_config() call (guarded by isGlobalInit()) in __init__ to initialize global DB config for multi-ASIC namespace resolution. Without this, connect_to_all_dbs_for_ns() fails with validateNamespace: Initialize global DB config.
  • Compute the Timestamp once per show invocation using datetime.now().strftime('%Y%m%d %H:%M:%S') so the format matches the legacy TEMPERATURE_INFO.timestamp column written by thermalctld.
  • show() merges platform + SFP rows; both tabular and JSON (-j) outputs are supported with the same column set as before.

How to verify it

  1. Install the updated tempershow and the companion thermalctld (which no longer publishes SFP entries to TEMPERATURE_INFO).
  2. Run show platform temperature.
  3. Confirm:
    • SFP rows are present and labelled xSFP module <N> Temp.
    • SFP temperature values match sonic-db-cli -n asic0 STATE_DB hget 'TRANSCEIVER_DOM_TEMPERATURE|<port>' temperature (on multi-ASIC) or redis-cli -n 6 hget 'TRANSCEIVER_DOM_TEMPERATURE|<port>' temperature (on single-ASIC).
    • SFP thresholds match the corresponding TRANSCEIVER_DOM_THRESHOLD|<port> fields.
    • Warning reflects the temperature flag state in TRANSCEIVER_DOM_FLAG|<port> (False when all four temp flags are de-asserted, True if any is asserted, N/A if the flags are not yet populated).
    • Timestamp is in the same YYYYMMDD HH:MM:SS format as the platform sensor rows and reflects the time the command was run.
    • Chassis / PSU / fan / CPU / SODIMM rows still appear and remain sourced from TEMPERATURE_INFO.
    • show platform temperature -j still produces well-formed JSON with Sensor / Temperature keys.
  4. Verify there are no TEMPERATURE_INFO|*SFP* keys remaining: redis-cli -n 6 keys 'TEMPERATURE_INFO|*SFP*'.

Tested on

  • Single-ASIC: Mellanox SN2700 (SONiC master) — SFP rows displayed correctly
  • Multi-ASIC: Arista 7280DR3AM-36 (ASIC Count: 2, Broadcom DNX) — SFP rows from both asic0 and asic1 namespaces displayed correctly after SonicDBConfig.load_sonic_global_db_config() fix

Previous command output (if the output of a command-line utility has changed)

$ show platform temperature
                Sensor    Temperature    High TH    Low TH    Crit High TH    Crit Low TH    Warning          Timestamp
----------------------  -------------  ---------  --------  --------------  -------------  ---------  -----------------
                  ASIC           71.0        105       N/A             120            N/A      False  20260506 18:28:08
 Ambient Fan Side Temp         32.562        N/A       N/A             N/A            N/A      False  20260506 18:28:08
Ambient Port Side Temp         32.562        N/A       N/A             N/A            N/A      False  20260506 18:28:08
         CPU Pack Temp         34.375       95.0       N/A           100.0            N/A      False  20260506 18:28:08
            PSU-1 Temp            N/A        N/A       N/A             N/A            N/A      False  20260506 18:28:08
            PSU-2 Temp           24.5       63.0       N/A             N/A            N/A      False  20260506 18:28:08
            PSU-3 Temp            N/A        N/A       N/A             N/A            N/A      False  20260506 18:28:08
            PSU-4 Temp           24.5       63.0       N/A             N/A            N/A      False  20260506 18:28:08
         SODIMM 2 Temp           35.5       85.0       N/A            95.0            N/A      False  20260506 18:28:08
    xSFP module 1 Temp         46.715       75.0      -5.0            80.0          -10.0      False  20260506 18:28:08
    xSFP module 2 Temp         53.297       75.0      -5.0            80.0          -10.0      False  20260506 18:28:08
    ... (SFP modules 3..64, populated transceivers)
   xSFP module 65 Temp            0.0       70.0       0.0            75.0           -5.0      False  20260506 18:28:08
   xSFP module 66 Temp            0.0       70.0       0.0            75.0           -5.0      False  20260506 18:28:08

(SFP rows sourced from TEMPERATURE_INFO populated by thermalctld.)

New command output (if the output of a command-line utility has changed)

$ show platform temperature
                Sensor    Temperature    High TH    Low TH    Crit High TH    Crit Low TH    Warning          Timestamp
----------------------  -------------  ---------  --------  --------------  -------------  ---------  -----------------
                  ASIC           72.0        105       N/A             120            N/A      False  20260506 22:06:16
 Ambient Fan Side Temp         33.812        N/A       N/A             N/A            N/A      False  20260506 22:05:58
Ambient Port Side Temp         33.562        N/A       N/A             N/A            N/A      False  20260506 22:05:58
         CPU Pack Temp          34.75       95.0       N/A           100.0            N/A      False  20260506 22:05:58
            PSU-1 Temp            N/A        N/A       N/A             N/A            N/A      False  20260506 22:05:58
            PSU-2 Temp           25.0       63.0       N/A             N/A            N/A      False  20260506 22:05:58
            PSU-3 Temp            N/A        N/A       N/A             N/A            N/A      False  20260506 22:05:58
            PSU-4 Temp           25.0       63.0       N/A             N/A            N/A      False  20260506 22:05:58
         SODIMM 2 Temp           36.0       85.0       N/A            95.0            N/A      False  20260506 22:05:58
    xSFP module 1 Temp         48.617       75.0      -5.0            80.0          -10.0      False  20260506 22:06:19
    xSFP module 2 Temp         54.832       75.0      -5.0            80.0          -10.0      False  20260506 22:06:19
   ... (SFP modules 3..64, populated transceivers)
   xSFP module 65 Temp            0.0       70.0       0.0            75.0           -5.0        N/A  20260506 22:06:19
   xSFP module 66 Temp            0.0       70.0       0.0            75.0           -5.0        N/A  20260506 22:06:19

Notes:

  • SFP rows now come from TRANSCEIVER_DOM_TEMPERATURE / TRANSCEIVER_DOM_THRESHOLD (values + thresholds) and TRANSCEIVER_DOM_FLAG (warning state) instead of TEMPERATURE_INFO.
  • Sensor labels are unchanged (xSFP module <N> Temp), so dashboards and parsers depending on the legacy naming continue to work.
  • Warning and Timestamp columns are populated for SFP rows with the same semantics and format as the legacy thermalctld output. Modules 65 and 66 show N/A for Warning because their TRANSCEIVER_DOM_FLAG table is not populated (no transceiver inserted on this device for those ports).

Copilot AI review requested due to automatic review settings May 6, 2026 19:38
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the tempershow utility (invoked by show platform temperature) so SFP/transceiver temperature rows are sourced directly from xcvrd-managed STATE_DB tables, avoiding dependence on thermalctld publishing per-SFP rows into TEMPERATURE_INFO.

Changes:

  • Add collection path for platform sensors from TEMPERATURE_INFO and new SFP sensor collection from TRANSCEIVER_DOM_TEMPERATURE + TRANSCEIVER_DOM_THRESHOLD.
  • Add SFP display-name mapping via SfpUtilHelper to preserve legacy xSFP module <N> Temp labels (with fallback behavior).
  • Make missing DB fields render as N/A and tighten DB key matching to <TABLE>|*.

Comment thread scripts/tempershow
Comment thread scripts/tempershow Outdated
@vvolam vvolam force-pushed the tempershow-sfp-from-xcvrd branch from a3ac7e8 to 0b2ded0 Compare May 6, 2026 20:02
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

thermalctld no longer publishes per-SFP entries into TEMPERATURE_INFO.
Update tempershow to additionally read SFP temperatures from the
xcvrd-managed TRANSCEIVER_DOM_TEMPERATURE table, with thresholds from
TRANSCEIVER_DOM_THRESHOLD:

  High TH       <- temphighwarning
  Low TH        <- templowwarning
  Crit High TH  <- temphighalarm
  Crit Low TH   <- templowalarm

Existing TEMPERATURE_INFO consumers (chassis/PSU/fan sensors) continue
to be displayed. Missing fields default to N/A instead of raising
KeyError; the table key glob is also tightened to '<table>|*'.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment thread scripts/tempershow Outdated
Comment thread scripts/tempershow Outdated
Comment thread scripts/tempershow Outdated
Comment thread scripts/tempershow Outdated
- _init_sfp_util_helper: also catch SystemExit raised by
  platform_sfputil_helper load functions so the CLI falls back to
  logical port names instead of terminating.
- _get_sfp_db_connections: keep successful per-namespace STATE_DB
  connections and skip only the failing namespace, instead of
  dropping all connections on any failure.
- _collect_sfp_sensors: prefetch TRANSCEIVER_DOM_THRESHOLD and
  TRANSCEIVER_DOM_FLAG once per namespace via new _prefetch_table()
  to avoid an N+1 hgetall pattern per port.
- _derive_sfp_warning: return 'True'/'False'/'N/A' strings to keep
  the JSON output's Warning field type consistent with platform
  sensor rows from TEMPERATURE_INFO.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

judyjoseph
judyjoseph previously approved these changes May 8, 2026
Copy link
Copy Markdown
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

prgeor
prgeor previously approved these changes May 11, 2026
…e resolution

On multi-ASIC platforms, SonicDBConfig.load_sonic_global_db_config()
must be called before creating namespace-scoped SonicV2Connector
instances. Without this, connect_to_all_dbs_for_ns() fails with
'validateNamespace: Initialize global DB config' and SFP temperatures
from per-ASIC namespace STATE_DBs are not displayed.

Add the standard isGlobalInit/load_sonic_global_db_config guard in
TemperShow.__init__(), matching the pattern used by fdbshow, fdbclear,
and nbrshow.

Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
@vvolam vvolam dismissed stale reviews from Junchao-Mellanox, prgeor, and judyjoseph via 1c9ea92 May 12, 2026 22:21
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

1 similar comment
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@judyjoseph judyjoseph merged commit 1f0271d into sonic-net:master May 13, 2026
11 checks passed
@vvolam vvolam deleted the tempershow-sfp-from-xcvrd branch May 14, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants