Skip to content

Updates to BMC platform management and monitoring spec#2294

Open
judyjoseph wants to merge 11 commits into
sonic-net:masterfrom
judyjoseph:pmon_bmc_updates
Open

Updates to BMC platform management and monitoring spec#2294
judyjoseph wants to merge 11 commits into
sonic-net:masterfrom
judyjoseph:pmon_bmc_updates

Conversation

@judyjoseph
Copy link
Copy Markdown
Contributor

@judyjoseph judyjoseph commented Apr 16, 2026

Updates to pmon design spec based on comments and further discussion with partners.

Summary

This PR consolidates Switch‑Host power‑control configuration into a unified schema, clarifies platform capability detection for liquid‑cooled systems, and updates default behaviors and CLI documentation for better consistency and predictability.

Key Changes

  • Add the RTC clock time synchronization approach

    • Plan to use this file "/usr/lib/clock-epoch" which is present when system boots
    • Later the chrony service sync with NTP server configured and updates the system time
  • Apply the power_on_delay only if the device went through a fullpower cycle

    • Check the last reboot cause using API get_reboot_cause()
    • If it is because of Power loss (which I assume is cold boot/POR) - add delay to make sure liquid cooling is good.
  • Unified CONFIG_DB schema for Switch‑Host power control

    • Merged the previously separate SWITCH_HOST_POWER_ON_DELAY|default and SWITCH_HOST_SHUTDOWN_TIMEOUT|default tables into a single CHASSIS_MODULE|SWITCH-HOST entry.
    • Introduced an explicit admin_status field for the Switch‑Host. The default is set to down, ensuring the Switch‑Host remains powered off on initial device boot unless explicitly enabled.
  • Liquid‑cooling detection cleanup

    • Removed the liquid_cooled=true flag from platform.env.conf, since the same BMC platform/SKU can be reused across air‑cooled and liquid‑cooled chassis.
    • Added explicit platform APIs to avoid implicit configuration:
      • is_bmc() to identify whether the chassis has a BMC module.
      • is_liquid_cooled_chassis() to indicate liquid or hybrid cooling capability.
  • Power‑on delay default behavior

    • Updated the default power_on_delay from -1 (Switch‑Host stays off) to 0, so the Switch‑Host powers on immediately when admin_status is set to up.
    • This makes the default behavior more intuitive while still preserving explicit administrative control via admin_status.
  • CLI consistency and documentation updates

    • Standardized CLI command examples to use SWITCH-HOST (uppercase, fixed identifier) instead of the <switch-host> placeholder.
    • Clarified CLI help text and descriptions for power_on_delay and graceful_shutdown_timeout.
    • Updated documentation to reflect the new admin_status semantics and revised defaults.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@judyjoseph judyjoseph changed the title Revise BMC platform configuration details Revise BMC platform management and monitoring spec Apr 16, 2026
@judyjoseph
Copy link
Copy Markdown
Contributor Author

@chander-nexthop f.y.i

@judyjoseph judyjoseph changed the title Revise BMC platform management and monitoring spec Update BMC platform management and monitoring spec Apr 16, 2026
@judyjoseph judyjoseph changed the title Update BMC platform management and monitoring spec Updates to BMC platform management and monitoring spec Apr 16, 2026
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@judyjoseph judyjoseph marked this pull request as ready for review April 28, 2026 04:55
Comment thread doc/bmc/sonicBMC/pmon-bmc-design.md Outdated
```
Sleep for power_on_delay configured in CHASSIS_MODULE|SWITCH-HOST (this is configurable value in config_db)
This is to make sure the Rack Manager is up and Liquid flow rate is good.
if the previous reboot was a Cold Boot (Full Power Cycle)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we determine that this is a full chassis power cycle?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manapalai @vivekverma-arista @chander-nexthop Could you check on your platforms if we can identify the reboot cause of "POWER-LOSS", trying to see if we can identify the device is gone through a power cycle/cold-boot

…. Remove the platform API requirement to retrieve the time from Switch-Host, as it will be difficult to do it before systemd starts
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Clarified initialization of system time in BMC RTC section and added details about the clock epoch file update.
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Comment thread doc/bmc/sonicBMC/pmon-bmc-design.md
Comment thread doc/bmc/sonicBMC/pmon-bmc-design.md Outdated
Comment thread doc/bmc/sonicBMC/pmon-bmc-design.md
Comment thread doc/bmc/sonicBMC/pmon-bmc-design.md
Comment thread doc/bmc/sonicBMC/pmon-bmc-design.md Outdated
Copy link
Copy Markdown

@manapalai manapalai May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define the severity values...

This is what we have
typedef enum {
SENSOR_OK_NO_LEAK, /* inside normal band; no leak indication /
SENSOR_OK_MILD_LEAK, /
in mild/advisory leak band /
SENSOR_OK_HIGH_LEAK, /
in high/critical leak band /
SENSOR_DISCONNECTED, /
sensor absent / open-circuit / not wired /
SENSOR_SHORT /
electrical short / invalid ADC/faulted range */
} leak_sensor_state_t;

in general - let us define input/output of all these APIs

Copy link
Copy Markdown
Contributor Author

@judyjoseph judyjoseph May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manapalai This already maps to what is defined today in platform API

https://github.com/sonic-net/sonic-platform-common/blob/41af585ee6e359ea9adf6c5f9f422bbfd750f5bb/sonic_platform_base/liquid_cooling_base.py#L85

        LeakSeverity: LeakSeverity.CRITICAL or LeakSeverity.MINOR, or None if no leak

The other two should not come as a leak severity enum, it should just tell if sensor of faulty or not, mapped to another platform API

def is_leak_sensor_ok(self) -> bool:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how it will tell sensor is faulty? we call this get_leak_severity(), this should return a value, say 'FAULT' or something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check for is_leak_sensor_ok() first and based on sensor ok/faulty take or discard the get_leak_severity data ? Could you check this PR : sonic-net/sonic-platform-daemons#776

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manapalai @oleksandrivantsiv @vivekverma-arista @fraserg-arista @chander-nexthop @roger-nexthop I have also updated the wording a bit in latest commit -- to make sure it is clear that we take zone/location of leak sensor into consideration when getting the leak severity

@oleksandrivantsiv
Copy link
Copy Markdown
Contributor

@Yakiv-Huryk, please review

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Copy link
Copy Markdown
Contributor Author

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOT ready to merge yet, as there could be some change in the initial state if Switch when the switch is powered on in Buildout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants