Updates to BMC platform management and monitoring spec#2294
Updates to BMC platform management and monitoring spec#2294judyjoseph wants to merge 11 commits into
Conversation
|
/azp run |
|
No pipelines are associated with this pull request. |
|
@chander-nexthop f.y.i |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
…use() to check if it is cold boot
|
/azp run |
|
No pipelines are associated with this pull request. |
| ``` | ||
| Sleep for power_on_delay configured in CHASSIS_MODULE|SWITCH-HOST (this is configurable value in config_db) | ||
| This is to make sure the Rack Manager is up and Liquid flow rate is good. | ||
| if the previous reboot was a Cold Boot (Full Power Cycle) |
There was a problem hiding this comment.
how do we determine that this is a full chassis power cycle?
There was a problem hiding this comment.
@manapalai @vivekverma-arista @chander-nexthop Could you check on your platforms if we can identify the reboot cause of "POWER-LOSS", trying to see if we can identify the device is gone through a power cycle/cold-boot
…. Remove the platform API requirement to retrieve the time from Switch-Host, as it will be difficult to do it before systemd starts
|
/azp run |
|
No pipelines are associated with this pull request. |
Clarified initialization of system time in BMC RTC section and added details about the clock epoch file update.
|
/azp run |
|
No pipelines are associated with this pull request. |
There was a problem hiding this comment.
can we define the severity values...
This is what we have
typedef enum {
SENSOR_OK_NO_LEAK, /* inside normal band; no leak indication /
SENSOR_OK_MILD_LEAK, / in mild/advisory leak band /
SENSOR_OK_HIGH_LEAK, / in high/critical leak band /
SENSOR_DISCONNECTED, / sensor absent / open-circuit / not wired /
SENSOR_SHORT / electrical short / invalid ADC/faulted range */
} leak_sensor_state_t;
in general - let us define input/output of all these APIs
There was a problem hiding this comment.
@manapalai This already maps to what is defined today in platform API
LeakSeverity: LeakSeverity.CRITICAL or LeakSeverity.MINOR, or None if no leak
The other two should not come as a leak severity enum, it should just tell if sensor of faulty or not, mapped to another platform API
def is_leak_sensor_ok(self) -> bool:
There was a problem hiding this comment.
how it will tell sensor is faulty? we call this get_leak_severity(), this should return a value, say 'FAULT' or something?
There was a problem hiding this comment.
We can check for is_leak_sensor_ok() first and based on sensor ok/faulty take or discard the get_leak_severity data ? Could you check this PR : sonic-net/sonic-platform-daemons#776
There was a problem hiding this comment.
@manapalai @oleksandrivantsiv @vivekverma-arista @fraserg-arista @chander-nexthop @roger-nexthop I have also updated the wording a bit in latest commit -- to make sure it is clear that we take zone/location of leak sensor into consideration when getting the leak severity
|
@Yakiv-Huryk, please review |
…ling as a future enhancement
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
Updates to pmon design spec based on comments and further discussion with partners.
Summary
This PR consolidates Switch‑Host power‑control configuration into a unified schema, clarifies platform capability detection for liquid‑cooled systems, and updates default behaviors and CLI documentation for better consistency and predictability.
Key Changes
Add the RTC clock time synchronization approach
Apply the power_on_delay only if the device went through a fullpower cycle
Unified CONFIG_DB schema for Switch‑Host power control
SWITCH_HOST_POWER_ON_DELAY|defaultandSWITCH_HOST_SHUTDOWN_TIMEOUT|defaulttables into a singleCHASSIS_MODULE|SWITCH-HOSTentry.admin_statusfield for the Switch‑Host. The default is set todown, ensuring the Switch‑Host remains powered off on initial device boot unless explicitly enabled.Liquid‑cooling detection cleanup
liquid_cooled=trueflag fromplatform.env.conf, since the same BMC platform/SKU can be reused across air‑cooled and liquid‑cooled chassis.is_bmc()to identify whether the chassis has a BMC module.is_liquid_cooled_chassis()to indicate liquid or hybrid cooling capability.Power‑on delay default behavior
power_on_delayfrom-1(Switch‑Host stays off) to0, so the Switch‑Host powers on immediately whenadmin_statusis set toup.admin_status.CLI consistency and documentation updates
SWITCH-HOST(uppercase, fixed identifier) instead of the<switch-host>placeholder.power_on_delayandgraceful_shutdown_timeout.admin_statussemantics and revised defaults.