Skip to content

Commit 68ebd3f

Browse files
authored
cosmo_seq: mask RAA229620A Iout overcurrent warnings (#2511)
See #2510 --- evidentially, AMD requires that the SVI3 fast sum overcurrent threshold for VRMs supplying SVI3 power rails to Turin CPUs be set at a threshold of 90% of 90% of the part's EDC amperage. This results in the VRM setting the overcurrent warning bit in `STATUS_IOUT` and tugging on its fault pin to request attention pretty incessantly when an IRM Group G CPU is under heavy (but importantly, safe!) load. AMD suggests that BMC's "mask or ignore" this warning, since the threshold has to be set to a value that results in it being asserted during normal operation due to reasons I am no doubt to provincial to understand. Therefore, this commit makes Hubris mask out the `STATUS_IOUT` bit for overcurrent warnings on the RAA229620A `VDDCR_CPU0` and `VDDCR_CPU1` SVI3 regulators. This felt like the best solution because it (a) shouldn't suppress any fault bits that actually *do* represent something bad, and (b) unlike ignoring this specific warning, doesn't result in Hubris taking a bunch of IRQs that it just ultimately decides to do nothing about. With help from @ericaasen, we have reproduced the warnings by manually adjusting the threshold to a lower value, as the CPU load testing script I was running wasn't actually able to make a 9755 hit the 297 amp threshold. I've validated that masking the relevant bit in the `SMBALERT_MASK` for the `STATUS_IOUT` register makes the IRQing go away. So that's good. I don't _love_ the function I wrote for masking this bit: it's specific for setting `SMBALERT_MASK` for `STATUS_IOUT` and not for any other register. I feel like it ought to be possible to write a single function for setting `SMBALERT_MASK` for *any* status register, but the abstractions provided by `pmbus` at time of writing made it difficult to do so generically --- or perhaps I'm holding it wrong! Anyway, I'd really like to get the immediate fix in first and then try to come up with a nicer abstraction for `SMBALERT_MASK` later, since I'd like to stop the ereport spew in R20 if possible. In addition, I have made some changes to the `initialize_pmbus_alerts` (née `initialize_pmbus_warnings`) method in `Vcore`. Since that method now both sets the input undervoltage warning threshold _and_ masks the output overcurrent warning, I've changed it to not return a `Result` so that it cannot accidentally `?` out early without trying to set _all_ the PMBus configurations it tries to set. We probably want to attempt to set the `SMBALERT_MASK` even if we encountered an amount of I2C Weather that caused setting the UV threshold for one of the VRMs to fail. So now we put the error in the ringbuf and keep going there. Fixes #2510.
1 parent 1754c74 commit 68ebd3f

3 files changed

Lines changed: 95 additions & 21 deletions

File tree

drv/cosmo-seq-server/src/main.rs

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -762,18 +762,11 @@ impl ServerImpl {
762762
notifications::SEQ_IRQ_MASK,
763763
sys_api::IrqControl::Enable,
764764
);
765-
// Enable the undervoltage warning PMBus alert from the Vcore
766-
// regulators.
767-
//
768-
// Yes, we just ignore the error here --- while that seems a bit
769-
// sketchy, but what else can we do? It seems pretty bad to panic and
770-
// say "nope, the computer won't turn on" because we weren't able to do
771-
// an I2C transaction to turn on an interrupt that we only use for
772-
// monitoring for faults. The initialize method will retry internally a
773-
// few times, so we should power through any transient I2C messiness,
774-
// and any I2C errors that occur get logged in the `vcore` module's
775-
// ringbuf.
776-
let _ = self.vcore.initialize_uv_warning();
765+
// Configure PMBus alerts from the Vcore regulators. This will attempt
766+
// to set a limit for the input undervoltage warning, and mask out the
767+
// (spurious) output overcurrent warning (which must be set to a
768+
// threshold that is not indicative of a real fault).
769+
self.vcore.initialize_pmbus_alerts();
777770
self.seq.ier.modify(|m| {
778771
m.set_fanfault(true);
779772
m.set_thermtrip(true);

drv/cosmo-seq-server/src/vcore.rs

Lines changed: 60 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ use drv_i2c_api::ResponseCode;
1818
use drv_i2c_devices::raa229620a::{self, Raa229620A};
1919
use ereports::pwr::{PmbusAlert, PmbusStatus};
2020
use fixedstr::FixedStr;
21+
use pmbus::commands::raa229620a::STATUS_IOUT;
2122
use pmbus::commands::raa229620a::STATUS_WORD;
2223
use ringbuf::*;
2324
use userlib::{TaskId, sys_get_timer, units};
@@ -41,6 +42,7 @@ pub(crate) enum Rail {
4142
#[derive(Copy, Clone, PartialEq)]
4243
enum PmbusCmd {
4344
LoadLimit,
45+
SetStatusIoutMask,
4446
ClearFaults,
4547
ReadVin,
4648
Status,
@@ -51,7 +53,12 @@ enum Trace {
5153
None,
5254
Initializing,
5355
Initialized,
54-
LimitsLoaded,
56+
LimitsLoaded {
57+
all_ok: bool,
58+
},
59+
StatusIoutMaskSet {
60+
all_ok: bool,
61+
},
5562
PmbusAlert {
5663
timestamp: u64,
5764
alerted: Vrms,
@@ -198,17 +205,63 @@ impl VCore {
198205
self.faulted.pwr_cont1 || self.faulted.pwr_cont2
199206
}
200207

201-
pub fn initialize_uv_warning(&mut self) -> Result<(), ResponseCode> {
208+
pub fn initialize_pmbus_alerts(&mut self) {
202209
ringbuf_entry!(Trace::Initializing);
203210

211+
// Yes, we just ignore errors here --- that may seem a bit sketchy, but
212+
// what else can we do? It seems pretty bad to panic and say "nope, the
213+
// computer won't turn on" because we weren't able to do an I2C
214+
// transaction to turn on an interrupt that we only use for monitoring.
215+
// Each step will retry internally a few times, so we should power
216+
// through any transient I2C messiness, and any I2C errors that occur
217+
// get logged in the ringbuf. We also do *not* bail out early if any
218+
// other step fails, because we would still like to do every other thing
219+
// if possible.
220+
204221
// Set our warn limit
205-
retry_i2c_txn(Rail::VddcrCpu0, PmbusCmd::LoadLimit, || {
222+
let mut all_ok = true;
223+
all_ok &= retry_i2c_txn(Rail::VddcrCpu0, PmbusCmd::LoadLimit, || {
206224
self.vddcr_cpu0.set_vin_uv_warn_limit(VCORE_UV_WARN_LIMIT)
207-
})?;
208-
retry_i2c_txn(Rail::VddcrCpu1, PmbusCmd::LoadLimit, || {
225+
})
226+
.is_ok();
227+
all_ok &= retry_i2c_txn(Rail::VddcrCpu1, PmbusCmd::LoadLimit, || {
209228
self.vddcr_cpu1.set_vin_uv_warn_limit(VCORE_UV_WARN_LIMIT)
210-
})?;
211-
ringbuf_entry!(Trace::LimitsLoaded);
229+
})
230+
.is_ok();
231+
ringbuf_entry!(Trace::LimitsLoaded { all_ok });
232+
233+
let iout_mask = {
234+
let mut mask = STATUS_IOUT::CommandData(0);
235+
// Mask out SMBus alerts for output overcurrent warnings on the
236+
// RAA229620A. This is necessary because the AMD power design
237+
// guidelines for Turin require that the SVI3 fast overcurrent
238+
// response warning threshold be 90% of the EDC for the CPU, which
239+
// results in the VRMs asserting output overcurrent warnings on
240+
// their SVI3 rail during normal operation of the CPU. AMD
241+
// recommends that these warnings be masked out and/or ignored, so
242+
// we disable them here.
243+
//
244+
// We don't expect to see overcurrent warnings on non-SVI3 rails, as
245+
// the default value of the PMBus `IOUT_OC_WARN_LIMIT` register
246+
// on the RAA229620A is 3000A, which is *well above* the overcurrent
247+
// *fault* threshold.
248+
mask.set_output_overcurrent_warning(
249+
STATUS_IOUT::OutputOvercurrentWarning::Warning,
250+
);
251+
mask
252+
};
253+
let mut all_ok = true;
254+
all_ok &=
255+
retry_i2c_txn(Rail::VddcrCpu0, PmbusCmd::SetStatusIoutMask, || {
256+
self.vddcr_cpu0.set_status_iout_smbalert_mask(iout_mask)
257+
})
258+
.is_ok();
259+
all_ok &=
260+
retry_i2c_txn(Rail::VddcrCpu1, PmbusCmd::SetStatusIoutMask, || {
261+
self.vddcr_cpu1.set_status_iout_smbalert_mask(iout_mask)
262+
})
263+
.is_ok();
264+
ringbuf_entry!(Trace::StatusIoutMaskSet { all_ok });
212265

213266
// Clear our faults
214267
self.try_to_clear_faults(Vrms {
@@ -220,8 +273,6 @@ impl VCore {
220273
// our guys.
221274

222275
ringbuf_entry!(Trace::Initialized);
223-
224-
Ok(())
225276
}
226277

227278
pub fn can_we_unmask_any_vrm_irqs_again(&mut self) -> Vrms {

drv/i2c-devices/src/raa229620a.rs

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,36 @@ impl Raa229620A {
143143
pmbus_rail_write!(self.device, self.rail, VIN_UV_WARN_LIMIT, vin)
144144
}
145145

146+
/// Set the `SMBALERT_MASK` for the `STATUS_IOUT` register.
147+
///
148+
/// Any bits set in `mask` will be masked, suppressing SMBus alerts when
149+
/// those bits in `STATUS_IOUT` become set.
150+
pub fn set_status_iout_smbalert_mask(
151+
&self,
152+
mask: STATUS_IOUT::CommandData,
153+
) -> Result<(), Error> {
154+
// Unfortunately, SMBALERT_MASK is kinda hard to use with the
155+
// `pmbus_write!` macro family, due to not having its own `CommandData`
156+
// type, and because it requires writing both the name of the status
157+
// register being masked *and* the value of that register (as the mask).
158+
// It's a bit odd. Probably it deserves its own
159+
// `pmbus_smbalert_mask_write!` macro or something, but for now, we'll
160+
// just do it manually.
161+
self.device
162+
.write_write(
163+
&[PAGE::CommandData::code(), self.rail],
164+
&[
165+
CommandCode::SMBALERT_MASK as u8,
166+
CommandCode::STATUS_IOUT as u8,
167+
mask.0,
168+
],
169+
)
170+
.map_err(|code| Error::BadWrite {
171+
cmd: CommandCode::SMBALERT_MASK as u8,
172+
code,
173+
})
174+
}
175+
146176
pub fn read_vin(&self) -> Result<Volts, Error> {
147177
let vin = pmbus_rail_read!(self.device, self.rail, READ_VIN)?;
148178
Ok(Volts(vin.get()?.0))

0 commit comments

Comments
 (0)