Skip to content

Add fmd inventory to the database#10345

Open
smklein wants to merge 35 commits into
mainfrom
add-fmd-inventory-db
Open

Add fmd inventory to the database#10345
smklein wants to merge 35 commits into
mainfrom
add-fmd-inventory-db

Conversation

@smklein
Copy link
Copy Markdown
Collaborator

@smklein smklein commented Apr 29, 2026

Adds local-to-sled fault management info, from FMD, to the database.

Builds on #10283 , which propagated this info out of the Sled Agent API.

In the database, this propagates:

  • inv_fmd_status: The per-sled identification of whether or not fmd data was collected successfully
  • inv_fmd_host_case: information about individual cases on a sled
  • inv_fmd_resource: information about individual resources that exist on a sled

smklein added 19 commits April 17, 2026 14:13
Exposes illumos Fault Management Daemon (FMD) data through the
sled-agent inventory endpoint. This lets the control plane see
diagnosed hardware/software faults on each sled.

New API version 35 adds an `fmd: Option<FmdInventory>` field to
`Inventory`. When present, it contains:
- Cases: diagnosed faults with UUID, diagnostic code, URL, and the
  full event nvlist serialized as JSON
- Resources: affected components with FMRI, fault status flags

On illumos, sled-agent queries FMD on each inventory request. On
non-illumos (sim, tests), the field is None. Database storage is
not included — that's a follow-up.
The inventory endpoint is async, but FMD queries go through door calls
to fmd(1M) that can stall the calling thread. Move the work onto
spawn_blocking so it doesn't occupy a Tokio worker; surface any
JoinError as FmdInventory::Error.
The FmdCase.url docstring now uses backticks instead of quotes around the
example URL, which changes the schema description and thus the spec hash.
The verify-libraries xtask checks that binaries don't link against
unexpected libraries. Add libfmd_adm.so.1 to the allowlist for the
binaries that legitimately need it (sled-agent, sled-agent-sim, and
omicron-dev which spins up sled-agent for tests).
oxidecomputer/fmd-adm#2 replaced the bool argument on
`FmdAdm::resources()` with an `InvisibleResources` enum to make
callsites self-documenting (per hawk's review feedback). Bump to the
post-merge rev and update the call in sled-agent.
The Err arm was dead code: omicron compiles with panic="abort", so a
panic inside the blocking task aborts the process before the JoinHandle
can return Err. Switch to .expect with a descriptive message — if the
invariant ever changes, the panic will say what happened.

Addresses the take-it-or-leave-it nit at #10283
(comment r3112399887).
The illumos module is cfg-gated, so cargo check on Linux skips it
entirely — the missing trait import wasn't visible locally. CI on helios
caught it: three E0599 errors for from_untyped_uuid on FmdHostCaseUuid
and FmdResourceUuid. Verified fix on atrium.
Adds three tables for persisting per-sled FMD data:
  inv_fmd_status      — per-sled outcome of FMD collection
  inv_fmd_host_case   — diagnosed cases (event payload as JSONB)
  inv_fmd_resource    — resources affected by cases

Bumps SCHEMA_VERSION to 254 with directory schema/crdb/inv-fmd. Adds
diesel table! entries, db-model structs, and From impls for the read
path. No callers yet — write/read/display follow in subsequent commits.
Wires the fmd field added to sled-agent's Inventory by the parent PR
through into Nexus's in-memory inventory representation. The collector
builder copies inventory.fmd verbatim. The DB read path will populate it
from the inv_fmd_* tables in a follow-on commit; for now, the read path
substitutes Available(empty) so existing tests round-trip cleanly.
Insert one InvFmdStatus row per sled in each inventory collection,
plus a row per case and resource when collection succeeded. Wire the
three tables into the existing prune transaction so old collections
clean up after themselves.
Loads inv_fmd_status, inv_fmd_host_case, and inv_fmd_resource for the
collection and reconstructs SledAgent.fmd. Status row's NULL
error_message indicates Available; non-NULL becomes Error{error}. A
missing status row falls back to Available with whatever cases/resources
were found (defensive, in case of historical data predating this PR).
Adds Display wrappers for FmdInventoryResult/FmdInventory/FmdHostCase/
FmdResource on the sled-agent types. Wires them into
nexus/types/src/inventory/display.rs::display_sleds so that
`omdb db inventory collections show` (and reconfigurator-cli scripts
that print sled inventories) include the FMD section.

The FmdHostCase event payload is the FMD nvlist serialized to JSON; we
intentionally don't interpret the schema, so it's pretty-printed
verbatim under the case heading.

Also seeds the representative test inventory (nexus/inventory examples)
with a single fault case + resource so the inv_fmd_* tables get rows
under test_representative_collection_populates_database. The
reconfigurator-cli golden outputs grow a 'fmd:' section accordingly.
@smklein smklein changed the title Add fmd inventory db Add fmd inventory to the database Apr 29, 2026
@smklein smklein marked this pull request as ready for review April 30, 2026 00:02
@smklein smklein requested review from hawkw and mergeconflict April 30, 2026 00:02
Comment on lines +4202 to +4203
// Load all FMD inventory rows. We expect at most ~tens of cases or
// resources per sled, so we don't bother paginating.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get into some sort of trouble if this assumption is wrong (due to a bug, for example)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the silver lining is that inventories are periodically trimmed, so this wouldn't last forever.

Dealing with this in a "pathological case" is complicated.
Suppose we have a bug which emits hundreds of thousands of faults on a host.

If we use an artificial cap of "N", it's possible that "the first N faults are spew, the N+1'th one is actually useful". But it's also possible that "N" doesn't fit in an inventory collection? So the caller may want to know "I saw N ereports, of M total, (for N < M), and I need some way of accessing the rest of them". But that is more complicated too, because then we're sorta eyeing inventory and sorta side-stepping it?

It probably does make sense to handle this in the limit, but are you okay with side-stepping this in the short-term, so we can get the flow working end-to-end? I think we'll probably "really want to see all faults" for the short-term future, but I can file an issue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, just asking questions. If there isn't an easy solution to this one, it seems fair to punt.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I had a vision of a broken system with a million faults, and I re-visited this.

The majority of the change exists in the parent PR, in e43ae1f , but I also merged that into this PR as d9d3da3

TL;DR: I'm setting an upper bound of "1000" for cases/resources, and if we cross that threshold, we'll return an explicit error. Since this upper bound is considered pathological, it should be fine to do this, and if that bound is too small, it should be fine to make it larger.

Comment thread schema/crdb/dbinit.sql
);

CREATE TABLE IF NOT EXISTS omicron.public.inv_fmd_status (
inv_collection_id UUID NOT NULL,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (here and below): need a comment like:

-- (foreign key into `inv_collection` table)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 040b815

smklein added 6 commits May 14, 2026 12:38
If a build.rs calls configure_default_omicron_rpaths() but contributes no
RPATH entries on illumos, the caller almost certainly forgot a direct
`*-sys` dep in Cargo.toml. Catching this at build time avoids silently
producing a binary with NEEDED libfmd_adm.so.1 (or libpq.so.5) but no
RPATH, which would only surface as an ld.so.1 failure at process startup.

The check is illumos-only because fmd-adm-sys is illumos-gated; on Linux
a caller can legitimately contribute nothing.
hawkw asked why we were reinventing Result. The answer for the wire format
is openapi-lint requires snake_case property names, and Result's default
serde representation produces 'Ok'/'Err'. Use omicron_common::snake_case_result
(already used for other Result fields in the inventory) to keep the Rust
API as a normal Result while presenting snake_case on the wire.
Match the convention used by other inv_* tables in dbinit.sql.
Addresses review feedback on #10345.
Captures the full `source()` chain on the three error paths in
`fmd::illumos::collect` (FMD open / list cases / list resources). The
underlying `fmd_adm::Error` variants `Nul(NulError)` and
`Uuid(uuid::Error)` carry inner errors via `#[from]` that the bare
`Display` impl can drop, so `InlineErrorChain` actually surfaces the
extra context. Matches the existing pattern used elsewhere in
`omicron-sled-agent` (e.g. `instance_manager.rs:989`).
smklein added 3 commits May 14, 2026 15:55
Replaces `Result<FmdInventory, String>` with a typed `FmdInventoryError`
(addresses hawkw's r3173718186 on PR #10283) and introduces cap constants
`FMD_MAX_CASES` / `FMD_MAX_RESOURCES` (1000 each).

The error type is flat — a `kind: FmdInventoryErrorKind` discriminator plus
a free-form `message`. Three variants:

- `FmdError`: catch-all for FMD daemon failures (open / list cases /
  list resources, or platform doesn't support FMD).
- `TooManyCases` / `TooManyResources`: bound exceeded; producer refuses
  to report a partial set because a count this high is itself a signal
  operators should investigate directly.

Counts/limits are included in the `message` string, so downstream display
shows them without needing structured fields.
Merges the typed FmdInventoryError from add-fmd-inventory and updates
the database side to match:

- Schema: add CRDB ENUM `fmd_inventory_error_kind`. `inv_fmd_status`
  gains an `error_kind` column (ENUM type) alongside the existing
  `error_message` (now documented as informational only). A CHECK
  constraint ensures the two columns agree on NULL.

- Diesel binding: `FmdInventoryErrorKindEnum` registered in db-schema;
  `FmdInventoryErrorKind` Diesel enum in db-model with From conversions
  to/from the API type in sled-agent-types.

- `InvFmdStatus::new` now takes `&Result<FmdInventory, FmdInventoryError>`
  and projects to (kind, message) for storage.

- Write path stores both columns; read path reconstructs the API
  `FmdInventoryError` from them.

- Display wrapper (`FmdInventoryResultDisplay`) updated to wrap the typed
  result. `Display` formats the error via its `Display` impl, which is
  the `message` field (set by the producer to include counts when
  relevant).

- `nexus_types::SledAgent.fmd` field type updated to match.
The schema test `test_migration_verification_files` enforces at most one
DDL statement per up*.sql file for schema versions after 220. The previous
up01.sql had two (CREATE TYPE + CREATE TABLE). Split into:

- up01.sql: CREATE TYPE fmd_inventory_error_kind
- up02.sql: CREATE TABLE inv_fmd_status
- up03.sql: CREATE TABLE inv_fmd_host_case  (was up02)
- up04.sql: CREATE TABLE inv_fmd_resource   (was up03)
Base automatically changed from add-fmd-inventory to main May 15, 2026 18:58
Comment thread nexus/db-model/src/inventory.rs
Comment thread nexus/db-model/src/inventory.rs
Comment on lines +4189 to +4191
// Load all FMD inventory rows. The producer's per-sled bounds
// (`FMD_MAX_CASES` / `FMD_MAX_RESOURCES`) keep this size predictable,
// so we don't paginate.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know that cases will disappear from the hosts inventory once they are resolved/fixed/whatever? i am concerned that we may be accidentally be enforcing an upper bound on the total number of cases a system can ever report...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question - but one I'd like to punt until after this PR merges. Frankly, a lot of this plumbing is speculative; I want to see what these faults look like from real systems, because I'm not sure I grok the lifecycle fully.

In short:

  • https://docs.rs/fmd-adm/0.3.0/fmd_adm/struct.FmdAdm.html (really, just wrapping the C API) exposes a few ways to cope with resources gone awry - it provides APIs to "acquit" resources/cases. But I also think there may be some degree of that happening automatically?
  • So: if there are "host faults" that need to be resolved, I think you're right -- this backlog would fill up in inventory, until we reach this limit.

At that point: inventory would start throwing errors, saying "hey you have thousands of cases/resources that are faulting. Fix me".

HOWEVER!

The inventory collection would not fail - it would just fail to report this resource/case info specifically.

IMO this is a better state than the alternative: having no bound, and potentially causing inventory to fail because it grows unbounded. At least, with this back-stop (set as "whatever we consider pathological") we don't disrupt other subsystems.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HOWEVER!

The inventory collection would not fail - it would just fail to report this resource/case info specifically.

IMO this is a better state than the alternative: having no bound, and potentially causing inventory to fail because it grows unbounded. At least, with this back-stop (set as "whatever we consider pathological") we don't disrupt other subsystems.

I think I agree. I would want to do some additional research into whether we must be proactively telling the host system to get rid of cases which have been resolved, or which we know to have been resolved even if the host fmd cannot determine this. But I think it's fine to do that subsequently.

Comment on lines +4210 to +4213
_ => unreachable!(
"inv_fmd_status CHECK constraint enforces \
error_kind and error_message agree on NULL"
),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kinda feel like we should bail out with an error here rather than panicking? i don't love code where a check constraint violation makes nexus panic, amplifying the problem...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fa93ec8

.or_default()
.insert_unique(row.into())
.map_err(|err| {
Error::internal_error(&format!(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&format! makes me sad, personally this is a case where i would just ignore the constructor and manually initialize internal_message, but you do you...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fa93ec8

Comment thread schema/crdb/dbinit.sql Outdated
url TEXT NOT NULL,
-- The full FMD fault event payload as JSON, if present. Stored as
-- JSONB without parsing — Nexus does not interpret the FMD event
-- schema; it round-trips verbatim for downstream tooling (e.g. omdb).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically it doesn't actually; JSONB does not preserve original whitespace 🤓

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fa93ec8

}
}

/// a displayer for the FMD inventory result on a sled
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love some examples (expectorate?) of this output!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test in fa93ec8

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this basically seems good to me, I left a few last little notes, but none of them are terribly important.

Comment on lines +4248 to +4252
internal_message: format!(
"unexpected duplicate FMD case: {}",
InlineErrorChain::new(&err)
),
})?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, this looks like we will just completely bail out here, which feels a bit odd; would it be better to just log it and ignore that particular case?

on the other hand, this probably does not actually matter as I think this is a unique key in the database, right? so if we see this it should then be indicative of a uniqueness violation in the DB and "should never happen" and maybe throwing away the whole inventory is a correct response?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the loading side, this does indicate corrupt database data, so I think I'm okay "failing hard and fast" here?

Comment on lines +7 to +10
-- limit; no partial data is recorded.
'too_many_cases',
-- Number of FMD resources reported by the sled exceeded the limit.
'too_many_resources'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also true that "no partial data is recorded" for too many resources too, right? the fact that we document that for one but not the other feels a bit odd...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding comments for consistency in 1e81977

Comment on lines +982 to +990
// The event payload is the FMD nvlist serialized to JSON. We
// intentionally do not interpret it; round-trip pretty-printing
// is enough to make it human-readable.
if let Some(event) = event {
match serde_json::to_string_pretty(event) {
Ok(rendered) => {
writeln!(f, " event:")?;
let mut indent = IndentWriter::new(" ", &mut *f);
writeln!(indent, "{rendered}")?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, and it's entirely up to you whether you want to use this, but I have written a formatter for serde_json::Values which formats them in a way that makes them look Sort of Similarish to nvlists: https://github.com/oxidecomputer/erebor/blob/eff27fb3a652e993a62c8babe327108538fcde53/tests/output/bad_ashift_default_indent.txt

if you wanted to be cute, you could use that here, but it's up to you.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am down to use this but can I punt to a different PR? Maybe we should have a more broad look at converting to_string_pretty to using erebor, so we can be consistent through omicron?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, no worries! i am going to be rolling this out across a few places in the near future and i can come mess with this when i do that!

writeln!(f, " fmri: {fmri}")?;
writeln!(
f,
" faulty: {faulty}, unusable: {unusable}, invisible: {invisible}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unimportant turbo-nit: should we make these all print the bool padded to five characters (the length of the string false) so that the "faulty:", "unusable:", and "invisible:" always line up when printing multiple resources?

if you don't care about this, please feel free to tell me to fuck off :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in 1e81977

use uuid::Uuid;

#[test]
fn fmd_inventory_result_display_snapshot() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would maybe be nice to see a version of this which includes at least two cases and resources, but not a big deal really...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1e81977

- schema/crdb/inv-fmd/up01.sql + dbinit.sql: document "no partial
  data is recorded" for too_many_resources too (matching too_many_cases).
- sled-agent/types/versions/src/impls/inventory.rs: pad bools in
  FmdResourceDisplay to width 5 so faulty/unusable/invisible columns
  line up across multiple resources.
- Snapshot test: cover a second case (event: None) and a second resource
  with the bool flags flipped, so the aligned bool output is visible in
  the expectorate file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants