Skip to content

[fm] more analysis plumbing#10258

Merged
hawkw merged 3 commits intomainfrom
eliza/de
Apr 14, 2026
Merged

[fm] more analysis plumbing#10258
hawkw merged 3 commits intomainfrom
eliza/de

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented Apr 10, 2026

Depends on #10255.

This commit does a bit more of the necessary plumbing to allow diagnosis engines defined in the nexus-fm crate to run in the fm_analysis background task. Now, there's a synchronous nexus_fm::diagnosis::analyze function which, given a set of analysis inputs and a sitrep builder, does...something. In the future, this function will run the diagnosis engines for various components, classes of faults, subsystems, and so on, passing them all the new ereports from the inputs and the sitrep builder so that they can make changes to it. Right now, it just fails with the error "FM analysis is not implemented yet", which, technically, means that this PR makes no functional change from the previous state of this code. However, if that function were not to do that, we would then go on and attempt to actually commit the new sitrep. So that's cool. Furthermore, we now also take the analysis report from #10255 and jam it into the bg task output.

@hawkw hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Apr 10, 2026
Base automatically changed from eliza/analysis-report to main April 10, 2026 19:25
@hawkw hawkw marked this pull request as ready for review April 13, 2026 21:58
@hawkw hawkw requested review from mergeconflict and smklein April 13, 2026 21:59
Comment thread dev-tools/omdb/src/bin/omdb/nexus.rs Outdated
report,
outcome: status::AnalysisOutcome::Unchanged,
};
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler doesn't yell at you about code after this line being unreachable?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I wondered the same - feels a little sketchy to have "impossible-to-reach" code down here that's definitely untested, but also this is probably fine for now as long as the build isn't broken?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah...I couldn't really think of a better way to do this incrementally while the diffing code and the sitrep-generating code both currently don't exist. I suppose we could just not add this part at all, but I wanted to make sure to remember to put it there when the necessary stuff is added. Once we actually have enough code to start trying to produce sitreps, any tests for that will start failing due to this until the diffing code is added, which felt like the right way to make sure we remembered to do that part...

What do you guys think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh. I vote move forward for now. don't want to get hung around the wheel too much, but agree that we should not leave it in this weird intermediate state for too long.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to avoid that! :)

Comment thread nexus/src/app/background/tasks/fm_analysis.rs
Comment thread dev-tools/omdb/tests/successes.out Outdated
report,
outcome: status::AnalysisOutcome::Unchanged,
};
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I wondered the same - feels a little sketchy to have "impossible-to-reach" code down here that's definitely untested, but also this is probably fine for now as long as the build isn't broken?

@hawkw hawkw merged commit d7aa33c into main Apr 14, 2026
16 checks passed
@hawkw hawkw deleted the eliza/de branch April 14, 2026 20:13
hawkw added a commit that referenced this pull request Apr 17, 2026
Depends on #10258.

As described in #10260 (and [§8.2 RFD 603]), **only _you_ can prevent
inventory collection time travel**: because each Nexus that might
participate in FM analysis has its own copy of the inventory which it
considers to be current (loaded by its `inventory_load` background
task). Since we must allow for the possibility of a particular Nexus
being arbitrarily behind, we can easily imagine a scenario where a Nexus
attempts to perform FM analysis and generate a sitrep while operating on
a copy of the inventory that is actually *older* than the inventory that
was used to produce the current sitrep. If this happens, we might
incorrectly think that the state of the system has _changed_ from the
state in the current sitrep, while actually, we are just operating on an
outdated understanding. This branch fixes that.

In #10260, I proposed addressing this by making the CTE that inserts a
sitrep into the sitrep history only succeed if the inventory collection
UUID in that sitrep's metadata is newer than the one in the current
sitrep's metadata. This would probably have worked, but putting that
check in the already somewhat hairy CTE was not actually necessary.
Instead, we can just perform this check *before* actually doing any
analysis, which feels a lot nicer. In order to do this, it was necessary
to add a query to just go and fetch only the timestamps from an
inventory by its UUID, which was not a huge deal. I think it is
plausible that the "is strictly newer than" check for inventory
collections may be useful elsewhere as well.

Fixes #10260 

[§8.2 RFD 603]: https://rfd.shared.oxide.computer/rfd/0603#_inventory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants