Skip to content

docs: add an operator guide for boot interfaces and DPU modes#2797

Merged
chet merged 1 commit into
NVIDIA:mainfrom
chet:operator-guide-boot-interface
Jun 23, 2026
Merged

docs: add an operator guide for boot interfaces and DPU modes#2797
chet merged 1 commit into
NVIDIA:mainfrom
chet:operator-guide-boot-interface

Conversation

@chet

@chet chet commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What

The operator's guide for boot-interface selection and DPU modes — the documentation capstone of the boot-interface standardization epic (#2660). New page docs/provisioning/boot-interfaces-and-dpu-modes.md, a deep companion to ingesting-hosts.md (which stays the basic ingest-flow page), wired into index.yml under Provisioning (Day 0).

It's written operator-first — leads with "what to set and what the defaults do," then scenario recipes, then the admin-cli / gRPC / web-UI surfaces, then the behind-the-scenes flow for debugging:

  1. The two independent axes (dpu_mode vs. the boot/primary interface) + segment types (Admin overlay / HostInband / Underlay / Tenant).
  2. Configuring via Expected Machines — dpu_mode, host_nics (primary, network_segment_type), and the default behavior when nothing is set.
  3. Scenario recipes — zero-DPU, NIC mode, integrated-NIC-with-managed-DPUs, flipping a DPU to NIC mode.
  4. admin-cli + Forge gRPC reference, and the web UI (incl. the note that DPU-mode switching is CLI/Expected-Machines only).
  5. The machine-controller state machine + how a boot device is chosen and set (predictions drive boot before the first lease; the owned interface supersedes after).
  6. The predicted → managed → retained data model and the selection precedence.

Notes

  • Docs-only, no code. Grounded in the merged epic code (the dpu_mode / network_segment_type / primary config, the predicted_machine_interfacesmachine_interfacesretained_boot_interfaces flow, the machine-controller boot config, and the admin-cli/gRPC/web-UI surfaces) via a focused research pass.
  • All cross-link targets verified to exist; new page added to the index.yml nav.
  • Per discussion, this is expected to iterate over time — it's a comprehensive first draft (also intended to support QA exercising these knobs).

Test plan

  • Docs-only; no build/tests. Spot-check the rendered page and deep anchors in the docs site.

Part of #2740 (epic #2660).

@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2731a0fd-62a3-43b2-bf8f-1b40d8dc8de6

📥 Commits

Reviewing files that changed from the base of the PR and between 1ff1cd1 and 7b8d05a.

📒 Files selected for processing (2)
  • docs/index.yml
  • docs/provisioning/boot-interfaces-and-dpu-modes.md
✅ Files skipped from review due to trivial changes (1)
  • docs/index.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/provisioning/boot-interfaces-and-dpu-modes.md

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive provisioning guide on boot interface selection and DPU management. Includes network segment type definitions, configuration options, supported DPU modes, operational workflows with CLI and RPC command references, Web UI capabilities, internal boot orchestration details, data structures, and detailed troubleshooting guidance keyed by observable symptoms and remediation approaches.

Walkthrough

A new provisioning guide, docs/provisioning/boot-interfaces-and-dpu-modes.md, is added covering NICo's boot interface and DPU mode configuration model, scenario recipes, admin CLI/Web UI reference, boot orchestration internals, the boot-interface data model, and troubleshooting. The guide is registered in docs/index.yml under the Provisioning (Day 0) navigation section.

Changes

Boot Interfaces and DPU Modes Documentation

Layer / File(s) Summary
Navigation index registration
docs/index.yml
Adds the new guide as a navigation entry under Provisioning (Day 0).
Core concepts: two-axis model and configuration reference
docs/provisioning/boot-interfaces-and-dpu-modes.md
Introduces the boot interface vs DPU management axes, segment-type mapping, Expected Machines declarative model, dpu_mode values with resolution precedence, host_nics per-NIC fields, and concrete JSON/CLI examples.
Scenario recipes and DPU mode flip workflow
docs/provisioning/boot-interfaces-and-dpu-modes.md
Documents scenario-based recipes for zero-DPU, NIC-mode, and integrated-NIC-boot-while-managed hosts, plus the re-ingest procedure for flipping DPU mode with retained boot-interface behavior.
Admin CLI and Web UI reference
docs/provisioning/boot-interfaces-and-dpu-modes.md
Provides admin-cli to Forge RPC mapping tables for expected-machine CRUD, boot/primary interface operations, and ingestion controls; enumerates Web UI pages and available actions.
Boot orchestration internals and data model
docs/provisioning/boot-interfaces-and-dpu-modes.md
Explains site-explorer prediction and machine-controller state machine lifecycle, boot-interface resolution precedence, boot order application per host type, and the predicted/managed/retained boot-interface data model with promotion and selection mechanics.
Verification, troubleshooting, and related pages
docs/provisioning/boot-interfaces-and-dpu-modes.md
Adds inspection guidance, a symptom/action troubleshooting table, and cross-references to related documentation pages.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/infra-controller#2722: Both changes document the same declared primary/boot-interface selection logic and precedence rules that govern how NICo resolves the host's boot interface across network segments and NIC types.
  • NVIDIA/infra-controller#2778: The main PR documents the decoupled integrated-NIC boot behavior while maintaining DPU management—the exact pattern implemented in #2778.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the primary change: adding an operator guide documentation for boot interfaces and DPU modes, which aligns with the changeset.
Description check ✅ Passed The description is comprehensive and directly addresses the changeset, detailing the new documentation page, its structure, and integration into the documentation index.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 markdownlint-cli2 (0.22.1)
docs/provisioning/boot-interfaces-and-dpu-modes.md

markdownlint-cli2 v0.22.1 (markdownlint v0.40.0)
Error: Unable to use configuration file '/coderabbit-0.markdownlint-cli2.jsonc'; ENOENT: no such file or directory, open '/coderabbit-0.markdownlint-cli2.jsonc'
at throwForConfigurationFile (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:48:9)
at readOptionsOrConfig (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:169:5)
at async main (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:927:21)
at async file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2-bin.mjs:14:22 {
[cause]: Error: ENOENT: no such file or directory, open '/coderabbit-0.markdownlint-cli2.jsonc'
at async open (node:internal/fs/promises:640:25)
at async Object.readFile (node:internal/fs/promises:1287:14)
at async readOptionsOrConfig (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:141:17)
at async main (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:927:21)
at async file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2-bin.mjs:14:22 {
errno: -2,
code: 'ENOENT',
syscall: 'open',
path: '/coderabbit-0.markdownlint-cli2.jsonc'
}
}


Comment @coderabbitai help to get the list of available commands.

@chet

chet commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai PTAL, thank you!

@chet

chet commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@chet: Acknowledged — initiating a review of the documentation PR now. I will examine the new docs/provisioning/boot-interfaces-and-dpu-modes.md page and its integration into index.yml and provide detailed feedback shortly.

(ᴖ‿ᴖ)🐇

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chet

chet commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/provisioning/boot-interfaces-and-dpu-modes.md`:
- Around line 245-255: The fenced code block containing the state diagram
starting with "Created → DpuDiscoveringState → HostInit → Validation → Ready" is
missing a language identifier after the opening triple backticks. Add
"plaintext" as the language specifier immediately after the opening triple
backticks to improve accessibility and enable proper syntax highlighting for the
diagram.
- Around line 1-351: Update all internal anchor link references in the document
to match GitHub Markdown's auto-generated anchor format by adding a leading dash
before the section number. Find all links like
`#2-configuring-via-expected-machines--and-the-defaults`, `#3-scenarios`,
`#34-boot-an-integrated-nic-while-keeping-the-dpus-managed`,
`#35-flipping-a-dpu-to-nic-mode`, `#5-web-ui`,
`#6-behind-the-scenes-how-a-boot-device-is-chosen-and-set`,
`#7-the-boot-interface-data-model`, and `#73-retained-boot-interfaces` and change
them to `#-2-configuring-via-expected-machines--and-the-defaults`, `#-3-scenarios`,
`#-34-boot-an-integrated-nic-while-keeping-the-dpus-managed`,
`#-35-flipping-a-dpu-to-nic-mode`, `#-5-web-ui`,
`#-6-behind-the-scenes-how-a-boot-device-is-chosen-and-set`,
`#-7-the-boot-interface-data-model`, and `#-73-retained-boot-interfaces`
respectively. These links appear in the introductory callout and throughout the
document for cross-referencing.
- Line 175: The link reference at line 175 contains a broken anchor that does
not match the header it targets. The link `[Section
7.3](`#73-retained-boot-interfaces`)` points to an anchor that does not exist
because the header `### 7.3 Retained (`retained_boot_interfaces`)` at line 299
contains backticks which affect anchor generation in markdown. Either update the
link anchor to match the actual generated anchor from the header (which includes
the `retained_` prefix due to the backticks), or simplify the header by removing
the backticks around `retained_boot_interfaces` so the anchor matches the link
target `#73-retained-boot-interfaces`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2809389d-47d9-4d29-9fb4-28e71be6b0aa

📥 Commits

Reviewing files that changed from the base of the PR and between 89c0b75 and 1ff1cd1.

📒 Files selected for processing (2)
  • docs/index.yml
  • docs/provisioning/boot-interfaces-and-dpu-modes.md

Comment thread docs/provisioning/boot-interfaces-and-dpu-modes.md
Comment thread docs/provisioning/boot-interfaces-and-dpu-modes.md
Comment thread docs/provisioning/boot-interfaces-and-dpu-modes.md Outdated
Operators have historically configured little around DPU modes and network
segments -- the defaults handled the common managed-DPU host. Now that NICo
supports DPUs in NIC mode, integrated and zero-DPU NICs, booting an
integrated NIC while keeping its DPUs managed, and live DPU-mode flips,
operators need to understand both the knobs and what the defaults do when
nothing is set.

This adds docs/provisioning/boot-interfaces-and-dpu-modes.md, the deep
companion to ingesting-hosts.md. It covers:

- The two independent axes -- DPU management (dpu_mode) vs. the boot/primary
  interface -- and the network segment types (Admin overlay, HostInband,
  Underlay, Tenant), with "the host-OS segment follows the boot mode."
- Configuring via Expected Machines: the dpu_mode and host_nics (primary,
  network_segment_type) knobs, and the default behavior when nothing is set.
- Scenario recipes: zero-DPU, DPU in NIC mode, an integrated NIC with the
  DPUs still managed, and flipping a DPU to NIC mode.
- The admin-cli + Forge gRPC surfaces and the web UI capabilities.
- Behind the scenes: the machine-controller state machine and how a boot
  device is chosen and set, plus the predicted -> managed -> retained
  boot-interface data model and the selection precedence.

Wired into docs/index.yml under Provisioning (Day 0).

Part of NVIDIA#2740 (epic NVIDIA#2660).

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet force-pushed the operator-guide-boot-interface branch from 1ff1cd1 to 7b8d05a Compare June 23, 2026 18:40
@chet

chet commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chet chet marked this pull request as ready for review June 23, 2026 18:55
@chet chet requested a review from a team as a code owner June 23, 2026 18:55
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

No Grype artifacts were found to aggregate.

@chet chet merged commit 40c7ca4 into NVIDIA:main Jun 23, 2026
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants