[bug/5614743] fix: scout repeatedly fails machine discovery with 'AttestKeyInfo is not populated' error by prbinu-nvidia · Pull Request #2670 · NVIDIA/infra-controller

prbinu-nvidia · 2026-06-16T22:41:30Z

Description

Theforge-scout repeatedly fails machine discovery against carbide-api with the hosts remain stuck in bootingwithdiscoveryimage state, and with the error:.

AttestKeyInfo is not populated

Manual workaround: Run tpm2_clear (or equivalent TPM clear) and reboot the host. Discovery then succeeds.

Root cause: In register.rs, scout decided whether it was running on a DPU vs. a managed host using TPM EK certificate presence:

let is_dpu = hardware_info.tpm_ek_certificate.is_none();

On affected DGX hosts, hardware enumeration can leave tpm_ek_certificate unset even though the machine is a normal x86 host (not a BlueField DPU). Scout then treated the host as a DPU, skipped attestation key setup (create_attest_key_info), and sent registration data without AttestKeyInfo. carbide-api correctly rejected the request.

Impact: Host discovery cannot complete without operator intervention (TPM clear + reboot). Affected platforms include DGX H100/GB200 class systems where EK cert enumeration is missing or incomplete.

Fixes Applied

correct host vs. DPU detection (register.rs): Replaced TPM-based DPU inference with SMBIOS platform detection. Hosts without an EK certificate are no longer misclassified as DPUs. Attestation key creation runs on managed hosts as intended, so AttestKeyInfo is populated before registration.
Shared platform helper (platform.rs, new)
Added platform::is_host() that reads SMBIOS system information and returns false when the product name contains "bluefield" (DPU), true otherwise.
automated TPM self-heal (tpm.rs, register.rs) For genuine TPM corruption, added optional one-shot recovery when local TPM setup fails. Recovery (tpm2_clear and reboot) is invoked from register.rs when create_context_from_path or create_attest_key_info fails with a recoverable TPM error.

Files changed

File	Change
`crates/scout/src/register.rs`	Fix `is_dpu` logic; wire TPM recovery on setup failure
`crates/scout/src/platform.rs`	New shared SMBIOS-based `is_host()`
`crates/scout/src/tpm.rs`	Add recovery helpers and tests
`crates/scout/src/deprovision/scrabbing.rs`	Use shared `platform::is_host()`
`crates/scout/src/main.rs`	Add `mod platform`

Expected outcome

DGX H100/GB200 hosts with missing EK cert material proceed through attestation and discovery without manual TPM clear.
If TPM state is genuinely corrupted and local setup fails, scout attempts one automated TPM2_Clear + reboot before giving up.

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

copy-pr-bot · 2026-06-16T22:41:34Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-16T22:41:38Z

Summary by CodeRabbit

New Features
- Added an automated TPM recovery flow for certain attestation-key failures, including a one-time-per-boot-cycle reboot marker.
- Improved host vs. accelerator (DPU) detection using centralized platform logic.
Bug Fixes
- Refined TPM error classification and recovery triggering to better match “recoverable” vs “not supported” scenarios during registration.
Refactor
- Updated deprovisioning and registration flows to use the shared platform detection for consistent cleanup/skip behavior.

Walkthrough

A new platform module centralises host/DPU detection via SMBIOS product-name inspection. The deprovision scrubbing and registration flows migrate to platform::is_host(). Registration gains structured TPM error recovery for attestation key provisioning: recoverable errors trigger a one-shot TPM clear and system reboot via systemctl reboot, guarded by a /run/scout/tpm_recovery_reboot_attempted marker file to prevent boot-cycle looping.

Changes

Platform Detection Centralisation and TPM Auto-Recovery

Layer / File(s)	Summary
New `platform` module: SMBIOS-based `is_host()` `crates/scout/src/main.rs`, `crates/scout/src/platform.rs`	`platform.rs` is registered as a crate submodule and implements `pub(crate) fn is_host_from_product_names()` to detect host vs. DPU via case-insensitive "bluefield" substring matching in product names, and `pub(crate) fn is_host()` to load SMBIOS system information and delegate to the helper with true-on-load-failure fallback. Expanded unit test coverage validates multiple product-name cases.
Migrate deprovision scrubbing to `platform::is_host()` `crates/scout/src/deprovision/scrabbing.rs`	Imports updated to include `platform`; in-file SMBIOS-based `is_host()` removed. Both `run()` and `run_no_api()` now gate cleanup behavior using `platform::is_host()`, preserving the same non-host early-return semantics.
Migrate `register.rs` DPU detection to `platform::is_host()` `crates/scout/src/register.rs`	`platform` import added; `is_dpu` is now derived from `!platform::is_host()`, replacing the previous TPM EK certificate absence heuristic.
TPM recovery mechanism with marker-based idempotency `crates/scout/src/tpm.rs`	Adds `TPM_RECOVERY_ATTEMPTED_PATH` constant (`/run/scout/tpm_recovery_reboot_attempted`), `should_attempt_tpm_recovery_for_attest_key_failure()` (classifies errors via lowercased message inspection; denies recovery if "not supported" substring present), and `claim_tpm_recovery_attempt()` to atomically create the boot-cycle marker. `recover_tpm_and_reboot()` now uses the marker-claim helper, invokes `clear_tpm`, and triggers `systemctl reboot` with error handling for non-success exits. Unit tests validate message-based classification for recoverable vs. non-recoverable cases.
Integrate TPM recovery into registration attestation key provisioning `crates/scout/src/register.rs`	Error handling for `attest::create_attest_key_info` refactored from `map_err`/`?` chains into `match` blocks. On failure, `should_attempt_tpm_recovery_for_attest_key_failure()` classifies the error; if recovery is warranted, invoke `recover_tpm_and_reboot()` before returning `CarbideClientError::TpmError`. Successful results populate `hardware_info.attest_key_info` and cached handles as before.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly addresses the core bug fix: scout's repeated failures in machine discovery due to missing AttestKeyInfo, with a clear reference to the issue identifier.
Description check	✅ Passed	The description is comprehensive and clearly related to the changeset, explaining root cause, fixes applied, files changed, and expected outcomes for the bug fix.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

crates/scout/src/platform.rs (1)
34-41: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Strengthen tests to validate classifier behavior (table-driven), not just non-panicking call.

The current test at Line 39 only checks invocation. Please add table-driven cases for at least BlueField/non-BlueField/empty product-name paths (preferably by extracting a pure helper for product-name classification).

As per coding guidelines, "Prefer table-driven tests for any function that maps inputs to outputs, errors, or other observable results" and "Use carbide_test_support::value_scenarios! macro for total operations."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/platform.rs` around lines 34 - 41, The test
`is_host_returns_bool_without_panicking` only verifies the function doesn't
panic rather than validating actual behavior. Replace this with a table-driven
test that tests multiple product name scenarios including BlueField,
non-BlueField, and empty product name cases, and assert the correct boolean
return value for each case instead of just invoking the function. Consider
extracting a pure helper function for product-name classification logic to make
it testable in isolation, and use the table-driven test pattern or
carbide_test_support::value_scenarios! macro as recommended in the coding
guidelines.
Source: Coding guidelines
crates/scout/src/tpm.rs (1)
137-147: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Expand recoverability tests into a table and cover all classifier branches.

Current tests validate only one recoverable phrase and one non-TPM error. Add explicit cases for "Could not create context" and "TPM2_Clear" to lock classifier behavior and reduce regression risk.

As per coding guidelines, "Prefer table-driven tests for any function that maps inputs to outputs" and "Reach for a table whenever two or more tests call the same operation with different inputs."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/tpm.rs` around lines 137 - 147, The current tests for the
is_recoverable_tpm_client_error function use individual test functions but
should be refactored into a table-driven test to provide comprehensive coverage
of all classifier branches. Convert the
recoverable_tpm_errors_include_attest_key_info_failures and
non_tpm_client_errors_are_not_recoverable test functions into a single
table-driven test that includes test cases for all the error phrases the
classifier checks: "Could not create AttestKeyInfo", "Could not create context",
and "TPM2_Clear" as recoverable TPM errors, plus cases for non-recoverable
errors like GenericError. This will ensure all branches of the
is_recoverable_tpm_client_error classifier are tested and locked in against
regressions.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/scout/src/platform.rs`:
- Around line 23-29: The is_host() predicate at lines 23-29 uses any() with a
negated contains("bluefield") check, which incorrectly returns true if any
single system record is not BlueField. The intended behavior is to return true
only if no system records indicate BlueField. Change the logic by removing the
negation (!) from inside the closure where
product_name().to_string().to_lowercase().contains("bluefield") is checked, and
instead negate the entire any() expression, so that the method returns true when
none of the system records contain "bluefield".

In `@crates/scout/src/register.rs`:
- Around line 60-69: The recoverability check in the error handling for
create_context_from_path and the similar code for AttestKeyInfo (around lines
74-84) are checking the wrapped CarbideClientError::TpmError message instead of
the underlying TPM error. Refactor these blocks to check recoverability against
the underlying error `e` directly before wrapping it in
CarbideClientError::TpmError, rather than checking the wrapped error. This
ensures that recovery decisions are based on the actual TPM error signal, not on
the user-facing error message prefix that gets matched by the recoverability
helper.

In `@crates/scout/src/tpm.rs`:
- Around line 90-96: The is_recoverable_tpm_client_error function uses overly
broad string prefix matching, particularly the "Could not create context" check,
which can incorrectly classify non-recoverable TPM errors as recoverable and
trigger unnecessary destructive recovery behavior. Refactor the error
classification in is_recoverable_tpm_client_error to use structured error causes
or a dedicated enum from call sites instead of relying on broad string prefixes,
ensuring that non-recoverable causes like invalid TPM path or runtime setup
issues are properly excluded from being marked as recoverable.
- Around line 103-116: The TPM recovery marker file check is vulnerable to
TOCTOU attacks because it checks existence separately and then creates the file
non-atomically in a world-writable directory. Remove the separate exists() check
on TPM_RECOVERY_ATTEMPTED_PATH, change the TPM_RECOVERY_ATTEMPTED_PATH constant
to point to a root-owned runtime directory (such as /run/...) instead of /tmp,
and replace the File::create() call with File::create_new() to atomically create
the file and fail if it already exists, which eliminates the race condition and
symlink abuse vulnerability.

---

Nitpick comments:
In `@crates/scout/src/platform.rs`:
- Around line 34-41: The test `is_host_returns_bool_without_panicking` only
verifies the function doesn't panic rather than validating actual behavior.
Replace this with a table-driven test that tests multiple product name scenarios
including BlueField, non-BlueField, and empty product name cases, and assert the
correct boolean return value for each case instead of just invoking the
function. Consider extracting a pure helper function for product-name
classification logic to make it testable in isolation, and use the table-driven
test pattern or carbide_test_support::value_scenarios! macro as recommended in
the coding guidelines.

In `@crates/scout/src/tpm.rs`:
- Around line 137-147: The current tests for the is_recoverable_tpm_client_error
function use individual test functions but should be refactored into a
table-driven test to provide comprehensive coverage of all classifier branches.
Convert the recoverable_tpm_errors_include_attest_key_info_failures and
non_tpm_client_errors_are_not_recoverable test functions into a single
table-driven test that includes test cases for all the error phrases the
classifier checks: "Could not create AttestKeyInfo", "Could not create context",
and "TPM2_Clear" as recoverable TPM errors, plus cases for non-recoverable
errors like GenericError. This will ensure all branches of the
is_recoverable_tpm_client_error classifier are tested and locked in against
regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d2a5d386-ca15-4bb2-8793-461e9c50d0c0

📥 Commits

Reviewing files that changed from the base of the PR and between 09809c2 and 6d04f46.

📒 Files selected for processing (5)

crates/scout/src/deprovision/scrabbing.rs
crates/scout/src/main.rs
crates/scout/src/platform.rs
crates/scout/src/register.rs
crates/scout/src/tpm.rs

github-actions · 2026-06-22T20:44:31Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	264	6	23	99	6	130
machine-validation-runner	714	34	183	266	35	196
machine_validation	714	34	183	266	35	196
nvmetal-carbide	714	34	183	266	35	196
TOTAL	2412	108	572	903	111	718

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

…estKeyInfo is not populated' error

github-actions · 2026-06-22T22:49:26Z

🌿 Preview your docs: https://nvidia-preview-pull-request-2670.docs.buildwithfern.com/infra-controller

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/scout/src/register.rs (1)

60-76: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Apply TPM recovery to context creation failures as well.

Line 60 still returns immediately on create_context_from_path failure, so the new self-heal path only covers failures after a TSS context already exists. That leaves one of the PR’s stated local TPM setup failure modes unrecovered.

Proposed fix

-        let mut tss_ctx = attest::create_context_from_path(tpm_path)
-            .map_err(|e| CarbideClientError::TpmError(format!("Could not create context: {e}")))?;
+        let mut tss_ctx = match attest::create_context_from_path(tpm_path) {
+            Ok(tss_ctx) => tss_ctx,
+            Err(e) => {
+                if tpm::should_attempt_tpm_recovery_for_attest_key_failure(&*e) {
+                    tpm::recover_tpm_and_reboot(tpm_path)?;
+                }
+                return Err(CarbideClientError::TpmError(format!(
+                    "Could not create context: {e}"
+                )));
+            }
+        };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/register.rs` around lines 60 - 76, The
create_context_from_path call at line 60 returns immediately on failure without
attempting TPM recovery, while the later create_attest_key_info call properly
checks for recovery conditions and calls tpm::recover_tpm_and_reboot before
returning. Apply the same recovery pattern to create_context_from_path: replace
the map_err approach with a match statement that catches errors, checks if TPM
recovery should be attempted using the appropriate condition (similar to
should_attempt_tpm_recovery_for_attest_key_failure but for context creation),
calls tpm::recover_tpm_and_reboot if recovery is applicable, and then returns
the error if recovery isn't applicable.

♻️ Duplicate comments (1)

crates/scout/src/tpm.rs (1)
94-99: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Recovery classifier is still too permissive for a destructive action.

should_attempt_tpm_recovery_for_attest_key_failure currently treats every error except "not supported" as recoverable. That can still route configuration/path/runtime failures into clear_tpm + reboot, which is an unsafe fallback boundary for ambiguous errors.

Please gate recovery on structured error causes (or a dedicated enum/signal from the call site) instead of near-default true.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/tpm.rs` around lines 94 - 99, The
`should_attempt_tpm_recovery_for_attest_key_failure` function is too permissive
by defaulting to true for recovery and only excluding errors containing "not
supported", which allows configuration, path, and runtime errors to trigger the
destructive clear_tpm and reboot action. Refactor this function to be more
restrictive by checking for specific, known-recoverable error types or causes
instead of using a simple exclusionary string match. Consider examining the
error's source chain or implementing a structured error classification mechanism
that explicitly allows recovery only for specific, validated error conditions
that are genuinely safe to recover from.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/scout/src/register.rs`:
- Around line 60-76: The create_context_from_path call at line 60 returns
immediately on failure without attempting TPM recovery, while the later
create_attest_key_info call properly checks for recovery conditions and calls
tpm::recover_tpm_and_reboot before returning. Apply the same recovery pattern to
create_context_from_path: replace the map_err approach with a match statement
that catches errors, checks if TPM recovery should be attempted using the
appropriate condition (similar to
should_attempt_tpm_recovery_for_attest_key_failure but for context creation),
calls tpm::recover_tpm_and_reboot if recovery is applicable, and then returns
the error if recovery isn't applicable.

---

Duplicate comments:
In `@crates/scout/src/tpm.rs`:
- Around line 94-99: The `should_attempt_tpm_recovery_for_attest_key_failure`
function is too permissive by defaulting to true for recovery and only excluding
errors containing "not supported", which allows configuration, path, and runtime
errors to trigger the destructive clear_tpm and reboot action. Refactor this
function to be more restrictive by checking for specific, known-recoverable
error types or causes instead of using a simple exclusionary string match.
Consider examining the error's source chain or implementing a structured error
classification mechanism that explicitly allows recovery only for specific,
validated error conditions that are genuinely safe to recover from.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 99793a6a-c0e7-4260-a39d-c6cb5a9df1e9

📥 Commits

Reviewing files that changed from the base of the PR and between 6d04f46 and 43ff5ed.

📒 Files selected for processing (5)

crates/scout/src/deprovision/scrabbing.rs
crates/scout/src/main.rs
crates/scout/src/platform.rs
crates/scout/src/register.rs
crates/scout/src/tpm.rs

✅ Files skipped from review due to trivial changes (1)

crates/scout/src/main.rs

🚧 Files skipped from review as they are similar to previous changes (1)

crates/scout/src/deprovision/scrabbing.rs

prbinu-nvidia force-pushed the bug/5614743 branch from 143533f to 6d04f46 Compare June 22, 2026 18:43

prbinu-nvidia marked this pull request as ready for review June 22, 2026 19:15

prbinu-nvidia requested a review from a team as a code owner June 22, 2026 19:15

prbinu-nvidia enabled auto-merge (squash) June 22, 2026 19:15

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread crates/scout/src/platform.rs Outdated

Comment thread crates/scout/src/register.rs Outdated

Comment thread crates/scout/src/tpm.rs Outdated

Comment thread crates/scout/src/tpm.rs Outdated

prbinu-nvidia added 2 commits June 22, 2026 22:48

[bug/5614743] fix: scout repeatedly fails machine discovery with 'Att…

2d06c9b

…estKeyInfo is not populated' error

[bug/5614743] fix: coderabbit fixes

43ff5ed

prbinu-nvidia force-pushed the bug/5614743 branch from 6d04f46 to 43ff5ed Compare June 22, 2026 22:48

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

rwthompsonii approved these changes Jun 23, 2026

View reviewed changes

prbinu-nvidia merged commit cf883d5 into NVIDIA:main Jun 23, 2026
56 checks passed

This was referenced Jun 24, 2026

refactor(machine): separate Machine fields into config/status #2847

Draft

fix(scout): skip attestation on hosts with no TPM device #2900

Open

fix: GBx00 misclassified as DPU during PXE boot #2932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug/5614743] fix: scout repeatedly fails machine discovery with 'AttestKeyInfo is not populated' error#2670

[bug/5614743] fix: scout repeatedly fails machine discovery with 'AttestKeyInfo is not populated' error#2670
prbinu-nvidia merged 2 commits into
NVIDIA:mainfrom
prbinu-nvidia:bug/5614743

prbinu-nvidia commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

prbinu-nvidia commented Jun 16, 2026

Description

Fixes Applied

Files changed

Expected outcome

Type of Change

Testing

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Container Scan Summary

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading