Skip to content

[bug/5614743] fix: scout repeatedly fails machine discovery with 'AttestKeyInfo is not populated' error#2670

Merged
prbinu-nvidia merged 2 commits into
NVIDIA:mainfrom
prbinu-nvidia:bug/5614743
Jun 23, 2026
Merged

[bug/5614743] fix: scout repeatedly fails machine discovery with 'AttestKeyInfo is not populated' error#2670
prbinu-nvidia merged 2 commits into
NVIDIA:mainfrom
prbinu-nvidia:bug/5614743

Conversation

@prbinu-nvidia

Copy link
Copy Markdown
Contributor

Description

Theforge-scout repeatedly fails machine discovery against carbide-api with the hosts remain stuck in bootingwithdiscoveryimage state, and with the error:.

AttestKeyInfo is not populated

Manual workaround: Run tpm2_clear (or equivalent TPM clear) and reboot the host. Discovery then succeeds.

Root cause: In register.rs, scout decided whether it was running on a DPU vs. a managed host using TPM EK certificate presence:

let is_dpu = hardware_info.tpm_ek_certificate.is_none();

On affected DGX hosts, hardware enumeration can leave tpm_ek_certificate unset even though the machine is a normal x86 host (not a BlueField DPU). Scout then treated the host as a DPU, skipped attestation key setup (create_attest_key_info), and sent registration data without AttestKeyInfo. carbide-api correctly rejected the request.

Impact: Host discovery cannot complete without operator intervention (TPM clear + reboot). Affected platforms include DGX H100/GB200 class systems where EK cert enumeration is missing or incomplete.

Fixes Applied

  1. correct host vs. DPU detection (register.rs): Replaced TPM-based DPU inference with SMBIOS platform detection. Hosts without an EK certificate are no longer misclassified as DPUs. Attestation key creation runs on managed hosts as intended, so AttestKeyInfo is populated before registration.
  2. Shared platform helper (platform.rs, new)
    Added platform::is_host() that reads SMBIOS system information and returns false when the product name contains "bluefield" (DPU), true otherwise.
  3. automated TPM self-heal (tpm.rs, register.rs) For genuine TPM corruption, added optional one-shot recovery when local TPM setup fails. Recovery (tpm2_clear and reboot) is invoked from register.rs when create_context_from_path or create_attest_key_info fails with a recoverable TPM error.

Files changed

File Change
crates/scout/src/register.rs Fix is_dpu logic; wire TPM recovery on setup failure
crates/scout/src/platform.rs New shared SMBIOS-based is_host()
crates/scout/src/tpm.rs Add recovery helpers and tests
crates/scout/src/deprovision/scrabbing.rs Use shared platform::is_host()
crates/scout/src/main.rs Add mod platform

Expected outcome

  • DGX H100/GB200 hosts with missing EK cert material proceed through attestation and discovery without manual TPM clear.
  • If TPM state is genuinely corrupted and local setup fails, scout attempts one automated TPM2_Clear + reboot before giving up.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features

    • Added an automated TPM recovery flow for certain attestation-key failures, including a one-time-per-boot-cycle reboot marker.
    • Improved host vs. accelerator (DPU) detection using centralized platform logic.
  • Bug Fixes

    • Refined TPM error classification and recovery triggering to better match “recoverable” vs “not supported” scenarios during registration.
  • Refactor

    • Updated deprovisioning and registration flows to use the shared platform detection for consistent cleanup/skip behavior.

Walkthrough

A new platform module centralises host/DPU detection via SMBIOS product-name inspection. The deprovision scrubbing and registration flows migrate to platform::is_host(). Registration gains structured TPM error recovery for attestation key provisioning: recoverable errors trigger a one-shot TPM clear and system reboot via systemctl reboot, guarded by a /run/scout/tpm_recovery_reboot_attempted marker file to prevent boot-cycle looping.

Changes

Platform Detection Centralisation and TPM Auto-Recovery

Layer / File(s) Summary
New platform module: SMBIOS-based is_host()
crates/scout/src/main.rs, crates/scout/src/platform.rs
platform.rs is registered as a crate submodule and implements pub(crate) fn is_host_from_product_names() to detect host vs. DPU via case-insensitive "bluefield" substring matching in product names, and pub(crate) fn is_host() to load SMBIOS system information and delegate to the helper with true-on-load-failure fallback. Expanded unit test coverage validates multiple product-name cases.
Migrate deprovision scrubbing to platform::is_host()
crates/scout/src/deprovision/scrabbing.rs
Imports updated to include platform; in-file SMBIOS-based is_host() removed. Both run() and run_no_api() now gate cleanup behavior using platform::is_host(), preserving the same non-host early-return semantics.
Migrate register.rs DPU detection to platform::is_host()
crates/scout/src/register.rs
platform import added; is_dpu is now derived from !platform::is_host(), replacing the previous TPM EK certificate absence heuristic.
TPM recovery mechanism with marker-based idempotency
crates/scout/src/tpm.rs
Adds TPM_RECOVERY_ATTEMPTED_PATH constant (/run/scout/tpm_recovery_reboot_attempted), should_attempt_tpm_recovery_for_attest_key_failure() (classifies errors via lowercased message inspection; denies recovery if "not supported" substring present), and claim_tpm_recovery_attempt() to atomically create the boot-cycle marker. recover_tpm_and_reboot() now uses the marker-claim helper, invokes clear_tpm, and triggers systemctl reboot with error handling for non-success exits. Unit tests validate message-based classification for recoverable vs. non-recoverable cases.
Integrate TPM recovery into registration attestation key provisioning
crates/scout/src/register.rs
Error handling for attest::create_attest_key_info refactored from map_err/? chains into match blocks. On failure, should_attempt_tpm_recovery_for_attest_key_failure() classifies the error; if recovery is warranted, invoke recover_tpm_and_reboot() before returning CarbideClientError::TpmError. Successful results populate hardware_info.attest_key_info and cached handles as before.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly addresses the core bug fix: scout's repeated failures in machine discovery due to missing AttestKeyInfo, with a clear reference to the issue identifier.
Description check ✅ Passed The description is comprehensive and clearly related to the changeset, explaining root cause, fixes applied, files changed, and expected outcomes for the bug fix.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@prbinu-nvidia prbinu-nvidia marked this pull request as ready for review June 22, 2026 19:15
@prbinu-nvidia prbinu-nvidia requested a review from a team as a code owner June 22, 2026 19:15
@prbinu-nvidia prbinu-nvidia enabled auto-merge (squash) June 22, 2026 19:15

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
crates/scout/src/platform.rs (1)

34-41: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Strengthen tests to validate classifier behavior (table-driven), not just non-panicking call.

The current test at Line 39 only checks invocation. Please add table-driven cases for at least BlueField/non-BlueField/empty product-name paths (preferably by extracting a pure helper for product-name classification).

As per coding guidelines, "Prefer table-driven tests for any function that maps inputs to outputs, errors, or other observable results" and "Use carbide_test_support::value_scenarios! macro for total operations."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/platform.rs` around lines 34 - 41, The test
`is_host_returns_bool_without_panicking` only verifies the function doesn't
panic rather than validating actual behavior. Replace this with a table-driven
test that tests multiple product name scenarios including BlueField,
non-BlueField, and empty product name cases, and assert the correct boolean
return value for each case instead of just invoking the function. Consider
extracting a pure helper function for product-name classification logic to make
it testable in isolation, and use the table-driven test pattern or
carbide_test_support::value_scenarios! macro as recommended in the coding
guidelines.

Source: Coding guidelines

crates/scout/src/tpm.rs (1)

137-147: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Expand recoverability tests into a table and cover all classifier branches.

Current tests validate only one recoverable phrase and one non-TPM error. Add explicit cases for "Could not create context" and "TPM2_Clear" to lock classifier behavior and reduce regression risk.

As per coding guidelines, "Prefer table-driven tests for any function that maps inputs to outputs" and "Reach for a table whenever two or more tests call the same operation with different inputs."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/tpm.rs` around lines 137 - 147, The current tests for the
is_recoverable_tpm_client_error function use individual test functions but
should be refactored into a table-driven test to provide comprehensive coverage
of all classifier branches. Convert the
recoverable_tpm_errors_include_attest_key_info_failures and
non_tpm_client_errors_are_not_recoverable test functions into a single
table-driven test that includes test cases for all the error phrases the
classifier checks: "Could not create AttestKeyInfo", "Could not create context",
and "TPM2_Clear" as recoverable TPM errors, plus cases for non-recoverable
errors like GenericError. This will ensure all branches of the
is_recoverable_tpm_client_error classifier are tested and locked in against
regressions.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/scout/src/platform.rs`:
- Around line 23-29: The is_host() predicate at lines 23-29 uses any() with a
negated contains("bluefield") check, which incorrectly returns true if any
single system record is not BlueField. The intended behavior is to return true
only if no system records indicate BlueField. Change the logic by removing the
negation (!) from inside the closure where
product_name().to_string().to_lowercase().contains("bluefield") is checked, and
instead negate the entire any() expression, so that the method returns true when
none of the system records contain "bluefield".

In `@crates/scout/src/register.rs`:
- Around line 60-69: The recoverability check in the error handling for
create_context_from_path and the similar code for AttestKeyInfo (around lines
74-84) are checking the wrapped CarbideClientError::TpmError message instead of
the underlying TPM error. Refactor these blocks to check recoverability against
the underlying error `e` directly before wrapping it in
CarbideClientError::TpmError, rather than checking the wrapped error. This
ensures that recovery decisions are based on the actual TPM error signal, not on
the user-facing error message prefix that gets matched by the recoverability
helper.

In `@crates/scout/src/tpm.rs`:
- Around line 90-96: The is_recoverable_tpm_client_error function uses overly
broad string prefix matching, particularly the "Could not create context" check,
which can incorrectly classify non-recoverable TPM errors as recoverable and
trigger unnecessary destructive recovery behavior. Refactor the error
classification in is_recoverable_tpm_client_error to use structured error causes
or a dedicated enum from call sites instead of relying on broad string prefixes,
ensuring that non-recoverable causes like invalid TPM path or runtime setup
issues are properly excluded from being marked as recoverable.
- Around line 103-116: The TPM recovery marker file check is vulnerable to
TOCTOU attacks because it checks existence separately and then creates the file
non-atomically in a world-writable directory. Remove the separate exists() check
on TPM_RECOVERY_ATTEMPTED_PATH, change the TPM_RECOVERY_ATTEMPTED_PATH constant
to point to a root-owned runtime directory (such as /run/...) instead of /tmp,
and replace the File::create() call with File::create_new() to atomically create
the file and fail if it already exists, which eliminates the race condition and
symlink abuse vulnerability.

---

Nitpick comments:
In `@crates/scout/src/platform.rs`:
- Around line 34-41: The test `is_host_returns_bool_without_panicking` only
verifies the function doesn't panic rather than validating actual behavior.
Replace this with a table-driven test that tests multiple product name scenarios
including BlueField, non-BlueField, and empty product name cases, and assert the
correct boolean return value for each case instead of just invoking the
function. Consider extracting a pure helper function for product-name
classification logic to make it testable in isolation, and use the table-driven
test pattern or carbide_test_support::value_scenarios! macro as recommended in
the coding guidelines.

In `@crates/scout/src/tpm.rs`:
- Around line 137-147: The current tests for the is_recoverable_tpm_client_error
function use individual test functions but should be refactored into a
table-driven test to provide comprehensive coverage of all classifier branches.
Convert the recoverable_tpm_errors_include_attest_key_info_failures and
non_tpm_client_errors_are_not_recoverable test functions into a single
table-driven test that includes test cases for all the error phrases the
classifier checks: "Could not create AttestKeyInfo", "Could not create context",
and "TPM2_Clear" as recoverable TPM errors, plus cases for non-recoverable
errors like GenericError. This will ensure all branches of the
is_recoverable_tpm_client_error classifier are tested and locked in against
regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d2a5d386-ca15-4bb2-8793-461e9c50d0c0

📥 Commits

Reviewing files that changed from the base of the PR and between 09809c2 and 6d04f46.

📒 Files selected for processing (5)
  • crates/scout/src/deprovision/scrabbing.rs
  • crates/scout/src/main.rs
  • crates/scout/src/platform.rs
  • crates/scout/src/register.rs
  • crates/scout/src/tpm.rs

Comment thread crates/scout/src/platform.rs Outdated
Comment thread crates/scout/src/register.rs Outdated
Comment thread crates/scout/src/tpm.rs Outdated
Comment thread crates/scout/src/tpm.rs Outdated
@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 264 6 23 99 6 130
machine-validation-runner 714 34 183 266 35 196
machine_validation 714 34 183 266 35 196
nvmetal-carbide 714 34 183 266 35 196
TOTAL 2412 108 572 903 111 718

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@github-actions

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/scout/src/register.rs (1)

60-76: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Apply TPM recovery to context creation failures as well.

Line 60 still returns immediately on create_context_from_path failure, so the new self-heal path only covers failures after a TSS context already exists. That leaves one of the PR’s stated local TPM setup failure modes unrecovered.

Proposed fix
-        let mut tss_ctx = attest::create_context_from_path(tpm_path)
-            .map_err(|e| CarbideClientError::TpmError(format!("Could not create context: {e}")))?;
+        let mut tss_ctx = match attest::create_context_from_path(tpm_path) {
+            Ok(tss_ctx) => tss_ctx,
+            Err(e) => {
+                if tpm::should_attempt_tpm_recovery_for_attest_key_failure(&*e) {
+                    tpm::recover_tpm_and_reboot(tpm_path)?;
+                }
+                return Err(CarbideClientError::TpmError(format!(
+                    "Could not create context: {e}"
+                )));
+            }
+        };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/register.rs` around lines 60 - 76, The
create_context_from_path call at line 60 returns immediately on failure without
attempting TPM recovery, while the later create_attest_key_info call properly
checks for recovery conditions and calls tpm::recover_tpm_and_reboot before
returning. Apply the same recovery pattern to create_context_from_path: replace
the map_err approach with a match statement that catches errors, checks if TPM
recovery should be attempted using the appropriate condition (similar to
should_attempt_tpm_recovery_for_attest_key_failure but for context creation),
calls tpm::recover_tpm_and_reboot if recovery is applicable, and then returns
the error if recovery isn't applicable.
♻️ Duplicate comments (1)
crates/scout/src/tpm.rs (1)

94-99: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Recovery classifier is still too permissive for a destructive action.

should_attempt_tpm_recovery_for_attest_key_failure currently treats every error except "not supported" as recoverable. That can still route configuration/path/runtime failures into clear_tpm + reboot, which is an unsafe fallback boundary for ambiguous errors.

Please gate recovery on structured error causes (or a dedicated enum/signal from the call site) instead of near-default true.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/tpm.rs` around lines 94 - 99, The
`should_attempt_tpm_recovery_for_attest_key_failure` function is too permissive
by defaulting to true for recovery and only excluding errors containing "not
supported", which allows configuration, path, and runtime errors to trigger the
destructive clear_tpm and reboot action. Refactor this function to be more
restrictive by checking for specific, known-recoverable error types or causes
instead of using a simple exclusionary string match. Consider examining the
error's source chain or implementing a structured error classification mechanism
that explicitly allows recovery only for specific, validated error conditions
that are genuinely safe to recover from.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/scout/src/register.rs`:
- Around line 60-76: The create_context_from_path call at line 60 returns
immediately on failure without attempting TPM recovery, while the later
create_attest_key_info call properly checks for recovery conditions and calls
tpm::recover_tpm_and_reboot before returning. Apply the same recovery pattern to
create_context_from_path: replace the map_err approach with a match statement
that catches errors, checks if TPM recovery should be attempted using the
appropriate condition (similar to
should_attempt_tpm_recovery_for_attest_key_failure but for context creation),
calls tpm::recover_tpm_and_reboot if recovery is applicable, and then returns
the error if recovery isn't applicable.

---

Duplicate comments:
In `@crates/scout/src/tpm.rs`:
- Around line 94-99: The `should_attempt_tpm_recovery_for_attest_key_failure`
function is too permissive by defaulting to true for recovery and only excluding
errors containing "not supported", which allows configuration, path, and runtime
errors to trigger the destructive clear_tpm and reboot action. Refactor this
function to be more restrictive by checking for specific, known-recoverable
error types or causes instead of using a simple exclusionary string match.
Consider examining the error's source chain or implementing a structured error
classification mechanism that explicitly allows recovery only for specific,
validated error conditions that are genuinely safe to recover from.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 99793a6a-c0e7-4260-a39d-c6cb5a9df1e9

📥 Commits

Reviewing files that changed from the base of the PR and between 6d04f46 and 43ff5ed.

📒 Files selected for processing (5)
  • crates/scout/src/deprovision/scrabbing.rs
  • crates/scout/src/main.rs
  • crates/scout/src/platform.rs
  • crates/scout/src/register.rs
  • crates/scout/src/tpm.rs
✅ Files skipped from review due to trivial changes (1)
  • crates/scout/src/main.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/scout/src/deprovision/scrabbing.rs

@prbinu-nvidia prbinu-nvidia merged commit cf883d5 into NVIDIA:main Jun 23, 2026
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants