[bug/5614743] fix: scout repeatedly fails machine discovery with 'AttestKeyInfo is not populated' error#2670
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Summary by CodeRabbit
WalkthroughA new ChangesPlatform Detection Centralisation and TPM Auto-Recovery
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
143533f to
6d04f46
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
crates/scout/src/platform.rs (1)
34-41: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winStrengthen tests to validate classifier behavior (table-driven), not just non-panicking call.
The current test at Line 39 only checks invocation. Please add table-driven cases for at least BlueField/non-BlueField/empty product-name paths (preferably by extracting a pure helper for product-name classification).
As per coding guidelines, "Prefer table-driven tests for any function that maps inputs to outputs, errors, or other observable results" and "Use
carbide_test_support::value_scenarios!macro for total operations."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/scout/src/platform.rs` around lines 34 - 41, The test `is_host_returns_bool_without_panicking` only verifies the function doesn't panic rather than validating actual behavior. Replace this with a table-driven test that tests multiple product name scenarios including BlueField, non-BlueField, and empty product name cases, and assert the correct boolean return value for each case instead of just invoking the function. Consider extracting a pure helper function for product-name classification logic to make it testable in isolation, and use the table-driven test pattern or carbide_test_support::value_scenarios! macro as recommended in the coding guidelines.Source: Coding guidelines
crates/scout/src/tpm.rs (1)
137-147: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winExpand recoverability tests into a table and cover all classifier branches.
Current tests validate only one recoverable phrase and one non-TPM error. Add explicit cases for
"Could not create context"and"TPM2_Clear"to lock classifier behavior and reduce regression risk.As per coding guidelines, "Prefer table-driven tests for any function that maps inputs to outputs" and "Reach for a table whenever two or more tests call the same operation with different inputs."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/scout/src/tpm.rs` around lines 137 - 147, The current tests for the is_recoverable_tpm_client_error function use individual test functions but should be refactored into a table-driven test to provide comprehensive coverage of all classifier branches. Convert the recoverable_tpm_errors_include_attest_key_info_failures and non_tpm_client_errors_are_not_recoverable test functions into a single table-driven test that includes test cases for all the error phrases the classifier checks: "Could not create AttestKeyInfo", "Could not create context", and "TPM2_Clear" as recoverable TPM errors, plus cases for non-recoverable errors like GenericError. This will ensure all branches of the is_recoverable_tpm_client_error classifier are tested and locked in against regressions.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/scout/src/platform.rs`:
- Around line 23-29: The is_host() predicate at lines 23-29 uses any() with a
negated contains("bluefield") check, which incorrectly returns true if any
single system record is not BlueField. The intended behavior is to return true
only if no system records indicate BlueField. Change the logic by removing the
negation (!) from inside the closure where
product_name().to_string().to_lowercase().contains("bluefield") is checked, and
instead negate the entire any() expression, so that the method returns true when
none of the system records contain "bluefield".
In `@crates/scout/src/register.rs`:
- Around line 60-69: The recoverability check in the error handling for
create_context_from_path and the similar code for AttestKeyInfo (around lines
74-84) are checking the wrapped CarbideClientError::TpmError message instead of
the underlying TPM error. Refactor these blocks to check recoverability against
the underlying error `e` directly before wrapping it in
CarbideClientError::TpmError, rather than checking the wrapped error. This
ensures that recovery decisions are based on the actual TPM error signal, not on
the user-facing error message prefix that gets matched by the recoverability
helper.
In `@crates/scout/src/tpm.rs`:
- Around line 90-96: The is_recoverable_tpm_client_error function uses overly
broad string prefix matching, particularly the "Could not create context" check,
which can incorrectly classify non-recoverable TPM errors as recoverable and
trigger unnecessary destructive recovery behavior. Refactor the error
classification in is_recoverable_tpm_client_error to use structured error causes
or a dedicated enum from call sites instead of relying on broad string prefixes,
ensuring that non-recoverable causes like invalid TPM path or runtime setup
issues are properly excluded from being marked as recoverable.
- Around line 103-116: The TPM recovery marker file check is vulnerable to
TOCTOU attacks because it checks existence separately and then creates the file
non-atomically in a world-writable directory. Remove the separate exists() check
on TPM_RECOVERY_ATTEMPTED_PATH, change the TPM_RECOVERY_ATTEMPTED_PATH constant
to point to a root-owned runtime directory (such as /run/...) instead of /tmp,
and replace the File::create() call with File::create_new() to atomically create
the file and fail if it already exists, which eliminates the race condition and
symlink abuse vulnerability.
---
Nitpick comments:
In `@crates/scout/src/platform.rs`:
- Around line 34-41: The test `is_host_returns_bool_without_panicking` only
verifies the function doesn't panic rather than validating actual behavior.
Replace this with a table-driven test that tests multiple product name scenarios
including BlueField, non-BlueField, and empty product name cases, and assert the
correct boolean return value for each case instead of just invoking the
function. Consider extracting a pure helper function for product-name
classification logic to make it testable in isolation, and use the table-driven
test pattern or carbide_test_support::value_scenarios! macro as recommended in
the coding guidelines.
In `@crates/scout/src/tpm.rs`:
- Around line 137-147: The current tests for the is_recoverable_tpm_client_error
function use individual test functions but should be refactored into a
table-driven test to provide comprehensive coverage of all classifier branches.
Convert the recoverable_tpm_errors_include_attest_key_info_failures and
non_tpm_client_errors_are_not_recoverable test functions into a single
table-driven test that includes test cases for all the error phrases the
classifier checks: "Could not create AttestKeyInfo", "Could not create context",
and "TPM2_Clear" as recoverable TPM errors, plus cases for non-recoverable
errors like GenericError. This will ensure all branches of the
is_recoverable_tpm_client_error classifier are tested and locked in against
regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d2a5d386-ca15-4bb2-8793-461e9c50d0c0
📒 Files selected for processing (5)
crates/scout/src/deprovision/scrabbing.rscrates/scout/src/main.rscrates/scout/src/platform.rscrates/scout/src/register.rscrates/scout/src/tpm.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
…estKeyInfo is not populated' error
6d04f46 to
43ff5ed
Compare
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2670.docs.buildwithfern.com/infra-controller |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/scout/src/register.rs (1)
60-76: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winApply TPM recovery to context creation failures as well.
Line 60 still returns immediately on
create_context_from_pathfailure, so the new self-heal path only covers failures after a TSS context already exists. That leaves one of the PR’s stated local TPM setup failure modes unrecovered.Proposed fix
- let mut tss_ctx = attest::create_context_from_path(tpm_path) - .map_err(|e| CarbideClientError::TpmError(format!("Could not create context: {e}")))?; + let mut tss_ctx = match attest::create_context_from_path(tpm_path) { + Ok(tss_ctx) => tss_ctx, + Err(e) => { + if tpm::should_attempt_tpm_recovery_for_attest_key_failure(&*e) { + tpm::recover_tpm_and_reboot(tpm_path)?; + } + return Err(CarbideClientError::TpmError(format!( + "Could not create context: {e}" + ))); + } + };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/scout/src/register.rs` around lines 60 - 76, The create_context_from_path call at line 60 returns immediately on failure without attempting TPM recovery, while the later create_attest_key_info call properly checks for recovery conditions and calls tpm::recover_tpm_and_reboot before returning. Apply the same recovery pattern to create_context_from_path: replace the map_err approach with a match statement that catches errors, checks if TPM recovery should be attempted using the appropriate condition (similar to should_attempt_tpm_recovery_for_attest_key_failure but for context creation), calls tpm::recover_tpm_and_reboot if recovery is applicable, and then returns the error if recovery isn't applicable.
♻️ Duplicate comments (1)
crates/scout/src/tpm.rs (1)
94-99: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftRecovery classifier is still too permissive for a destructive action.
should_attempt_tpm_recovery_for_attest_key_failurecurrently treats every error except"not supported"as recoverable. That can still route configuration/path/runtime failures intoclear_tpm + reboot, which is an unsafe fallback boundary for ambiguous errors.Please gate recovery on structured error causes (or a dedicated enum/signal from the call site) instead of near-default
true.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/scout/src/tpm.rs` around lines 94 - 99, The `should_attempt_tpm_recovery_for_attest_key_failure` function is too permissive by defaulting to true for recovery and only excluding errors containing "not supported", which allows configuration, path, and runtime errors to trigger the destructive clear_tpm and reboot action. Refactor this function to be more restrictive by checking for specific, known-recoverable error types or causes instead of using a simple exclusionary string match. Consider examining the error's source chain or implementing a structured error classification mechanism that explicitly allows recovery only for specific, validated error conditions that are genuinely safe to recover from.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/scout/src/register.rs`:
- Around line 60-76: The create_context_from_path call at line 60 returns
immediately on failure without attempting TPM recovery, while the later
create_attest_key_info call properly checks for recovery conditions and calls
tpm::recover_tpm_and_reboot before returning. Apply the same recovery pattern to
create_context_from_path: replace the map_err approach with a match statement
that catches errors, checks if TPM recovery should be attempted using the
appropriate condition (similar to
should_attempt_tpm_recovery_for_attest_key_failure but for context creation),
calls tpm::recover_tpm_and_reboot if recovery is applicable, and then returns
the error if recovery isn't applicable.
---
Duplicate comments:
In `@crates/scout/src/tpm.rs`:
- Around line 94-99: The `should_attempt_tpm_recovery_for_attest_key_failure`
function is too permissive by defaulting to true for recovery and only excluding
errors containing "not supported", which allows configuration, path, and runtime
errors to trigger the destructive clear_tpm and reboot action. Refactor this
function to be more restrictive by checking for specific, known-recoverable
error types or causes instead of using a simple exclusionary string match.
Consider examining the error's source chain or implementing a structured error
classification mechanism that explicitly allows recovery only for specific,
validated error conditions that are genuinely safe to recover from.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 99793a6a-c0e7-4260-a39d-c6cb5a9df1e9
📒 Files selected for processing (5)
crates/scout/src/deprovision/scrabbing.rscrates/scout/src/main.rscrates/scout/src/platform.rscrates/scout/src/register.rscrates/scout/src/tpm.rs
✅ Files skipped from review due to trivial changes (1)
- crates/scout/src/main.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/scout/src/deprovision/scrabbing.rs
Description
The
forge-scoutrepeatedly fails machine discovery againstcarbide-apiwith the hosts remain stuck inbootingwithdiscoveryimagestate, and with the error:.Manual workaround: Run
tpm2_clear(or equivalent TPM clear) and reboot the host. Discovery then succeeds.Root cause: In register.rs, scout decided whether it was running on a DPU vs. a managed host using TPM EK certificate presence:
On affected DGX hosts, hardware enumeration can leave
tpm_ek_certificateunset even though the machine is a normal x86 host (not a BlueField DPU). Scout then treated the host as a DPU, skipped attestation key setup (create_attest_key_info), and sent registration data withoutAttestKeyInfo. carbide-api correctly rejected the request.Impact: Host discovery cannot complete without operator intervention (TPM clear + reboot). Affected platforms include DGX H100/GB200 class systems where EK cert enumeration is missing or incomplete.
Fixes Applied
AttestKeyInfois populated before registration.Added
platform::is_host()that reads SMBIOS system information and returns false when the product name contains "bluefield" (DPU), true otherwise.tpm2_clearand reboot) is invoked from register.rs whencreate_context_from_pathorcreate_attest_key_infofails with a recoverable TPM error.Files changed
crates/scout/src/register.rsis_dpulogic; wire TPM recovery on setup failurecrates/scout/src/platform.rsis_host()crates/scout/src/tpm.rscrates/scout/src/deprovision/scrabbing.rsplatform::is_host()crates/scout/src/main.rsmod platformExpected outcome
TPM2_Clear+ reboot before giving up.Type of Change
Testing