feat: support storing secrets/credentials in Postgres by chet · Pull Request #2665 · NVIDIA/infra-controller

chet · 2026-06-16T20:45:10Z

Description

This supports #353.

This introduces a new secrets/credentials storage backend of sorts, such that NICo can now keep its credentials in Postgres instead of Vault, encrypted (with envelope encryption, leaning on #939) so that a copy of the database on its own gives up nothing. This entirely feature/flagged and configurable, currently not enabled at all, but will let us:

Use it alongside Vault.
Use it to migrate away from Vault.

The encryption is layered. Every credential gets its own data encryption key (DEK), and that DEK is wrapped by a key encryption key (KEK) that lives outside the database -- in NICo itself or whatever KmsProvider we choose. The credential's path is mixed into the encryption as well. The KEK is the only thing an operator has to protect; everything else lives in Postgres (some reference here and here.

PostgresCredentialManager implements the same CredentialManager traits the rest of the system already uses, so nothing downstream had to change. It keeps the two reader behaviors callers learned from Vault: the newest write for a path wins, and an empty-password entry reads as "no credential" (several delete paths still record that tombstone). The store is an append-only journal -- one row per write, newest-per-path on read -- which the rotation work going on in #367 can use for history, rollback, rotation, etc.

Where the wrapping keys come from is pluggable through KmsBackend: local key material (integrated), or Vault/OpenBao Transit secrets engine API, and more than one provider at once. [secrets.routing] maps path prefixes to the KEK that encrypts new writes under them. Startup cross-checks the routing against the providers, so a misspelled, duplicated, or colliding key fails at startup.

Moving an existing site off Vault could be:

A one-time import at startup -- deliberately all-or-nothing: any list or read failure, or an empty listing, aborts the boot rather than recording a half-finished import as done, and only one replica runs it at a time behind a Postgres advisory lock. Once it completes, Vault is out of the credential chain entirely -- with [secrets] set, the chain is env -> file -> postgres and nothing falls back to Vault. Prerequisites that live outside this process (services that still read Vault directly, mixed-fleet rolling upgrades) are spelled out on SecretsConfig.
A slow migration -- we would introduce it as the primary CredentialWriter, such that all new credentials get written to Postgres, with Vault still in the ChainedCredentialReader path, allowing us to slowly migrate over.

BUT, migration is NOT a part of this PR. This PR is for getting some initial code in place, and then we will iterate on it, and then do subsequent work for migration planning.

Rotating a KEK is a config change plus one command: point the route at the new key and run carbide-admin-cli secrets re-wrap. Only the per-row data-key wrapping is redone; the encrypted values are never touched. The re-wrap makes its KMS calls outside the write transaction, runs one at a time, and reports how many rows still sit on a retired key so you know when the old KEK is safe to remove.

Tests cover the manager round-trip, journal write-order under equal timestamps, rotation rollback by deleting the newest entry, tombstone reads, create conflicts, re-wrap idempotence and counting, and a path-binding regression that confirms a transplanted row will not decrypt -- plus an end-to-end Vault import against a real Vault dev server.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Related issues

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

copy-pr-bot · 2026-06-16T20:45:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-16T20:45:19Z

Summary by CodeRabbit

New Features
- Added secrets command to the admin CLI, including secrets re-wrap to re-encrypt stale secrets to the currently active KMS key routing (supports optional --batch-size).
- Added Postgres-backed encrypted secrets storage with configurable KMS provider routing.
- Added one-time Vault-to-Postgres import with completion tracking.
Improvements
- Added dedicated metrics for secrets operations and expanded RBAC coverage for re-wrapping.
- Strengthened encryption with authenticated associated data to prevent cross-context decryption.

Walkthrough

This change introduces a Postgres-backed secrets backend as an alternative to Vault KV, featuring append-only journal semantics with AES-256-GCM envelope encryption per row, multi-provider KMS with prefix-based KEK routing, a one-time Vault import flow, a re_wrap_stale operation for DEK key rotation, updated KMS crypto and provider contracts (AAD binding, kek_ids() method, cancellable token renewal), a gRPC ReWrapSecrets RPC with batch-size control and single-flight advisory locking, and an admin-cli secrets re-wrap command with user-facing completion reporting.

Changes

Postgres Secrets Backend with KMS, Re-wrap, and Admin CLI

Layer / File(s)	Summary
Secret identity, schema, and database layer `crates/uuid/src/secret.rs`, `crates/uuid/src/lib.rs`, `crates/api-model/src/secrets.rs`, `crates/api-model/src/lib.rs`, `crates/api-db/migrations/20260416000305_create_secrets_table.sql`, `crates/api-db/src/secrets.rs`, `crates/api-db/src/lib.rs`	`SecretId` typed UUID and `SecretRow` SQLx struct are defined. The `secrets` migration creates an append-only journal table with identity-backed `seq`, envelope-encryption columns (`encrypted_value`, `nonce`, `kek_id`, `encrypted_dek`, `dek_nonce`), created-at timestamp, and two performance indexes. The DB access layer implements all read (latest, history, KEK-scoped, paginated), write (insert, insert-if-missing with transaction/session advisory locking), deletion, and DEK-wrap-only update operations.
KMS provider contract: AAD binding, kek_ids method, Transit cancellation `crates/kms-provider/src/crypto.rs`, `crates/kms-provider/src/lib.rs`, `crates/kms-provider/src/providers/integrated.rs`, `crates/kms-provider/src/providers/multi.rs`, `crates/kms-provider/src/providers/transit.rs`, `crates/kms-provider/Cargo.toml`	`encrypt`/`decrypt` gain an `aad: &[u8]` parameter for AES-256-GCM authenticated data binding, enabling ciphertext binding to storage context (e.g., secret path). `KmsBackend` trait gains a required `kek_ids()` method for startup KEK deduplication validation. All providers implement the new contract. `TransitKmsProvider::start_token_renewal` is replaced by a cancellable `run_token_renewal(cancel)` future. Key source handling adds zeroization wrappers and file permission warnings on Unix platforms.
Vault client enumeration modes and raw settings helper `crates/secrets/src/forge_vault.rs`, `crates/secrets/src/lib.rs`	`EnumerationMode` (BestEffort/Strict) controls Vault KV enumeration and read error handling. `get_secrets_strict` performs full KV mount enumeration with strict semantics (failing on first non-404 error). Methods are refactored to support both modes. `create_raw_vault_client_settings` constructs a standalone `VaultClientSettings` instance for narrow provider-specific Vault client bootstrap, complementing the existing ForgeVaultClient.
Secrets domain logic: routing, credential manager, encryption, metrics, import, and re-wrap `crates/api-core/src/secrets/routing.rs`, `crates/api-core/src/secrets/mod.rs`, `crates/api-core/src/secrets/metrics.rs`, `crates/api-core/src/secrets/import.rs`, `crates/api-core/src/secrets/re_wrap.rs`, `crates/api-core/src/lib.rs`, `crates/api-core/src/secrets/tests.rs`	`SecretRouting` implements longest-prefix-first KEK selection with catch-all fallback, enforces strict configuration validation (required `/` catch-all, no empty prefixes or KEKs, no normalized-route collisions, no leading-slash non-catch-all prefixes). `PostgresCredentialManager` implements `CredentialReader` and `CredentialWriter` traits with AES-256-GCM envelope encryption using path-bound AAD, Vault tombstone semantics (empty password → `Ok(None)`), create-only enforcement, and delete-by-id rotation. `SecretsMetrics` and `OperationTimer` instrument operations via OpenTelemetry counters and histograms. `import_secrets` supports `MissingOnly` (transactional check-then-insert) and `All` (unconditional appends) strategies. `re_wrap_stale` rotates DEKs in batches under a Postgres session advisory lock, re-encrypting rows whose stored KEK differs from the routed KEK, tracking `re_wrapped`, `already_current`, and `stale_remaining` counts with single-flight enforcement. Comprehensive test suite covers journal semantics, tombstones, AAD path binding, rotation rollback, and re-wrap correctness (idempotency, per-row classification).
Configuration types and startup orchestration `crates/api-core/src/cfg/file.rs`, `crates/api-core/src/run.rs`, `crates/api-core/src/setup.rs`, `crates/api-core/src/test_support/builder.rs`, `crates/api-core/src/test_support/default_config.rs`, `crates/api-core/Cargo.toml`	`CarbideConfig` gains an optional `[secrets]` section with `SecretsConfig`, `KmsConfig`, `ProviderConfig` (serde-tagged `integrated` and `transit` variants with variant-specific fields), and `ImportSource` types, all with strict unknown-field denial and comprehensive TOML validation tests. `run.rs` bifurcates on `carbide_config.secrets` presence: when configured, builds `SecretRouting`, constructs and validates `MultiKmsProvider` (ensuring no KEK ID duplication across providers), instantiates `PostgresCredentialManager` with optional metrics, conditionally performs strict one-time Vault import guarded by a DB session advisory lock and completion marker; otherwise falls back to `CARBIDE_CREDENTIAL_STORE` (vault/memory). `setup.rs` refactors `start_api` to accept pre-built `PgPool` and optional `SecretsContext` rather than creating the pool internally. Test support builder threads `secrets_context` into test `Api` instances.
RPC surface, API struct, handler wiring, and RBAC `crates/rpc/proto/forge.proto`, `crates/api-core/src/api.rs`, `crates/api-core/src/handlers/secrets.rs`, `crates/api-core/src/handlers/mod.rs`, `crates/api-core/src/auth/internal_rbac_rules.rs`	`ReWrapSecrets` RPC is added to the `Forge` service with `ReWrapSecretsRequest` (optional `batch_size`) and `ReWrapSecretsResponse` (counters: `re_wrapped`, `already_current`, `stale_remaining`). `Api` struct gains `secrets_context: Option<SecretsContext>` field. The `re_wrap_secrets` handler validates secrets context precondition, parses and clamps `batch_size` to an inclusive safe range, invokes `re_wrap_stale`, maps `PgSecretsError::ReWrapInProgress` to `FailedPrecondition`, and returns response with operation counts. `ForgeAdminCLI` principal is granted `ReWrapSecrets` RBAC permission.
Admin CLI secrets command tree and re-wrap subcommand `crates/admin-cli/src/cfg/cli_options.rs`, `crates/admin-cli/src/main.rs`, `crates/admin-cli/src/secrets/mod.rs`, `crates/admin-cli/src/secrets/re_wrap/args.rs`, `crates/admin-cli/src/secrets/re_wrap/cmd.rs`, `crates/admin-cli/src/secrets/re_wrap/mod.rs`, `crates/admin-cli/cli_domains.yaml`	A `secrets` top-level subcommand is added to `CliCommand` with clap configuration and dispatch routing. The `re-wrap` subcommand accepts `--batch-size` (optional `u32`) for controlling journal walk batch size, calls the `ReWrapSecrets` RPC via `ApiClient`, and prints user-facing output: counts of re-wrapped and already-current rows; if `stale_remaining` is zero, outputs a success message, otherwise prints a warning with the stale count and re-run guidance.
Vault-to-Postgres import integration test `crates/api-integration-tests/tests/secrets_import.rs`, `crates/api-integration-tests/Cargo.toml`	End-to-end test exercises full Vault→Postgres import flow: allocates a temporary database, starts a real Vault dev server, generates credential variants covering multiple `CredentialKey` types, writes to Vault KV mount, imports via `ImportApproach::All` with direct round-trip verification from `PostgresCredentialManager`, re-imports with `ImportApproach::MissingOnly` and asserts idempotency, validates import completion marker via `is_vault_import_complete`. Test dependencies include `carbide-kms-provider`, `carbide-api-core`/`carbide-api-db` (no defaults), and `base64`.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as admin-cli
  participant API as api-core Forge gRPC
  participant Handler as handlers::secrets
  participant ReWrap as secrets::re_wrap_stale
  participant KMS as MultiKmsProvider
  participant DB as Postgres secrets table

  CLI->>API: ReWrapSecretsRequest { batch_size }
  API->>Handler: re_wrap_secrets(request)
  Handler->>Handler: clamp batch_size to [1, MAX]
  Handler->>Handler: validate secrets_context present
  Handler->>ReWrap: re_wrap_stale(pool, kms, routing, batch_size)
  ReWrap->>DB: try_lock_path_session (session advisory lock)
  loop each batch
    ReWrap->>DB: find_batch_after(after_seq, limit)
    DB-->>ReWrap: Vec~SecretRow~
    loop each row
      alt kek_id matches routed KEK
        ReWrap->>ReWrap: increment already_current
      else KEK mismatch
        ReWrap->>KMS: decrypt_dek(row.kek_id, row.encrypted_dek)
        KMS-->>ReWrap: dek
        ReWrap->>KMS: encrypt_dek(routed_kek_id, dek)
        KMS-->>ReWrap: new_encrypted_dek
        ReWrap->>ReWrap: add to pending updates
      end
    end
    alt pending updates exist
      ReWrap->>DB: update_dek_wrap (batch transaction)
      ReWrap->>ReWrap: increment re_wrapped
    end
  end
  ReWrap->>DB: count_wrapped_outside(routing_kek_ids)
  DB-->>ReWrap: stale_remaining
  ReWrap-->>Handler: ReWrapStaleResult
  Handler-->>API: ReWrapSecretsResponse { re_wrapped, already_current, stale_remaining }
  API-->>CLI: Response
  CLI->>CLI: print completion summary

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: support storing secrets/credentials in Postgres' directly and accurately reflects the main objective of the changeset—introducing PostgreSQL as a credential storage backend.
Description check	✅ Passed	The description provides comprehensive context for the Postgres credential storage feature, including architectural decisions, encryption design, configuration, migration strategy, KEK rotation mechanics, and testing coverage. It is directly related to the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

chet · 2026-06-16T20:45:40Z

@coderabbitai PTAL, thanks!

coderabbitai · 2026-06-16T20:45:53Z

@chet: Absolutely, I will conduct a thorough review of the PR now. Thank you for the ping.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (9)

crates/api-db/src/secrets.rs (1)

124-130: ⚡ Quick win

Use query_scalar for single-boolean reads in try_lock_path and exists.

Both queries return exactly one scalar bool; query_scalar is clearer and removes tuple unpacking.

Proposed refactor

 pub async fn try_lock_path(txn: &mut PgConnection, path: &str) -> DatabaseResult<bool> {
     let sql = "SELECT pg_try_advisory_xact_lock(hashtextextended('secrets:' || $1, 0))";
-    let row: (bool,) = sqlx::query_as(sql)
+    let row: bool = sqlx::query_scalar(sql)
         .bind(path)
         .fetch_one(txn)
         .await
         .map_err(|e| DatabaseError::query(sql, e))?;
-    Ok(row.0)
+    Ok(row)
 }
 
 /// Whether any entries exist for the path.
 pub async fn exists(txn: impl DbReader<'_>, path: &str) -> DatabaseResult<bool> {
     let sql = "SELECT EXISTS(SELECT 1 FROM secrets WHERE path = $1)";
-    let row: (bool,) = sqlx::query_as(sql)
+    let row: bool = sqlx::query_scalar(sql)
         .bind(path)
         .fetch_one(txn)
         .await
         .map_err(|e| DatabaseError::query(sql, e))?;
-    Ok(row.0)
+    Ok(row)
 }

Based on learnings: in crates/api-db, prefer sqlx::query_scalar(query).bind(...).fetch_one(...) when fetching exactly one scalar value.

Also applies to: 135-141

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-db/src/secrets.rs` around lines 124 - 130, Replace the query
pattern in both the `try_lock_path` and `exists` functions that use
`sqlx::query_as` to fetch a single boolean value wrapped in a tuple. Instead of
using `query_as` with tuple unpacking, use `sqlx::query_scalar` which directly
fetches the scalar boolean value without requiring tuple unpacking. This makes
the code clearer and more idiomatic for single scalar value queries.

Source: Learnings

crates/admin-cli/src/secrets/re_wrap/args.rs (1)

20-27: ⚡ Quick win

Add worked EXAMPLES: help to the re-wrap args struct.

Line 21 defines a new operator-facing command args struct but omits the required #[command(after_long_help = "...")] examples block.

Suggested patch

 #[derive(Parser, Debug, Clone)]
+#[command(
+    after_long_help = "EXAMPLES:\n\
+\n\
+Run re-wrap with server defaults:\n\
+    $ nico-admin-cli secrets re-wrap\n\
+\n\
+Run re-wrap with an explicit batch size:\n\
+    $ nico-admin-cli secrets re-wrap --batch-size 1000\n\
+\n"
+)]
 pub struct Args {
     #[clap(
         long,
         help = "Rows scanned per batch during the walk. The server applies its own default and limits."
     )]
     pub batch_size: Option<u32>,
 }

As per coding guidelines, crates/admin-cli/**/*.rs should attach #[command(after_long_help = "...")] with a concrete EXAMPLES: section for realistic usage.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/admin-cli/src/secrets/re_wrap/args.rs` around lines 20 - 27, The Args
struct in crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required
#[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add
this macro attribute to the Args struct definition (right after the
#[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that
demonstrates realistic usage of the re-wrap command with the batch_size
parameter, following the coding guidelines established for admin-cli command
argument structs.

Source: Coding guidelines

crates/admin-cli/src/secrets/re_wrap/cmd.rs (1)

22-26: ⚡ Quick win

Wrap the RPC failure with operator-facing context.

Line 25 should add a context segment so failure chains read as a coherent “while attempting to …” story.

Suggested patch

 use crate::errors::CarbideCliResult;
+use eyre::WrapErr as _;
 
 use crate::rpc::ApiClient;
 
 pub async fn re_wrap(api_client: &ApiClient, batch_size: Option<u32>) -> CarbideCliResult<()> {
     let request = ::rpc::forge::ReWrapSecretsRequest { batch_size };
 
-    let resp = api_client.0.re_wrap_secrets(request).await?;
+    let resp = api_client
+        .0
+        .re_wrap_secrets(request)
+        .await
+        .wrap_err("while attempting to re-wrap stale secrets via Forge API")?;

As per coding guidelines, admin-cli error chains should include .context()/wrap_err() phrased “while attempting to …” for actionable operator troubleshooting.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs` around lines 22 - 26, The RPC
call to re_wrap_secrets in the re_wrap function does not wrap its error with
operator-facing context. Modify the await result of
api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err()
call with a message phrased as "while attempting to ..." (such as "while
attempting to re-wrap secrets") before the ? operator, so that error chains
provide coherent, actionable troubleshooting information for operators as per
admin-cli coding guidelines.

Source: Coding guidelines

crates/secrets/src/forge_vault.rs (3)

770-780: ⚡ Quick win

Derive Eq with PartialEq to keep the new enum clippy-clean.

EnumerationMode has only fieldless variants, so Eq is correct and avoids clippy noise in a warnings-as-errors workspace. As per coding guidelines, “we enable all clippy lints by default, and treat all warnings as errors.”

♻️ Proposed fix

-#[derive(Clone, Copy, PartialEq)]
+#[derive(Clone, Copy, PartialEq, Eq)]
 enum EnumerationMode {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/forge_vault.rs` around lines 770 - 780, The
EnumerationMode enum derives Clone, Copy, and PartialEq, but since it only
contains fieldless variants, it should also derive Eq to satisfy clippy
requirements in the warnings-as-errors workspace. Add Eq to the derive attribute
list for the EnumerationMode enum to keep it clippy-clean and avoid warnings.

Source: Coding guidelines

1105-1108: ⚡ Quick win

Make the static-token requirement explicit in the returned error.

For Transit KMS on service-account-auth deployments, vault_config.token()? currently reports only VAULT_TOKEN. Add context so startup points directly at the unsupported auth mode instead of looking like a missing generic Vault setting.

♻️ Proposed fix

+    let token = vault_config.token().wrap_err(
+        "transit KMS requires vault.token or VAULT_TOKEN; Kubernetes service-account auth \
+         is not supported by create_raw_vault_client_settings yet",
+    )?;
     let mut builder = VaultClientSettingsBuilder::default();
     builder
-        .token(vault_config.token()?)
+        .token(token)
         .address(vault_config.address()?)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/forge_vault.rs` around lines 1105 - 1108, The error
returned by vault_config.token()? when building the VaultClientSettingsBuilder
is too generic and doesn't clarify the root cause. Enhance the error handling
around the vault_config.token()? call to provide a more explicit error message
that indicates static-token authentication is required but unsupported for
Transit KMS on service-account-auth deployments, so users understand the
authentication mode limitation rather than treating it as a missing generic
Vault setting.

833-835: ⚡ Quick win

Log Vault enumeration errors as structured fields.

These new warnings interpolate e into the message, which makes error filtering harder. Put the error in an attribute so logfmt consumers can search by error, prefix, and path. As per coding guidelines, “prefer placing common fields as attributes passed to tracing function, instead of using string interpolation.”

♻️ Proposed fix

                     tracing::warn!(
                         prefix = %dir,
-                        "failed to list vault path: {e}"
+                        error = %e,
+                        "failed to list vault path"
                     );

                     tracing::warn!(
                         path = %path,
-                        "failed to read: {e}"
+                        error = %e,
+                        "failed to read vault secret"
                     );

Also applies to: 937-939

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/forge_vault.rs` around lines 833 - 835, The warning logs
in the vault path enumeration are interpolating the error `e` directly into the
message string using `{e}`, which makes error filtering difficult for logfmt
consumers. Refactor the tracing::warn! calls to pass the error as a structured
attribute instead of including it in the message string. Remove the `{e}` from
the message and add `error = %e` (or similar attribute syntax) as a parameter to
the tracing::warn! macro, keeping the existing attributes like `prefix`. This
change needs to be applied at the location shown (lines 833-835 in the failed to
list vault path warning) and also at the similar warning around lines 937-939 as
mentioned in the comment.

Source: Coding guidelines

crates/secrets/src/lib.rs (1)

23-27: ⚡ Quick win

Update the re-export comment for the new Transit helper.

The block now exposes create_raw_vault_client_settings, which is not for CertificateProvider usage. Please broaden the comment so the public API intent remains accurate.

♻️ Proposed fix

-/// Exposed for `CertificateProvider` usage only. Credential operations should go
-/// through `create_credential_manager` instead of using the vault client directly.
+/// Exposed for direct Vault integrations such as `CertificateProvider` and
+/// Transit KMS setup. Credential operations should go through
+/// `create_credential_manager` instead of using the vault client directly.
 pub use crate::forge_vault::{
     ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, create_vault_client,
 };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/lib.rs` around lines 23 - 27, The comment above the
re-export block in crate::forge_vault states that the exports are only for
CertificateProvider usage, but this is no longer accurate now that
create_raw_vault_client_settings is included in the public API. Broaden the
comment to accurately describe the public API intent for all re-exported items
(ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, and
create_vault_client), making it clear that these are exposed for multiple
purposes beyond just CertificateProvider usage.

crates/api-core/Cargo.toml (1)

57-61: ⚡ Quick win

Keep the new internal dependency in alphabetized order.

carbide-kms-provider should sit before carbide-libmlx, not after carbide-rpc-utils, to match the manifest’s ordering contract.

♻️ Proposed fix

 carbide-ipmi = { path = "../ipmi" }
 carbide-ipxe-renderer = { path = "../ipxe-renderer" }
+carbide-kms-provider = { path = "../kms-provider" }
 carbide-libmlx = { path = "../libmlx" }
@@
 carbide-rpc = { path = "../rpc", features = ["sqlx", "model"] }
 carbide-rpc-utils = { path = "../rpc-utils" }
-carbide-kms-provider = { path = "../kms-provider" }
 carbide-secrets = { path = "../secrets" }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/Cargo.toml` around lines 57 - 61, The internal dependencies
in the Cargo.toml file are not in alphabetical order. Move the
carbide-kms-provider dependency to its correct alphabetical position, which
should be before carbide-redfish, not after carbide-rpc-utils. Reorder all the
dependencies shown (carbide-redfish, carbide-rpc, carbide-rpc-utils,
carbide-kms-provider, carbide-secrets) so they appear in strict alphabetical
order by their crate names.

crates/api-core/src/test_support/builder.rs (1)

51-65: ⚡ Quick win

Allow tests to build a secrets-enabled Api.

Hard-coding secrets_context: None means handler tests for the new secrets/re-wrap path cannot use TestApiBuilder. Add an optional builder field and with_secrets_context method so tests can exercise both enabled and disabled cases.

♻️ Proposed fix

 pub struct TestApiBuilder {
@@
     ib_fabric_manager: Option<Arc<dyn IBFabricManager>>,
     component_manager: Option<Arc<component_manager::component_manager::ComponentManager>>,
+    secrets_context: Option<crate::secrets::SecretsContext>,
 }
@@
             ib_fabric_manager: None,
             component_manager: None,
+            secrets_context: None,
         }
     }
@@
     pub fn with_component_manager(
         self,
         component_manager: Arc<component_manager::component_manager::ComponentManager>,
     ) -> Self {
@@
         }
     }
+
+    pub fn with_secrets_context(self, secrets_context: crate::secrets::SecretsContext) -> Self {
+        Self {
+            secrets_context: Some(secrets_context),
+            ..self
+        }
+    }
@@
-            secrets_context: None,
+            secrets_context: self.secrets_context,
         }
     }
 }

Also applies to: 73-87, 153-161, 254-254

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/test_support/builder.rs` around lines 51 - 65, The
TestApiBuilder struct currently hard-codes secrets_context to None, preventing
tests from building an Api with secrets enabled. Add a new optional field (of
appropriate secrets context type) to the TestApiBuilder struct definition,
implement a with_secrets_context builder method that allows setting this field,
and ensure the field is properly passed through and used when the Api is
constructed so that tests can exercise both enabled and disabled cases for the
secrets/re-wrap path.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/admin-cli/src/cfg/cli_options.rs`:
- Around line 205-206: The new Secrets variant added to the CLI options enum in
cli_options.rs at lines 205–206 lacks a corresponding domain mapping entry in
cli_domains.yaml, which will cause the every_command_has_a_domain test to fail
during CI. Add a secrets entry under the appropriate domain list (admin or
tenant, based on the operational context of the secrets command) in
cli_domains.yaml to ensure the command is properly registered in the domain
configuration.

In `@crates/api-core/src/cfg/file.rs`:
- Around line 863-891: Add the `#[serde(deny_unknown_fields)]` attribute to the
SecretsConfig struct to reject misspelled configuration keys like `import_fom`
instead of silently ignoring them. Additionally, apply the same
`#[serde(deny_unknown_fields)]` attribute to the KmsConfig struct (located
around lines 903-910) for consistency. Finally, add a regression test that
verifies the configuration parser correctly rejects a misspelled optional field
(for example, `import_fom` instead of `import_from`) in the `[secrets]` section,
ensuring startup validation prevents such typos.

In `@crates/api-core/src/run.rs`:
- Around line 515-560: The code holds the transaction lock_txn (which contains a
pooled connection and advisory lock) while performing long-running operations
like vault_client.get_secrets_strict().await and
crate::secrets::import_secrets().await. This creates a deadlock risk during
concurrent replica startup where other waiters can block on the same lock while
occupying remaining pool connections. Restructure the code to use short
transactions: acquire the lock and check completion status in one short
transaction, then release it before calling
vault_client.get_secrets_strict().await and the import_secrets function, and
finally use another short transaction (reusing the same connection if possible,
or using a distinct lock key) to mark the import as complete. This applies at
the location shown (lines 515-560, specifically around the vault enumeration and
import_secrets calls) and also at lines 569-576 where similar long-running
operations may be awaited while holding a transaction.

In `@crates/api-core/src/secrets/re_wrap.rs`:
- Around line 72-88: The re-wrap operation holds a lock transaction from the
pool while the loop in `find_batch_after` and subsequent batch operations try to
acquire additional connections, causing potential deadlock under pool
saturation. Replace the transaction-based locking approach with a single
dedicated connection obtained from the pool at the start. Use a session advisory
lock (via db::secrets::try_lock_path) on that dedicated connection instead of a
transaction, then execute all scan and update operations on that same connection
throughout the loop to avoid holding the transaction open across awaited work.
Apply the same fix at the second location mentioned (lines 126-129) where
similar transaction holding occurs.

In `@crates/api-core/src/secrets/routing.rs`:
- Around line 65-91: In the from_config method's validation loop (where prefix
and kek_id are validated), add a check to reject any prefix that starts with "/"
unless it matches the CATCH_ALL constant. This prevents prefixes like
"/machines/bmc" from being silently misrouted to the catch-all, which could
cause writes to be encrypted under the wrong KEK. Add this validation check
after the empty prefix check and before the normalized prefix processing.

In `@crates/api-db/migrations/20260416000305_create_secrets_table.sql`:
- Around line 21-22: The `seq` column in the migration file uses `BIGINT
GENERATED ALWAYS AS IDENTITY` but is missing a `UNIQUE` constraint. Since
PostgreSQL's GENERATED ALWAYS AS IDENTITY does not enforce uniqueness by itself,
duplicates can occur from sequence resets or manual insertions, breaking the
pagination contract used by functions like `find_batch_after` that rely on `seq
> $1` ordering. Add the `UNIQUE` constraint to the `seq` column definition in
the CREATE TABLE statement to ensure deterministic batch ordering and pagination
behavior.

In `@crates/api-integration-tests/tests/secrets_import.rs`:
- Around line 31-35: The allocate_port() function at lines 31-35 releases the
TcpListener immediately after binding, creating a race condition where another
process can claim the port before it's used at lines 237-238. Fix this by either
keeping the listener alive and passing it to the Vault startup code instead of
just the port number, or by implementing a retry mechanism that attempts to bind
to the port immediately before starting Vault (lines 237-238), retrying if the
port becomes unavailable due to race conditions. This ensures the port
reservation and Vault startup happen atomically without releasing the binding in
between.
- Around line 216-219: The call to db::migrations::MIGRATOR.run() uses .expect()
which causes a panic that bypasses the cleanup logic if migrations fail, leaking
temporary databases. Replace the .expect() call with a fallible pattern using
the ? operator, and restructure the code to ensure the drop/cleanup logic
(currently at lines 225-228) always executes after setup and exercise,
regardless of whether migrations succeed or fail. This typically means using a
guard pattern or ensuring cleanup runs in a finally-like block that is
guaranteed to execute even when earlier operations return errors.
- Around line 212-214: The string concatenation used to build test_db_url at
line 212 breaks when the original db_url contains a database path or query
parameters. Instead of using format!("{db_url}/{test_db_name}"), parse the
db_url into its connection components using the appropriate PostgreSQL URL
parsing utilities, modify the database name field in the parsed options to use
test_db_name, and then reconstruct the connection URL from the parsed options.
This ensures existing query parameters and path components in the original URL
are preserved when building the test database connection string.

In `@crates/rpc/proto/forge.proto`:
- Around line 8550-8554: The `batch_size` field in the `ReWrapSecretsRequest`
message lacks documentation about its validation contract. Update the comment
for the `batch_size` field to explicitly document the semantics for when the
field is unset (its default behavior), what happens if zero is provided, and any
upper-bound constraints the server enforces. This ensures that generated clients
understand the expected validation rules and can properly handle edge cases.
- Around line 946-947: In the crates/rpc/build.rs file, add explicit serde
Serialize annotations for the ReWrapSecretsRequest and ReWrapSecretsResponse
message types using the .type_attribute() method calls, similar to how
comparable gRPC Request/Response types like ListOsImageRequest and
GetBmcCredentialsRequest are configured. This will ensure that these types have
the necessary Serialize trait implementation required by the log_request_data()
function called in the handler at crates/api-core/src/handlers/secrets.rs,
enabling proper request logging and serialization.

---

Nitpick comments:
In `@crates/admin-cli/src/secrets/re_wrap/args.rs`:
- Around line 20-27: The Args struct in
crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required
#[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add
this macro attribute to the Args struct definition (right after the
#[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that
demonstrates realistic usage of the re-wrap command with the batch_size
parameter, following the coding guidelines established for admin-cli command
argument structs.

In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs`:
- Around line 22-26: The RPC call to re_wrap_secrets in the re_wrap function
does not wrap its error with operator-facing context. Modify the await result of
api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err()
call with a message phrased as "while attempting to ..." (such as "while
attempting to re-wrap secrets") before the ? operator, so that error chains
provide coherent, actionable troubleshooting information for operators as per
admin-cli coding guidelines.

In `@crates/api-core/Cargo.toml`:
- Around line 57-61: The internal dependencies in the Cargo.toml file are not in
alphabetical order. Move the carbide-kms-provider dependency to its correct
alphabetical position, which should be before carbide-redfish, not after
carbide-rpc-utils. Reorder all the dependencies shown (carbide-redfish,
carbide-rpc, carbide-rpc-utils, carbide-kms-provider, carbide-secrets) so they
appear in strict alphabetical order by their crate names.

In `@crates/api-core/src/test_support/builder.rs`:
- Around line 51-65: The TestApiBuilder struct currently hard-codes
secrets_context to None, preventing tests from building an Api with secrets
enabled. Add a new optional field (of appropriate secrets context type) to the
TestApiBuilder struct definition, implement a with_secrets_context builder
method that allows setting this field, and ensure the field is properly passed
through and used when the Api is constructed so that tests can exercise both
enabled and disabled cases for the secrets/re-wrap path.

In `@crates/api-db/src/secrets.rs`:
- Around line 124-130: Replace the query pattern in both the `try_lock_path` and
`exists` functions that use `sqlx::query_as` to fetch a single boolean value
wrapped in a tuple. Instead of using `query_as` with tuple unpacking, use
`sqlx::query_scalar` which directly fetches the scalar boolean value without
requiring tuple unpacking. This makes the code clearer and more idiomatic for
single scalar value queries.

In `@crates/secrets/src/forge_vault.rs`:
- Around line 770-780: The EnumerationMode enum derives Clone, Copy, and
PartialEq, but since it only contains fieldless variants, it should also derive
Eq to satisfy clippy requirements in the warnings-as-errors workspace. Add Eq to
the derive attribute list for the EnumerationMode enum to keep it clippy-clean
and avoid warnings.
- Around line 1105-1108: The error returned by vault_config.token()? when
building the VaultClientSettingsBuilder is too generic and doesn't clarify the
root cause. Enhance the error handling around the vault_config.token()? call to
provide a more explicit error message that indicates static-token authentication
is required but unsupported for Transit KMS on service-account-auth deployments,
so users understand the authentication mode limitation rather than treating it
as a missing generic Vault setting.
- Around line 833-835: The warning logs in the vault path enumeration are
interpolating the error `e` directly into the message string using `{e}`, which
makes error filtering difficult for logfmt consumers. Refactor the
tracing::warn! calls to pass the error as a structured attribute instead of
including it in the message string. Remove the `{e}` from the message and add
`error = %e` (or similar attribute syntax) as a parameter to the tracing::warn!
macro, keeping the existing attributes like `prefix`. This change needs to be
applied at the location shown (lines 833-835 in the failed to list vault path
warning) and also at the similar warning around lines 937-939 as mentioned in
the comment.

In `@crates/secrets/src/lib.rs`:
- Around line 23-27: The comment above the re-export block in crate::forge_vault
states that the exports are only for CertificateProvider usage, but this is no
longer accurate now that create_raw_vault_client_settings is included in the
public API. Broaden the comment to accurately describe the public API intent for
all re-exported items (ForgeVaultClient, VaultConfig,
create_raw_vault_client_settings, and create_vault_client), making it clear that
these are exposed for multiple purposes beyond just CertificateProvider usage.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9bd7cccb-fd3d-47e1-9d17-4ccbe5fee75a

📥 Commits

Reviewing files that changed from the base of the PR and between 7eb1969 and b171e20.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (41)

crates/admin-cli/src/cfg/cli_options.rs
crates/admin-cli/src/main.rs
crates/admin-cli/src/secrets/mod.rs
crates/admin-cli/src/secrets/re_wrap/args.rs
crates/admin-cli/src/secrets/re_wrap/cmd.rs
crates/admin-cli/src/secrets/re_wrap/mod.rs
crates/api-core/Cargo.toml
crates/api-core/src/api.rs
crates/api-core/src/auth/internal_rbac_rules.rs
crates/api-core/src/cfg/file.rs
crates/api-core/src/handlers/mod.rs
crates/api-core/src/handlers/secrets.rs
crates/api-core/src/lib.rs
crates/api-core/src/run.rs
crates/api-core/src/secrets/import.rs
crates/api-core/src/secrets/metrics.rs
crates/api-core/src/secrets/mod.rs
crates/api-core/src/secrets/re_wrap.rs
crates/api-core/src/secrets/routing.rs
crates/api-core/src/secrets/tests.rs
crates/api-core/src/setup.rs
crates/api-core/src/test_support/builder.rs
crates/api-core/src/test_support/default_config.rs
crates/api-db/migrations/20260416000305_create_secrets_table.sql
crates/api-db/src/lib.rs
crates/api-db/src/secrets.rs
crates/api-integration-tests/Cargo.toml
crates/api-integration-tests/tests/secrets_import.rs
crates/api-model/src/lib.rs
crates/api-model/src/secrets.rs
crates/kms-provider/Cargo.toml
crates/kms-provider/src/crypto.rs
crates/kms-provider/src/lib.rs
crates/kms-provider/src/providers/integrated.rs
crates/kms-provider/src/providers/multi.rs
crates/kms-provider/src/providers/transit.rs
crates/rpc/proto/forge.proto
crates/secrets/src/forge_vault.rs
crates/secrets/src/lib.rs
crates/uuid/src/lib.rs
crates/uuid/src/secret.rs

coderabbitai · 2026-06-16T20:58:17Z

+/// allocate_port picks a free port by binding to port 0.
+fn allocate_port() -> SocketAddr {
+    let listener = TcpListener::bind("127.0.0.1:0").expect("bind to free port");
+    listener.local_addr().expect("local addr")
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Port reservation is racy and can cause intermittent test failures.

Line 31-35 releases the listener before Line 238 starts Vault, so another process can claim the port in between. This is a classic TOCTOU race and a common CI flake source for integration tests. Prefer a helper that binds and starts atomically (or retries on bind conflicts).

As per coding guidelines, integration tests should be deterministic and avoid flaky behavior.

Also applies to: 237-238

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-integration-tests/tests/secrets_import.rs` around lines 31 - 35, The allocate_port() function at lines 31-35 releases the TcpListener immediately after binding, creating a race condition where another process can claim the port before it's used at lines 237-238. Fix this by either keeping the listener alive and passing it to the Vault startup code instead of just the port number, or by implementing a retry mechanism that attempts to bind to the port immediately before starting Vault (lines 237-238), retrying if the port becomes unavailable due to race conditions. This ensures the port reservation and Vault startup happen atomically without releasing the binding in between.

Source: Coding guidelines

@coderabbitai Fair -- it is a real TOCTOU, but the atomic bind-and-launch would have to live in how api_test_helper spawns the external Vault process, not in this test, so it's out of scope for this PR. Leaving as-is. Thanks for checking!

@chet, understood — the atomic bind-and-launch fix is appropriately scoped to the api_test_helper process spawner, not this test file. Happy to leave it as-is here.

Would you like me to open a follow-up GitHub issue to track the api_test_helper improvement so it is not lost?

@coderabbitai Yes please! And if you can associate it as a sub-task of #353 that would be great. Thank you!

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail repo="NVIDIA/infra-controller" parent_number="353" child_url=$(gh issue create \ --repo "$repo" \ --title "api_test_helper: atomic port reservation for external process spawning" \ --assignee "chet" \ --body "## Background During review of PR `#2665` (Postgres secrets backend), a TOCTOU race was identified in the integration test helper \`allocate_port()\` in \`crates/api-integration-tests/tests/secrets_import.rs\`: \`\`\`rust fn allocate_port() -> SocketAddr { let listener = TcpListener::bind(\"127.0.0.1:0\").expect(\"bind to free port\"); listener.local_addr().expect(\"local addr\") } // listener is dropped here, releasing the port \`\`\` The listener is dropped immediately after obtaining the address, so another process can claim the port between this call and when the external process (e.g., Vault dev server) is actually started. This is a well-known source of intermittent CI failures. ## Problem The fix requires changes to the \`api_test_helper\` process spawner so that port reservation and external process startup happen atomically — either by passing the live \`TcpListener\` (held open) to the spawner, or by having the spawner accept a bound socket directly. ## Proposed Fix In \`api_test_helper\`, update the Vault dev server launch path to: 1. Accept a \`TcpListener\` (already bound) instead of a plain port number, **or** 2. Internally bind, pass the socket fd to the child process, and release only after the child has confirmed it is listening. This keeps the reservation live until the child process owns the port, eliminating the race. ## References - PR: https://github.com/NVIDIA/infra-controller/pull/2665 - Review comment: https://github.com/NVIDIA/infra-controller/pull/2665#discussion_r3423994822 - Requested by: `@chet`") echo "Created issue: $child_url" child_number="${child_url##*/}" echo "Child issue number: $child_number" parent_id=$(gh api graphql \ -F owner="NVIDIA" \ -F name="infra-controller" \ -F parentNumber="$parent_number" \ -f query=' query($owner: String!, $name: String!, $parentNumber: Int!) { repository(owner: $owner, name: $name) { issue(number: $parentNumber) { id } } }' \ --jq '.data.repository.issue.id') child_id=$(gh api graphql \ -F owner="NVIDIA" \ -F name="infra-controller" \ -F childNumber="$child_number" \ -f query=' query($owner: String!, $name: String!, $childNumber: Int!) { repository(owner: $owner, name: $name) { issue(number: $childNumber) { id } } }' \ --jq '.data.repository.issue.id') gh api graphql \ -F parentId="$parent_id" \ -F childId="$child_id" \ -f query=' mutation($parentId: ID!, $childId: ID!) { addSubIssue(input: {issueId: $parentId, subIssueId: $childId}) { issue { url } subIssue { url } } }'

Length of output: 3214

@chet, done! Issue #2777 — "api_test_helper: atomic port reservation for external process spawning" — has been created, assigned to you, and linked as a sub-task of #353.

chet · 2026-06-22T18:23:17Z

@coderabbitai PTAL again, fixed some conflicts, thanks!

coderabbitai · 2026-06-22T18:23:32Z

@chet: Understood — I will conduct a fresh review pass now, taking the conflict resolution into account. Thank you for the update.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

chet · 2026-06-22T18:29:54Z

@coderabbitai review

coderabbitai · 2026-06-22T18:30:01Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

crates/api-db/src/secrets.rs (1)

61-78: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Return the generated SecretId from insert.

insert creates the row identity and then discards it. Returning it lets rollback code call delete_by_id on the exact row it appended, instead of rediscovering by path/order under concurrency.

Proposed API adjustment

-pub async fn insert(txn: &mut PgConnection, entry: &NewSecretEntry<'_>) -> DatabaseResult<()> {
+pub async fn insert(txn: &mut PgConnection, entry: &NewSecretEntry<'_>) -> DatabaseResult<SecretId> {
     let secret_id = SecretId::new();
     let sql = "INSERT INTO secrets
          (secret_id, path, encrypted_value, nonce,
           kek_id, encrypted_dek, dek_nonce)
          VALUES ($1, $2, $3, $4, $5, $6, $7)";
@@
         .await
         .map_err(|e| DatabaseError::query(sql, e))?;
-    Ok(())
+    Ok(secret_id)
 }

Based on PR objectives, rotation rollback is a supported behavior for this journal.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-db/src/secrets.rs` around lines 61 - 78, The `insert` function
currently generates a `SecretId` but discards it after inserting the row,
returning `DatabaseResult<()>` instead. Change the return type of the `insert`
function from `DatabaseResult<()>` to `DatabaseResult<SecretId>`, and replace
the final `Ok(())` statement with `Ok(secret_id)` after the successful query
execution. This enables rollback code to delete the inserted row using the exact
`secret_id` that was created, avoiding the need to rediscover it by path under
concurrent conditions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-db/src/secrets.rs`:
- Around line 150-157: The unlock_path_session function currently ignores the
boolean return value from PostgreSQL's pg_advisory_unlock() function, which
indicates whether the lock was successfully released (returns true) or failed to
release (returns false). Change the function signature to return
DatabaseResult<bool> instead of DatabaseResult<()>, and replace sqlx::query()
with sqlx::query_scalar() and the execute() call with fetch_one() to capture and
return the boolean result from pg_advisory_unlock(). This ensures that
lock-release failures are observable to callers and aligns with the established
pattern used in try_lock_path_session() within the same module.
- Around line 170-178: The `delete_all` function in this file performs
destructive deletion which violates the append-only journal contract documented
in secrets/mod.rs. Instead of using `delete_all` in the `delete_credentials`
method, write an empty-password tombstone entry for the path to preserve history
and maintain append-only semantics, similar to how other credential types (UFM,
SiteWideBmc) already implement deletion. Alternatively, if physical deletion is
only needed for test fixtures, gate the `delete_all` function with
`#[cfg(test)]` to restrict its usage to test-only code paths.
- Around line 81-96: The function `insert_if_missing` currently accepts a plain
PgConnection parameter which could allow autocommit connections and cause the
advisory lock to release prematurely, creating a race condition where concurrent
callers could both insert entries. Change the type signature of
`insert_if_missing` to require `&mut Transaction<'_, Postgres>` instead of `&mut
PgConnection` to ensure the lock is held for the transaction's duration. Apply
the same type tightening to the `lock_path` function as well to maintain
consistency and safety.

---

Nitpick comments:
In `@crates/api-db/src/secrets.rs`:
- Around line 61-78: The `insert` function currently generates a `SecretId` but
discards it after inserting the row, returning `DatabaseResult<()>` instead.
Change the return type of the `insert` function from `DatabaseResult<()>` to
`DatabaseResult<SecretId>`, and replace the final `Ok(())` statement with
`Ok(secret_id)` after the successful query execution. This enables rollback code
to delete the inserted row using the exact `secret_id` that was created,
avoiding the need to rediscover it by path under concurrent conditions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4ba2c609-05ec-4ffe-b4f7-3a2aa64aa4e0

📥 Commits

Reviewing files that changed from the base of the PR and between b171e20 and 04013cf.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (42)

crates/admin-cli/cli_domains.yaml
crates/admin-cli/src/cfg/cli_options.rs
crates/admin-cli/src/main.rs
crates/admin-cli/src/secrets/mod.rs
crates/admin-cli/src/secrets/re_wrap/args.rs
crates/admin-cli/src/secrets/re_wrap/cmd.rs
crates/admin-cli/src/secrets/re_wrap/mod.rs
crates/api-core/Cargo.toml
crates/api-core/src/api.rs
crates/api-core/src/auth/internal_rbac_rules.rs
crates/api-core/src/cfg/file.rs
crates/api-core/src/handlers/mod.rs
crates/api-core/src/handlers/secrets.rs
crates/api-core/src/lib.rs
crates/api-core/src/run.rs
crates/api-core/src/secrets/import.rs
crates/api-core/src/secrets/metrics.rs
crates/api-core/src/secrets/mod.rs
crates/api-core/src/secrets/re_wrap.rs
crates/api-core/src/secrets/routing.rs
crates/api-core/src/secrets/tests.rs
crates/api-core/src/setup.rs
crates/api-core/src/test_support/builder.rs
crates/api-core/src/test_support/default_config.rs
crates/api-db/migrations/20260416000305_create_secrets_table.sql
crates/api-db/src/lib.rs
crates/api-db/src/secrets.rs
crates/api-integration-tests/Cargo.toml
crates/api-integration-tests/tests/secrets_import.rs
crates/api-model/src/lib.rs
crates/api-model/src/secrets.rs
crates/kms-provider/Cargo.toml
crates/kms-provider/src/crypto.rs
crates/kms-provider/src/lib.rs
crates/kms-provider/src/providers/integrated.rs
crates/kms-provider/src/providers/multi.rs
crates/kms-provider/src/providers/transit.rs
crates/rpc/proto/forge.proto
crates/secrets/src/forge_vault.rs
crates/secrets/src/lib.rs
crates/uuid/src/lib.rs
crates/uuid/src/secret.rs

💤 Files with no reviewable changes (11)

crates/uuid/src/lib.rs
crates/kms-provider/src/providers/multi.rs
crates/uuid/src/secret.rs
crates/kms-provider/Cargo.toml
crates/secrets/src/lib.rs
crates/kms-provider/src/lib.rs
crates/rpc/proto/forge.proto
crates/kms-provider/src/providers/integrated.rs
crates/secrets/src/forge_vault.rs
crates/kms-provider/src/crypto.rs
crates/kms-provider/src/providers/transit.rs

✅ Files skipped from review due to trivial changes (3)

crates/api-core/src/lib.rs
crates/api-core/src/handlers/mod.rs
crates/api-core/src/test_support/default_config.rs

🚧 Files skipped from review as they are similar to previous changes (25)

crates/api-db/src/lib.rs
crates/admin-cli/src/secrets/re_wrap/args.rs
crates/admin-cli/src/secrets/re_wrap/mod.rs
crates/api-core/src/handlers/secrets.rs
crates/api-model/src/secrets.rs
crates/admin-cli/src/cfg/cli_options.rs
crates/api-core/src/auth/internal_rbac_rules.rs
crates/api-model/src/lib.rs
crates/admin-cli/src/secrets/mod.rs
crates/api-core/src/api.rs
crates/api-core/src/secrets/import.rs
crates/api-core/src/secrets/metrics.rs
crates/api-core/src/secrets/routing.rs
crates/admin-cli/src/main.rs
crates/admin-cli/src/secrets/re_wrap/cmd.rs
crates/api-db/migrations/20260416000305_create_secrets_table.sql
crates/api-integration-tests/Cargo.toml
crates/api-core/src/setup.rs
crates/api-core/src/secrets/mod.rs
crates/api-integration-tests/tests/secrets_import.rs
crates/api-core/src/secrets/re_wrap.rs
crates/api-core/src/secrets/tests.rs
crates/api-core/src/run.rs
crates/api-core/src/cfg/file.rs
crates/api-core/Cargo.toml

wminckler · 2026-06-22T18:49:05Z

+--
+-- The table is an append-only journal: every write inserts a new row, and a
+-- read returns the newest row for the path. seq is the journal order --
+-- created_at cannot be, because Postgres fixes now() per transaction, so


I've seen us pass in the actual "now" time independently of the transaction "now" time to avoid this. still not a guarantee, though...

github-actions · 2026-06-22T19:49:09Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	264	6	23	99	6	130
machine-validation-runner	714	34	183	266	35	196
machine_validation	714	34	183	266	35	196
nvmetal-carbide	714	34	183	266	35	196
TOTAL	2412	108	572	903	111	718

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Carbide can now keep its credentials in Postgres instead of Vault, encrypted so that a copy of the database on its own gives up nothing. This swaps out Vault KV as the credential store; Vault still issues the PKI certificates. The encryption is layered. Every credential gets its own data key, and that data key is wrapped by a key encryption key (KEK) that lives outside the database -- in carbide's own config or an external KMS. The credential's path is mixed into the encryption as well, so a row lifted out and dropped onto another path will not decrypt. The KEK is the only thing an operator has to protect; everything else lives in Postgres. `PostgresCredentialManager` implements the same `CredentialManager` traits the rest of the system already uses, so nothing downstream had to change. It keeps the two reader behaviors callers learned from Vault: the newest write for a path wins, and an empty-password entry reads as "no credential" (several delete paths still record that tombstone). The store is an append-only journal -- one row per write, newest-per-path on read -- which is what the upcoming rotation work uses for history and rollback. Where the wrapping keys come from is pluggable through `KmsBackend`: local key material (`integrated`), or Vault/OpenBao Transit, and more than one provider at once. `[secrets.routing]` maps path prefixes to the KEK that encrypts new writes under them. Startup cross-checks the routing against the providers, so a misspelled, duplicated, or colliding key fails the boot instead of some write hours later. Moving an existing site off Vault is a one-time import at startup, and it is deliberately all-or-nothing: any list or read failure, or an empty listing, aborts the boot rather than recording a half-finished import as done, and only one replica runs it at a time behind a Postgres advisory lock. Once it completes, Vault is out of the credential chain entirely -- with `[secrets]` set, the chain is env -> file -> postgres and nothing falls back to Vault. Prerequisites that live outside this process (services that still read Vault directly, mixed-fleet rolling upgrades) are spelled out on `SecretsConfig`. Rotating a KEK is a config change plus one command: point the route at the new key and run `carbide-admin-cli secrets re-wrap`. Only the per-row data-key wrapping is redone; the encrypted values are never touched. The re-wrap makes its KMS calls outside the write transaction, runs one at a time, and reports how many rows still sit on a retired key so you know when the old KEK is safe to remove. Tests cover the manager round-trip, journal write-order under equal timestamps, rotation rollback by deleting the newest entry, tombstone reads, create conflicts, re-wrap idempotence and counting, and a path-binding regression that confirms a transplanted row will not decrypt -- plus an end-to-end Vault import against a real Vault dev server. This supports NVIDIA#353. Signed-off-by: Chet Nichols III <chetn@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

crates/kms-provider/src/providers/integrated.rs (1)
152-152: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Consider binding DEK wraps to context via AAD.

The DEK encryption and decryption pass an empty byte slice as the associated data parameter (b""), leaving the DEK ciphertext unbound to any context such as the kek_id or path. While the kek_id is explicitly tracked in the database and validated at startup, cryptographically binding the wrap to the kek_id would provide defense in depth—if KEK IDs were ever mixed up due to a configuration error or migration bug, mismatched unwraps would fail at the crypto layer rather than succeeding with the wrong key.
🔐 Proposed enhancement for defense in depth
 async fn encrypt_dek(&self, kek_id: &str, dek: &[u8; 32]) -> Result<EncryptedDek, KmsError> {
     let kek = self
         .keys
         .get(kek_id)
         .ok_or_else(|| KmsError::KeyNotFound(kek_id.to_string()))?;
-    let (ciphertext, nonce) = crypto::encrypt(kek, dek, b"")?;
+    let (ciphertext, nonce) = crypto::encrypt(kek, dek, kek_id.as_bytes())?;
     Ok(EncryptedDek { ciphertext, nonce })
 }

 async fn decrypt_dek(
     &self,
     kek_id: &str,
     encrypted: &EncryptedDek,
 ) -> Result<Zeroizing<[u8; 32]>, KmsError> {
     let kek = self
         .keys
         .get(kek_id)
         .ok_or_else(|| KmsError::KeyNotFound(kek_id.to_string()))?;
-    let mut plaintext = crypto::decrypt(kek, &encrypted.nonce, &encrypted.ciphertext, b"")?;
+    let mut plaintext = crypto::decrypt(kek, &encrypted.nonce, &encrypted.ciphertext, kek_id.as_bytes())?;
     let len = plaintext.len();
This change would need to be applied consistently across all KmsBackend implementations (integrated, transit, multi) to maintain the contract.
Also applies to: 165-165
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/kms-provider/src/providers/integrated.rs` at line 152, The
crypto::encrypt call at line 152 and the corresponding decrypt operation at line
165 currently pass an empty byte slice as the associated authenticated data
(AAD) parameter, leaving the DEK ciphertext unbound to any context. Modify both
the encryption and decryption calls to bind the DEK wrap to the kek_id by
passing the kek_id as the AAD parameter instead of b"". This change must be
applied consistently across all KmsBackend implementations (integrated, transit,
and multi) to maintain the consistent contract and provide defense in depth
against key ID confusion errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/kms-provider/src/providers/integrated.rs`:
- Line 152: The crypto::encrypt call at line 152 and the corresponding decrypt
operation at line 165 currently pass an empty byte slice as the associated
authenticated data (AAD) parameter, leaving the DEK ciphertext unbound to any
context. Modify both the encryption and decryption calls to bind the DEK wrap to
the kek_id by passing the kek_id as the AAD parameter instead of b"". This
change must be applied consistently across all KmsBackend implementations
(integrated, transit, and multi) to maintain the consistent contract and provide
defense in depth against key ID confusion errors.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 29072d47-870b-4926-b3a2-833cb24c429b

📥 Commits

Reviewing files that changed from the base of the PR and between 5b6f8f5 and c5b4bf8.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (42)

crates/admin-cli/cli_domains.yaml
crates/admin-cli/src/cfg/cli_options.rs
crates/admin-cli/src/main.rs
crates/admin-cli/src/secrets/mod.rs
crates/admin-cli/src/secrets/re_wrap/args.rs
crates/admin-cli/src/secrets/re_wrap/cmd.rs
crates/admin-cli/src/secrets/re_wrap/mod.rs
crates/api-core/Cargo.toml
crates/api-core/src/api.rs
crates/api-core/src/auth/internal_rbac_rules.rs
crates/api-core/src/cfg/file.rs
crates/api-core/src/handlers/mod.rs
crates/api-core/src/handlers/secrets.rs
crates/api-core/src/lib.rs
crates/api-core/src/run.rs
crates/api-core/src/secrets/import.rs
crates/api-core/src/secrets/metrics.rs
crates/api-core/src/secrets/mod.rs
crates/api-core/src/secrets/re_wrap.rs
crates/api-core/src/secrets/routing.rs
crates/api-core/src/secrets/tests.rs
crates/api-core/src/setup.rs
crates/api-core/src/test_support/builder.rs
crates/api-core/src/test_support/default_config.rs
crates/api-db/migrations/20260416000305_create_secrets_table.sql
crates/api-db/src/lib.rs
crates/api-db/src/secrets.rs
crates/api-integration-tests/Cargo.toml
crates/api-integration-tests/tests/secrets_import.rs
crates/api-model/src/lib.rs
crates/api-model/src/secrets.rs
crates/kms-provider/Cargo.toml
crates/kms-provider/src/crypto.rs
crates/kms-provider/src/lib.rs
crates/kms-provider/src/providers/integrated.rs
crates/kms-provider/src/providers/multi.rs
crates/kms-provider/src/providers/transit.rs
crates/rpc/proto/forge.proto
crates/secrets/src/forge_vault.rs
crates/secrets/src/lib.rs
crates/uuid/src/lib.rs
crates/uuid/src/secret.rs

✅ Files skipped from review due to trivial changes (2)

crates/secrets/src/lib.rs
crates/api-core/src/handlers/mod.rs

🚧 Files skipped from review as they are similar to previous changes (38)

crates/api-model/src/lib.rs
crates/admin-cli/src/main.rs
crates/api-core/src/lib.rs
crates/kms-provider/src/lib.rs
crates/uuid/src/lib.rs
crates/api-integration-tests/Cargo.toml
crates/kms-provider/Cargo.toml
crates/admin-cli/src/secrets/mod.rs
crates/api-core/src/test_support/default_config.rs
crates/api-db/src/lib.rs
crates/api-core/src/auth/internal_rbac_rules.rs
crates/kms-provider/src/providers/multi.rs
crates/uuid/src/secret.rs
crates/admin-cli/src/secrets/re_wrap/cmd.rs
crates/api-core/src/handlers/secrets.rs
crates/api-core/src/secrets/metrics.rs
crates/admin-cli/cli_domains.yaml
crates/rpc/proto/forge.proto
crates/api-model/src/secrets.rs
crates/admin-cli/src/secrets/re_wrap/mod.rs
crates/admin-cli/src/cfg/cli_options.rs
crates/admin-cli/src/secrets/re_wrap/args.rs
crates/api-core/src/secrets/tests.rs
crates/kms-provider/src/crypto.rs
crates/api-core/src/secrets/re_wrap.rs
crates/api-core/Cargo.toml
crates/api-core/src/setup.rs
crates/api-core/src/api.rs
crates/kms-provider/src/providers/transit.rs
crates/api-db/migrations/20260416000305_create_secrets_table.sql
crates/api-integration-tests/tests/secrets_import.rs
crates/api-core/src/run.rs
crates/api-core/src/test_support/builder.rs
crates/secrets/src/forge_vault.rs
crates/api-core/src/cfg/file.rs
crates/api-core/src/secrets/mod.rs
crates/api-db/src/secrets.rs
crates/api-core/src/secrets/import.rs

chet requested a review from wminckler June 16, 2026 20:54

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

chet force-pushed the carbide-postgres-secrets branch 2 times, most recently from db91806 to 04013cf Compare June 22, 2026 18:23

chet marked this pull request as ready for review June 22, 2026 18:23

chet requested a review from a team as a code owner June 22, 2026 18:23

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread crates/api-db/src/secrets.rs

Comment thread crates/api-db/src/secrets.rs

Comment thread crates/api-db/src/secrets.rs

wminckler reviewed Jun 22, 2026

View reviewed changes

chet force-pushed the carbide-postgres-secrets branch from 04013cf to 5b6f8f5 Compare June 22, 2026 19:36

wminckler approved these changes Jun 22, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Jun 23, 2026

api_test_helper: atomic port reservation for external process spawning #2777

Open

chet force-pushed the carbide-postgres-secrets branch from 5b6f8f5 to c5b4bf8 Compare June 23, 2026 00:30

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

nick-vetro approved these changes Jun 23, 2026

View reviewed changes

chet merged commit ff893a7 into NVIDIA:main Jun 23, 2026
55 checks passed

chet deleted the carbide-postgres-secrets branch June 23, 2026 07:46

coderabbitai Bot mentioned this pull request Jun 25, 2026

feat(secrets): configurable credential chain and writer #2873

Merged

Uh oh!

Conversation

chet commented Jun 16, 2026

Description

Related issues

Type of Change

Breaking Changes

Testing

Additional Notes

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

chet commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chet Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chet Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chet commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chet commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wminckler Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Container Scan Summary

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading