Skip to content

feat: support storing secrets/credentials in Postgres#2665

Merged
chet merged 1 commit into
NVIDIA:mainfrom
chet:carbide-postgres-secrets
Jun 23, 2026
Merged

feat: support storing secrets/credentials in Postgres#2665
chet merged 1 commit into
NVIDIA:mainfrom
chet:carbide-postgres-secrets

Conversation

@chet

@chet chet commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

This supports #353.

This introduces a new secrets/credentials storage backend of sorts, such that NICo can now keep its credentials in Postgres instead of Vault, encrypted (with envelope encryption, leaning on #939) so that a copy of the database on its own gives up nothing. This entirely feature/flagged and configurable, currently not enabled at all, but will let us:

  • Use it alongside Vault.
  • Use it to migrate away from Vault.

The encryption is layered. Every credential gets its own data encryption key (DEK), and that DEK is wrapped by a key encryption key (KEK) that lives outside the database -- in NICo itself or whatever KmsProvider we choose. The credential's path is mixed into the encryption as well. The KEK is the only thing an operator has to protect; everything else lives in Postgres (some reference here and here.

PostgresCredentialManager implements the same CredentialManager traits the rest of the system already uses, so nothing downstream had to change. It keeps the two reader behaviors callers learned from Vault: the newest write for a path wins, and an empty-password entry reads as "no credential" (several delete paths still record that tombstone). The store is an append-only journal -- one row per write, newest-per-path on read -- which the rotation work going on in #367 can use for history, rollback, rotation, etc.

Where the wrapping keys come from is pluggable through KmsBackend: local key material (integrated), or Vault/OpenBao Transit secrets engine API, and more than one provider at once. [secrets.routing] maps path prefixes to the KEK that encrypts new writes under them. Startup cross-checks the routing against the providers, so a misspelled, duplicated, or colliding key fails at startup.

Moving an existing site off Vault could be:

  • A one-time import at startup -- deliberately all-or-nothing: any list or read failure, or an empty listing, aborts the boot rather than recording a half-finished import as done, and only one replica runs it at a time behind a Postgres advisory lock. Once it completes, Vault is out of the credential chain entirely -- with [secrets] set, the chain is env -> file -> postgres and nothing falls back to Vault. Prerequisites that live outside this process (services that still read Vault directly, mixed-fleet rolling upgrades) are spelled out on SecretsConfig.
  • A slow migration -- we would introduce it as the primary CredentialWriter, such that all new credentials get written to Postgres, with Vault still in the ChainedCredentialReader path, allowing us to slowly migrate over.

BUT, migration is NOT a part of this PR. This PR is for getting some initial code in place, and then we will iterate on it, and then do subsequent work for migration planning.

Rotating a KEK is a config change plus one command: point the route at the new key and run carbide-admin-cli secrets re-wrap. Only the per-row data-key wrapping is redone; the encrypted values are never touched. The re-wrap makes its KMS calls outside the write transaction, runs one at a time, and reports how many rows still sit on a retired key so you know when the old KEK is safe to remove.

Tests cover the manager round-trip, journal write-order under equal timestamps, rotation rollback by deleting the newest entry, tombstone reads, create conflicts, re-wrap idempotence and counting, and a path-binding regression that confirms a transplanted row will not decrypt -- plus an end-to-end Vault import against a real Vault dev server.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features
    • Added secrets command to the admin CLI, including secrets re-wrap to re-encrypt stale secrets to the currently active KMS key routing (supports optional --batch-size).
    • Added Postgres-backed encrypted secrets storage with configurable KMS provider routing.
    • Added one-time Vault-to-Postgres import with completion tracking.
  • Improvements
    • Added dedicated metrics for secrets operations and expanded RBAC coverage for re-wrapping.
    • Strengthened encryption with authenticated associated data to prevent cross-context decryption.

Walkthrough

This change introduces a Postgres-backed secrets backend as an alternative to Vault KV, featuring append-only journal semantics with AES-256-GCM envelope encryption per row, multi-provider KMS with prefix-based KEK routing, a one-time Vault import flow, a re_wrap_stale operation for DEK key rotation, updated KMS crypto and provider contracts (AAD binding, kek_ids() method, cancellable token renewal), a gRPC ReWrapSecrets RPC with batch-size control and single-flight advisory locking, and an admin-cli secrets re-wrap command with user-facing completion reporting.

Changes

Postgres Secrets Backend with KMS, Re-wrap, and Admin CLI

Layer / File(s) Summary
Secret identity, schema, and database layer
crates/uuid/src/secret.rs, crates/uuid/src/lib.rs, crates/api-model/src/secrets.rs, crates/api-model/src/lib.rs, crates/api-db/migrations/20260416000305_create_secrets_table.sql, crates/api-db/src/secrets.rs, crates/api-db/src/lib.rs
SecretId typed UUID and SecretRow SQLx struct are defined. The secrets migration creates an append-only journal table with identity-backed seq, envelope-encryption columns (encrypted_value, nonce, kek_id, encrypted_dek, dek_nonce), created-at timestamp, and two performance indexes. The DB access layer implements all read (latest, history, KEK-scoped, paginated), write (insert, insert-if-missing with transaction/session advisory locking), deletion, and DEK-wrap-only update operations.
KMS provider contract: AAD binding, kek_ids method, Transit cancellation
crates/kms-provider/src/crypto.rs, crates/kms-provider/src/lib.rs, crates/kms-provider/src/providers/integrated.rs, crates/kms-provider/src/providers/multi.rs, crates/kms-provider/src/providers/transit.rs, crates/kms-provider/Cargo.toml
encrypt/decrypt gain an aad: &[u8] parameter for AES-256-GCM authenticated data binding, enabling ciphertext binding to storage context (e.g., secret path). KmsBackend trait gains a required kek_ids() method for startup KEK deduplication validation. All providers implement the new contract. TransitKmsProvider::start_token_renewal is replaced by a cancellable run_token_renewal(cancel) future. Key source handling adds zeroization wrappers and file permission warnings on Unix platforms.
Vault client enumeration modes and raw settings helper
crates/secrets/src/forge_vault.rs, crates/secrets/src/lib.rs
EnumerationMode (BestEffort/Strict) controls Vault KV enumeration and read error handling. get_secrets_strict performs full KV mount enumeration with strict semantics (failing on first non-404 error). Methods are refactored to support both modes. create_raw_vault_client_settings constructs a standalone VaultClientSettings instance for narrow provider-specific Vault client bootstrap, complementing the existing ForgeVaultClient.
Secrets domain logic: routing, credential manager, encryption, metrics, import, and re-wrap
crates/api-core/src/secrets/routing.rs, crates/api-core/src/secrets/mod.rs, crates/api-core/src/secrets/metrics.rs, crates/api-core/src/secrets/import.rs, crates/api-core/src/secrets/re_wrap.rs, crates/api-core/src/lib.rs, crates/api-core/src/secrets/tests.rs
SecretRouting implements longest-prefix-first KEK selection with catch-all fallback, enforces strict configuration validation (required / catch-all, no empty prefixes or KEKs, no normalized-route collisions, no leading-slash non-catch-all prefixes). PostgresCredentialManager implements CredentialReader and CredentialWriter traits with AES-256-GCM envelope encryption using path-bound AAD, Vault tombstone semantics (empty password → Ok(None)), create-only enforcement, and delete-by-id rotation. SecretsMetrics and OperationTimer instrument operations via OpenTelemetry counters and histograms. import_secrets supports MissingOnly (transactional check-then-insert) and All (unconditional appends) strategies. re_wrap_stale rotates DEKs in batches under a Postgres session advisory lock, re-encrypting rows whose stored KEK differs from the routed KEK, tracking re_wrapped, already_current, and stale_remaining counts with single-flight enforcement. Comprehensive test suite covers journal semantics, tombstones, AAD path binding, rotation rollback, and re-wrap correctness (idempotency, per-row classification).
Configuration types and startup orchestration
crates/api-core/src/cfg/file.rs, crates/api-core/src/run.rs, crates/api-core/src/setup.rs, crates/api-core/src/test_support/builder.rs, crates/api-core/src/test_support/default_config.rs, crates/api-core/Cargo.toml
CarbideConfig gains an optional [secrets] section with SecretsConfig, KmsConfig, ProviderConfig (serde-tagged integrated and transit variants with variant-specific fields), and ImportSource types, all with strict unknown-field denial and comprehensive TOML validation tests. run.rs bifurcates on carbide_config.secrets presence: when configured, builds SecretRouting, constructs and validates MultiKmsProvider (ensuring no KEK ID duplication across providers), instantiates PostgresCredentialManager with optional metrics, conditionally performs strict one-time Vault import guarded by a DB session advisory lock and completion marker; otherwise falls back to CARBIDE_CREDENTIAL_STORE (vault/memory). setup.rs refactors start_api to accept pre-built PgPool and optional SecretsContext rather than creating the pool internally. Test support builder threads secrets_context into test Api instances.
RPC surface, API struct, handler wiring, and RBAC
crates/rpc/proto/forge.proto, crates/api-core/src/api.rs, crates/api-core/src/handlers/secrets.rs, crates/api-core/src/handlers/mod.rs, crates/api-core/src/auth/internal_rbac_rules.rs
ReWrapSecrets RPC is added to the Forge service with ReWrapSecretsRequest (optional batch_size) and ReWrapSecretsResponse (counters: re_wrapped, already_current, stale_remaining). Api struct gains secrets_context: Option<SecretsContext> field. The re_wrap_secrets handler validates secrets context precondition, parses and clamps batch_size to an inclusive safe range, invokes re_wrap_stale, maps PgSecretsError::ReWrapInProgress to FailedPrecondition, and returns response with operation counts. ForgeAdminCLI principal is granted ReWrapSecrets RBAC permission.
Admin CLI secrets command tree and re-wrap subcommand
crates/admin-cli/src/cfg/cli_options.rs, crates/admin-cli/src/main.rs, crates/admin-cli/src/secrets/mod.rs, crates/admin-cli/src/secrets/re_wrap/args.rs, crates/admin-cli/src/secrets/re_wrap/cmd.rs, crates/admin-cli/src/secrets/re_wrap/mod.rs, crates/admin-cli/cli_domains.yaml
A secrets top-level subcommand is added to CliCommand with clap configuration and dispatch routing. The re-wrap subcommand accepts --batch-size (optional u32) for controlling journal walk batch size, calls the ReWrapSecrets RPC via ApiClient, and prints user-facing output: counts of re-wrapped and already-current rows; if stale_remaining is zero, outputs a success message, otherwise prints a warning with the stale count and re-run guidance.
Vault-to-Postgres import integration test
crates/api-integration-tests/tests/secrets_import.rs, crates/api-integration-tests/Cargo.toml
End-to-end test exercises full Vault→Postgres import flow: allocates a temporary database, starts a real Vault dev server, generates credential variants covering multiple CredentialKey types, writes to Vault KV mount, imports via ImportApproach::All with direct round-trip verification from PostgresCredentialManager, re-imports with ImportApproach::MissingOnly and asserts idempotency, validates import completion marker via is_vault_import_complete. Test dependencies include carbide-kms-provider, carbide-api-core/carbide-api-db (no defaults), and base64.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as admin-cli
  participant API as api-core Forge gRPC
  participant Handler as handlers::secrets
  participant ReWrap as secrets::re_wrap_stale
  participant KMS as MultiKmsProvider
  participant DB as Postgres secrets table

  CLI->>API: ReWrapSecretsRequest { batch_size }
  API->>Handler: re_wrap_secrets(request)
  Handler->>Handler: clamp batch_size to [1, MAX]
  Handler->>Handler: validate secrets_context present
  Handler->>ReWrap: re_wrap_stale(pool, kms, routing, batch_size)
  ReWrap->>DB: try_lock_path_session (session advisory lock)
  loop each batch
    ReWrap->>DB: find_batch_after(after_seq, limit)
    DB-->>ReWrap: Vec~SecretRow~
    loop each row
      alt kek_id matches routed KEK
        ReWrap->>ReWrap: increment already_current
      else KEK mismatch
        ReWrap->>KMS: decrypt_dek(row.kek_id, row.encrypted_dek)
        KMS-->>ReWrap: dek
        ReWrap->>KMS: encrypt_dek(routed_kek_id, dek)
        KMS-->>ReWrap: new_encrypted_dek
        ReWrap->>ReWrap: add to pending updates
      end
    end
    alt pending updates exist
      ReWrap->>DB: update_dek_wrap (batch transaction)
      ReWrap->>ReWrap: increment re_wrapped
    end
  end
  ReWrap->>DB: count_wrapped_outside(routing_kek_ids)
  DB-->>ReWrap: stale_remaining
  ReWrap-->>Handler: ReWrapStaleResult
  Handler-->>API: ReWrapSecretsResponse { re_wrapped, already_current, stale_remaining }
  API-->>CLI: Response
  CLI->>CLI: print completion summary
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: support storing secrets/credentials in Postgres' directly and accurately reflects the main objective of the changeset—introducing PostgreSQL as a credential storage backend.
Description check ✅ Passed The description provides comprehensive context for the Postgres credential storage feature, including architectural decisions, encryption design, configuration, migration strategy, KEK rotation mechanics, and testing coverage. It is directly related to the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@chet

chet commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai PTAL, thanks!

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@chet: Absolutely, I will conduct a thorough review of the PR now. Thank you for the ping.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chet chet requested a review from wminckler June 16, 2026 20:54

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (9)
crates/api-db/src/secrets.rs (1)

124-130: ⚡ Quick win

Use query_scalar for single-boolean reads in try_lock_path and exists.

Both queries return exactly one scalar bool; query_scalar is clearer and removes tuple unpacking.

Proposed refactor
 pub async fn try_lock_path(txn: &mut PgConnection, path: &str) -> DatabaseResult<bool> {
     let sql = "SELECT pg_try_advisory_xact_lock(hashtextextended('secrets:' || $1, 0))";
-    let row: (bool,) = sqlx::query_as(sql)
+    let row: bool = sqlx::query_scalar(sql)
         .bind(path)
         .fetch_one(txn)
         .await
         .map_err(|e| DatabaseError::query(sql, e))?;
-    Ok(row.0)
+    Ok(row)
 }
 
 /// Whether any entries exist for the path.
 pub async fn exists(txn: impl DbReader<'_>, path: &str) -> DatabaseResult<bool> {
     let sql = "SELECT EXISTS(SELECT 1 FROM secrets WHERE path = $1)";
-    let row: (bool,) = sqlx::query_as(sql)
+    let row: bool = sqlx::query_scalar(sql)
         .bind(path)
         .fetch_one(txn)
         .await
         .map_err(|e| DatabaseError::query(sql, e))?;
-    Ok(row.0)
+    Ok(row)
 }

Based on learnings: in crates/api-db, prefer sqlx::query_scalar(query).bind(...).fetch_one(...) when fetching exactly one scalar value.

Also applies to: 135-141

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-db/src/secrets.rs` around lines 124 - 130, Replace the query
pattern in both the `try_lock_path` and `exists` functions that use
`sqlx::query_as` to fetch a single boolean value wrapped in a tuple. Instead of
using `query_as` with tuple unpacking, use `sqlx::query_scalar` which directly
fetches the scalar boolean value without requiring tuple unpacking. This makes
the code clearer and more idiomatic for single scalar value queries.

Source: Learnings

crates/admin-cli/src/secrets/re_wrap/args.rs (1)

20-27: ⚡ Quick win

Add worked EXAMPLES: help to the re-wrap args struct.

Line 21 defines a new operator-facing command args struct but omits the required #[command(after_long_help = "...")] examples block.

Suggested patch
 #[derive(Parser, Debug, Clone)]
+#[command(
+    after_long_help = "EXAMPLES:\n\
+\n\
+Run re-wrap with server defaults:\n\
+    $ nico-admin-cli secrets re-wrap\n\
+\n\
+Run re-wrap with an explicit batch size:\n\
+    $ nico-admin-cli secrets re-wrap --batch-size 1000\n\
+\n"
+)]
 pub struct Args {
     #[clap(
         long,
         help = "Rows scanned per batch during the walk. The server applies its own default and limits."
     )]
     pub batch_size: Option<u32>,
 }

As per coding guidelines, crates/admin-cli/**/*.rs should attach #[command(after_long_help = "...")] with a concrete EXAMPLES: section for realistic usage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/admin-cli/src/secrets/re_wrap/args.rs` around lines 20 - 27, The Args
struct in crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required
#[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add
this macro attribute to the Args struct definition (right after the
#[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that
demonstrates realistic usage of the re-wrap command with the batch_size
parameter, following the coding guidelines established for admin-cli command
argument structs.

Source: Coding guidelines

crates/admin-cli/src/secrets/re_wrap/cmd.rs (1)

22-26: ⚡ Quick win

Wrap the RPC failure with operator-facing context.

Line 25 should add a context segment so failure chains read as a coherent “while attempting to …” story.

Suggested patch
 use crate::errors::CarbideCliResult;
+use eyre::WrapErr as _;
 
 use crate::rpc::ApiClient;
 
 pub async fn re_wrap(api_client: &ApiClient, batch_size: Option<u32>) -> CarbideCliResult<()> {
     let request = ::rpc::forge::ReWrapSecretsRequest { batch_size };
 
-    let resp = api_client.0.re_wrap_secrets(request).await?;
+    let resp = api_client
+        .0
+        .re_wrap_secrets(request)
+        .await
+        .wrap_err("while attempting to re-wrap stale secrets via Forge API")?;

As per coding guidelines, admin-cli error chains should include .context()/wrap_err() phrased “while attempting to …” for actionable operator troubleshooting.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs` around lines 22 - 26, The RPC
call to re_wrap_secrets in the re_wrap function does not wrap its error with
operator-facing context. Modify the await result of
api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err()
call with a message phrased as "while attempting to ..." (such as "while
attempting to re-wrap secrets") before the ? operator, so that error chains
provide coherent, actionable troubleshooting information for operators as per
admin-cli coding guidelines.

Source: Coding guidelines

crates/secrets/src/forge_vault.rs (3)

770-780: ⚡ Quick win

Derive Eq with PartialEq to keep the new enum clippy-clean.

EnumerationMode has only fieldless variants, so Eq is correct and avoids clippy noise in a warnings-as-errors workspace. As per coding guidelines, “we enable all clippy lints by default, and treat all warnings as errors.”

♻️ Proposed fix
-#[derive(Clone, Copy, PartialEq)]
+#[derive(Clone, Copy, PartialEq, Eq)]
 enum EnumerationMode {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/forge_vault.rs` around lines 770 - 780, The
EnumerationMode enum derives Clone, Copy, and PartialEq, but since it only
contains fieldless variants, it should also derive Eq to satisfy clippy
requirements in the warnings-as-errors workspace. Add Eq to the derive attribute
list for the EnumerationMode enum to keep it clippy-clean and avoid warnings.

Source: Coding guidelines


1105-1108: ⚡ Quick win

Make the static-token requirement explicit in the returned error.

For Transit KMS on service-account-auth deployments, vault_config.token()? currently reports only VAULT_TOKEN. Add context so startup points directly at the unsupported auth mode instead of looking like a missing generic Vault setting.

♻️ Proposed fix
+    let token = vault_config.token().wrap_err(
+        "transit KMS requires vault.token or VAULT_TOKEN; Kubernetes service-account auth \
+         is not supported by create_raw_vault_client_settings yet",
+    )?;
     let mut builder = VaultClientSettingsBuilder::default();
     builder
-        .token(vault_config.token()?)
+        .token(token)
         .address(vault_config.address()?)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/forge_vault.rs` around lines 1105 - 1108, The error
returned by vault_config.token()? when building the VaultClientSettingsBuilder
is too generic and doesn't clarify the root cause. Enhance the error handling
around the vault_config.token()? call to provide a more explicit error message
that indicates static-token authentication is required but unsupported for
Transit KMS on service-account-auth deployments, so users understand the
authentication mode limitation rather than treating it as a missing generic
Vault setting.

833-835: ⚡ Quick win

Log Vault enumeration errors as structured fields.

These new warnings interpolate e into the message, which makes error filtering harder. Put the error in an attribute so logfmt consumers can search by error, prefix, and path. As per coding guidelines, “prefer placing common fields as attributes passed to tracing function, instead of using string interpolation.”

♻️ Proposed fix
                     tracing::warn!(
                         prefix = %dir,
-                        "failed to list vault path: {e}"
+                        error = %e,
+                        "failed to list vault path"
                     );
                     tracing::warn!(
                         path = %path,
-                        "failed to read: {e}"
+                        error = %e,
+                        "failed to read vault secret"
                     );

Also applies to: 937-939

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/forge_vault.rs` around lines 833 - 835, The warning logs
in the vault path enumeration are interpolating the error `e` directly into the
message string using `{e}`, which makes error filtering difficult for logfmt
consumers. Refactor the tracing::warn! calls to pass the error as a structured
attribute instead of including it in the message string. Remove the `{e}` from
the message and add `error = %e` (or similar attribute syntax) as a parameter to
the tracing::warn! macro, keeping the existing attributes like `prefix`. This
change needs to be applied at the location shown (lines 833-835 in the failed to
list vault path warning) and also at the similar warning around lines 937-939 as
mentioned in the comment.

Source: Coding guidelines

crates/secrets/src/lib.rs (1)

23-27: ⚡ Quick win

Update the re-export comment for the new Transit helper.

The block now exposes create_raw_vault_client_settings, which is not for CertificateProvider usage. Please broaden the comment so the public API intent remains accurate.

♻️ Proposed fix
-/// Exposed for `CertificateProvider` usage only. Credential operations should go
-/// through `create_credential_manager` instead of using the vault client directly.
+/// Exposed for direct Vault integrations such as `CertificateProvider` and
+/// Transit KMS setup. Credential operations should go through
+/// `create_credential_manager` instead of using the vault client directly.
 pub use crate::forge_vault::{
     ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, create_vault_client,
 };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/lib.rs` around lines 23 - 27, The comment above the
re-export block in crate::forge_vault states that the exports are only for
CertificateProvider usage, but this is no longer accurate now that
create_raw_vault_client_settings is included in the public API. Broaden the
comment to accurately describe the public API intent for all re-exported items
(ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, and
create_vault_client), making it clear that these are exposed for multiple
purposes beyond just CertificateProvider usage.
crates/api-core/Cargo.toml (1)

57-61: ⚡ Quick win

Keep the new internal dependency in alphabetized order.

carbide-kms-provider should sit before carbide-libmlx, not after carbide-rpc-utils, to match the manifest’s ordering contract.

♻️ Proposed fix
 carbide-ipmi = { path = "../ipmi" }
 carbide-ipxe-renderer = { path = "../ipxe-renderer" }
+carbide-kms-provider = { path = "../kms-provider" }
 carbide-libmlx = { path = "../libmlx" }
@@
 carbide-rpc = { path = "../rpc", features = ["sqlx", "model"] }
 carbide-rpc-utils = { path = "../rpc-utils" }
-carbide-kms-provider = { path = "../kms-provider" }
 carbide-secrets = { path = "../secrets" }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/Cargo.toml` around lines 57 - 61, The internal dependencies
in the Cargo.toml file are not in alphabetical order. Move the
carbide-kms-provider dependency to its correct alphabetical position, which
should be before carbide-redfish, not after carbide-rpc-utils. Reorder all the
dependencies shown (carbide-redfish, carbide-rpc, carbide-rpc-utils,
carbide-kms-provider, carbide-secrets) so they appear in strict alphabetical
order by their crate names.
crates/api-core/src/test_support/builder.rs (1)

51-65: ⚡ Quick win

Allow tests to build a secrets-enabled Api.

Hard-coding secrets_context: None means handler tests for the new secrets/re-wrap path cannot use TestApiBuilder. Add an optional builder field and with_secrets_context method so tests can exercise both enabled and disabled cases.

♻️ Proposed fix
 pub struct TestApiBuilder {
@@
     ib_fabric_manager: Option<Arc<dyn IBFabricManager>>,
     component_manager: Option<Arc<component_manager::component_manager::ComponentManager>>,
+    secrets_context: Option<crate::secrets::SecretsContext>,
 }
@@
             ib_fabric_manager: None,
             component_manager: None,
+            secrets_context: None,
         }
     }
@@
     pub fn with_component_manager(
         self,
         component_manager: Arc<component_manager::component_manager::ComponentManager>,
     ) -> Self {
@@
         }
     }
+
+    pub fn with_secrets_context(self, secrets_context: crate::secrets::SecretsContext) -> Self {
+        Self {
+            secrets_context: Some(secrets_context),
+            ..self
+        }
+    }
@@
-            secrets_context: None,
+            secrets_context: self.secrets_context,
         }
     }
 }

Also applies to: 73-87, 153-161, 254-254

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/test_support/builder.rs` around lines 51 - 65, The
TestApiBuilder struct currently hard-codes secrets_context to None, preventing
tests from building an Api with secrets enabled. Add a new optional field (of
appropriate secrets context type) to the TestApiBuilder struct definition,
implement a with_secrets_context builder method that allows setting this field,
and ensure the field is properly passed through and used when the Api is
constructed so that tests can exercise both enabled and disabled cases for the
secrets/re-wrap path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/admin-cli/src/cfg/cli_options.rs`:
- Around line 205-206: The new Secrets variant added to the CLI options enum in
cli_options.rs at lines 205–206 lacks a corresponding domain mapping entry in
cli_domains.yaml, which will cause the every_command_has_a_domain test to fail
during CI. Add a secrets entry under the appropriate domain list (admin or
tenant, based on the operational context of the secrets command) in
cli_domains.yaml to ensure the command is properly registered in the domain
configuration.

In `@crates/api-core/src/cfg/file.rs`:
- Around line 863-891: Add the `#[serde(deny_unknown_fields)]` attribute to the
SecretsConfig struct to reject misspelled configuration keys like `import_fom`
instead of silently ignoring them. Additionally, apply the same
`#[serde(deny_unknown_fields)]` attribute to the KmsConfig struct (located
around lines 903-910) for consistency. Finally, add a regression test that
verifies the configuration parser correctly rejects a misspelled optional field
(for example, `import_fom` instead of `import_from`) in the `[secrets]` section,
ensuring startup validation prevents such typos.

In `@crates/api-core/src/run.rs`:
- Around line 515-560: The code holds the transaction lock_txn (which contains a
pooled connection and advisory lock) while performing long-running operations
like vault_client.get_secrets_strict().await and
crate::secrets::import_secrets().await. This creates a deadlock risk during
concurrent replica startup where other waiters can block on the same lock while
occupying remaining pool connections. Restructure the code to use short
transactions: acquire the lock and check completion status in one short
transaction, then release it before calling
vault_client.get_secrets_strict().await and the import_secrets function, and
finally use another short transaction (reusing the same connection if possible,
or using a distinct lock key) to mark the import as complete. This applies at
the location shown (lines 515-560, specifically around the vault enumeration and
import_secrets calls) and also at lines 569-576 where similar long-running
operations may be awaited while holding a transaction.

In `@crates/api-core/src/secrets/re_wrap.rs`:
- Around line 72-88: The re-wrap operation holds a lock transaction from the
pool while the loop in `find_batch_after` and subsequent batch operations try to
acquire additional connections, causing potential deadlock under pool
saturation. Replace the transaction-based locking approach with a single
dedicated connection obtained from the pool at the start. Use a session advisory
lock (via db::secrets::try_lock_path) on that dedicated connection instead of a
transaction, then execute all scan and update operations on that same connection
throughout the loop to avoid holding the transaction open across awaited work.
Apply the same fix at the second location mentioned (lines 126-129) where
similar transaction holding occurs.

In `@crates/api-core/src/secrets/routing.rs`:
- Around line 65-91: In the from_config method's validation loop (where prefix
and kek_id are validated), add a check to reject any prefix that starts with "/"
unless it matches the CATCH_ALL constant. This prevents prefixes like
"/machines/bmc" from being silently misrouted to the catch-all, which could
cause writes to be encrypted under the wrong KEK. Add this validation check
after the empty prefix check and before the normalized prefix processing.

In `@crates/api-db/migrations/20260416000305_create_secrets_table.sql`:
- Around line 21-22: The `seq` column in the migration file uses `BIGINT
GENERATED ALWAYS AS IDENTITY` but is missing a `UNIQUE` constraint. Since
PostgreSQL's GENERATED ALWAYS AS IDENTITY does not enforce uniqueness by itself,
duplicates can occur from sequence resets or manual insertions, breaking the
pagination contract used by functions like `find_batch_after` that rely on `seq
> $1` ordering. Add the `UNIQUE` constraint to the `seq` column definition in
the CREATE TABLE statement to ensure deterministic batch ordering and pagination
behavior.

In `@crates/api-integration-tests/tests/secrets_import.rs`:
- Around line 31-35: The allocate_port() function at lines 31-35 releases the
TcpListener immediately after binding, creating a race condition where another
process can claim the port before it's used at lines 237-238. Fix this by either
keeping the listener alive and passing it to the Vault startup code instead of
just the port number, or by implementing a retry mechanism that attempts to bind
to the port immediately before starting Vault (lines 237-238), retrying if the
port becomes unavailable due to race conditions. This ensures the port
reservation and Vault startup happen atomically without releasing the binding in
between.
- Around line 216-219: The call to db::migrations::MIGRATOR.run() uses .expect()
which causes a panic that bypasses the cleanup logic if migrations fail, leaking
temporary databases. Replace the .expect() call with a fallible pattern using
the ? operator, and restructure the code to ensure the drop/cleanup logic
(currently at lines 225-228) always executes after setup and exercise,
regardless of whether migrations succeed or fail. This typically means using a
guard pattern or ensuring cleanup runs in a finally-like block that is
guaranteed to execute even when earlier operations return errors.
- Around line 212-214: The string concatenation used to build test_db_url at
line 212 breaks when the original db_url contains a database path or query
parameters. Instead of using format!("{db_url}/{test_db_name}"), parse the
db_url into its connection components using the appropriate PostgreSQL URL
parsing utilities, modify the database name field in the parsed options to use
test_db_name, and then reconstruct the connection URL from the parsed options.
This ensures existing query parameters and path components in the original URL
are preserved when building the test database connection string.

In `@crates/rpc/proto/forge.proto`:
- Around line 8550-8554: The `batch_size` field in the `ReWrapSecretsRequest`
message lacks documentation about its validation contract. Update the comment
for the `batch_size` field to explicitly document the semantics for when the
field is unset (its default behavior), what happens if zero is provided, and any
upper-bound constraints the server enforces. This ensures that generated clients
understand the expected validation rules and can properly handle edge cases.
- Around line 946-947: In the crates/rpc/build.rs file, add explicit serde
Serialize annotations for the ReWrapSecretsRequest and ReWrapSecretsResponse
message types using the .type_attribute() method calls, similar to how
comparable gRPC Request/Response types like ListOsImageRequest and
GetBmcCredentialsRequest are configured. This will ensure that these types have
the necessary Serialize trait implementation required by the log_request_data()
function called in the handler at crates/api-core/src/handlers/secrets.rs,
enabling proper request logging and serialization.

---

Nitpick comments:
In `@crates/admin-cli/src/secrets/re_wrap/args.rs`:
- Around line 20-27: The Args struct in
crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required
#[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add
this macro attribute to the Args struct definition (right after the
#[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that
demonstrates realistic usage of the re-wrap command with the batch_size
parameter, following the coding guidelines established for admin-cli command
argument structs.

In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs`:
- Around line 22-26: The RPC call to re_wrap_secrets in the re_wrap function
does not wrap its error with operator-facing context. Modify the await result of
api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err()
call with a message phrased as "while attempting to ..." (such as "while
attempting to re-wrap secrets") before the ? operator, so that error chains
provide coherent, actionable troubleshooting information for operators as per
admin-cli coding guidelines.

In `@crates/api-core/Cargo.toml`:
- Around line 57-61: The internal dependencies in the Cargo.toml file are not in
alphabetical order. Move the carbide-kms-provider dependency to its correct
alphabetical position, which should be before carbide-redfish, not after
carbide-rpc-utils. Reorder all the dependencies shown (carbide-redfish,
carbide-rpc, carbide-rpc-utils, carbide-kms-provider, carbide-secrets) so they
appear in strict alphabetical order by their crate names.

In `@crates/api-core/src/test_support/builder.rs`:
- Around line 51-65: The TestApiBuilder struct currently hard-codes
secrets_context to None, preventing tests from building an Api with secrets
enabled. Add a new optional field (of appropriate secrets context type) to the
TestApiBuilder struct definition, implement a with_secrets_context builder
method that allows setting this field, and ensure the field is properly passed
through and used when the Api is constructed so that tests can exercise both
enabled and disabled cases for the secrets/re-wrap path.

In `@crates/api-db/src/secrets.rs`:
- Around line 124-130: Replace the query pattern in both the `try_lock_path` and
`exists` functions that use `sqlx::query_as` to fetch a single boolean value
wrapped in a tuple. Instead of using `query_as` with tuple unpacking, use
`sqlx::query_scalar` which directly fetches the scalar boolean value without
requiring tuple unpacking. This makes the code clearer and more idiomatic for
single scalar value queries.

In `@crates/secrets/src/forge_vault.rs`:
- Around line 770-780: The EnumerationMode enum derives Clone, Copy, and
PartialEq, but since it only contains fieldless variants, it should also derive
Eq to satisfy clippy requirements in the warnings-as-errors workspace. Add Eq to
the derive attribute list for the EnumerationMode enum to keep it clippy-clean
and avoid warnings.
- Around line 1105-1108: The error returned by vault_config.token()? when
building the VaultClientSettingsBuilder is too generic and doesn't clarify the
root cause. Enhance the error handling around the vault_config.token()? call to
provide a more explicit error message that indicates static-token authentication
is required but unsupported for Transit KMS on service-account-auth deployments,
so users understand the authentication mode limitation rather than treating it
as a missing generic Vault setting.
- Around line 833-835: The warning logs in the vault path enumeration are
interpolating the error `e` directly into the message string using `{e}`, which
makes error filtering difficult for logfmt consumers. Refactor the
tracing::warn! calls to pass the error as a structured attribute instead of
including it in the message string. Remove the `{e}` from the message and add
`error = %e` (or similar attribute syntax) as a parameter to the tracing::warn!
macro, keeping the existing attributes like `prefix`. This change needs to be
applied at the location shown (lines 833-835 in the failed to list vault path
warning) and also at the similar warning around lines 937-939 as mentioned in
the comment.

In `@crates/secrets/src/lib.rs`:
- Around line 23-27: The comment above the re-export block in crate::forge_vault
states that the exports are only for CertificateProvider usage, but this is no
longer accurate now that create_raw_vault_client_settings is included in the
public API. Broaden the comment to accurately describe the public API intent for
all re-exported items (ForgeVaultClient, VaultConfig,
create_raw_vault_client_settings, and create_vault_client), making it clear that
these are exposed for multiple purposes beyond just CertificateProvider usage.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9bd7cccb-fd3d-47e1-9d17-4ccbe5fee75a

📥 Commits

Reviewing files that changed from the base of the PR and between 7eb1969 and b171e20.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (41)
  • crates/admin-cli/src/cfg/cli_options.rs
  • crates/admin-cli/src/main.rs
  • crates/admin-cli/src/secrets/mod.rs
  • crates/admin-cli/src/secrets/re_wrap/args.rs
  • crates/admin-cli/src/secrets/re_wrap/cmd.rs
  • crates/admin-cli/src/secrets/re_wrap/mod.rs
  • crates/api-core/Cargo.toml
  • crates/api-core/src/api.rs
  • crates/api-core/src/auth/internal_rbac_rules.rs
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/src/handlers/mod.rs
  • crates/api-core/src/handlers/secrets.rs
  • crates/api-core/src/lib.rs
  • crates/api-core/src/run.rs
  • crates/api-core/src/secrets/import.rs
  • crates/api-core/src/secrets/metrics.rs
  • crates/api-core/src/secrets/mod.rs
  • crates/api-core/src/secrets/re_wrap.rs
  • crates/api-core/src/secrets/routing.rs
  • crates/api-core/src/secrets/tests.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/api-core/src/test_support/default_config.rs
  • crates/api-db/migrations/20260416000305_create_secrets_table.sql
  • crates/api-db/src/lib.rs
  • crates/api-db/src/secrets.rs
  • crates/api-integration-tests/Cargo.toml
  • crates/api-integration-tests/tests/secrets_import.rs
  • crates/api-model/src/lib.rs
  • crates/api-model/src/secrets.rs
  • crates/kms-provider/Cargo.toml
  • crates/kms-provider/src/crypto.rs
  • crates/kms-provider/src/lib.rs
  • crates/kms-provider/src/providers/integrated.rs
  • crates/kms-provider/src/providers/multi.rs
  • crates/kms-provider/src/providers/transit.rs
  • crates/rpc/proto/forge.proto
  • crates/secrets/src/forge_vault.rs
  • crates/secrets/src/lib.rs
  • crates/uuid/src/lib.rs
  • crates/uuid/src/secret.rs

Comment thread crates/admin-cli/src/cfg/cli_options.rs
Comment thread crates/api-core/src/cfg/file.rs
Comment thread crates/api-core/src/run.rs Outdated
Comment thread crates/api-core/src/secrets/re_wrap.rs Outdated
Comment thread crates/api-core/src/secrets/routing.rs
Comment on lines +31 to +35
/// allocate_port picks a free port by binding to port 0.
fn allocate_port() -> SocketAddr {
let listener = TcpListener::bind("127.0.0.1:0").expect("bind to free port");
listener.local_addr().expect("local addr")
}

@coderabbitai coderabbitai Bot Jun 16, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Port reservation is racy and can cause intermittent test failures.

Line 31-35 releases the listener before Line 238 starts Vault, so another process can claim the port in between. This is a classic TOCTOU race and a common CI flake source for integration tests. Prefer a helper that binds and starts atomically (or retries on bind conflicts).

As per coding guidelines, integration tests should be deterministic and avoid flaky behavior.

Also applies to: 237-238

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-integration-tests/tests/secrets_import.rs` around lines 31 - 35,
The allocate_port() function at lines 31-35 releases the TcpListener immediately
after binding, creating a race condition where another process can claim the
port before it's used at lines 237-238. Fix this by either keeping the listener
alive and passing it to the Vault startup code instead of just the port number,
or by implementing a retry mechanism that attempts to bind to the port
immediately before starting Vault (lines 237-238), retrying if the port becomes
unavailable due to race conditions. This ensures the port reservation and Vault
startup happen atomically without releasing the binding in between.

Source: Coding guidelines

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai Fair -- it is a real TOCTOU, but the atomic bind-and-launch would have to live in how api_test_helper spawns the external Vault process, not in this test, so it's out of scope for this PR. Leaving as-is. Thanks for checking!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chet, understood — the atomic bind-and-launch fix is appropriately scoped to the api_test_helper process spawner, not this test file. Happy to leave it as-is here.

Would you like me to open a follow-up GitHub issue to track the api_test_helper improvement so it is not lost?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai Yes please! And if you can associate it as a sub-task of #353 that would be great. Thank you!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

repo="NVIDIA/infra-controller"
parent_number="353"

child_url=$(gh issue create \
  --repo "$repo" \
  --title "api_test_helper: atomic port reservation for external process spawning" \
  --assignee "chet" \
  --body "## Background

During review of PR `#2665` (Postgres secrets backend), a TOCTOU race was identified in the integration test helper \`allocate_port()\` in \`crates/api-integration-tests/tests/secrets_import.rs\`:

\`\`\`rust
fn allocate_port() -> SocketAddr {
    let listener = TcpListener::bind(\"127.0.0.1:0\").expect(\"bind to free port\");
    listener.local_addr().expect(\"local addr\")
}  // listener is dropped here, releasing the port
\`\`\`

The listener is dropped immediately after obtaining the address, so another process can claim the port between this call and when the external process (e.g., Vault dev server) is actually started. This is a well-known source of intermittent CI failures.

## Problem

The fix requires changes to the \`api_test_helper\` process spawner so that port reservation and external process startup happen atomically — either by passing the live \`TcpListener\` (held open) to the spawner, or by having the spawner accept a bound socket directly.

## Proposed Fix

In \`api_test_helper\`, update the Vault dev server launch path to:
1. Accept a \`TcpListener\` (already bound) instead of a plain port number, **or**
2. Internally bind, pass the socket fd to the child process, and release only after the child has confirmed it is listening.

This keeps the reservation live until the child process owns the port, eliminating the race.

## References

- PR: https://github.com/NVIDIA/infra-controller/pull/2665
- Review comment: https://github.com/NVIDIA/infra-controller/pull/2665#discussion_r3423994822
- Requested by: `@chet`")

echo "Created issue: $child_url"
child_number="${child_url##*/}"
echo "Child issue number: $child_number"

parent_id=$(gh api graphql \
  -F owner="NVIDIA" \
  -F name="infra-controller" \
  -F parentNumber="$parent_number" \
  -f query='
query($owner: String!, $name: String!, $parentNumber: Int!) {
  repository(owner: $owner, name: $name) {
    issue(number: $parentNumber) { id }
  }
}' \
  --jq '.data.repository.issue.id')

child_id=$(gh api graphql \
  -F owner="NVIDIA" \
  -F name="infra-controller" \
  -F childNumber="$child_number" \
  -f query='
query($owner: String!, $name: String!, $childNumber: Int!) {
  repository(owner: $owner, name: $name) {
    issue(number: $childNumber) { id }
  }
}' \
  --jq '.data.repository.issue.id')

gh api graphql \
  -F parentId="$parent_id" \
  -F childId="$child_id" \
  -f query='
mutation($parentId: ID!, $childId: ID!) {
  addSubIssue(input: {issueId: $parentId, subIssueId: $childId}) {
    issue { url }
    subIssue { url }
  }
}'

Length of output: 3214



@chet, done! Issue #2777"api_test_helper: atomic port reservation for external process spawning" — has been created, assigned to you, and linked as a sub-task of #353.

Comment thread crates/api-integration-tests/tests/secrets_import.rs Outdated
Comment thread crates/api-integration-tests/tests/secrets_import.rs Outdated
Comment thread crates/rpc/proto/forge.proto
Comment thread crates/rpc/proto/forge.proto
@chet chet force-pushed the carbide-postgres-secrets branch 2 times, most recently from db91806 to 04013cf Compare June 22, 2026 18:23
@chet chet marked this pull request as ready for review June 22, 2026 18:23
@chet chet requested a review from a team as a code owner June 22, 2026 18:23
@chet

chet commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai PTAL again, fixed some conflicts, thanks!

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@chet: Understood — I will conduct a fresh review pass now, taking the conflict resolution into account. Thank you for the update.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chet

chet commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
crates/api-db/src/secrets.rs (1)

61-78: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Return the generated SecretId from insert.

insert creates the row identity and then discards it. Returning it lets rollback code call delete_by_id on the exact row it appended, instead of rediscovering by path/order under concurrency.

Proposed API adjustment
-pub async fn insert(txn: &mut PgConnection, entry: &NewSecretEntry<'_>) -> DatabaseResult<()> {
+pub async fn insert(txn: &mut PgConnection, entry: &NewSecretEntry<'_>) -> DatabaseResult<SecretId> {
     let secret_id = SecretId::new();
     let sql = "INSERT INTO secrets
          (secret_id, path, encrypted_value, nonce,
           kek_id, encrypted_dek, dek_nonce)
          VALUES ($1, $2, $3, $4, $5, $6, $7)";
@@
         .await
         .map_err(|e| DatabaseError::query(sql, e))?;
-    Ok(())
+    Ok(secret_id)
 }

Based on PR objectives, rotation rollback is a supported behavior for this journal.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-db/src/secrets.rs` around lines 61 - 78, The `insert` function
currently generates a `SecretId` but discards it after inserting the row,
returning `DatabaseResult<()>` instead. Change the return type of the `insert`
function from `DatabaseResult<()>` to `DatabaseResult<SecretId>`, and replace
the final `Ok(())` statement with `Ok(secret_id)` after the successful query
execution. This enables rollback code to delete the inserted row using the exact
`secret_id` that was created, avoiding the need to rediscover it by path under
concurrent conditions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-db/src/secrets.rs`:
- Around line 150-157: The unlock_path_session function currently ignores the
boolean return value from PostgreSQL's pg_advisory_unlock() function, which
indicates whether the lock was successfully released (returns true) or failed to
release (returns false). Change the function signature to return
DatabaseResult<bool> instead of DatabaseResult<()>, and replace sqlx::query()
with sqlx::query_scalar() and the execute() call with fetch_one() to capture and
return the boolean result from pg_advisory_unlock(). This ensures that
lock-release failures are observable to callers and aligns with the established
pattern used in try_lock_path_session() within the same module.
- Around line 170-178: The `delete_all` function in this file performs
destructive deletion which violates the append-only journal contract documented
in secrets/mod.rs. Instead of using `delete_all` in the `delete_credentials`
method, write an empty-password tombstone entry for the path to preserve history
and maintain append-only semantics, similar to how other credential types (UFM,
SiteWideBmc) already implement deletion. Alternatively, if physical deletion is
only needed for test fixtures, gate the `delete_all` function with
`#[cfg(test)]` to restrict its usage to test-only code paths.
- Around line 81-96: The function `insert_if_missing` currently accepts a plain
PgConnection parameter which could allow autocommit connections and cause the
advisory lock to release prematurely, creating a race condition where concurrent
callers could both insert entries. Change the type signature of
`insert_if_missing` to require `&mut Transaction<'_, Postgres>` instead of `&mut
PgConnection` to ensure the lock is held for the transaction's duration. Apply
the same type tightening to the `lock_path` function as well to maintain
consistency and safety.

---

Nitpick comments:
In `@crates/api-db/src/secrets.rs`:
- Around line 61-78: The `insert` function currently generates a `SecretId` but
discards it after inserting the row, returning `DatabaseResult<()>` instead.
Change the return type of the `insert` function from `DatabaseResult<()>` to
`DatabaseResult<SecretId>`, and replace the final `Ok(())` statement with
`Ok(secret_id)` after the successful query execution. This enables rollback code
to delete the inserted row using the exact `secret_id` that was created,
avoiding the need to rediscover it by path under concurrent conditions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4ba2c609-05ec-4ffe-b4f7-3a2aa64aa4e0

📥 Commits

Reviewing files that changed from the base of the PR and between b171e20 and 04013cf.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (42)
  • crates/admin-cli/cli_domains.yaml
  • crates/admin-cli/src/cfg/cli_options.rs
  • crates/admin-cli/src/main.rs
  • crates/admin-cli/src/secrets/mod.rs
  • crates/admin-cli/src/secrets/re_wrap/args.rs
  • crates/admin-cli/src/secrets/re_wrap/cmd.rs
  • crates/admin-cli/src/secrets/re_wrap/mod.rs
  • crates/api-core/Cargo.toml
  • crates/api-core/src/api.rs
  • crates/api-core/src/auth/internal_rbac_rules.rs
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/src/handlers/mod.rs
  • crates/api-core/src/handlers/secrets.rs
  • crates/api-core/src/lib.rs
  • crates/api-core/src/run.rs
  • crates/api-core/src/secrets/import.rs
  • crates/api-core/src/secrets/metrics.rs
  • crates/api-core/src/secrets/mod.rs
  • crates/api-core/src/secrets/re_wrap.rs
  • crates/api-core/src/secrets/routing.rs
  • crates/api-core/src/secrets/tests.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/api-core/src/test_support/default_config.rs
  • crates/api-db/migrations/20260416000305_create_secrets_table.sql
  • crates/api-db/src/lib.rs
  • crates/api-db/src/secrets.rs
  • crates/api-integration-tests/Cargo.toml
  • crates/api-integration-tests/tests/secrets_import.rs
  • crates/api-model/src/lib.rs
  • crates/api-model/src/secrets.rs
  • crates/kms-provider/Cargo.toml
  • crates/kms-provider/src/crypto.rs
  • crates/kms-provider/src/lib.rs
  • crates/kms-provider/src/providers/integrated.rs
  • crates/kms-provider/src/providers/multi.rs
  • crates/kms-provider/src/providers/transit.rs
  • crates/rpc/proto/forge.proto
  • crates/secrets/src/forge_vault.rs
  • crates/secrets/src/lib.rs
  • crates/uuid/src/lib.rs
  • crates/uuid/src/secret.rs
💤 Files with no reviewable changes (11)
  • crates/uuid/src/lib.rs
  • crates/kms-provider/src/providers/multi.rs
  • crates/uuid/src/secret.rs
  • crates/kms-provider/Cargo.toml
  • crates/secrets/src/lib.rs
  • crates/kms-provider/src/lib.rs
  • crates/rpc/proto/forge.proto
  • crates/kms-provider/src/providers/integrated.rs
  • crates/secrets/src/forge_vault.rs
  • crates/kms-provider/src/crypto.rs
  • crates/kms-provider/src/providers/transit.rs
✅ Files skipped from review due to trivial changes (3)
  • crates/api-core/src/lib.rs
  • crates/api-core/src/handlers/mod.rs
  • crates/api-core/src/test_support/default_config.rs
🚧 Files skipped from review as they are similar to previous changes (25)
  • crates/api-db/src/lib.rs
  • crates/admin-cli/src/secrets/re_wrap/args.rs
  • crates/admin-cli/src/secrets/re_wrap/mod.rs
  • crates/api-core/src/handlers/secrets.rs
  • crates/api-model/src/secrets.rs
  • crates/admin-cli/src/cfg/cli_options.rs
  • crates/api-core/src/auth/internal_rbac_rules.rs
  • crates/api-model/src/lib.rs
  • crates/admin-cli/src/secrets/mod.rs
  • crates/api-core/src/api.rs
  • crates/api-core/src/secrets/import.rs
  • crates/api-core/src/secrets/metrics.rs
  • crates/api-core/src/secrets/routing.rs
  • crates/admin-cli/src/main.rs
  • crates/admin-cli/src/secrets/re_wrap/cmd.rs
  • crates/api-db/migrations/20260416000305_create_secrets_table.sql
  • crates/api-integration-tests/Cargo.toml
  • crates/api-core/src/setup.rs
  • crates/api-core/src/secrets/mod.rs
  • crates/api-integration-tests/tests/secrets_import.rs
  • crates/api-core/src/secrets/re_wrap.rs
  • crates/api-core/src/secrets/tests.rs
  • crates/api-core/src/run.rs
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/Cargo.toml

Comment thread crates/api-db/src/secrets.rs
Comment thread crates/api-db/src/secrets.rs
Comment thread crates/api-db/src/secrets.rs
--
-- The table is an append-only journal: every write inserts a new row, and a
-- read returns the newest row for the path. seq is the journal order --
-- created_at cannot be, because Postgres fixes now() per transaction, so

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen us pass in the actual "now" time independently of the transaction "now" time to avoid this. still not a guarantee, though...

@chet chet force-pushed the carbide-postgres-secrets branch from 04013cf to 5b6f8f5 Compare June 22, 2026 19:36
@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 264 6 23 99 6 130
machine-validation-runner 714 34 183 266 35 196
machine_validation 714 34 183 266 35 196
nvmetal-carbide 714 34 183 266 35 196
TOTAL 2412 108 572 903 111 718

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Carbide can now keep its credentials in Postgres instead of Vault,
encrypted so that a copy of the database on its own gives up nothing.
This swaps out Vault KV as the credential store; Vault still issues the
PKI certificates.

The encryption is layered. Every credential gets its own data key, and
that data key is wrapped by a key encryption key (KEK) that lives
outside the database -- in carbide's own config or an external KMS. The
credential's path is mixed into the encryption as well, so a row lifted
out and dropped onto another path will not decrypt. The KEK is the only
thing an operator has to protect; everything else lives in Postgres.

`PostgresCredentialManager` implements the same `CredentialManager`
traits the rest of the system already uses, so nothing downstream had to
change. It keeps the two reader behaviors callers learned from Vault:
the newest write for a path wins, and an empty-password entry reads as
"no credential" (several delete paths still record that tombstone). The
store is an append-only journal -- one row per write, newest-per-path on
read -- which is what the upcoming rotation work uses for history and
rollback.

Where the wrapping keys come from is pluggable through `KmsBackend`:
local key material (`integrated`), or Vault/OpenBao Transit, and more
than one provider at once. `[secrets.routing]` maps path prefixes to the
KEK that encrypts new writes under them. Startup cross-checks the routing
against the providers, so a misspelled, duplicated, or colliding key
fails the boot instead of some write hours later.

Moving an existing site off Vault is a one-time import at startup, and it
is deliberately all-or-nothing: any list or read failure, or an empty
listing, aborts the boot rather than recording a half-finished import as
done, and only one replica runs it at a time behind a Postgres advisory
lock. Once it completes, Vault is out of the credential chain entirely --
with `[secrets]` set, the chain is env -> file -> postgres and nothing
falls back to Vault. Prerequisites that live outside this process
(services that still read Vault directly, mixed-fleet rolling upgrades)
are spelled out on `SecretsConfig`.

Rotating a KEK is a config change plus one command: point the route at
the new key and run `carbide-admin-cli secrets re-wrap`. Only the
per-row data-key wrapping is redone; the encrypted values are never
touched. The re-wrap makes its KMS calls outside the write transaction,
runs one at a time, and reports how many rows still sit on a retired key
so you know when the old KEK is safe to remove.

Tests cover the manager round-trip, journal write-order under equal
timestamps, rotation rollback by deleting the newest entry, tombstone
reads, create conflicts, re-wrap idempotence and counting, and a
path-binding regression that confirms a transplanted row will not
decrypt -- plus an end-to-end Vault import against a real Vault dev
server.

This supports NVIDIA#353.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet force-pushed the carbide-postgres-secrets branch from 5b6f8f5 to c5b4bf8 Compare June 23, 2026 00:30

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/kms-provider/src/providers/integrated.rs (1)

152-152: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Consider binding DEK wraps to context via AAD.

The DEK encryption and decryption pass an empty byte slice as the associated data parameter (b""), leaving the DEK ciphertext unbound to any context such as the kek_id or path. While the kek_id is explicitly tracked in the database and validated at startup, cryptographically binding the wrap to the kek_id would provide defense in depth—if KEK IDs were ever mixed up due to a configuration error or migration bug, mismatched unwraps would fail at the crypto layer rather than succeeding with the wrong key.

🔐 Proposed enhancement for defense in depth
 async fn encrypt_dek(&self, kek_id: &str, dek: &[u8; 32]) -> Result<EncryptedDek, KmsError> {
     let kek = self
         .keys
         .get(kek_id)
         .ok_or_else(|| KmsError::KeyNotFound(kek_id.to_string()))?;
-    let (ciphertext, nonce) = crypto::encrypt(kek, dek, b"")?;
+    let (ciphertext, nonce) = crypto::encrypt(kek, dek, kek_id.as_bytes())?;
     Ok(EncryptedDek { ciphertext, nonce })
 }

 async fn decrypt_dek(
     &self,
     kek_id: &str,
     encrypted: &EncryptedDek,
 ) -> Result<Zeroizing<[u8; 32]>, KmsError> {
     let kek = self
         .keys
         .get(kek_id)
         .ok_or_else(|| KmsError::KeyNotFound(kek_id.to_string()))?;
-    let mut plaintext = crypto::decrypt(kek, &encrypted.nonce, &encrypted.ciphertext, b"")?;
+    let mut plaintext = crypto::decrypt(kek, &encrypted.nonce, &encrypted.ciphertext, kek_id.as_bytes())?;
     let len = plaintext.len();

This change would need to be applied consistently across all KmsBackend implementations (integrated, transit, multi) to maintain the contract.

Also applies to: 165-165

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/kms-provider/src/providers/integrated.rs` at line 152, The
crypto::encrypt call at line 152 and the corresponding decrypt operation at line
165 currently pass an empty byte slice as the associated authenticated data
(AAD) parameter, leaving the DEK ciphertext unbound to any context. Modify both
the encryption and decryption calls to bind the DEK wrap to the kek_id by
passing the kek_id as the AAD parameter instead of b"". This change must be
applied consistently across all KmsBackend implementations (integrated, transit,
and multi) to maintain the consistent contract and provide defense in depth
against key ID confusion errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/kms-provider/src/providers/integrated.rs`:
- Line 152: The crypto::encrypt call at line 152 and the corresponding decrypt
operation at line 165 currently pass an empty byte slice as the associated
authenticated data (AAD) parameter, leaving the DEK ciphertext unbound to any
context. Modify both the encryption and decryption calls to bind the DEK wrap to
the kek_id by passing the kek_id as the AAD parameter instead of b"". This
change must be applied consistently across all KmsBackend implementations
(integrated, transit, and multi) to maintain the consistent contract and provide
defense in depth against key ID confusion errors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 29072d47-870b-4926-b3a2-833cb24c429b

📥 Commits

Reviewing files that changed from the base of the PR and between 5b6f8f5 and c5b4bf8.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (42)
  • crates/admin-cli/cli_domains.yaml
  • crates/admin-cli/src/cfg/cli_options.rs
  • crates/admin-cli/src/main.rs
  • crates/admin-cli/src/secrets/mod.rs
  • crates/admin-cli/src/secrets/re_wrap/args.rs
  • crates/admin-cli/src/secrets/re_wrap/cmd.rs
  • crates/admin-cli/src/secrets/re_wrap/mod.rs
  • crates/api-core/Cargo.toml
  • crates/api-core/src/api.rs
  • crates/api-core/src/auth/internal_rbac_rules.rs
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/src/handlers/mod.rs
  • crates/api-core/src/handlers/secrets.rs
  • crates/api-core/src/lib.rs
  • crates/api-core/src/run.rs
  • crates/api-core/src/secrets/import.rs
  • crates/api-core/src/secrets/metrics.rs
  • crates/api-core/src/secrets/mod.rs
  • crates/api-core/src/secrets/re_wrap.rs
  • crates/api-core/src/secrets/routing.rs
  • crates/api-core/src/secrets/tests.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/api-core/src/test_support/default_config.rs
  • crates/api-db/migrations/20260416000305_create_secrets_table.sql
  • crates/api-db/src/lib.rs
  • crates/api-db/src/secrets.rs
  • crates/api-integration-tests/Cargo.toml
  • crates/api-integration-tests/tests/secrets_import.rs
  • crates/api-model/src/lib.rs
  • crates/api-model/src/secrets.rs
  • crates/kms-provider/Cargo.toml
  • crates/kms-provider/src/crypto.rs
  • crates/kms-provider/src/lib.rs
  • crates/kms-provider/src/providers/integrated.rs
  • crates/kms-provider/src/providers/multi.rs
  • crates/kms-provider/src/providers/transit.rs
  • crates/rpc/proto/forge.proto
  • crates/secrets/src/forge_vault.rs
  • crates/secrets/src/lib.rs
  • crates/uuid/src/lib.rs
  • crates/uuid/src/secret.rs
✅ Files skipped from review due to trivial changes (2)
  • crates/secrets/src/lib.rs
  • crates/api-core/src/handlers/mod.rs
🚧 Files skipped from review as they are similar to previous changes (38)
  • crates/api-model/src/lib.rs
  • crates/admin-cli/src/main.rs
  • crates/api-core/src/lib.rs
  • crates/kms-provider/src/lib.rs
  • crates/uuid/src/lib.rs
  • crates/api-integration-tests/Cargo.toml
  • crates/kms-provider/Cargo.toml
  • crates/admin-cli/src/secrets/mod.rs
  • crates/api-core/src/test_support/default_config.rs
  • crates/api-db/src/lib.rs
  • crates/api-core/src/auth/internal_rbac_rules.rs
  • crates/kms-provider/src/providers/multi.rs
  • crates/uuid/src/secret.rs
  • crates/admin-cli/src/secrets/re_wrap/cmd.rs
  • crates/api-core/src/handlers/secrets.rs
  • crates/api-core/src/secrets/metrics.rs
  • crates/admin-cli/cli_domains.yaml
  • crates/rpc/proto/forge.proto
  • crates/api-model/src/secrets.rs
  • crates/admin-cli/src/secrets/re_wrap/mod.rs
  • crates/admin-cli/src/cfg/cli_options.rs
  • crates/admin-cli/src/secrets/re_wrap/args.rs
  • crates/api-core/src/secrets/tests.rs
  • crates/kms-provider/src/crypto.rs
  • crates/api-core/src/secrets/re_wrap.rs
  • crates/api-core/Cargo.toml
  • crates/api-core/src/setup.rs
  • crates/api-core/src/api.rs
  • crates/kms-provider/src/providers/transit.rs
  • crates/api-db/migrations/20260416000305_create_secrets_table.sql
  • crates/api-integration-tests/tests/secrets_import.rs
  • crates/api-core/src/run.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/secrets/src/forge_vault.rs
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/src/secrets/mod.rs
  • crates/api-db/src/secrets.rs
  • crates/api-core/src/secrets/import.rs

@chet chet merged commit ff893a7 into NVIDIA:main Jun 23, 2026
55 checks passed
@chet chet deleted the carbide-postgres-secrets branch June 23, 2026 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants