feat: support storing secrets/credentials in Postgres#2665
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Summary by CodeRabbit
WalkthroughThis change introduces a Postgres-backed secrets backend as an alternative to Vault KV, featuring append-only journal semantics with AES-256-GCM envelope encryption per row, multi-provider KMS with prefix-based KEK routing, a one-time Vault import flow, a ChangesPostgres Secrets Backend with KMS, Re-wrap, and Admin CLI
Sequence Diagram(s)sequenceDiagram
participant CLI as admin-cli
participant API as api-core Forge gRPC
participant Handler as handlers::secrets
participant ReWrap as secrets::re_wrap_stale
participant KMS as MultiKmsProvider
participant DB as Postgres secrets table
CLI->>API: ReWrapSecretsRequest { batch_size }
API->>Handler: re_wrap_secrets(request)
Handler->>Handler: clamp batch_size to [1, MAX]
Handler->>Handler: validate secrets_context present
Handler->>ReWrap: re_wrap_stale(pool, kms, routing, batch_size)
ReWrap->>DB: try_lock_path_session (session advisory lock)
loop each batch
ReWrap->>DB: find_batch_after(after_seq, limit)
DB-->>ReWrap: Vec~SecretRow~
loop each row
alt kek_id matches routed KEK
ReWrap->>ReWrap: increment already_current
else KEK mismatch
ReWrap->>KMS: decrypt_dek(row.kek_id, row.encrypted_dek)
KMS-->>ReWrap: dek
ReWrap->>KMS: encrypt_dek(routed_kek_id, dek)
KMS-->>ReWrap: new_encrypted_dek
ReWrap->>ReWrap: add to pending updates
end
end
alt pending updates exist
ReWrap->>DB: update_dek_wrap (batch transaction)
ReWrap->>ReWrap: increment re_wrapped
end
end
ReWrap->>DB: count_wrapped_outside(routing_kek_ids)
DB-->>ReWrap: stale_remaining
ReWrap-->>Handler: ReWrapStaleResult
Handler-->>API: ReWrapSecretsResponse { re_wrapped, already_current, stale_remaining }
API-->>CLI: Response
CLI->>CLI: print completion summary
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai PTAL, thanks! |
|
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (9)
crates/api-db/src/secrets.rs (1)
124-130: ⚡ Quick winUse
query_scalarfor single-boolean reads intry_lock_pathandexists.Both queries return exactly one scalar
bool;query_scalaris clearer and removes tuple unpacking.Proposed refactor
pub async fn try_lock_path(txn: &mut PgConnection, path: &str) -> DatabaseResult<bool> { let sql = "SELECT pg_try_advisory_xact_lock(hashtextextended('secrets:' || $1, 0))"; - let row: (bool,) = sqlx::query_as(sql) + let row: bool = sqlx::query_scalar(sql) .bind(path) .fetch_one(txn) .await .map_err(|e| DatabaseError::query(sql, e))?; - Ok(row.0) + Ok(row) } /// Whether any entries exist for the path. pub async fn exists(txn: impl DbReader<'_>, path: &str) -> DatabaseResult<bool> { let sql = "SELECT EXISTS(SELECT 1 FROM secrets WHERE path = $1)"; - let row: (bool,) = sqlx::query_as(sql) + let row: bool = sqlx::query_scalar(sql) .bind(path) .fetch_one(txn) .await .map_err(|e| DatabaseError::query(sql, e))?; - Ok(row.0) + Ok(row) }Based on learnings: in
crates/api-db, prefersqlx::query_scalar(query).bind(...).fetch_one(...)when fetching exactly one scalar value.Also applies to: 135-141
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-db/src/secrets.rs` around lines 124 - 130, Replace the query pattern in both the `try_lock_path` and `exists` functions that use `sqlx::query_as` to fetch a single boolean value wrapped in a tuple. Instead of using `query_as` with tuple unpacking, use `sqlx::query_scalar` which directly fetches the scalar boolean value without requiring tuple unpacking. This makes the code clearer and more idiomatic for single scalar value queries.Source: Learnings
crates/admin-cli/src/secrets/re_wrap/args.rs (1)
20-27: ⚡ Quick winAdd worked
EXAMPLES:help to the re-wrap args struct.Line 21 defines a new operator-facing command args struct but omits the required
#[command(after_long_help = "...")]examples block.Suggested patch
#[derive(Parser, Debug, Clone)] +#[command( + after_long_help = "EXAMPLES:\n\ +\n\ +Run re-wrap with server defaults:\n\ + $ nico-admin-cli secrets re-wrap\n\ +\n\ +Run re-wrap with an explicit batch size:\n\ + $ nico-admin-cli secrets re-wrap --batch-size 1000\n\ +\n" +)] pub struct Args { #[clap( long, help = "Rows scanned per batch during the walk. The server applies its own default and limits." )] pub batch_size: Option<u32>, }As per coding guidelines,
crates/admin-cli/**/*.rsshould attach#[command(after_long_help = "...")]with a concreteEXAMPLES:section for realistic usage.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/secrets/re_wrap/args.rs` around lines 20 - 27, The Args struct in crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required #[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add this macro attribute to the Args struct definition (right after the #[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that demonstrates realistic usage of the re-wrap command with the batch_size parameter, following the coding guidelines established for admin-cli command argument structs.Source: Coding guidelines
crates/admin-cli/src/secrets/re_wrap/cmd.rs (1)
22-26: ⚡ Quick winWrap the RPC failure with operator-facing context.
Line 25 should add a context segment so failure chains read as a coherent “while attempting to …” story.
Suggested patch
use crate::errors::CarbideCliResult; +use eyre::WrapErr as _; use crate::rpc::ApiClient; pub async fn re_wrap(api_client: &ApiClient, batch_size: Option<u32>) -> CarbideCliResult<()> { let request = ::rpc::forge::ReWrapSecretsRequest { batch_size }; - let resp = api_client.0.re_wrap_secrets(request).await?; + let resp = api_client + .0 + .re_wrap_secrets(request) + .await + .wrap_err("while attempting to re-wrap stale secrets via Forge API")?;As per coding guidelines, admin-cli error chains should include
.context()/wrap_err()phrased “while attempting to …” for actionable operator troubleshooting.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs` around lines 22 - 26, The RPC call to re_wrap_secrets in the re_wrap function does not wrap its error with operator-facing context. Modify the await result of api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err() call with a message phrased as "while attempting to ..." (such as "while attempting to re-wrap secrets") before the ? operator, so that error chains provide coherent, actionable troubleshooting information for operators as per admin-cli coding guidelines.Source: Coding guidelines
crates/secrets/src/forge_vault.rs (3)
770-780: ⚡ Quick winDerive
EqwithPartialEqto keep the new enum clippy-clean.
EnumerationModehas only fieldless variants, soEqis correct and avoids clippy noise in a warnings-as-errors workspace. As per coding guidelines, “we enable all clippy lints by default, and treat all warnings as errors.”♻️ Proposed fix
-#[derive(Clone, Copy, PartialEq)] +#[derive(Clone, Copy, PartialEq, Eq)] enum EnumerationMode {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/secrets/src/forge_vault.rs` around lines 770 - 780, The EnumerationMode enum derives Clone, Copy, and PartialEq, but since it only contains fieldless variants, it should also derive Eq to satisfy clippy requirements in the warnings-as-errors workspace. Add Eq to the derive attribute list for the EnumerationMode enum to keep it clippy-clean and avoid warnings.Source: Coding guidelines
1105-1108: ⚡ Quick winMake the static-token requirement explicit in the returned error.
For Transit KMS on service-account-auth deployments,
vault_config.token()?currently reports onlyVAULT_TOKEN. Add context so startup points directly at the unsupported auth mode instead of looking like a missing generic Vault setting.♻️ Proposed fix
+ let token = vault_config.token().wrap_err( + "transit KMS requires vault.token or VAULT_TOKEN; Kubernetes service-account auth \ + is not supported by create_raw_vault_client_settings yet", + )?; let mut builder = VaultClientSettingsBuilder::default(); builder - .token(vault_config.token()?) + .token(token) .address(vault_config.address()?)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/secrets/src/forge_vault.rs` around lines 1105 - 1108, The error returned by vault_config.token()? when building the VaultClientSettingsBuilder is too generic and doesn't clarify the root cause. Enhance the error handling around the vault_config.token()? call to provide a more explicit error message that indicates static-token authentication is required but unsupported for Transit KMS on service-account-auth deployments, so users understand the authentication mode limitation rather than treating it as a missing generic Vault setting.
833-835: ⚡ Quick winLog Vault enumeration errors as structured fields.
These new warnings interpolate
einto the message, which makes error filtering harder. Put the error in an attribute so logfmt consumers can search byerror,prefix, andpath. As per coding guidelines, “prefer placing common fields as attributes passed to tracing function, instead of using string interpolation.”♻️ Proposed fix
tracing::warn!( prefix = %dir, - "failed to list vault path: {e}" + error = %e, + "failed to list vault path" );tracing::warn!( path = %path, - "failed to read: {e}" + error = %e, + "failed to read vault secret" );Also applies to: 937-939
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/secrets/src/forge_vault.rs` around lines 833 - 835, The warning logs in the vault path enumeration are interpolating the error `e` directly into the message string using `{e}`, which makes error filtering difficult for logfmt consumers. Refactor the tracing::warn! calls to pass the error as a structured attribute instead of including it in the message string. Remove the `{e}` from the message and add `error = %e` (or similar attribute syntax) as a parameter to the tracing::warn! macro, keeping the existing attributes like `prefix`. This change needs to be applied at the location shown (lines 833-835 in the failed to list vault path warning) and also at the similar warning around lines 937-939 as mentioned in the comment.Source: Coding guidelines
crates/secrets/src/lib.rs (1)
23-27: ⚡ Quick winUpdate the re-export comment for the new Transit helper.
The block now exposes
create_raw_vault_client_settings, which is not forCertificateProviderusage. Please broaden the comment so the public API intent remains accurate.♻️ Proposed fix
-/// Exposed for `CertificateProvider` usage only. Credential operations should go -/// through `create_credential_manager` instead of using the vault client directly. +/// Exposed for direct Vault integrations such as `CertificateProvider` and +/// Transit KMS setup. Credential operations should go through +/// `create_credential_manager` instead of using the vault client directly. pub use crate::forge_vault::{ ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, create_vault_client, };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/secrets/src/lib.rs` around lines 23 - 27, The comment above the re-export block in crate::forge_vault states that the exports are only for CertificateProvider usage, but this is no longer accurate now that create_raw_vault_client_settings is included in the public API. Broaden the comment to accurately describe the public API intent for all re-exported items (ForgeVaultClient, VaultConfig, create_raw_vault_client_settings, and create_vault_client), making it clear that these are exposed for multiple purposes beyond just CertificateProvider usage.crates/api-core/Cargo.toml (1)
57-61: ⚡ Quick winKeep the new internal dependency in alphabetized order.
carbide-kms-providershould sit beforecarbide-libmlx, not aftercarbide-rpc-utils, to match the manifest’s ordering contract.♻️ Proposed fix
carbide-ipmi = { path = "../ipmi" } carbide-ipxe-renderer = { path = "../ipxe-renderer" } +carbide-kms-provider = { path = "../kms-provider" } carbide-libmlx = { path = "../libmlx" } @@ carbide-rpc = { path = "../rpc", features = ["sqlx", "model"] } carbide-rpc-utils = { path = "../rpc-utils" } -carbide-kms-provider = { path = "../kms-provider" } carbide-secrets = { path = "../secrets" }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/Cargo.toml` around lines 57 - 61, The internal dependencies in the Cargo.toml file are not in alphabetical order. Move the carbide-kms-provider dependency to its correct alphabetical position, which should be before carbide-redfish, not after carbide-rpc-utils. Reorder all the dependencies shown (carbide-redfish, carbide-rpc, carbide-rpc-utils, carbide-kms-provider, carbide-secrets) so they appear in strict alphabetical order by their crate names.crates/api-core/src/test_support/builder.rs (1)
51-65: ⚡ Quick winAllow tests to build a secrets-enabled
Api.Hard-coding
secrets_context: Nonemeans handler tests for the new secrets/re-wrap path cannot useTestApiBuilder. Add an optional builder field andwith_secrets_contextmethod so tests can exercise both enabled and disabled cases.♻️ Proposed fix
pub struct TestApiBuilder { @@ ib_fabric_manager: Option<Arc<dyn IBFabricManager>>, component_manager: Option<Arc<component_manager::component_manager::ComponentManager>>, + secrets_context: Option<crate::secrets::SecretsContext>, } @@ ib_fabric_manager: None, component_manager: None, + secrets_context: None, } } @@ pub fn with_component_manager( self, component_manager: Arc<component_manager::component_manager::ComponentManager>, ) -> Self { @@ } } + + pub fn with_secrets_context(self, secrets_context: crate::secrets::SecretsContext) -> Self { + Self { + secrets_context: Some(secrets_context), + ..self + } + } @@ - secrets_context: None, + secrets_context: self.secrets_context, } } }Also applies to: 73-87, 153-161, 254-254
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/test_support/builder.rs` around lines 51 - 65, The TestApiBuilder struct currently hard-codes secrets_context to None, preventing tests from building an Api with secrets enabled. Add a new optional field (of appropriate secrets context type) to the TestApiBuilder struct definition, implement a with_secrets_context builder method that allows setting this field, and ensure the field is properly passed through and used when the Api is constructed so that tests can exercise both enabled and disabled cases for the secrets/re-wrap path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/admin-cli/src/cfg/cli_options.rs`:
- Around line 205-206: The new Secrets variant added to the CLI options enum in
cli_options.rs at lines 205–206 lacks a corresponding domain mapping entry in
cli_domains.yaml, which will cause the every_command_has_a_domain test to fail
during CI. Add a secrets entry under the appropriate domain list (admin or
tenant, based on the operational context of the secrets command) in
cli_domains.yaml to ensure the command is properly registered in the domain
configuration.
In `@crates/api-core/src/cfg/file.rs`:
- Around line 863-891: Add the `#[serde(deny_unknown_fields)]` attribute to the
SecretsConfig struct to reject misspelled configuration keys like `import_fom`
instead of silently ignoring them. Additionally, apply the same
`#[serde(deny_unknown_fields)]` attribute to the KmsConfig struct (located
around lines 903-910) for consistency. Finally, add a regression test that
verifies the configuration parser correctly rejects a misspelled optional field
(for example, `import_fom` instead of `import_from`) in the `[secrets]` section,
ensuring startup validation prevents such typos.
In `@crates/api-core/src/run.rs`:
- Around line 515-560: The code holds the transaction lock_txn (which contains a
pooled connection and advisory lock) while performing long-running operations
like vault_client.get_secrets_strict().await and
crate::secrets::import_secrets().await. This creates a deadlock risk during
concurrent replica startup where other waiters can block on the same lock while
occupying remaining pool connections. Restructure the code to use short
transactions: acquire the lock and check completion status in one short
transaction, then release it before calling
vault_client.get_secrets_strict().await and the import_secrets function, and
finally use another short transaction (reusing the same connection if possible,
or using a distinct lock key) to mark the import as complete. This applies at
the location shown (lines 515-560, specifically around the vault enumeration and
import_secrets calls) and also at lines 569-576 where similar long-running
operations may be awaited while holding a transaction.
In `@crates/api-core/src/secrets/re_wrap.rs`:
- Around line 72-88: The re-wrap operation holds a lock transaction from the
pool while the loop in `find_batch_after` and subsequent batch operations try to
acquire additional connections, causing potential deadlock under pool
saturation. Replace the transaction-based locking approach with a single
dedicated connection obtained from the pool at the start. Use a session advisory
lock (via db::secrets::try_lock_path) on that dedicated connection instead of a
transaction, then execute all scan and update operations on that same connection
throughout the loop to avoid holding the transaction open across awaited work.
Apply the same fix at the second location mentioned (lines 126-129) where
similar transaction holding occurs.
In `@crates/api-core/src/secrets/routing.rs`:
- Around line 65-91: In the from_config method's validation loop (where prefix
and kek_id are validated), add a check to reject any prefix that starts with "/"
unless it matches the CATCH_ALL constant. This prevents prefixes like
"/machines/bmc" from being silently misrouted to the catch-all, which could
cause writes to be encrypted under the wrong KEK. Add this validation check
after the empty prefix check and before the normalized prefix processing.
In `@crates/api-db/migrations/20260416000305_create_secrets_table.sql`:
- Around line 21-22: The `seq` column in the migration file uses `BIGINT
GENERATED ALWAYS AS IDENTITY` but is missing a `UNIQUE` constraint. Since
PostgreSQL's GENERATED ALWAYS AS IDENTITY does not enforce uniqueness by itself,
duplicates can occur from sequence resets or manual insertions, breaking the
pagination contract used by functions like `find_batch_after` that rely on `seq
> $1` ordering. Add the `UNIQUE` constraint to the `seq` column definition in
the CREATE TABLE statement to ensure deterministic batch ordering and pagination
behavior.
In `@crates/api-integration-tests/tests/secrets_import.rs`:
- Around line 31-35: The allocate_port() function at lines 31-35 releases the
TcpListener immediately after binding, creating a race condition where another
process can claim the port before it's used at lines 237-238. Fix this by either
keeping the listener alive and passing it to the Vault startup code instead of
just the port number, or by implementing a retry mechanism that attempts to bind
to the port immediately before starting Vault (lines 237-238), retrying if the
port becomes unavailable due to race conditions. This ensures the port
reservation and Vault startup happen atomically without releasing the binding in
between.
- Around line 216-219: The call to db::migrations::MIGRATOR.run() uses .expect()
which causes a panic that bypasses the cleanup logic if migrations fail, leaking
temporary databases. Replace the .expect() call with a fallible pattern using
the ? operator, and restructure the code to ensure the drop/cleanup logic
(currently at lines 225-228) always executes after setup and exercise,
regardless of whether migrations succeed or fail. This typically means using a
guard pattern or ensuring cleanup runs in a finally-like block that is
guaranteed to execute even when earlier operations return errors.
- Around line 212-214: The string concatenation used to build test_db_url at
line 212 breaks when the original db_url contains a database path or query
parameters. Instead of using format!("{db_url}/{test_db_name}"), parse the
db_url into its connection components using the appropriate PostgreSQL URL
parsing utilities, modify the database name field in the parsed options to use
test_db_name, and then reconstruct the connection URL from the parsed options.
This ensures existing query parameters and path components in the original URL
are preserved when building the test database connection string.
In `@crates/rpc/proto/forge.proto`:
- Around line 8550-8554: The `batch_size` field in the `ReWrapSecretsRequest`
message lacks documentation about its validation contract. Update the comment
for the `batch_size` field to explicitly document the semantics for when the
field is unset (its default behavior), what happens if zero is provided, and any
upper-bound constraints the server enforces. This ensures that generated clients
understand the expected validation rules and can properly handle edge cases.
- Around line 946-947: In the crates/rpc/build.rs file, add explicit serde
Serialize annotations for the ReWrapSecretsRequest and ReWrapSecretsResponse
message types using the .type_attribute() method calls, similar to how
comparable gRPC Request/Response types like ListOsImageRequest and
GetBmcCredentialsRequest are configured. This will ensure that these types have
the necessary Serialize trait implementation required by the log_request_data()
function called in the handler at crates/api-core/src/handlers/secrets.rs,
enabling proper request logging and serialization.
---
Nitpick comments:
In `@crates/admin-cli/src/secrets/re_wrap/args.rs`:
- Around line 20-27: The Args struct in
crates/admin-cli/src/secrets/re_wrap/args.rs is missing the required
#[command(after_long_help = "...")] attribute with an EXAMPLES: section. Add
this macro attribute to the Args struct definition (right after the
#[derive(Parser, Debug, Clone)] line) containing a concrete EXAMPLES: block that
demonstrates realistic usage of the re-wrap command with the batch_size
parameter, following the coding guidelines established for admin-cli command
argument structs.
In `@crates/admin-cli/src/secrets/re_wrap/cmd.rs`:
- Around line 22-26: The RPC call to re_wrap_secrets in the re_wrap function
does not wrap its error with operator-facing context. Modify the await result of
api_client.0.re_wrap_secrets(request).await to chain a .context() or .wrap_err()
call with a message phrased as "while attempting to ..." (such as "while
attempting to re-wrap secrets") before the ? operator, so that error chains
provide coherent, actionable troubleshooting information for operators as per
admin-cli coding guidelines.
In `@crates/api-core/Cargo.toml`:
- Around line 57-61: The internal dependencies in the Cargo.toml file are not in
alphabetical order. Move the carbide-kms-provider dependency to its correct
alphabetical position, which should be before carbide-redfish, not after
carbide-rpc-utils. Reorder all the dependencies shown (carbide-redfish,
carbide-rpc, carbide-rpc-utils, carbide-kms-provider, carbide-secrets) so they
appear in strict alphabetical order by their crate names.
In `@crates/api-core/src/test_support/builder.rs`:
- Around line 51-65: The TestApiBuilder struct currently hard-codes
secrets_context to None, preventing tests from building an Api with secrets
enabled. Add a new optional field (of appropriate secrets context type) to the
TestApiBuilder struct definition, implement a with_secrets_context builder
method that allows setting this field, and ensure the field is properly passed
through and used when the Api is constructed so that tests can exercise both
enabled and disabled cases for the secrets/re-wrap path.
In `@crates/api-db/src/secrets.rs`:
- Around line 124-130: Replace the query pattern in both the `try_lock_path` and
`exists` functions that use `sqlx::query_as` to fetch a single boolean value
wrapped in a tuple. Instead of using `query_as` with tuple unpacking, use
`sqlx::query_scalar` which directly fetches the scalar boolean value without
requiring tuple unpacking. This makes the code clearer and more idiomatic for
single scalar value queries.
In `@crates/secrets/src/forge_vault.rs`:
- Around line 770-780: The EnumerationMode enum derives Clone, Copy, and
PartialEq, but since it only contains fieldless variants, it should also derive
Eq to satisfy clippy requirements in the warnings-as-errors workspace. Add Eq to
the derive attribute list for the EnumerationMode enum to keep it clippy-clean
and avoid warnings.
- Around line 1105-1108: The error returned by vault_config.token()? when
building the VaultClientSettingsBuilder is too generic and doesn't clarify the
root cause. Enhance the error handling around the vault_config.token()? call to
provide a more explicit error message that indicates static-token authentication
is required but unsupported for Transit KMS on service-account-auth deployments,
so users understand the authentication mode limitation rather than treating it
as a missing generic Vault setting.
- Around line 833-835: The warning logs in the vault path enumeration are
interpolating the error `e` directly into the message string using `{e}`, which
makes error filtering difficult for logfmt consumers. Refactor the
tracing::warn! calls to pass the error as a structured attribute instead of
including it in the message string. Remove the `{e}` from the message and add
`error = %e` (or similar attribute syntax) as a parameter to the tracing::warn!
macro, keeping the existing attributes like `prefix`. This change needs to be
applied at the location shown (lines 833-835 in the failed to list vault path
warning) and also at the similar warning around lines 937-939 as mentioned in
the comment.
In `@crates/secrets/src/lib.rs`:
- Around line 23-27: The comment above the re-export block in crate::forge_vault
states that the exports are only for CertificateProvider usage, but this is no
longer accurate now that create_raw_vault_client_settings is included in the
public API. Broaden the comment to accurately describe the public API intent for
all re-exported items (ForgeVaultClient, VaultConfig,
create_raw_vault_client_settings, and create_vault_client), making it clear that
these are exposed for multiple purposes beyond just CertificateProvider usage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9bd7cccb-fd3d-47e1-9d17-4ccbe5fee75a
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (41)
crates/admin-cli/src/cfg/cli_options.rscrates/admin-cli/src/main.rscrates/admin-cli/src/secrets/mod.rscrates/admin-cli/src/secrets/re_wrap/args.rscrates/admin-cli/src/secrets/re_wrap/cmd.rscrates/admin-cli/src/secrets/re_wrap/mod.rscrates/api-core/Cargo.tomlcrates/api-core/src/api.rscrates/api-core/src/auth/internal_rbac_rules.rscrates/api-core/src/cfg/file.rscrates/api-core/src/handlers/mod.rscrates/api-core/src/handlers/secrets.rscrates/api-core/src/lib.rscrates/api-core/src/run.rscrates/api-core/src/secrets/import.rscrates/api-core/src/secrets/metrics.rscrates/api-core/src/secrets/mod.rscrates/api-core/src/secrets/re_wrap.rscrates/api-core/src/secrets/routing.rscrates/api-core/src/secrets/tests.rscrates/api-core/src/setup.rscrates/api-core/src/test_support/builder.rscrates/api-core/src/test_support/default_config.rscrates/api-db/migrations/20260416000305_create_secrets_table.sqlcrates/api-db/src/lib.rscrates/api-db/src/secrets.rscrates/api-integration-tests/Cargo.tomlcrates/api-integration-tests/tests/secrets_import.rscrates/api-model/src/lib.rscrates/api-model/src/secrets.rscrates/kms-provider/Cargo.tomlcrates/kms-provider/src/crypto.rscrates/kms-provider/src/lib.rscrates/kms-provider/src/providers/integrated.rscrates/kms-provider/src/providers/multi.rscrates/kms-provider/src/providers/transit.rscrates/rpc/proto/forge.protocrates/secrets/src/forge_vault.rscrates/secrets/src/lib.rscrates/uuid/src/lib.rscrates/uuid/src/secret.rs
| /// allocate_port picks a free port by binding to port 0. | ||
| fn allocate_port() -> SocketAddr { | ||
| let listener = TcpListener::bind("127.0.0.1:0").expect("bind to free port"); | ||
| listener.local_addr().expect("local addr") | ||
| } |
There was a problem hiding this comment.
Port reservation is racy and can cause intermittent test failures.
Line 31-35 releases the listener before Line 238 starts Vault, so another process can claim the port in between. This is a classic TOCTOU race and a common CI flake source for integration tests. Prefer a helper that binds and starts atomically (or retries on bind conflicts).
As per coding guidelines, integration tests should be deterministic and avoid flaky behavior.
Also applies to: 237-238
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/api-integration-tests/tests/secrets_import.rs` around lines 31 - 35,
The allocate_port() function at lines 31-35 releases the TcpListener immediately
after binding, creating a race condition where another process can claim the
port before it's used at lines 237-238. Fix this by either keeping the listener
alive and passing it to the Vault startup code instead of just the port number,
or by implementing a retry mechanism that attempts to bind to the port
immediately before starting Vault (lines 237-238), retrying if the port becomes
unavailable due to race conditions. This ensures the port reservation and Vault
startup happen atomically without releasing the binding in between.
Source: Coding guidelines
There was a problem hiding this comment.
@coderabbitai Fair -- it is a real TOCTOU, but the atomic bind-and-launch would have to live in how api_test_helper spawns the external Vault process, not in this test, so it's out of scope for this PR. Leaving as-is. Thanks for checking!
There was a problem hiding this comment.
@chet, understood — the atomic bind-and-launch fix is appropriately scoped to the api_test_helper process spawner, not this test file. Happy to leave it as-is here.
Would you like me to open a follow-up GitHub issue to track the api_test_helper improvement so it is not lost?
There was a problem hiding this comment.
@coderabbitai Yes please! And if you can associate it as a sub-task of #353 that would be great. Thank you!
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
repo="NVIDIA/infra-controller"
parent_number="353"
child_url=$(gh issue create \
--repo "$repo" \
--title "api_test_helper: atomic port reservation for external process spawning" \
--assignee "chet" \
--body "## Background
During review of PR `#2665` (Postgres secrets backend), a TOCTOU race was identified in the integration test helper \`allocate_port()\` in \`crates/api-integration-tests/tests/secrets_import.rs\`:
\`\`\`rust
fn allocate_port() -> SocketAddr {
let listener = TcpListener::bind(\"127.0.0.1:0\").expect(\"bind to free port\");
listener.local_addr().expect(\"local addr\")
} // listener is dropped here, releasing the port
\`\`\`
The listener is dropped immediately after obtaining the address, so another process can claim the port between this call and when the external process (e.g., Vault dev server) is actually started. This is a well-known source of intermittent CI failures.
## Problem
The fix requires changes to the \`api_test_helper\` process spawner so that port reservation and external process startup happen atomically — either by passing the live \`TcpListener\` (held open) to the spawner, or by having the spawner accept a bound socket directly.
## Proposed Fix
In \`api_test_helper\`, update the Vault dev server launch path to:
1. Accept a \`TcpListener\` (already bound) instead of a plain port number, **or**
2. Internally bind, pass the socket fd to the child process, and release only after the child has confirmed it is listening.
This keeps the reservation live until the child process owns the port, eliminating the race.
## References
- PR: https://github.com/NVIDIA/infra-controller/pull/2665
- Review comment: https://github.com/NVIDIA/infra-controller/pull/2665#discussion_r3423994822
- Requested by: `@chet`")
echo "Created issue: $child_url"
child_number="${child_url##*/}"
echo "Child issue number: $child_number"
parent_id=$(gh api graphql \
-F owner="NVIDIA" \
-F name="infra-controller" \
-F parentNumber="$parent_number" \
-f query='
query($owner: String!, $name: String!, $parentNumber: Int!) {
repository(owner: $owner, name: $name) {
issue(number: $parentNumber) { id }
}
}' \
--jq '.data.repository.issue.id')
child_id=$(gh api graphql \
-F owner="NVIDIA" \
-F name="infra-controller" \
-F childNumber="$child_number" \
-f query='
query($owner: String!, $name: String!, $childNumber: Int!) {
repository(owner: $owner, name: $name) {
issue(number: $childNumber) { id }
}
}' \
--jq '.data.repository.issue.id')
gh api graphql \
-F parentId="$parent_id" \
-F childId="$child_id" \
-f query='
mutation($parentId: ID!, $childId: ID!) {
addSubIssue(input: {issueId: $parentId, subIssueId: $childId}) {
issue { url }
subIssue { url }
}
}'Length of output: 3214
@chet, done! Issue #2777 — "api_test_helper: atomic port reservation for external process spawning" — has been created, assigned to you, and linked as a sub-task of #353.
db91806 to
04013cf
Compare
|
@coderabbitai PTAL again, fixed some conflicts, thanks! |
|
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
crates/api-db/src/secrets.rs (1)
61-78: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winReturn the generated
SecretIdfrominsert.
insertcreates the row identity and then discards it. Returning it lets rollback code calldelete_by_idon the exact row it appended, instead of rediscovering by path/order under concurrency.Proposed API adjustment
-pub async fn insert(txn: &mut PgConnection, entry: &NewSecretEntry<'_>) -> DatabaseResult<()> { +pub async fn insert(txn: &mut PgConnection, entry: &NewSecretEntry<'_>) -> DatabaseResult<SecretId> { let secret_id = SecretId::new(); let sql = "INSERT INTO secrets (secret_id, path, encrypted_value, nonce, kek_id, encrypted_dek, dek_nonce) VALUES ($1, $2, $3, $4, $5, $6, $7)"; @@ .await .map_err(|e| DatabaseError::query(sql, e))?; - Ok(()) + Ok(secret_id) }Based on PR objectives, rotation rollback is a supported behavior for this journal.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-db/src/secrets.rs` around lines 61 - 78, The `insert` function currently generates a `SecretId` but discards it after inserting the row, returning `DatabaseResult<()>` instead. Change the return type of the `insert` function from `DatabaseResult<()>` to `DatabaseResult<SecretId>`, and replace the final `Ok(())` statement with `Ok(secret_id)` after the successful query execution. This enables rollback code to delete the inserted row using the exact `secret_id` that was created, avoiding the need to rediscover it by path under concurrent conditions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api-db/src/secrets.rs`:
- Around line 150-157: The unlock_path_session function currently ignores the
boolean return value from PostgreSQL's pg_advisory_unlock() function, which
indicates whether the lock was successfully released (returns true) or failed to
release (returns false). Change the function signature to return
DatabaseResult<bool> instead of DatabaseResult<()>, and replace sqlx::query()
with sqlx::query_scalar() and the execute() call with fetch_one() to capture and
return the boolean result from pg_advisory_unlock(). This ensures that
lock-release failures are observable to callers and aligns with the established
pattern used in try_lock_path_session() within the same module.
- Around line 170-178: The `delete_all` function in this file performs
destructive deletion which violates the append-only journal contract documented
in secrets/mod.rs. Instead of using `delete_all` in the `delete_credentials`
method, write an empty-password tombstone entry for the path to preserve history
and maintain append-only semantics, similar to how other credential types (UFM,
SiteWideBmc) already implement deletion. Alternatively, if physical deletion is
only needed for test fixtures, gate the `delete_all` function with
`#[cfg(test)]` to restrict its usage to test-only code paths.
- Around line 81-96: The function `insert_if_missing` currently accepts a plain
PgConnection parameter which could allow autocommit connections and cause the
advisory lock to release prematurely, creating a race condition where concurrent
callers could both insert entries. Change the type signature of
`insert_if_missing` to require `&mut Transaction<'_, Postgres>` instead of `&mut
PgConnection` to ensure the lock is held for the transaction's duration. Apply
the same type tightening to the `lock_path` function as well to maintain
consistency and safety.
---
Nitpick comments:
In `@crates/api-db/src/secrets.rs`:
- Around line 61-78: The `insert` function currently generates a `SecretId` but
discards it after inserting the row, returning `DatabaseResult<()>` instead.
Change the return type of the `insert` function from `DatabaseResult<()>` to
`DatabaseResult<SecretId>`, and replace the final `Ok(())` statement with
`Ok(secret_id)` after the successful query execution. This enables rollback code
to delete the inserted row using the exact `secret_id` that was created,
avoiding the need to rediscover it by path under concurrent conditions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4ba2c609-05ec-4ffe-b4f7-3a2aa64aa4e0
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (42)
crates/admin-cli/cli_domains.yamlcrates/admin-cli/src/cfg/cli_options.rscrates/admin-cli/src/main.rscrates/admin-cli/src/secrets/mod.rscrates/admin-cli/src/secrets/re_wrap/args.rscrates/admin-cli/src/secrets/re_wrap/cmd.rscrates/admin-cli/src/secrets/re_wrap/mod.rscrates/api-core/Cargo.tomlcrates/api-core/src/api.rscrates/api-core/src/auth/internal_rbac_rules.rscrates/api-core/src/cfg/file.rscrates/api-core/src/handlers/mod.rscrates/api-core/src/handlers/secrets.rscrates/api-core/src/lib.rscrates/api-core/src/run.rscrates/api-core/src/secrets/import.rscrates/api-core/src/secrets/metrics.rscrates/api-core/src/secrets/mod.rscrates/api-core/src/secrets/re_wrap.rscrates/api-core/src/secrets/routing.rscrates/api-core/src/secrets/tests.rscrates/api-core/src/setup.rscrates/api-core/src/test_support/builder.rscrates/api-core/src/test_support/default_config.rscrates/api-db/migrations/20260416000305_create_secrets_table.sqlcrates/api-db/src/lib.rscrates/api-db/src/secrets.rscrates/api-integration-tests/Cargo.tomlcrates/api-integration-tests/tests/secrets_import.rscrates/api-model/src/lib.rscrates/api-model/src/secrets.rscrates/kms-provider/Cargo.tomlcrates/kms-provider/src/crypto.rscrates/kms-provider/src/lib.rscrates/kms-provider/src/providers/integrated.rscrates/kms-provider/src/providers/multi.rscrates/kms-provider/src/providers/transit.rscrates/rpc/proto/forge.protocrates/secrets/src/forge_vault.rscrates/secrets/src/lib.rscrates/uuid/src/lib.rscrates/uuid/src/secret.rs
💤 Files with no reviewable changes (11)
- crates/uuid/src/lib.rs
- crates/kms-provider/src/providers/multi.rs
- crates/uuid/src/secret.rs
- crates/kms-provider/Cargo.toml
- crates/secrets/src/lib.rs
- crates/kms-provider/src/lib.rs
- crates/rpc/proto/forge.proto
- crates/kms-provider/src/providers/integrated.rs
- crates/secrets/src/forge_vault.rs
- crates/kms-provider/src/crypto.rs
- crates/kms-provider/src/providers/transit.rs
✅ Files skipped from review due to trivial changes (3)
- crates/api-core/src/lib.rs
- crates/api-core/src/handlers/mod.rs
- crates/api-core/src/test_support/default_config.rs
🚧 Files skipped from review as they are similar to previous changes (25)
- crates/api-db/src/lib.rs
- crates/admin-cli/src/secrets/re_wrap/args.rs
- crates/admin-cli/src/secrets/re_wrap/mod.rs
- crates/api-core/src/handlers/secrets.rs
- crates/api-model/src/secrets.rs
- crates/admin-cli/src/cfg/cli_options.rs
- crates/api-core/src/auth/internal_rbac_rules.rs
- crates/api-model/src/lib.rs
- crates/admin-cli/src/secrets/mod.rs
- crates/api-core/src/api.rs
- crates/api-core/src/secrets/import.rs
- crates/api-core/src/secrets/metrics.rs
- crates/api-core/src/secrets/routing.rs
- crates/admin-cli/src/main.rs
- crates/admin-cli/src/secrets/re_wrap/cmd.rs
- crates/api-db/migrations/20260416000305_create_secrets_table.sql
- crates/api-integration-tests/Cargo.toml
- crates/api-core/src/setup.rs
- crates/api-core/src/secrets/mod.rs
- crates/api-integration-tests/tests/secrets_import.rs
- crates/api-core/src/secrets/re_wrap.rs
- crates/api-core/src/secrets/tests.rs
- crates/api-core/src/run.rs
- crates/api-core/src/cfg/file.rs
- crates/api-core/Cargo.toml
| -- | ||
| -- The table is an append-only journal: every write inserts a new row, and a | ||
| -- read returns the newest row for the path. seq is the journal order -- | ||
| -- created_at cannot be, because Postgres fixes now() per transaction, so |
There was a problem hiding this comment.
I've seen us pass in the actual "now" time independently of the transaction "now" time to avoid this. still not a guarantee, though...
04013cf to
5b6f8f5
Compare
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Carbide can now keep its credentials in Postgres instead of Vault, encrypted so that a copy of the database on its own gives up nothing. This swaps out Vault KV as the credential store; Vault still issues the PKI certificates. The encryption is layered. Every credential gets its own data key, and that data key is wrapped by a key encryption key (KEK) that lives outside the database -- in carbide's own config or an external KMS. The credential's path is mixed into the encryption as well, so a row lifted out and dropped onto another path will not decrypt. The KEK is the only thing an operator has to protect; everything else lives in Postgres. `PostgresCredentialManager` implements the same `CredentialManager` traits the rest of the system already uses, so nothing downstream had to change. It keeps the two reader behaviors callers learned from Vault: the newest write for a path wins, and an empty-password entry reads as "no credential" (several delete paths still record that tombstone). The store is an append-only journal -- one row per write, newest-per-path on read -- which is what the upcoming rotation work uses for history and rollback. Where the wrapping keys come from is pluggable through `KmsBackend`: local key material (`integrated`), or Vault/OpenBao Transit, and more than one provider at once. `[secrets.routing]` maps path prefixes to the KEK that encrypts new writes under them. Startup cross-checks the routing against the providers, so a misspelled, duplicated, or colliding key fails the boot instead of some write hours later. Moving an existing site off Vault is a one-time import at startup, and it is deliberately all-or-nothing: any list or read failure, or an empty listing, aborts the boot rather than recording a half-finished import as done, and only one replica runs it at a time behind a Postgres advisory lock. Once it completes, Vault is out of the credential chain entirely -- with `[secrets]` set, the chain is env -> file -> postgres and nothing falls back to Vault. Prerequisites that live outside this process (services that still read Vault directly, mixed-fleet rolling upgrades) are spelled out on `SecretsConfig`. Rotating a KEK is a config change plus one command: point the route at the new key and run `carbide-admin-cli secrets re-wrap`. Only the per-row data-key wrapping is redone; the encrypted values are never touched. The re-wrap makes its KMS calls outside the write transaction, runs one at a time, and reports how many rows still sit on a retired key so you know when the old KEK is safe to remove. Tests cover the manager round-trip, journal write-order under equal timestamps, rotation rollback by deleting the newest entry, tombstone reads, create conflicts, re-wrap idempotence and counting, and a path-binding regression that confirms a transplanted row will not decrypt -- plus an end-to-end Vault import against a real Vault dev server. This supports NVIDIA#353. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
5b6f8f5 to
c5b4bf8
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/kms-provider/src/providers/integrated.rs (1)
152-152: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick winConsider binding DEK wraps to context via AAD.
The DEK encryption and decryption pass an empty byte slice as the associated data parameter (
b""), leaving the DEK ciphertext unbound to any context such as thekek_idor path. While thekek_idis explicitly tracked in the database and validated at startup, cryptographically binding the wrap to thekek_idwould provide defense in depth—if KEK IDs were ever mixed up due to a configuration error or migration bug, mismatched unwraps would fail at the crypto layer rather than succeeding with the wrong key.🔐 Proposed enhancement for defense in depth
async fn encrypt_dek(&self, kek_id: &str, dek: &[u8; 32]) -> Result<EncryptedDek, KmsError> { let kek = self .keys .get(kek_id) .ok_or_else(|| KmsError::KeyNotFound(kek_id.to_string()))?; - let (ciphertext, nonce) = crypto::encrypt(kek, dek, b"")?; + let (ciphertext, nonce) = crypto::encrypt(kek, dek, kek_id.as_bytes())?; Ok(EncryptedDek { ciphertext, nonce }) } async fn decrypt_dek( &self, kek_id: &str, encrypted: &EncryptedDek, ) -> Result<Zeroizing<[u8; 32]>, KmsError> { let kek = self .keys .get(kek_id) .ok_or_else(|| KmsError::KeyNotFound(kek_id.to_string()))?; - let mut plaintext = crypto::decrypt(kek, &encrypted.nonce, &encrypted.ciphertext, b"")?; + let mut plaintext = crypto::decrypt(kek, &encrypted.nonce, &encrypted.ciphertext, kek_id.as_bytes())?; let len = plaintext.len();This change would need to be applied consistently across all
KmsBackendimplementations (integrated, transit, multi) to maintain the contract.Also applies to: 165-165
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/kms-provider/src/providers/integrated.rs` at line 152, The crypto::encrypt call at line 152 and the corresponding decrypt operation at line 165 currently pass an empty byte slice as the associated authenticated data (AAD) parameter, leaving the DEK ciphertext unbound to any context. Modify both the encryption and decryption calls to bind the DEK wrap to the kek_id by passing the kek_id as the AAD parameter instead of b"". This change must be applied consistently across all KmsBackend implementations (integrated, transit, and multi) to maintain the consistent contract and provide defense in depth against key ID confusion errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/kms-provider/src/providers/integrated.rs`:
- Line 152: The crypto::encrypt call at line 152 and the corresponding decrypt
operation at line 165 currently pass an empty byte slice as the associated
authenticated data (AAD) parameter, leaving the DEK ciphertext unbound to any
context. Modify both the encryption and decryption calls to bind the DEK wrap to
the kek_id by passing the kek_id as the AAD parameter instead of b"". This
change must be applied consistently across all KmsBackend implementations
(integrated, transit, and multi) to maintain the consistent contract and provide
defense in depth against key ID confusion errors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 29072d47-870b-4926-b3a2-833cb24c429b
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (42)
crates/admin-cli/cli_domains.yamlcrates/admin-cli/src/cfg/cli_options.rscrates/admin-cli/src/main.rscrates/admin-cli/src/secrets/mod.rscrates/admin-cli/src/secrets/re_wrap/args.rscrates/admin-cli/src/secrets/re_wrap/cmd.rscrates/admin-cli/src/secrets/re_wrap/mod.rscrates/api-core/Cargo.tomlcrates/api-core/src/api.rscrates/api-core/src/auth/internal_rbac_rules.rscrates/api-core/src/cfg/file.rscrates/api-core/src/handlers/mod.rscrates/api-core/src/handlers/secrets.rscrates/api-core/src/lib.rscrates/api-core/src/run.rscrates/api-core/src/secrets/import.rscrates/api-core/src/secrets/metrics.rscrates/api-core/src/secrets/mod.rscrates/api-core/src/secrets/re_wrap.rscrates/api-core/src/secrets/routing.rscrates/api-core/src/secrets/tests.rscrates/api-core/src/setup.rscrates/api-core/src/test_support/builder.rscrates/api-core/src/test_support/default_config.rscrates/api-db/migrations/20260416000305_create_secrets_table.sqlcrates/api-db/src/lib.rscrates/api-db/src/secrets.rscrates/api-integration-tests/Cargo.tomlcrates/api-integration-tests/tests/secrets_import.rscrates/api-model/src/lib.rscrates/api-model/src/secrets.rscrates/kms-provider/Cargo.tomlcrates/kms-provider/src/crypto.rscrates/kms-provider/src/lib.rscrates/kms-provider/src/providers/integrated.rscrates/kms-provider/src/providers/multi.rscrates/kms-provider/src/providers/transit.rscrates/rpc/proto/forge.protocrates/secrets/src/forge_vault.rscrates/secrets/src/lib.rscrates/uuid/src/lib.rscrates/uuid/src/secret.rs
✅ Files skipped from review due to trivial changes (2)
- crates/secrets/src/lib.rs
- crates/api-core/src/handlers/mod.rs
🚧 Files skipped from review as they are similar to previous changes (38)
- crates/api-model/src/lib.rs
- crates/admin-cli/src/main.rs
- crates/api-core/src/lib.rs
- crates/kms-provider/src/lib.rs
- crates/uuid/src/lib.rs
- crates/api-integration-tests/Cargo.toml
- crates/kms-provider/Cargo.toml
- crates/admin-cli/src/secrets/mod.rs
- crates/api-core/src/test_support/default_config.rs
- crates/api-db/src/lib.rs
- crates/api-core/src/auth/internal_rbac_rules.rs
- crates/kms-provider/src/providers/multi.rs
- crates/uuid/src/secret.rs
- crates/admin-cli/src/secrets/re_wrap/cmd.rs
- crates/api-core/src/handlers/secrets.rs
- crates/api-core/src/secrets/metrics.rs
- crates/admin-cli/cli_domains.yaml
- crates/rpc/proto/forge.proto
- crates/api-model/src/secrets.rs
- crates/admin-cli/src/secrets/re_wrap/mod.rs
- crates/admin-cli/src/cfg/cli_options.rs
- crates/admin-cli/src/secrets/re_wrap/args.rs
- crates/api-core/src/secrets/tests.rs
- crates/kms-provider/src/crypto.rs
- crates/api-core/src/secrets/re_wrap.rs
- crates/api-core/Cargo.toml
- crates/api-core/src/setup.rs
- crates/api-core/src/api.rs
- crates/kms-provider/src/providers/transit.rs
- crates/api-db/migrations/20260416000305_create_secrets_table.sql
- crates/api-integration-tests/tests/secrets_import.rs
- crates/api-core/src/run.rs
- crates/api-core/src/test_support/builder.rs
- crates/secrets/src/forge_vault.rs
- crates/api-core/src/cfg/file.rs
- crates/api-core/src/secrets/mod.rs
- crates/api-db/src/secrets.rs
- crates/api-core/src/secrets/import.rs
Description
This supports #353.
This introduces a new secrets/credentials storage backend of sorts, such that NICo can now keep its credentials in Postgres instead of Vault, encrypted (with envelope encryption, leaning on #939) so that a copy of the database on its own gives up nothing. This entirely feature/flagged and configurable, currently not enabled at all, but will let us:
The encryption is layered. Every credential gets its own data encryption key (DEK), and that DEK is wrapped by a key encryption key (KEK) that lives outside the database -- in NICo itself or whatever
KmsProviderwe choose. The credential's path is mixed into the encryption as well. The KEK is the only thing an operator has to protect; everything else lives in Postgres (some reference here and here.PostgresCredentialManagerimplements the sameCredentialManagertraits the rest of the system already uses, so nothing downstream had to change. It keeps the two reader behaviors callers learned from Vault: the newest write for a path wins, and an empty-password entry reads as "no credential" (several delete paths still record that tombstone). The store is an append-only journal -- one row per write, newest-per-path on read -- which the rotation work going on in #367 can use for history, rollback, rotation, etc.Where the wrapping keys come from is pluggable through
KmsBackend: local key material (integrated), or Vault/OpenBao Transit secrets engine API, and more than one provider at once.[secrets.routing]maps path prefixes to the KEK that encrypts new writes under them. Startup cross-checks the routing against the providers, so a misspelled, duplicated, or colliding key fails at startup.Moving an existing site off Vault could be:
[secrets]set, the chain is env -> file -> postgres and nothing falls back to Vault. Prerequisites that live outside this process (services that still read Vault directly, mixed-fleet rolling upgrades) are spelled out onSecretsConfig.CredentialWriter, such that all new credentials get written to Postgres, with Vault still in theChainedCredentialReaderpath, allowing us to slowly migrate over.BUT, migration is NOT a part of this PR. This PR is for getting some initial code in place, and then we will iterate on it, and then do subsequent work for migration planning.
Rotating a KEK is a config change plus one command: point the route at the new key and run
carbide-admin-cli secrets re-wrap. Only the per-row data-key wrapping is redone; the encrypted values are never touched. The re-wrap makes its KMS calls outside the write transaction, runs one at a time, and reports how many rows still sit on a retired key so you know when the old KEK is safe to remove.Tests cover the manager round-trip, journal write-order under equal timestamps, rotation rollback by deleting the newest entry, tombstone reads, create conflicts, re-wrap idempotence and counting, and a path-binding regression that confirms a transplanted row will not decrypt -- plus an end-to-end Vault import against a real Vault dev server.
Signed-off-by: Chet Nichols III chetn@nvidia.com
Related issues
Type of Change
Breaking Changes
Testing
Additional Notes