Skip to content

Commit ff893a7

Browse files
authored
feat: support storing secrets/credentials in Postgres (#2665)
## Description This supports #353. This introduces a new secrets/credentials storage backend of sorts, such that _NICo_ can now keep its credentials in _Postgres_ instead of _Vault_, encrypted (with envelope encryption, leaning on #939) so that a copy of the database on its own gives up nothing. This entirely feature/flagged and configurable, currently not enabled at all, but will let us: - Use it _alongside_ Vault. - Use it to _migrate away_ from Vault. The encryption is layered. Every credential gets its own _data encryption key_ (DEK), and that DEK is wrapped by a _key encryption key_ (KEK) that lives outside the database -- in NICo itself or whatever `KmsProvider` we choose. The credential's path is mixed into the encryption as well. The KEK is the only thing an operator has to protect; everything else lives in Postgres (some reference [here](https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/concepts.html#envelope-encryption) and [here](https://docs.cloud.google.com/kms/docs/envelope-encryption). `PostgresCredentialManager` implements the same `CredentialManager` traits the rest of the system already uses, so nothing downstream had to change. It keeps the two reader behaviors callers learned from Vault: the newest write for a path wins, and an empty-password entry reads as "no credential" (several delete paths still record that tombstone). The store is an append-only journal -- one row per write, newest-per-path on read -- which the rotation work going on in #367 can use for history, rollback, rotation, etc. Where the wrapping keys come from is pluggable through `KmsBackend`: local key material (`integrated`), or Vault/OpenBao [Transit secrets engine API](https://developer.hashicorp.com/vault/api-docs/secret/transit), and more than one provider at once. `[secrets.routing]` maps path prefixes to the KEK that encrypts new writes under them. Startup cross-checks the routing against the providers, so a misspelled, duplicated, or colliding key fails at startup. Moving an existing site off _Vault_ could be: - **A one-time import at startup** -- deliberately all-or-nothing: any list or read failure, or an empty listing, aborts the boot rather than recording a half-finished import as done, and only one replica runs it at a time behind a Postgres advisory lock. Once it completes, Vault is out of the credential chain entirely -- with `[secrets]` set, the chain is env -> file -> postgres and nothing falls back to Vault. Prerequisites that live outside this process (services that still read Vault directly, mixed-fleet rolling upgrades) are spelled out on `SecretsConfig`. - **A slow migration** -- we would introduce it as the primary `CredentialWriter`, such that all new credentials get written to _Postgres_, with _Vault_ still in the `ChainedCredentialReader` path, allowing us to slowly migrate over. **BUT, migration is NOT a part of this PR. This PR is for getting some initial code in place, and then we will iterate on it, and then do subsequent work for migration planning.** Rotating a KEK is a config change plus one command: point the route at the new key and run `carbide-admin-cli secrets re-wrap`. Only the per-row data-key wrapping is redone; the encrypted values are never touched. The re-wrap makes its KMS calls outside the write transaction, runs one at a time, and reports how many rows still sit on a retired key so you know when the old KEK is safe to remove. Tests cover the manager round-trip, journal write-order under equal timestamps, rotation rollback by deleting the newest entry, tombstone reads, create conflicts, re-wrap idempotence and counting, and a path-binding regression that confirms a transplanted row will not decrypt -- plus an end-to-end Vault import against a real Vault dev server. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Related issues <!-- Refer to existing GitHub issues here --> ## Type of Change <!-- Check one that best describes this PR --> - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Breaking Changes <!-- If checked, describe the breaking changes and migration steps --> <!-- Breaking changes are not generally permitted, please discuss on a GitHub discussion or with the development team if you believe you need to break a backward compatibility guarantee --> - [ ] **This PR contains breaking changes** ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [x] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com>
1 parent 5cfb65c commit ff893a7

43 files changed

Lines changed: 3449 additions & 158 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Cargo.lock

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

crates/admin-cli/cli_domains.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,3 +81,4 @@ admin:
8181
- dev-env
8282
- ssh
8383
- jump
84+
- secrets

crates/admin-cli/src/cfg/cli_options.rs

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,9 @@ use crate::{
2525
ipxe_template, jump, machine, machine_interfaces, machine_validation, managed_host,
2626
managed_switch, mlx, network_devices, network_security_group, network_segment, nvl_domain,
2727
nvl_logical_partition, nvl_partition, nvlink_nmxc_endpoints, operating_system, os_image, ping,
28-
power_shelf, rack, redfish, resource_pool, rms, route_server, scout_stream, set, site_explorer,
29-
sku, spx_partition, ssh, switch, tenant, tenant_keyset, tpm_ca, trim_table, version, vpc,
30-
vpc_peering, vpc_prefix,
28+
power_shelf, rack, redfish, resource_pool, rms, route_server, scout_stream, secrets, set,
29+
site_explorer, sku, spx_partition, ssh, switch, tenant, tenant_keyset, tpm_ca, trim_table,
30+
version, vpc, vpc_peering, vpc_prefix,
3131
};
3232

3333
#[derive(Parser, Debug)]
@@ -202,6 +202,8 @@ pub enum CliCommand {
202202
ExtensionService(extension_service::Cmd),
203203
#[clap(about = "Firmware related actions", subcommand)]
204204
Firmware(firmware::Cmd),
205+
#[clap(about = "Secrets management", subcommand)]
206+
Secrets(secrets::Cmd),
205207
#[clap(
206208
about = "Regenerate the docs/manuals/nico-admin-cli markdown reference",
207209
hide = true

crates/admin-cli/src/main.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ mod rms;
103103
mod route_server;
104104
mod rpc;
105105
mod scout_stream;
106+
mod secrets;
106107
mod set;
107108
mod site_explorer;
108109
mod sku;
@@ -274,6 +275,7 @@ async fn main() -> color_eyre::Result<()> {
274275
CliCommand::ResourcePool(cmd) => cmd.dispatch(ctx).await?,
275276
CliCommand::RouteServer(cmd) => cmd.dispatch(ctx).await?,
276277
CliCommand::ScoutStream(cmd) => cmd.dispatch(ctx).await?,
278+
CliCommand::Secrets(cmd) => cmd.dispatch(ctx).await?,
277279
CliCommand::Set(cmd) => cmd.dispatch(ctx).await?,
278280
CliCommand::Ssh(cmd) => cmd.dispatch(ctx).await?,
279281
CliCommand::SiteExplorer(cmd) => cmd.dispatch(ctx).await?,
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
/*
2+
* SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
* SPDX-License-Identifier: Apache-2.0
4+
*
5+
* Licensed under the Apache License, Version 2.0 (the "License");
6+
* you may not use this file except in compliance with the License.
7+
* You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
mod re_wrap;
19+
20+
use clap::Parser;
21+
22+
use crate::cfg::dispatch::Dispatch;
23+
24+
#[derive(Parser, Debug, Clone, Dispatch)]
25+
#[clap(rename_all = "kebab_case")]
26+
pub enum Cmd {
27+
#[clap(about = "Re-wrap secret DEKs to use the \
28+
currently active KEK per routing \
29+
config")]
30+
ReWrap(re_wrap::Args),
31+
}
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
/*
2+
* SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
* SPDX-License-Identifier: Apache-2.0
4+
*
5+
* Licensed under the Apache License, Version 2.0 (the "License");
6+
* you may not use this file except in compliance with the License.
7+
* You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
use clap::Parser;
19+
20+
#[derive(Parser, Debug, Clone)]
21+
#[command(after_long_help = "\
22+
EXAMPLES:
23+
24+
Re-wrap every credential whose KEK no longer matches the routing config
25+
(run this after rotating a key in [secrets.routing]):
26+
$ nico-admin-cli secrets re-wrap
27+
28+
Use a smaller batch size to lighten load on an external KMS:
29+
$ nico-admin-cli secrets re-wrap --batch-size 25
30+
31+
")]
32+
pub struct Args {
33+
#[clap(
34+
long,
35+
help = "Rows scanned per batch during the walk. The server applies its own default and limits."
36+
)]
37+
pub batch_size: Option<u32>,
38+
}
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
/*
2+
* SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
* SPDX-License-Identifier: Apache-2.0
4+
*
5+
* Licensed under the Apache License, Version 2.0 (the "License");
6+
* you may not use this file except in compliance with the License.
7+
* You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
use crate::errors::CarbideCliResult;
19+
use crate::rpc::ApiClient;
20+
21+
pub async fn re_wrap(api_client: &ApiClient, batch_size: Option<u32>) -> CarbideCliResult<()> {
22+
let request = ::rpc::forge::ReWrapSecretsRequest { batch_size };
23+
24+
let resp = api_client.0.re_wrap_secrets(request).await?;
25+
26+
println!(
27+
"Re-wrap complete: {} re-wrapped, {} already current",
28+
resp.re_wrapped, resp.already_current
29+
);
30+
if resp.stale_remaining == 0 {
31+
println!(
32+
"No rows remain on KEKs outside the routing config; unrouted KEKs can be retired."
33+
);
34+
} else {
35+
println!(
36+
"{} rows are still wrapped by KEKs outside the routing config -- \
37+
concurrent writers likely landed rows mid-walk; run re-wrap again.",
38+
resp.stale_remaining
39+
);
40+
}
41+
Ok(())
42+
}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
/*
2+
* SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
* SPDX-License-Identifier: Apache-2.0
4+
*
5+
* Licensed under the Apache License, Version 2.0 (the "License");
6+
* you may not use this file except in compliance with the License.
7+
* You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
pub mod args;
19+
mod cmd;
20+
21+
pub use args::Args;
22+
23+
use crate::cfg::run::Run;
24+
use crate::cfg::runtime::RuntimeContext;
25+
use crate::errors::CarbideCliResult;
26+
27+
impl Run for Args {
28+
async fn run(self, ctx: &mut RuntimeContext) -> CarbideCliResult<()> {
29+
cmd::re_wrap(&ctx.api_client, self.batch_size).await
30+
}
31+
}

crates/api-core/Cargo.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ carbide-ib-fabric = { path = "../ib-fabric" }
4242
carbide-ib-partition-controller = { path = "../ib-partition-controller" }
4343
carbide-ipmi = { path = "../ipmi" }
4444
carbide-ipxe-renderer = { path = "../ipxe-renderer" }
45+
carbide-kms-provider = { path = "../kms-provider" }
4546
carbide-libmlx = { path = "../libmlx" }
4647
carbide-machine-controller = { path = "../machine-controller" }
4748
carbide-measured-boot = { path = "../measured-boot", features = ["sqlx"] }
@@ -180,7 +181,9 @@ tracing-subscriber = { workspace = true, features = [
180181
tss-esapi = { workspace = true, optional = true }
181182
url = { workspace = true, features = ["serde"] }
182183
uuid = { workspace = true, features = ["v4", "serde"] }
184+
vaultrs = { workspace = true }
183185
x509-parser = { workspace = true, features = ["verify"] }
186+
zeroize = { workspace = true }
184187

185188
[features]
186189
default = ["linux-build"]

crates/api-core/src/api.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ pub struct Api {
8888
pub(crate) metric_emitter: ApiMetricsEmitter,
8989
pub(crate) component_manager: Option<component_manager::component_manager::ComponentManager>,
9090
pub(crate) bms_client: OnceLock<Arc<BmsDsxExchangeHandle>>,
91+
pub(crate) secrets_context: Option<crate::secrets::SecretsContext>,
9192
}
9293

9394
pub(crate) type ScoutStreamType =
@@ -1444,6 +1445,13 @@ impl Forge for Api {
14441445
crate::handlers::credential::delete_credential(self, request).await
14451446
}
14461447

1448+
async fn re_wrap_secrets(
1449+
&self,
1450+
request: Request<rpc::ReWrapSecretsRequest>,
1451+
) -> Result<Response<rpc::ReWrapSecretsResponse>, Status> {
1452+
crate::handlers::secrets::re_wrap_secrets(self, request).await
1453+
}
1454+
14471455
/// get_route_servers returns a list of all configured route server
14481456
/// entries for all source types.
14491457
async fn get_route_servers(

0 commit comments

Comments
 (0)