feat(zero-dpu): Allow flat VPC's to not belong to a network segment#2666
Conversation
In an environment without DPU's, we don't get dynamic network isolation, and the expectation is that there is a single, flat network address space, modeled as a single network segment. In this case, we still want to model VPC's for things like NVLink partitions, where tenants will have the understanding that the network is not isolated between VPC's. To make this work, a network segment can no longer be the "link" that links an instance to a VPC. Before, an instance interface maps to a machine interface, which maps to a network segment, which maps to a VPC. But if multiple VPC's are all sharing a network segment, this link doesn't exist (namely, network_segments.vpc_id will be null), and so we need the instance address itself to store the VPC ID on it. This change adds a vpc_id column to instance_addresses, so that we know the VPC of any given address without having to consult the network_segments table. It also changes the AllocateInstance API so that you can pass a `auto_config` object which allows specifying a VPC ID as part of the allocation request. (This replaces the `auto: bool` field, such that auto is implied if `auto_config` is set.)
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR replaces the ChangesFlat VPC / auto_config for zero-DPU instance allocation
Sequence Diagram(s)sequenceDiagram
participant CLI as admin-cli
participant RPC as build_instance_request
participant CoreAlloc as allocate_instance
participant ValidateVpc as validate_zero_dpu_auto_vpc
participant DB as instance_address::allocate
participant Resp as RPC response
CLI->>RPC: --flat-vpc-id=<VpcId>
RPC->>RPC: fetch VPC, validate Flat virtualization, build auto_config { vpc_id }
RPC->>CoreAlloc: InstanceConfig { auto_config: Some { vpc_id }, interfaces: [] }
CoreAlloc->>ValidateVpc: validate_zero_dpu_auto_vpc(vpc_id, tenant_org)
ValidateVpc-->>CoreAlloc: Ok (Flat, Nic, tenant-owned)
CoreAlloc->>CoreAlloc: validate HostInband segment binding vs vpc_id
CoreAlloc->>DB: allocate addresses with interface_vpc_id per interface
DB->>DB: INSERT instance_addresses(vpc_id)
DB-->>CoreAlloc: allocated addresses
CoreAlloc->>Resp: Instance { status.network.interfaces[*].vpc_id, network.auto_config.vpc_id }
Resp->>CLI: RPC response with vpc_id propagated
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
crates/api-db/src/instance_network_config.rs (1)
75-80:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the invalid-argument message to reference
auto_configinstead of deprecatedauto.Line 76 still reports
InstanceNetworkConfig.auto, which no longer matches the request contract and can mislead operators during troubleshooting.Suggested wording update
- "InstanceNetworkConfig.auto reached the resolver with {} \ + "InstanceNetworkConfig.auto_config reached the resolver with {} \ pre-existing interfaces; auto requests must arrive with an \ empty interfaces list",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-db/src/instance_network_config.rs` around lines 75 - 80, The error message in the DatabaseError::InvalidArgument call references the deprecated field name `auto`, but the field has been renamed to `auto_config`. Update the error message string on line 76 to replace "InstanceNetworkConfig.auto" with "InstanceNetworkConfig.auto_config" so the error message accurately reflects the current field name and prevents misleading operators during troubleshooting.crates/api-core/src/instance/mod.rs (1)
935-944:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate stale
autoerror text toauto_config.These user-facing errors still tell operators to set
InstanceNetworkConfig.auto, but this path now requires/rejectsauto_config.Proposed fix
- // Zero-DPU hosts (no DPU, or DPU in NIC mode) MUST use `auto`, because + // Zero-DPU hosts (no DPU, or DPU in NIC mode) MUST use `auto_config`, because // their only valid attachments are HostInband segments, and NICo knows // which one(s) the host is on. Conversely, hosts with DPUs cannot use - // `auto`, and are expected to enumerate their interfaces explicitly. + // `auto_config`, and are expected to enumerate their interfaces explicitly. if !mh_snapshot.has_managed_dpus() { let Some(requested_auto_config) = request.config.network.auto_config else { return Err(CarbideError::InvalidArgument(format!( - "zero-DPU host {} requires `InstanceNetworkConfig.auto = true`; cannot allocate an instance with explicitly-listed interfaces or with `auto = false`", + "zero-DPU host {} requires `InstanceNetworkConfig.auto_config.vpc_id`; cannot allocate an instance without auto_config", mh_snapshot.host_snapshot.id, ))); };- // `auto` is only valid on zero-DPU hosts; DPU-managed hosts must + // `auto_config` is only valid on zero-DPU hosts; DPU-managed hosts must // list their interfaces explicitly. if request.config.network.auto_config.is_some() { return Err(CarbideError::InvalidArgument(format!( - "host {} has DPUs; `InstanceNetworkConfig.auto` is only valid on zero-DPU hosts", + "host {} has DPUs; `InstanceNetworkConfig.auto_config` is only valid on zero-DPU hosts", mh_snapshot.host_snapshot.id, ))); }Also applies to: 1022-1028
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/instance/mod.rs` around lines 935 - 944, The error message in the zero-DPU host validation block references the outdated field name `InstanceNetworkConfig.auto` but the code is now checking for `auto_config`. Update the error message text to correctly reference `InstanceNetworkConfig.auto_config` instead of `InstanceNetworkConfig.auto` to match the actual field being validated. Additionally, apply the same correction to any other similar error messages at lines 1022-1028 that still reference the old field name.crates/api-db/src/vpc.rs (1)
240-255:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not return soft-deleted VPCs from
find_by_segment.
fetch_optionalnow models “segment is unbound”, but a segment whosevpc_idpoints at a soft-deleted VPC still returnsSome(vpc). That lets allocation/validation treat deleted VPC membership as active.Proposed fix
.filter_relation( &ObjectColumnFilter::One(network_segment::IdColumn, &segment_id), Some("s"), ); + query.push(" AND v.deleted IS NULL"); query.push(" LIMIT 1");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-db/src/vpc.rs` around lines 240 - 255, The find_by_segment function in the VPC query builder returns soft-deleted VPCs because it lacks a filter to exclude them, allowing allocation/validation logic to treat deleted VPC membership as active. Add a filter to the FilterableQueryBuilder in find_by_segment to exclude soft-deleted VPCs by filtering on the appropriate soft-delete column (such as deleted_at or similar field) in the vpcs table, similar to how the function already filters by segment_id using filter_relation.crates/rpc/src/model/instance/config/network.rs (1)
1002-1029:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTest fixture does not exercise the intended rejection path.
The test
test_auto_rejects_explicit_interfacessetsauto_config: None, which meansauto = config.auto_config.is_some()evaluates tofalse. Consequently, the check at line 142 (if auto && !config.interfaces.is_empty()) never triggers. The test currently passes because the conversion succeeds (no rejection occurs), and then theexpect_errcall fails with the wrong reason.To correctly test that
auto_configcombined with non-empty interfaces is rejected, supplyauto_config: Some(...):Proposed fix
#[test] fn test_auto_rejects_explicit_interfaces() { // `auto: true` and populated interfaces are mutually exclusive, // so this should error. + let vpc_id = VpcId::new(); let rpc_config = rpc::InstanceNetworkConfig { interfaces: vec![rpc::InstanceInterfaceConfig { function_type: rpc::InterfaceFunctionType::Physical as i32, network_segment_id: Some(NetworkSegmentId::new()), network_details: None, device: None, device_instance: 0, virtual_function_id: None, ip_address: None, ipv6_interface_config: None, routing_profile: None, }], #[allow(deprecated)] - auto: false, - auto_config: None, + auto: true, + auto_config: Some(forge::InstanceNetworkAutoConfig { + vpc_id: Some(vpc_id), + }), }; let result: Result<InstanceNetworkConfig, _> = rpc_config.try_into(); let err = result.expect_err("auto + non-empty interfaces should be rejected");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/rpc/src/model/instance/config/network.rs` around lines 1002 - 1029, In the test function `test_auto_rejects_explicit_interfaces`, the `auto_config` field is currently set to `None`, which causes `auto` to be derived as `false` and prevents the intended rejection check from triggering. Change `auto_config: None` to `auto_config: Some(...)` (with any valid auto configuration structure) so that the condition `if auto && !config.interfaces.is_empty()` at line 142 is properly evaluated and the rejection behavior is actually tested.crates/admin-cli/src/instance/allocate/args.rs (1)
25-54:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a flat-VPC allocation example.
The command help now exposes
--flat-vpc-id, but the examples only cover subnet and VPC-prefix allocation, leaving the zero-DPU replacement path undiscoverable.Proposed help addition
Allocate one instance onto a subnet: $ nico-admin-cli instance allocate --prefix-name eth0 --subnet 192.0.2.0/24 +Allocate one instance into a flat VPC for a zero-DPU host: + $ nico-admin-cli instance allocate --prefix-name eth0 \ + --flat-vpc-id 12345678-1234-5678-90ab-cdef01234567 + Allocate several instances at once: $ nico-admin-cli instance allocate --number 4 --prefix-name eth0 \As per coding guidelines, admin CLI examples should cover each meaningful axis of variation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/instance/allocate/args.rs` around lines 25 - 54, The help documentation for the instance allocate command is missing an example demonstrating the --flat-vpc-id flag, which is a key allocation option. Add a new example in the after_long_help string that shows how to allocate an instance using --flat-vpc-id instead of subnet or vpc-prefix-id. Follow the existing example format and include it alongside the other allocation examples to make this allocation path discoverable for users. This example should demonstrate the zero-DPU replacement path as mentioned in the comment.Source: Coding guidelines
🧹 Nitpick comments (4)
crates/api-core/src/tests/instance_config_update.rs (1)
901-903: ⚡ Quick winConsolidate deprecated-field suppression into a single test helper.
These repeated
#[allow(deprecated)] auto: falseinitializers are functionally correct but create high churn and spread lint suppression across the file. Prefer one helper that setsauto_config: Noneand contains the deprecation handling in one place.Proposed refactor
+#[allow(deprecated)] +fn test_network_config( + interfaces: Vec<rpc::InstanceInterfaceConfig>, +) -> rpc::InstanceNetworkConfig { + rpc::InstanceNetworkConfig { + interfaces, + auto: false, + auto_config: None, + } +} ... - let network = rpc::InstanceNetworkConfig { - interfaces: vec![...], - #[allow(deprecated)] - auto: false, - auto_config: None, - }; + let network = test_network_config(vec![...]);Also applies to: 969-971, 1028-1030, 1115-1117, 1188-1190, 1276-1278, 1344-1346, 1475-1477, 1543-1545, 1660-1662, 1702-1704, 1746-1748, 1884-1886, 1944-1946, 2004-2006
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/tests/instance_config_update.rs` around lines 901 - 903, Create a single test helper function that encapsulates the initialization of the deprecated `auto` field and `auto_config` field. Instead of repeating the `#[allow(deprecated)]` attribute and the pattern `auto: false, auto_config: None` at multiple test locations (lines 901-903, 969-971, 1028-1030, 1115-1117, 1188-1190, 1276-1278, 1344-1346, 1475-1477, 1543-1545, 1660-1662, 1702-1704, 1746-1748, 1884-1886, 1944-1946, 2004-2006), define a helper function that builds the configuration struct with the deprecated field handling contained in one place, then replace all occurrences of the scattered `#[allow(deprecated)] auto: false, auto_config: None` pattern with calls to this helper function.Source: Coding guidelines
crates/api-core/src/tests/instance.rs (2)
1033-1035: ⚡ Quick winCentralize the deprecated
autofield handling.The repeated local
#[allow(deprecated)]blocks make future removal noisy. Prefer..Default::default()for these explicit-interface configs if available, or move the deprecated-field initialization behind one small test helper with a single documented allow. As per coding guidelines, “Avoid using#[allow(...)]unless you have a strong reason to do so.”Representative cleanup
rpc::InstanceNetworkConfig { interfaces, - #[allow(deprecated)] - auto: false, - auto_config: None, + ..Default::default() }If the generated type cannot use
Defaultcleanly, centralize the suppression instead:#[allow(deprecated)] fn explicit_network_config( interfaces: Vec<rpc::InstanceInterfaceConfig>, ) -> rpc::InstanceNetworkConfig { rpc::InstanceNetworkConfig { interfaces, auto: false, auto_config: None, } }Also applies to: 1849-1851, 2442-2444, 3030-3032, 3416-3418, 3506-3508, 3593-3595, 3749-3751, 3827-3829, 3965-3967, 4143-4145, 4242-4244, 4384-4386, 4522-4524, 4589-4591, 4745-4747, 5333-5335
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/tests/instance.rs` around lines 1033 - 1035, Centralize the deprecated field handling across all instances in the test file. Create a single test helper function (e.g., explicit_network_config) that contains the `#[allow(deprecated)]` annotation and constructs the rpc::InstanceNetworkConfig with the deprecated auto and auto_config fields set to false and None respectively. Then replace all instances of the scattered #[allow(deprecated)] blocks at lines 1033-1035, 1849-1851, 2442-2444, 3030-3032, 3416-3418, 3506-3508, 3593-3595, 3749-3751, 3827-3829, 3965-3967, 4143-4145, 4242-4244, 4384-4386, 4522-4524, 4589-4591, 4745-4747, and 5333-5335 with calls to this centralized helper function, removing the redundant local allow annotations entirely.Source: Coding guidelines
167-178: ⚡ Quick winAssert persisted
instance_addresses.vpc_idin the record checks.These tests now compute expected VPC IDs, but the rows read from
instance_addressesstill only validate address/prefix. Add directrecord.vpc_idassertions so a regression that leaves the new DB association unset/wrong is caught even if config/status propagation still works. Based on PR objectives,instance_addresses.vpc_idis now the direct VPC association used by lookup paths.Proposed assertions
// This should the first IP. Algo does not look into machine_interface_addresses // table for used addresses for instance. assert_eq!(record.address.to_string(), "192.0.4.3"); + assert_eq!(record.vpc_id, Some(vpc_ids[0])); assert_eq!( &record.address, network_config.interfaces[0]// This should the first IP. Algo does not look into machine_interface_addresses // table for used addresses for instance. assert_eq!(record.address.to_string(), "192.0.4.3"); + assert_eq!(record.vpc_id, Some(vpc_id)); assert_eq!( &record.address, network_config.interfaces[0]// This should the first IP. Algo does not look into machine_interface_addresses // table for used addresses for instance. assert_eq!(record.address.to_string(), "10.217.5.225"); + assert_eq!(record.vpc_id, Some(vpc.id)); assert_eq!( &record.address, network_config.interfaces[0]Also applies to: 273-286, 355-359, 513-524, 2641-2641, 2671-2682
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/tests/instance.rs` around lines 167 - 178, The test code now computes expected VPC IDs from segments using the join_all and db::vpc::find_by_segment pattern, but the subsequent record assertions for instance_addresses rows only validate address and prefix fields. Add direct assertions that check record.vpc_id against the expected VPC IDs that were computed in this block to ensure the database associations are correctly set. Apply this fix at all locations where instance_addresses records are being validated after computing VPC IDs, including the primary location (167-178) and the other affected test locations mentioned in the comment (273-286, 355-359, 513-524, 2641-2641, 2671-2682).crates/rpc/proto/forge.proto (1)
2749-2755: ⚡ Quick winUpdate the contract comment to make
auto_configthe sole authoritative path.Line 2749-Line 2751 still describes mutual exclusivity using deprecated
auto. This can mislead API consumers and generated documentation; the rule should be documented againstauto_configinstead.Suggested doc fix
- // Mutually exclusive with `auto`: when `auto` is true, this - // MUST be empty, When `auto` is false, this lists the explicit - // interface configuration the caller wants applied. + // Mutually exclusive with `auto_config`: when `auto_config` is set, this + // MUST be empty. When `auto_config` is unset, this lists the explicit + // interface configuration the caller wants applied.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/rpc/proto/forge.proto` around lines 2749 - 2755, Update the contract comment for the `interfaces` field (lines 2749-2751) to remove references to the deprecated `auto` field and instead describe the mutual exclusivity relationship with `auto_config`. The comment should make clear that `auto_config` is the authoritative mechanism for controlling whether interfaces are automatically configured or explicitly specified, ensuring API consumers and generated documentation reflect the current contract rather than referencing deprecated fields.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/admin-cli/src/instance/show/cmd.rs`:
- Around line 201-230: The current implementation uses filter_map and flatten
operations that remove missing VPC lookup results, causing the remaining VPCs to
shift to earlier indices and misalign with their corresponding interfaces.
Replace the filter_map(s.vpc_id).map(...) and
filter_map(c.network_segment_id).map(...) patterns with approaches that preserve
an Option<Vpc> for each interface instead of discarding missing results.
Specifically, for the auto_network branch, map each interface status to an
Option containing either the looked-up VPC or None if vpc_id is missing; for the
else branch, map each interface config to an Option containing either the
looked-up VPC or None if segment_id is missing or the lookup fails. Remove the
flatten() call since you now have Vec<Option<Vpc>> with proper 1:1 alignment,
and adjust the vpcs.get(idx) lookups to handle the Option wrapper appropriately
when displaying results.
In `@crates/admin-cli/src/machine/show/cmd.rs`:
- Around line 441-443: The current condition checking if flat_vpc_id.is_some()
returns the first eligible machine without verifying it is suitable for zero-DPU
flat-VPC auto allocation. Add an additional eligibility check in the if
statement to ensure the machine is a static-membership machine before returning
Some(machine), since the server rejects auto_config on DPU-backed machines and
this prevents batch allocation from trying valid hosts.
In `@crates/admin-cli/src/rpc.rs`:
- Around line 1267-1274: The error checking in the flat-vpc-id validation block
(in the condition with vpc_prefix_id, vf_vpc_prefix_id, and vf_subnet) currently
only rejects VPC prefixes and VF subnets, but it must also reject subnet,
explicit IP addresses, and IPv6 selector fields to prevent them from being
silently ignored. Expand the if condition to include checks for the subnet
field, any explicit IP selector fields, and IPv6 selector fields in the
allocate_instance struct, adding each to the logical OR chain with the existing
checks.
In `@crates/api-core/src/handlers/dpu.rs`:
- Around line 258-263: The error handling in the vpc::find_by_segment lookup
block incorrectly uses CarbideError::InvalidArgument when the network segment is
not found to be a member of a VPC. This is a system-side database state issue,
not a validation failure on caller input. Change the error variant from
CarbideError::InvalidArgument to CarbideError::FailedPrecondition to properly
classify this as a system failure and ensure correct error ownership semantics
and retry/alert behavior according to the project's error handling guidelines.
In `@crates/api-core/src/instance/mod.rs`:
- Line 31: The VPC-attached address creation in instance_addresses inserts for
requested_auto_config.vpc_id lacks the necessary VpcRowLock::Mutation lock,
allowing a race condition where concurrent VPC deletions can soft-delete a VPC
while addresses are still being committed to it. Acquire the
VpcRowLock::Mutation lock before performing the instance_addresses inserts for
the requested VPC to ensure mutual exclusion with concurrent VPC deletion
operations and prevent dangling addresses from being created for deleted VPCs.
In `@crates/api-db/migrations/20260615120000_instance_addresses_vpc_id.sql`:
- Around line 1-8: The migration adds a nullable vpc_id column to
instance_addresses and only partially backfills it via an UPDATE statement that
joins with network_segments. This leaves room for NULL vpc_id values in existing
or newly-inserted rows during a mixed-version rollout, violating the intended
VPC-address association contract. After the UPDATE backfill statement completes,
add an ALTER TABLE statement to the migration to enforce the vpc_id column as
NOT NULL, ensuring data consistency and preventing future null values from being
inserted into this critical field.
In `@crates/api-db/src/vpc.rs`:
- Around line 286-293: The instance_address_count_query contains a redundant
EXISTS subquery that checks if the VPC still exists (is not deleted). Since
instance_addresses are hard-deleted during instance release, this defensive
check is unnecessary. Simplify the query string to remove the EXISTS clause and
only check WHERE vpc_id=$1, reducing the query to a single simple condition.
Update the instance_address_count_query variable to contain only the simplified
SQL without the subquery that validates VPC existence.
In `@crates/machine-a-tron/src/api_client.rs`:
- Around line 291-293: Remove the #[allow(deprecated)] annotation and refactor
the InstanceNetworkConfig construction to use Default::default() instead of
manually setting the deprecated fields auto and auto_config. Keep only the
non-deprecated fields that need to be explicitly set, allowing the deprecated
fields to use their default values without suppressing the deprecation warning.
This maintains clippy cleanliness and follows coding guidelines that avoid
unnecessary lint suppressions.
---
Outside diff comments:
In `@crates/admin-cli/src/instance/allocate/args.rs`:
- Around line 25-54: The help documentation for the instance allocate command is
missing an example demonstrating the --flat-vpc-id flag, which is a key
allocation option. Add a new example in the after_long_help string that shows
how to allocate an instance using --flat-vpc-id instead of subnet or
vpc-prefix-id. Follow the existing example format and include it alongside the
other allocation examples to make this allocation path discoverable for users.
This example should demonstrate the zero-DPU replacement path as mentioned in
the comment.
In `@crates/api-core/src/instance/mod.rs`:
- Around line 935-944: The error message in the zero-DPU host validation block
references the outdated field name `InstanceNetworkConfig.auto` but the code is
now checking for `auto_config`. Update the error message text to correctly
reference `InstanceNetworkConfig.auto_config` instead of
`InstanceNetworkConfig.auto` to match the actual field being validated.
Additionally, apply the same correction to any other similar error messages at
lines 1022-1028 that still reference the old field name.
In `@crates/api-db/src/instance_network_config.rs`:
- Around line 75-80: The error message in the DatabaseError::InvalidArgument
call references the deprecated field name `auto`, but the field has been renamed
to `auto_config`. Update the error message string on line 76 to replace
"InstanceNetworkConfig.auto" with "InstanceNetworkConfig.auto_config" so the
error message accurately reflects the current field name and prevents misleading
operators during troubleshooting.
In `@crates/api-db/src/vpc.rs`:
- Around line 240-255: The find_by_segment function in the VPC query builder
returns soft-deleted VPCs because it lacks a filter to exclude them, allowing
allocation/validation logic to treat deleted VPC membership as active. Add a
filter to the FilterableQueryBuilder in find_by_segment to exclude soft-deleted
VPCs by filtering on the appropriate soft-delete column (such as deleted_at or
similar field) in the vpcs table, similar to how the function already filters by
segment_id using filter_relation.
In `@crates/rpc/src/model/instance/config/network.rs`:
- Around line 1002-1029: In the test function
`test_auto_rejects_explicit_interfaces`, the `auto_config` field is currently
set to `None`, which causes `auto` to be derived as `false` and prevents the
intended rejection check from triggering. Change `auto_config: None` to
`auto_config: Some(...)` (with any valid auto configuration structure) so that
the condition `if auto && !config.interfaces.is_empty()` at line 142 is properly
evaluated and the rejection behavior is actually tested.
---
Nitpick comments:
In `@crates/api-core/src/tests/instance_config_update.rs`:
- Around line 901-903: Create a single test helper function that encapsulates
the initialization of the deprecated `auto` field and `auto_config` field.
Instead of repeating the `#[allow(deprecated)]` attribute and the pattern `auto:
false, auto_config: None` at multiple test locations (lines 901-903, 969-971,
1028-1030, 1115-1117, 1188-1190, 1276-1278, 1344-1346, 1475-1477, 1543-1545,
1660-1662, 1702-1704, 1746-1748, 1884-1886, 1944-1946, 2004-2006), define a
helper function that builds the configuration struct with the deprecated field
handling contained in one place, then replace all occurrences of the scattered
`#[allow(deprecated)] auto: false, auto_config: None` pattern with calls to this
helper function.
In `@crates/api-core/src/tests/instance.rs`:
- Around line 1033-1035: Centralize the deprecated field handling across all
instances in the test file. Create a single test helper function (e.g.,
explicit_network_config) that contains the `#[allow(deprecated)]` annotation and
constructs the rpc::InstanceNetworkConfig with the deprecated auto and
auto_config fields set to false and None respectively. Then replace all
instances of the scattered #[allow(deprecated)] blocks at lines 1033-1035,
1849-1851, 2442-2444, 3030-3032, 3416-3418, 3506-3508, 3593-3595, 3749-3751,
3827-3829, 3965-3967, 4143-4145, 4242-4244, 4384-4386, 4522-4524, 4589-4591,
4745-4747, and 5333-5335 with calls to this centralized helper function,
removing the redundant local allow annotations entirely.
- Around line 167-178: The test code now computes expected VPC IDs from segments
using the join_all and db::vpc::find_by_segment pattern, but the subsequent
record assertions for instance_addresses rows only validate address and prefix
fields. Add direct assertions that check record.vpc_id against the expected VPC
IDs that were computed in this block to ensure the database associations are
correctly set. Apply this fix at all locations where instance_addresses records
are being validated after computing VPC IDs, including the primary location
(167-178) and the other affected test locations mentioned in the comment
(273-286, 355-359, 513-524, 2641-2641, 2671-2682).
In `@crates/rpc/proto/forge.proto`:
- Around line 2749-2755: Update the contract comment for the `interfaces` field
(lines 2749-2751) to remove references to the deprecated `auto` field and
instead describe the mutual exclusivity relationship with `auto_config`. The
comment should make clear that `auto_config` is the authoritative mechanism for
controlling whether interfaces are automatically configured or explicitly
specified, ensuring API consumers and generated documentation reflect the
current contract rather than referencing deprecated fields.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ae5d57a1-9f74-4c1a-ba99-6b896f9b9dea
📒 Files selected for processing (46)
crates/admin-cli/src/instance/allocate/args.rscrates/admin-cli/src/instance/allocate/cmd.rscrates/admin-cli/src/instance/show/cmd.rscrates/admin-cli/src/machine/show/cmd.rscrates/admin-cli/src/rpc.rscrates/agent/src/tests/full.rscrates/api-core/src/cfg/file.rscrates/api-core/src/ethernet_virtualization.rscrates/api-core/src/handlers/dpu.rscrates/api-core/src/handlers/instance.rscrates/api-core/src/instance/mod.rscrates/api-core/src/setup.rscrates/api-core/src/test_support/network_segment.rscrates/api-core/src/tests/client_resolution.rscrates/api-core/src/tests/common/api_fixtures/instance.rscrates/api-core/src/tests/common/api_fixtures/mod.rscrates/api-core/src/tests/common/api_fixtures/test_managed_host.rscrates/api-core/src/tests/common/network_segment.rscrates/api-core/src/tests/instance.rscrates/api-core/src/tests/instance_allocate.rscrates/api-core/src/tests/instance_config_update.rscrates/api-core/src/tests/machine_dhcp.rscrates/api-core/src/tests/machine_network.rscrates/api-core/src/tests/network_security_group.rscrates/api-core/src/tests/network_segment.rscrates/api-core/src/tests/vpc.rscrates/api-core/src/tests/vpc_find.rscrates/api-core/src/tests/vpc_peering.rscrates/api-core/src/tests/vpc_prefix.rscrates/api-db/migrations/20260615120000_instance_addresses_vpc_id.sqlcrates/api-db/src/dpa_interface.rscrates/api-db/src/instance.rscrates/api-db/src/instance_address.rscrates/api-db/src/instance_network_config.rscrates/api-db/src/network_security_group.rscrates/api-db/src/vpc.rscrates/api-model/src/instance/config/network.rscrates/api-model/src/instance/status/network.rscrates/api-model/src/instance_address.rscrates/machine-a-tron/src/api_client.rscrates/network/src/virtualization.rscrates/rpc/build.rscrates/rpc/proto/forge.protocrates/rpc/src/model/instance/config/network.rscrates/rpc/src/model/instance/status/network.rscrates/test-harness/src/network/controller.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Quick question: Why does instance_addresses matter at all for no-dpu? I assumed since the instance just uses any available underlay IP, there might not be an entry needed in that table. All the instance logic could just return the machines IPs directly. Is this route just because using instance_addresses simplifies something else and keeps it more in common with the DPU path? |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/api-core/src/handlers/instance.rs (1)
1397-1401: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winUpdate deprecated field wording in the zero-DPU auto-path error.
Line 1399 still says
auto: true; the public contract now usesauto_config, so this message is misleading during troubleshooting.Suggested fix
- "instance was allocated with `auto: true` but host {} is no longer zero-DPU; cannot update via the auto path", + "instance was allocated with `auto_config` but host {} is no longer zero-DPU; cannot update via the auto path", instance.machine_id,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/instance.rs` around lines 1397 - 1401, The error message in the InvalidArgument error within the mh_snapshot.has_managed_dpus() conditional block uses the deprecated field name `auto: true` which no longer matches the public contract terminology. Update the error message string to replace `auto: true` with `auto_config` to provide accurate guidance to users during troubleshooting.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/api-core/src/handlers/instance.rs`:
- Around line 1397-1401: The error message in the InvalidArgument error within
the mh_snapshot.has_managed_dpus() conditional block uses the deprecated field
name `auto: true` which no longer matches the public contract terminology.
Update the error message string to replace `auto: true` with `auto_config` to
provide accurate guidance to users during troubleshooting.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dc5a1a29-6389-457b-a81c-43fadc3b382d
📒 Files selected for processing (26)
crates/admin-cli/src/instance/show/cmd.rscrates/admin-cli/src/machine/show/cmd.rscrates/admin-cli/src/rpc.rscrates/agent/src/tests/full.rscrates/api-core/src/cfg/file.rscrates/api-core/src/ethernet_virtualization.rscrates/api-core/src/handlers/dpu.rscrates/api-core/src/handlers/instance.rscrates/api-core/src/instance/mod.rscrates/api-core/src/setup.rscrates/api-core/src/tests/common/api_fixtures/mod.rscrates/api-core/src/tests/instance_allocate.rscrates/api-core/src/tests/machine_dhcp.rscrates/api-core/src/tests/machine_network.rscrates/api-core/src/tests/network_segment.rscrates/api-core/src/tests/vpc.rscrates/api-core/src/tests/vpc_find.rscrates/api-core/src/tests/vpc_peering.rscrates/api-core/src/tests/vpc_prefix.rscrates/api-db/migrations/20260615120000_instance_addresses_vpc_id.sqlcrates/api-db/src/dpa_interface.rscrates/api-db/src/instance_address.rscrates/api-db/src/vpc.rscrates/api-model/src/instance_address.rscrates/rpc/build.rscrates/rpc/proto/forge.proto
💤 Files with no reviewable changes (2)
- crates/rpc/build.rs
- crates/rpc/proto/forge.proto
✅ Files skipped from review due to trivial changes (2)
- crates/api-core/src/tests/machine_dhcp.rs
- crates/api-core/src/tests/vpc.rs
🚧 Files skipped from review as they are similar to previous changes (20)
- crates/api-db/migrations/20260615120000_instance_addresses_vpc_id.sql
- crates/api-core/src/handlers/dpu.rs
- crates/api-core/src/tests/vpc_find.rs
- crates/api-core/src/setup.rs
- crates/admin-cli/src/machine/show/cmd.rs
- crates/api-core/src/tests/common/api_fixtures/mod.rs
- crates/api-db/src/vpc.rs
- crates/api-core/src/cfg/file.rs
- crates/api-db/src/dpa_interface.rs
- crates/api-core/src/tests/vpc_prefix.rs
- crates/api-core/src/ethernet_virtualization.rs
- crates/api-core/src/tests/machine_network.rs
- crates/api-core/src/tests/network_segment.rs
- crates/api-core/src/tests/vpc_peering.rs
- crates/agent/src/tests/full.rs
- crates/admin-cli/src/rpc.rs
- crates/admin-cli/src/instance/show/cmd.rs
- crates/api-db/src/instance_address.rs
- crates/api-core/src/instance/mod.rs
- crates/api-core/src/tests/instance_allocate.rs
Basically, yes. It would be more complex to say "in these cases we query instance_addresses, and in these cases we query machine_interface_addresses", so for zero-dpu cases, we copy the machine addresses over to instance_addresses: see crates/api-db/src/instance_address.rs:337. As for why the VPC ID is being attached to instance_addresses at all, we need to store it somewhere on the instance, because it's the only way at all that we know what VPC an instance belongs to, if the network segment has no vpc_id. So if we want to know "what VPC(s) is this instance supposed to be in", we're using the instance address as the jumping off point. |
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2666.docs.buildwithfern.com/infra-controller |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/api-test-helper/src/instance.rs (1)
39-83: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy liftThread the flat VPC ID through this helper.
The
segment_id == Nonepath still emits only deprecatedauto: true, but flat-VPC segments may have no segment-level VPC for the handler to hydrate from. Add aflat_vpc_id/auto_config.vpc_idpath here, and keep deprecatedautocoverage in a dedicated compatibility test if needed. As per path instructions, review test-support crates for “deterministic behavior, clear fixtures, reusable helpers.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-test-helper/src/instance.rs` around lines 39 - 83, The None branch of the segment_id match statement (zero-DPU path) currently only uses the deprecated auto: true in the network configuration, but flat-VPC segments require a vpc_id to be specified. Add a flat_vpc_id parameter to the function, then in the None branch where the instance_config is constructed, replace or supplement the "auto": true with an "auto_config" object that includes "vpc_id" set to the flat_vpc_id value. Keep the deprecated auto: true path available in a separate compatibility test if needed.Source: Path instructions
🧹 Nitpick comments (2)
crates/api-core/src/handlers/instance.rs (2)
187-192: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winHydrate the batch inside one transaction.
This loop opens and commits a transaction per instance request before the actual batch allocation. Since hydration is a read-only pass over the batch, wrap the loop in one
with_txncall to avoid avoidable DB round trips and keep the hydration pass together.Proposed refactor
- for request in &mut batch_request.instance_requests { - api.with_txn(|txn| { - hydrate_instance_allocation_request_from_deprecated_fields(txn, request).boxed() - }) - .await??; - } + api.with_txn(|txn| { + async { + for request in &mut batch_request.instance_requests { + hydrate_instance_allocation_request_from_deprecated_fields(txn, request).await?; + } + + Ok(()) + } + .boxed() + }) + .await??;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/instance.rs` around lines 187 - 192, The loop iterating over batch_request.instance_requests currently opens a separate transaction for each request when calling api.with_txn, causing unnecessary database round trips. Move the api.with_txn call outside the loop so that it wraps the entire for loop, allowing all calls to hydrate_instance_allocation_request_from_deprecated_fields to execute within a single read-only transaction. This keeps the hydration pass together and eliminates avoidable DB round trips.
1939-1966: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winKeep deprecated-hydration warnings attributable.
These warning paths can run while hydrating a batch, but the emitted records do not include the
machine_id, and the multi-segment case drops the segment count. Add those as structured fields so operators can identify the failing request without correlating nearby log lines. As per coding guidelines, “When writing log messages, prefer placing common fields as attributes passed to tracing functions,” and as per path instructions, reviewcrates/**/*.rsfor “structured tracing fields.”Proposed logging update
[] => { tracing::warn!( + %machine_id, "Assigned machine is not in a HostInband network, cannot auto-assign VPC ID" ); return Ok(()); } - _ => { + segments => { tracing::warn!( - "Assigned machine is in multiple a HostInband networks, cannot auto-assign VPC ID" + %machine_id, + count = segments.len(), + "Assigned machine is in multiple HostInband networks, cannot auto-assign VPC ID" ); return Ok(()); } @@ } else { - tracing::warn!(network_segment_id = %ns_id, "Network segment does not have a VPC ID, cannot auto-assign VPC ID"); + tracing::warn!(%machine_id, network_segment_id = %ns_id, "Network segment does not have a VPC ID, cannot auto-assign VPC ID"); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/instance.rs` around lines 1939 - 1966, The warning messages in the network segment handling code are missing structured tracing fields needed for operators to identify failing requests. Add the machine_id as a structured field to all three warning calls: the single tracing::warn for machines not in a HostInband network, the multi-segment case which should also include the count of networks found, and the existing network_segment warning. Include these as attributes to the tracing::warn! macro calls following the pattern already shown in the info log for vpc_id assignment, ensuring machine_id is consistently available across all warning paths to avoid requiring log line correlation for debugging.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/api-test-helper/src/instance.rs`:
- Around line 39-83: The None branch of the segment_id match statement (zero-DPU
path) currently only uses the deprecated auto: true in the network
configuration, but flat-VPC segments require a vpc_id to be specified. Add a
flat_vpc_id parameter to the function, then in the None branch where the
instance_config is constructed, replace or supplement the "auto": true with an
"auto_config" object that includes "vpc_id" set to the flat_vpc_id value. Keep
the deprecated auto: true path available in a separate compatibility test if
needed.
---
Nitpick comments:
In `@crates/api-core/src/handlers/instance.rs`:
- Around line 187-192: The loop iterating over batch_request.instance_requests
currently opens a separate transaction for each request when calling
api.with_txn, causing unnecessary database round trips. Move the api.with_txn
call outside the loop so that it wraps the entire for loop, allowing all calls
to hydrate_instance_allocation_request_from_deprecated_fields to execute within
a single read-only transaction. This keeps the hydration pass together and
eliminates avoidable DB round trips.
- Around line 1939-1966: The warning messages in the network segment handling
code are missing structured tracing fields needed for operators to identify
failing requests. Add the machine_id as a structured field to all three warning
calls: the single tracing::warn for machines not in a HostInband network, the
multi-segment case which should also include the count of networks found, and
the existing network_segment warning. Include these as attributes to the
tracing::warn! macro calls following the pattern already shown in the info log
for vpc_id assignment, ensuring machine_id is consistently available across all
warning paths to avoid requiring log line correlation for debugging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 522e8444-254f-4c46-92ef-f4a2b259e077
📒 Files selected for processing (2)
crates/api-core/src/handlers/instance.rscrates/api-test-helper/src/instance.rs
Thanks for the explanation. I am fine either way: With this extension to instance_addresses or by changing other code not to use it. We won't get around special handling this in some way, so whatever you, @chet and @bcavnvidia and decide is fine 👍🏻 |
Building on #2666: Passing `auto=true` as part of InstanceNetworkConfig when allocating an Instance on NICo is now deprecated in favor of passing a VPC ID as part of a InstanceNetworkConfig.auto_config message, so that NICo knows what VPC ID to assign to the network interface if the network segment doesn't belong to any VPC. (This is so multiple "flat" VPC's can share a single network segment for single-dpu hosts: An instance's network interface now carries the VPC ID instead of the network segment.) Since the REST API's APIInstanceCreateRequest already requires callers to pass a VPC ID, we can pass it directly as part of the call to AllocateInstance, which makes change this pretty easy, and not require any changes for callers of the REST API. ## Related issues #2647 ## Type of Change - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Breaking Changes - [ ] **This PR contains breaking changes** ## Testing - [X] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes
This supports #2647.
In an environment without DPU's, we don't get dynamic network isolation, and the expectation is that there is a single, flat network address space, modeled as a single network segment. In this case, we still want to model VPC's for things like NVLink partitions, where tenants will have the understanding that the network is not isolated between VPC's.
To make this work, a network segment can no longer be the "link" that links an instance to a VPC. Before, an instance interface maps to a machine interface, which maps to a network segment, which maps to a VPC. But if multiple VPC's are all sharing a network segment, this link doesn't exist (namely, network_segments.vpc_id will be null), and so we need the instance address itself to store the VPC ID on it.
This change adds a vpc_id column to instance_addresses, so that we know the VPC of any given address without having to consult the network_segments table. It also changes the AllocateInstance API so that you can pass a
auto_configobject which allows specifying a VPC ID as part of the allocation request. (This replaces theauto: boolfield, such that auto is implied ifauto_configis set.)Related issues
Type of Change
Breaking Changes
The
--zero-dpuflag onnico-admin-cli instance allocateis now--flat-vpc-id=<vpc_id>. The old flag was just merged last week so it's doubtful anyone's relying on it yet (especially because the functionality isn't done yet.)Testing
Additional Notes