Skip to content

Support creating sole tenancy nodes#1410

Closed
djeebus wants to merge 8 commits into
mainfrom
sole-tenancy-vm
Closed

Support creating sole tenancy nodes#1410
djeebus wants to merge 8 commits into
mainfrom
sole-tenancy-vm

Conversation

@djeebus
Copy link
Copy Markdown
Contributor

@djeebus djeebus commented Oct 27, 2025

This will let us isolate ourselves from noisy neighbors


Note

Adds an isolated client sole-tenant GCE node pool with new variables wired through Terraform, relaxes the Google provider constraint, and introduces a CI job to validate IaC.

  • Infrastructure (Terraform):
    • Isolated client node pool: Add sole-tenant resources in iac/provider-gcp/nomad-cluster/nodepool-client-isolated.tf (google_compute_node_template, google_compute_node_group, google_compute_instance_template, google_compute_region_instance_group_manager).
    • Configuration: Introduce client_node_type, isolated_client_cluster_target_size, and isolated_client_cluster_disk_count variables with defaults in iac/provider-gcp/variables.tf; plumb through iac/provider-gcp/main.tf to ./nomad-cluster and declare in iac/provider-gcp/nomad-cluster/variables.tf.
    • Provider: Relax google provider to ~> 6 in iac/provider-gcp/main.tf.
    • Remove obsolete cache disk size/type variables in iac/provider-gcp/variables.tf.
  • CI:
    • Add permissions: contents: read and new validate-iac job in .github/workflows/pr-tests.yml to terraform init -backend=false and terraform validate in iac/provider-gcp.

Written by Cursor Bugbot for commit a45dda8. This will update automatically on new commits. Configure here.

region = var.gcp_region
distribution_policy_zones = [var.gcp_zone]

target_size = var.isolated_client_cluster_size < var.isolated_client_cluster_size_max ? null : var.isolated_client_cluster_size
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Bug

The target_size logic for the isolated_client_pool instance group manager is misconfigured. When isolated_client_cluster_size is less than isolated_client_cluster_size_max, target_size becomes null. This implies autoscaling, but no autoscaling policies are defined, which can lead to the instance group manager failing or behaving unexpectedly.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's tweak the default setup so that ideally the node type is defined but it got size 0.


scheduling {
on_host_maintenance = "MIGRATE"
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing node_affinities for sole-tenant scheduling

The scheduling block for sole tenant instances is missing the required node_affinities configuration. Instances created from this template won't be scheduled on the sole tenant node group (google_compute_node_group.client), defeating the purpose of sole tenancy. The scheduling block should include node_affinities that reference the node group to ensure instances are placed on the dedicated sole tenant nodes.

Fix in Cursor Fix in Web

Comment thread .github/workflows/pr-tests.yml Fixed
@djeebus
Copy link
Copy Markdown
Contributor Author

djeebus commented Nov 6, 2025

Putting this back into draft, as us-west-1 has no available sole tenant n1 resources that also have local ssds.

@djeebus djeebus marked this pull request as draft November 6, 2025 02:44
@ValentaTomas
Copy link
Copy Markdown
Member

We will reopen after we have support for the new machine types.

@ValentaTomas ValentaTomas deleted the sole-tenancy-vm branch November 15, 2025 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants