Skip to content

feat: add Topology Aware Scheduling support#6375

Merged
julienmancuso merged 16 commits into
mainfrom
jsm/tas-impl
Mar 21, 2026
Merged

feat: add Topology Aware Scheduling support#6375
julienmancuso merged 16 commits into
mainfrom
jsm/tas-impl

Conversation

@julienmancuso
Copy link
Copy Markdown
Contributor

@julienmancuso julienmancuso commented Feb 18, 2026

Overview:

add Topology Aware Scheduling support

according to proposal : ai-dynamo/enhancements#69

fixes DYN-775

Summary by CodeRabbit

  • New Features

    • Added Topology Aware Scheduling support, enabling optional topology constraints to control pod placement across topology domains (zones, racks, hosts, etc.) at deployment or service levels.
    • Includes topology domain hierarchy validation and inheritance support.
  • Documentation

    • Added comprehensive guide for configuring and troubleshooting Topology Aware Scheduling, including prerequisites, configuration examples, and common issues.

@julienmancuso julienmancuso requested a review from a team as a code owner February 18, 2026 15:58
@github-actions github-actions Bot added feat deployment::k8s Relates to dynamo deployment in kubernetes labels Feb 18, 2026
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Feb 18, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 18, 2026

Walkthrough

This pull request introduces Topology Aware Scheduling (TAS) for Kubernetes deployments by adding TopologyConstraint and TopologyDomain types, extending CRD schemas, updating RBAC for Grove integration, implementing webhook validation with inheritance rules, and adding reconciliation logic for condition propagation. Changes span API definitions, controllers, validation, tests, and documentation.

Changes

Cohort / File(s) Summary
API Type Definitions
deploy/operator/api/v1alpha1/topology_types.go, deploy/operator/api/v1alpha1/dynamographdeployment_types.go, deploy/operator/api/v1alpha1/dynamocomponentdeployment_types.go, deploy/operator/api/v1alpha1/zz_generated.deepcopy.go
Introduces TopologyConstraint and TopologyDomain types with condition constants; adds TopologyConstraint field to DynamoGraphDeploymentSpec and DynamoComponentDeploymentSharedSpec; includes HasAnyTopologyConstraint() method; generates deep copy support.
CRD & Schema Definitions
deploy/operator/config/crd/bases/nvidia.com_dynamograph*.yaml, deploy/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml, deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamograph*.yaml, deploy/helm/charts/platform/components/operator/crds/nvidia.com_dynamocomponentdeployments.yaml
Extends CRD schemas with topologyConstraint field at deployment, component, and service levels; defines packDomain validation patterns and topologyProfile inheritance semantics.
RBAC Configuration
deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml, deploy/operator/config/rbac/role.yaml
Updates ClusterRole and ClusterRoleBinding for Grove ClusterTopology access (get/list/watch); adds Kai-scheduler queue access rules (get/list); removes prior scheduling.run.ai queue access from manager Role.
Webhook Validation
deploy/operator/internal/webhook/validation/dynamographdeployment.go, deploy/operator/internal/webhook/validation/dynamographdeployment_handler.go
Adds comprehensive topology constraint validation including domain format checks, Grove ClusterTopology verification, inheritance hierarchy enforcement, and immutability rules; introduces validateTopologyConstraints(), validateTopologyDomainsAgainstGroveClusterTopology(), and validateTopologyConstraintImmutability() methods.
Controller Reconciliation
deploy/operator/internal/controller/dynamographdeployment_controller.go, deploy/operator/internal/dynamo/graph.go, deploy/operator/internal/dynamo/grove.go
Adds propagateTopologyCondition() method to propagate Grove topology conditions to DGD status; injects deployment-level topology constraints into Grove PCS template; handles single-node and multinode service propagation; adds toGroveTopologyConstraint() converter.
Test Coverage
deploy/operator/internal/controller/dynamographdeployment_controller_test.go, deploy/operator/internal/dynamo/graph_test.go, deploy/operator/internal/webhook/validation/dynamographdeployment_test.go
Adds TestPropagateTopologyCondition suite for condition propagation scenarios; introduces TestGenerateGrovePodCliqueSet_TopologyConstraints for PCS generation; expands validation tests with 337+ lines covering topology constraint edge cases and inheritance rules.
Documentation
docs/kubernetes/topology-aware-scheduling.md, docs/kubernetes/api-reference.md, docs/kubernetes/deployment/create-deployment.md, docs/kubernetes/deployment/multinode-deployment.md, docs/kubernetes/grove.md, docs/index.yml, docs/kubernetes/README.md
Introduces new TAS guide with prerequisites, topology domain semantics, packDomain constraints, multinode considerations, immutability rules, monitoring, and troubleshooting; updates API reference, deployment guides, and navigation.
Build & Cleanup
deploy/helm/charts/platform/Makefile
Minor formatting: adds newline at end of file.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Topology domains now bloom, from host to zone so wide,
Constraints inherit like clover, with hierarchy as our guide,
Grove and scheduler in harmony, placement refined with care,
Pods pack closer together, in topology trees we share! 🌲

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete. It provides an overview and issue reference but is missing the 'Details' and 'Where should the reviewer start?' sections from the template. Add 'Details' section describing the changes made and 'Where should the reviewer start?' section identifying key files for review.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add Topology Aware Scheduling support' clearly and accurately summarizes the main change in the PR, which is the addition of Topology Aware Scheduling functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable the changed files summary in the walkthrough.

Disable the reviews.changed_files_summary setting to disable the changed files summary in the walkthrough.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (6)
docs/pages/kubernetes/deployment/create-deployment.md (1)

192-192: Optional: hyphenate "Topology Aware Scheduling" consistently across the PR.

The LanguageTool static analysis flag is technically correct — "Topology-Aware Scheduling" is the hyphenated adjective form. However, the term is used without a hyphen consistently everywhere in this PR (page title, nav entries, other section headings). If you do hyphenate, update it uniformly across all affected docs; otherwise leave it as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/pages/kubernetes/deployment/create-deployment.md` at line 192, The
heading "Topology Aware Scheduling" is inconsistent with the LanguageTool
suggestion to hyphenate; decide whether to use "Topology-Aware Scheduling" or
keep "Topology Aware Scheduling" and then apply that choice uniformly across the
PR — update the page title, this heading, nav entries, and any other section
headings or references (search for the exact phrase "Topology Aware Scheduling")
to match the chosen hyphenation.
deploy/operator/api/v1alpha1/topology_types.go (1)

85-93: IsNarrowerOrEqual silently returns a potentially misleading result for invalid domains.

If either receiver or argument is not in topologyDomainOrder, the map lookup returns 0, causing silent miscomparisons (e.g., an unrecognized domain appears "broader than everything"). While the kubebuilder enum marker should prevent invalid values at the API boundary, these are exported helpers that could be called from tests or future code paths without prior validation.

Consider either returning an error or adding a validity guard:

♻️ Optional defensive approach
-func (d TopologyDomain) IsNarrowerOrEqual(other TopologyDomain) bool {
-	return topologyDomainOrder[d] >= topologyDomainOrder[other]
+func (d TopologyDomain) IsNarrowerOrEqual(other TopologyDomain) (bool, error) {
+	dOrder, dOk := topologyDomainOrder[d]
+	oOrder, oOk := topologyDomainOrder[other]
+	if !dOk || !oOk {
+		return false, fmt.Errorf("unknown topology domain(s): %q, %q", d, other)
+	}
+	return dOrder >= oOrder, nil
+}

This would require updating ValidateHierarchy and other callers accordingly.

Alternatively, if keeping the current bool signature, document the precondition that both domains must be valid (checked via IsValidTopologyDomain).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/api/v1alpha1/topology_types.go` around lines 85 - 93,
IsNarrowerOrEqual currently uses topologyDomainOrder lookup that yields 0 for
unknown domains, causing silent miscomparisons; update IsNarrowerOrEqual to
guard by calling IsValidTopologyDomain for both receiver and other and if either
is invalid return false (or alternatively change the signature to return (bool,
error) and return an error for invalid domains), and if you choose the latter
update all callers such as ValidateHierarchy to handle the new error return;
ensure references to topologyDomainOrder remain but are only used after validity
checks.
deploy/operator/internal/dynamo/graph_test.go (1)

7067-7093: Strengthen clique/PCSG coverage to avoid silent passes.
The loops skip any clique/PCSG not present in the expectation maps, so unexpected names or extra cliques can pass without validation. Consider asserting that the expected name set matches the actual set and error on unknown entries.

✅ Suggested tightening of assertions
-			// Verify clique-level TopologyConstraints
-			for _, clique := range pcs.Spec.Template.Cliques {
-				expectedTC, ok := tt.wantCliqueTC[clique.Name]
-				if !ok {
-					continue
-				}
+			// Verify clique-level TopologyConstraints
+			if tt.wantCliqueTC != nil {
+				expectedNames := make([]string, 0, len(tt.wantCliqueTC))
+				for name := range tt.wantCliqueTC {
+					expectedNames = append(expectedNames, name)
+				}
+				actualNames := make([]string, 0, len(pcs.Spec.Template.Cliques))
+				for _, clique := range pcs.Spec.Template.Cliques {
+					actualNames = append(actualNames, clique.Name)
+				}
+				assert.ElementsMatch(t, expectedNames, actualNames, "clique names mismatch")
+			}
+			for _, clique := range pcs.Spec.Template.Cliques {
+				expectedTC, ok := tt.wantCliqueTC[clique.Name]
+				if !ok {
+					t.Errorf("unexpected clique %q in PCS template", clique.Name)
+					continue
+				}
 				if expectedTC == nil {
 					assert.Nil(t, clique.TopologyConstraint, "clique %q: expected nil TopologyConstraint", clique.Name)
 				} else {
 					assert.NotNil(t, clique.TopologyConstraint, "clique %q: expected non-nil TopologyConstraint", clique.Name)
 					assert.Equal(t, expectedTC.PackDomain, clique.TopologyConstraint.PackDomain, "clique %q: packDomain mismatch", clique.Name)
 				}
 			}
 
 			// Verify PCSG-level TopologyConstraints
 			assert.Equal(t, tt.wantPCSGCount, len(pcs.Spec.Template.PodCliqueScalingGroupConfigs), "PCSG count mismatch")
+			if tt.wantPCSGTC != nil {
+				expectedNames := make([]string, 0, len(tt.wantPCSGTC))
+				for name := range tt.wantPCSGTC {
+					expectedNames = append(expectedNames, name)
+				}
+				actualNames := make([]string, 0, len(pcs.Spec.Template.PodCliqueScalingGroupConfigs))
+				for _, pcsg := range pcs.Spec.Template.PodCliqueScalingGroupConfigs {
+					actualNames = append(actualNames, pcsg.Name)
+				}
+				assert.ElementsMatch(t, expectedNames, actualNames, "PCSG names mismatch")
+			}
 			for _, pcsg := range pcs.Spec.Template.PodCliqueScalingGroupConfigs {
 				if tt.wantPCSGTC != nil {
 					expectedTC, ok := tt.wantPCSGTC[pcsg.Name]
 					if ok {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/dynamo/graph_test.go` around lines 7067 - 7093, The
test currently only checks cliques and PCSGs that exist in the expectation maps
(wantCliqueTC, wantPCSGTC) which allows extra or missing entries to slip by;
update the checks around pcs.Spec.Template.Cliques and
pcs.Spec.Template.PodCliqueScalingGroupConfigs to first assert that the set of
actual names equals the set of expected names (for example by collecting names
from pcs.Spec.Template.Cliques and comparing to keys of wantCliqueTC, and
similarly for PodCliqueScalingGroupConfigs vs wantPCSGTC), then iterate and
perform the existing nil/non-nil and PackDomain comparisons (using the existing
assertions for clique.TopologyConstraint and pcsg.TopologyConstraint) so unknown
or extra entries fail the test rather than being skipped.
deploy/operator/internal/webhook/validation/dynamographdeployment_test.go (1)

1063-1357: Remove duplicated topology constraint test cases to keep the suite lean

The block starting at Line 1063 repeats the same cases already defined around Lines 768–1062, which doubles runtime and makes subtest output ambiguous. Consider deleting the duplicate block or extracting shared cases into a helper used once.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/webhook/validation/dynamographdeployment_test.go`
around lines 1063 - 1357, The test suite contains a duplicated set of
topology-constraint subtests (same cases like "no topology constraints is valid
(backward compatible)", "valid topology constraints with spec and service
level", "service-only topology constraint (no spec-level) is valid", etc.)
within dynamographdeployment_test.go; remove the duplicated block of test cases
(the repeated slice entries used in the topology constraint validation table) or
refactor them into a single helper/fixture invoked once (e.g., consolidate into
the existing topology constraint test table or a shared function used by
TestValidateTopologyConstraints) so each case appears only once and the subtests
are not duplicated.
deploy/operator/internal/controller/dynamographdeployment_controller.go (1)

486-500: Avoid repeated warning events when the topology-unavailable condition persists

Line 499 emits a warning on every reconcile while the condition stays unavailable, which can flood the event stream. Consider emitting only on transition (status/reason/message change).

♻️ Suggested guard for event emission
-		logger.Info("Topology constraints no longer enforced", "reason", reason, "message", groveTopoCond.Message)
-		r.Recorder.Eventf(dgd, corev1.EventTypeWarning, reason, "Topology constraints no longer enforced: %s", groveTopoCond.Message)
+		prev := metav1.FindStatusCondition(dgd.Status.Conditions, nvidiacomv1alpha1.ConditionTypeTopologyConstraintsEnforced)
+		shouldEmit := prev == nil || prev.Status != metav1.ConditionFalse || prev.Reason != reason || prev.Message != groveTopoCond.Message
+		if shouldEmit {
+			logger.Info("Topology constraints no longer enforced", "reason", reason, "message", groveTopoCond.Message)
+			r.Recorder.Eventf(dgd, corev1.EventTypeWarning, reason, "Topology constraints no longer enforced: %s", groveTopoCond.Message)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/controller/dynamographdeployment_controller.go`
around lines 486 - 500, The code currently calls r.Recorder.Eventf and logs
inside the groveTopoCond.Status==True branch every reconcile, which floods
events; modify the block around the metav1.Condition construction for
ConditionTypeTopologyConstraintsEnforced (where groveTopoCond, reason,
dynamoCond and dgd are referenced) to emit the warning only when the condition
actually transitions or its reason/message/status changes: fetch the existing
condition from dgd.Status.Conditions (by Type ==
nvidiacomv1alpha1.ConditionTypeTopologyConstraintsEnforced), compare
Status/Reason/Message, and only call r.Recorder.Eventf and logger.Info when any
of those fields differ (also set LastTransitionTime appropriately on real
transitions); otherwise update the condition in status without emitting an
event.
deploy/operator/internal/webhook/validation/dynamographdeployment.go (1)

608-616: topologyConstraintsEqual could be more maintainable against future TopologyConstraint changes.

Currently, TopologyConstraint has only the PackDomain field, so the comparison is complete. However, if TopologyConstraint gains additional fields in the future, this function will silently miss them without a compile-time signal, potentially allowing unintended field changes during validation checks.

Consider using reflect.DeepEqual for structural completeness:

♻️ Option – use reflect.DeepEqual
+import "reflect"

 func topologyConstraintsEqual(a, b *nvidiacomv1alpha1.TopologyConstraint) bool {
 	if a == nil && b == nil {
 		return true
 	}
 	if a == nil || b == nil {
 		return false
 	}
-	return a.PackDomain == b.PackDomain
+	return reflect.DeepEqual(a, b)
 }

Alternatively, add a comment documenting the intentional field-limited scope for clarity to future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/operator/internal/webhook/validation/dynamographdeployment.go` around
lines 608 - 616, The current topologyConstraintsEqual function only compares
PackDomain and will miss future fields; update topologyConstraintsEqual to use
reflect.DeepEqual(a, b) (add import "reflect") so the comparison is structural
and future-proof, or if the intent is to only compare PackDomain, add a clear
comment above topologyConstraintsEqual explaining that the comparison is
intentionally limited to PackDomain and must be updated when other fields should
be considered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@deploy/helm/charts/crds/templates/nvidia.com_dynamocomponentdeployments.yaml`:
- Around line 11055-11075: The CRD change to topologyConstraint.packDomain is in
an auto-generated file and will be overwritten; update the operator Go type
(e.g., the Topology spec type in topology_types.go or the struct that defines
topologyConstraint/packDomain) to include the desired enum/description, then run
the manifests generation step (make manifests) to regenerate the CRDs so the
Helm template (topologyConstraint / packDomain) is updated from the canonical
source rather than editing deploy/helm/charts/crds/templates/* directly.

In `@deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml`:
- Around line 127-134: The manager Role currently includes rules for the
cluster-scoped resource "clustertopologies" (apiGroup grove.io, resource
clustertopologies) which is ineffective when namespaceRestriction.enabled=true;
remove that rule from the conditional manager Role/RoleBinding block and instead
create an always-present ClusterRole and matching ClusterRoleBinding granting
read-only verbs (get,list,watch) on clustertopologies.grove.io, modeled after
the existing always-ClusterRole for "queues" (refer to the queue-reader
ClusterRole/ClusterRoleBinding section), and update the
topology-aware-scheduling.md prerequisite note to reflect that read-only RBAC
for clustertopologies is only included when this ClusterRole is present.

In `@deploy/operator/api/v1alpha1/dynamographdeployment_types.go`:
- Around line 70-74: Add a helper like resolveTopologyConstraint(parent
*DynamographDeployment, svcConstraint *TopologyConstraint) that returns
svcConstraint if non-nil otherwise parent.Spec.TopologyConstraint, and replace
direct uses of component.TopologyConstraint in GenerateGrovePodCliqueSet and
generateSingleDCD with calls to this helper so both Grove and DCD pathways
consistently inherit the spec-level topology constraint when a service has none.

In
`@deploy/operator/internal/controller/dynamographdeployment_controller_test.go`:
- Around line 2103-2106: The test references consts.KubeAnnotationEnableGrove
but this file uses the commonconsts import alias; update the annotation key to
commonconsts.KubeAnnotationEnableGrove in the ObjectMeta block (where
ObjectMeta: metav1.ObjectMeta{ Name: "test", Namespace: "default", Annotations:
... }) to fix the unresolved identifier compile error.

In `@deploy/operator/internal/webhook/validation/dynamographdeployment.go`:
- Around line 506-511: The redundant CRD-existence error occurs because
validateTopologyDomainsAgainstCRD is run even when some packDomain strings
already failed IsValidTopologyDomain, producing a misleading "domain does not
exist" message; update the call-site or the validator to avoid duplicate errors:
either (A) in the caller (where v.mgr != nil && v.isGrovePathway()) add a guard
to skip calling v.validateTopologyDomainsAgainstCRD if any domain-level
validation error from IsValidTopologyDomain exists, or (B) modify
validateTopologyDomainsAgainstCRD to filter its domainsToCheck input to only
include domains that passed IsValidTopologyDomain before querying the
ClusterTopology CRD (use the same IsValidTopologyDomain helper and packDomain
identifiers), so only syntactically valid domains are checked against the CRD
and spurious errors are not appended.
- Around line 538-554: Change domainsToCheck from
map[nvidiacomv1alpha1.TopologyDomain]string to
map[nvidiacomv1alpha1.TopologyDomain][]string, collect all fieldPath entries
(push into the slice) in the existing loops that build domainsToCheck (both the
TopologyConstraint on v.deployment.Spec and per-service loop), then when
validating (the loop that iterates domainsToCheck) iterate deterministicly by
extracting and sorting the domain keys, sort each domain's slice of field paths,
and emit an error for each field path (or include all sorted paths in the error)
when the domain is missing (keep using topologyLevelDomains(ct) for the
available list). This ensures no field paths are lost and error order is
deterministic.

In `@docs/pages/kubernetes/topology-aware-scheduling.md`:
- Around line 60-62: The DynamoGraphDeployment YAML examples incorrectly use the
Kubernetes key `nvidia.com/gpu: "1"`; update each example (the three
DynamoGraphDeployment blocks in this document) to use the Dynamo-specific
shorthand `gpu: "1"` under the resources.limits section so they match the
operator's CR format and the rest of the docs (e.g., replace `nvidia.com/gpu:
"1"` with `gpu: "1"` in each DynamoGraphDeployment example).
- Around line 15-17: Replace every occurrence of the incorrect Grove repository
URL "https://github.com/ai-dynamo/grove" in
docs/pages/kubernetes/topology-aware-scheduling.md with the canonical
"https://github.com/NVIDIA/grove" (this applies to the markdown links such as
the Grove Installation Guide and Grove documentation references); ensure all
three instances in the file are updated so links like [Grove Installation
Guide](https://github.com/ai-dynamo/grove/...) and any other references to
ai-dynamo/grove are changed to NVIDIA/grove.

---

Nitpick comments:
In `@deploy/operator/api/v1alpha1/topology_types.go`:
- Around line 85-93: IsNarrowerOrEqual currently uses topologyDomainOrder lookup
that yields 0 for unknown domains, causing silent miscomparisons; update
IsNarrowerOrEqual to guard by calling IsValidTopologyDomain for both receiver
and other and if either is invalid return false (or alternatively change the
signature to return (bool, error) and return an error for invalid domains), and
if you choose the latter update all callers such as ValidateHierarchy to handle
the new error return; ensure references to topologyDomainOrder remain but are
only used after validity checks.

In `@deploy/operator/internal/controller/dynamographdeployment_controller.go`:
- Around line 486-500: The code currently calls r.Recorder.Eventf and logs
inside the groveTopoCond.Status==True branch every reconcile, which floods
events; modify the block around the metav1.Condition construction for
ConditionTypeTopologyConstraintsEnforced (where groveTopoCond, reason,
dynamoCond and dgd are referenced) to emit the warning only when the condition
actually transitions or its reason/message/status changes: fetch the existing
condition from dgd.Status.Conditions (by Type ==
nvidiacomv1alpha1.ConditionTypeTopologyConstraintsEnforced), compare
Status/Reason/Message, and only call r.Recorder.Eventf and logger.Info when any
of those fields differ (also set LastTransitionTime appropriately on real
transitions); otherwise update the condition in status without emitting an
event.

In `@deploy/operator/internal/dynamo/graph_test.go`:
- Around line 7067-7093: The test currently only checks cliques and PCSGs that
exist in the expectation maps (wantCliqueTC, wantPCSGTC) which allows extra or
missing entries to slip by; update the checks around pcs.Spec.Template.Cliques
and pcs.Spec.Template.PodCliqueScalingGroupConfigs to first assert that the set
of actual names equals the set of expected names (for example by collecting
names from pcs.Spec.Template.Cliques and comparing to keys of wantCliqueTC, and
similarly for PodCliqueScalingGroupConfigs vs wantPCSGTC), then iterate and
perform the existing nil/non-nil and PackDomain comparisons (using the existing
assertions for clique.TopologyConstraint and pcsg.TopologyConstraint) so unknown
or extra entries fail the test rather than being skipped.

In `@deploy/operator/internal/webhook/validation/dynamographdeployment_test.go`:
- Around line 1063-1357: The test suite contains a duplicated set of
topology-constraint subtests (same cases like "no topology constraints is valid
(backward compatible)", "valid topology constraints with spec and service
level", "service-only topology constraint (no spec-level) is valid", etc.)
within dynamographdeployment_test.go; remove the duplicated block of test cases
(the repeated slice entries used in the topology constraint validation table) or
refactor them into a single helper/fixture invoked once (e.g., consolidate into
the existing topology constraint test table or a shared function used by
TestValidateTopologyConstraints) so each case appears only once and the subtests
are not duplicated.

In `@deploy/operator/internal/webhook/validation/dynamographdeployment.go`:
- Around line 608-616: The current topologyConstraintsEqual function only
compares PackDomain and will miss future fields; update topologyConstraintsEqual
to use reflect.DeepEqual(a, b) (add import "reflect") so the comparison is
structural and future-proof, or if the intent is to only compare PackDomain, add
a clear comment above topologyConstraintsEqual explaining that the comparison is
intentionally limited to PackDomain and must be updated when other fields should
be considered.

In `@docs/pages/kubernetes/deployment/create-deployment.md`:
- Line 192: The heading "Topology Aware Scheduling" is inconsistent with the
LanguageTool suggestion to hyphenate; decide whether to use "Topology-Aware
Scheduling" or keep "Topology Aware Scheduling" and then apply that choice
uniformly across the PR — update the page title, this heading, nav entries, and
any other section headings or references (search for the exact phrase "Topology
Aware Scheduling") to match the chosen hyphenation.

Comment thread deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml Outdated
Comment thread deploy/operator/api/v1alpha1/dynamographdeployment_types.go Outdated
Comment thread deploy/operator/internal/webhook/validation/dynamographdeployment.go Outdated
Comment thread deploy/operator/internal/webhook/validation/dynamographdeployment.go Outdated
Comment thread docs/pages/kubernetes/topology-aware-scheduling.md Outdated
Comment thread docs/pages/kubernetes/topology-aware-scheduling.md Outdated
@athreesh
Copy link
Copy Markdown
Contributor

few thoughts, non-blocking:

  • seems like topology constraints are immutable after creation, will we document this and then tie in how to use @tmonty12 rolling update support?
  • Is the multinode guidance clear enough? This is likely where our most sophisticated customers will use TAS (large model sharding).
  • The "Prerequisites" table lists KAI Scheduler as "recommended" — should it be "required"?

Comment thread deploy/helm/charts/platform/Makefile
@julienmancuso
Copy link
Copy Markdown
Contributor Author

  • seems like topology constraints are immutable after creation, will we document this and then tie in how to use @tmonty12 rolling update support?
  • Is the multinode guidance clear enough? This is likely where our most sophisticated customers will use TAS (large model sharding).
  • The "Prerequisites" table lists KAI Scheduler as "recommended" — should it be "required"?

immutability is already documented, the rolling update support introduced recently is currently not using grove.
doc was improved for multinode.
and yes it is required, doc updated.

Comment thread deploy/operator/internal/controller/dynamographdeployment_controller.go Outdated
Comment thread deploy/operator/internal/dynamo/graph.go Outdated
Comment thread deploy/operator/internal/dynamo/graph.go
Comment thread deploy/operator/internal/webhook/validation/dynamographdeployment.go Outdated
Comment thread deploy/operator/internal/webhook/validation/dynamographdeployment.go Outdated
Comment thread deploy/operator/internal/webhook/validation/dynamographdeployment.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 20, 2026

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso requested a review from a team March 19, 2026 23:21
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Copy link
Copy Markdown
Contributor

@tmonty12 tmonty12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread .github/workflows/detect_broken_links.py
Copy link
Copy Markdown
Contributor

@dillon-cullinan dillon-cullinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving changes to broken_links script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants