feat: add Topology Aware Scheduling support#6375
Conversation
d05ac57 to
a105d97
Compare
WalkthroughThis pull request introduces Topology Aware Scheduling (TAS) for Kubernetes deployments by adding TopologyConstraint and TopologyDomain types, extending CRD schemas, updating RBAC for Grove integration, implementing webhook validation with inheritance rules, and adding reconciliation logic for condition propagation. Changes span API definitions, controllers, validation, tests, and documentation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can disable the changed files summary in the walkthrough.Disable the |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (6)
docs/pages/kubernetes/deployment/create-deployment.md (1)
192-192: Optional: hyphenate "Topology Aware Scheduling" consistently across the PR.The LanguageTool static analysis flag is technically correct — "Topology-Aware Scheduling" is the hyphenated adjective form. However, the term is used without a hyphen consistently everywhere in this PR (page title, nav entries, other section headings). If you do hyphenate, update it uniformly across all affected docs; otherwise leave it as-is.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/pages/kubernetes/deployment/create-deployment.md` at line 192, The heading "Topology Aware Scheduling" is inconsistent with the LanguageTool suggestion to hyphenate; decide whether to use "Topology-Aware Scheduling" or keep "Topology Aware Scheduling" and then apply that choice uniformly across the PR — update the page title, this heading, nav entries, and any other section headings or references (search for the exact phrase "Topology Aware Scheduling") to match the chosen hyphenation.deploy/operator/api/v1alpha1/topology_types.go (1)
85-93:IsNarrowerOrEqualsilently returns a potentially misleading result for invalid domains.If either receiver or argument is not in
topologyDomainOrder, the map lookup returns0, causing silent miscomparisons (e.g., an unrecognized domain appears "broader than everything"). While the kubebuilder enum marker should prevent invalid values at the API boundary, these are exported helpers that could be called from tests or future code paths without prior validation.Consider either returning an error or adding a validity guard:
♻️ Optional defensive approach
-func (d TopologyDomain) IsNarrowerOrEqual(other TopologyDomain) bool { - return topologyDomainOrder[d] >= topologyDomainOrder[other] +func (d TopologyDomain) IsNarrowerOrEqual(other TopologyDomain) (bool, error) { + dOrder, dOk := topologyDomainOrder[d] + oOrder, oOk := topologyDomainOrder[other] + if !dOk || !oOk { + return false, fmt.Errorf("unknown topology domain(s): %q, %q", d, other) + } + return dOrder >= oOrder, nil +}This would require updating
ValidateHierarchyand other callers accordingly.Alternatively, if keeping the current
boolsignature, document the precondition that both domains must be valid (checked viaIsValidTopologyDomain).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/api/v1alpha1/topology_types.go` around lines 85 - 93, IsNarrowerOrEqual currently uses topologyDomainOrder lookup that yields 0 for unknown domains, causing silent miscomparisons; update IsNarrowerOrEqual to guard by calling IsValidTopologyDomain for both receiver and other and if either is invalid return false (or alternatively change the signature to return (bool, error) and return an error for invalid domains), and if you choose the latter update all callers such as ValidateHierarchy to handle the new error return; ensure references to topologyDomainOrder remain but are only used after validity checks.deploy/operator/internal/dynamo/graph_test.go (1)
7067-7093: Strengthen clique/PCSG coverage to avoid silent passes.
The loops skip any clique/PCSG not present in the expectation maps, so unexpected names or extra cliques can pass without validation. Consider asserting that the expected name set matches the actual set and error on unknown entries.✅ Suggested tightening of assertions
- // Verify clique-level TopologyConstraints - for _, clique := range pcs.Spec.Template.Cliques { - expectedTC, ok := tt.wantCliqueTC[clique.Name] - if !ok { - continue - } + // Verify clique-level TopologyConstraints + if tt.wantCliqueTC != nil { + expectedNames := make([]string, 0, len(tt.wantCliqueTC)) + for name := range tt.wantCliqueTC { + expectedNames = append(expectedNames, name) + } + actualNames := make([]string, 0, len(pcs.Spec.Template.Cliques)) + for _, clique := range pcs.Spec.Template.Cliques { + actualNames = append(actualNames, clique.Name) + } + assert.ElementsMatch(t, expectedNames, actualNames, "clique names mismatch") + } + for _, clique := range pcs.Spec.Template.Cliques { + expectedTC, ok := tt.wantCliqueTC[clique.Name] + if !ok { + t.Errorf("unexpected clique %q in PCS template", clique.Name) + continue + } if expectedTC == nil { assert.Nil(t, clique.TopologyConstraint, "clique %q: expected nil TopologyConstraint", clique.Name) } else { assert.NotNil(t, clique.TopologyConstraint, "clique %q: expected non-nil TopologyConstraint", clique.Name) assert.Equal(t, expectedTC.PackDomain, clique.TopologyConstraint.PackDomain, "clique %q: packDomain mismatch", clique.Name) } } // Verify PCSG-level TopologyConstraints assert.Equal(t, tt.wantPCSGCount, len(pcs.Spec.Template.PodCliqueScalingGroupConfigs), "PCSG count mismatch") + if tt.wantPCSGTC != nil { + expectedNames := make([]string, 0, len(tt.wantPCSGTC)) + for name := range tt.wantPCSGTC { + expectedNames = append(expectedNames, name) + } + actualNames := make([]string, 0, len(pcs.Spec.Template.PodCliqueScalingGroupConfigs)) + for _, pcsg := range pcs.Spec.Template.PodCliqueScalingGroupConfigs { + actualNames = append(actualNames, pcsg.Name) + } + assert.ElementsMatch(t, expectedNames, actualNames, "PCSG names mismatch") + } for _, pcsg := range pcs.Spec.Template.PodCliqueScalingGroupConfigs { if tt.wantPCSGTC != nil { expectedTC, ok := tt.wantPCSGTC[pcsg.Name] if ok {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/internal/dynamo/graph_test.go` around lines 7067 - 7093, The test currently only checks cliques and PCSGs that exist in the expectation maps (wantCliqueTC, wantPCSGTC) which allows extra or missing entries to slip by; update the checks around pcs.Spec.Template.Cliques and pcs.Spec.Template.PodCliqueScalingGroupConfigs to first assert that the set of actual names equals the set of expected names (for example by collecting names from pcs.Spec.Template.Cliques and comparing to keys of wantCliqueTC, and similarly for PodCliqueScalingGroupConfigs vs wantPCSGTC), then iterate and perform the existing nil/non-nil and PackDomain comparisons (using the existing assertions for clique.TopologyConstraint and pcsg.TopologyConstraint) so unknown or extra entries fail the test rather than being skipped.deploy/operator/internal/webhook/validation/dynamographdeployment_test.go (1)
1063-1357: Remove duplicated topology constraint test cases to keep the suite leanThe block starting at Line 1063 repeats the same cases already defined around Lines 768–1062, which doubles runtime and makes subtest output ambiguous. Consider deleting the duplicate block or extracting shared cases into a helper used once.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/internal/webhook/validation/dynamographdeployment_test.go` around lines 1063 - 1357, The test suite contains a duplicated set of topology-constraint subtests (same cases like "no topology constraints is valid (backward compatible)", "valid topology constraints with spec and service level", "service-only topology constraint (no spec-level) is valid", etc.) within dynamographdeployment_test.go; remove the duplicated block of test cases (the repeated slice entries used in the topology constraint validation table) or refactor them into a single helper/fixture invoked once (e.g., consolidate into the existing topology constraint test table or a shared function used by TestValidateTopologyConstraints) so each case appears only once and the subtests are not duplicated.deploy/operator/internal/controller/dynamographdeployment_controller.go (1)
486-500: Avoid repeated warning events when the topology-unavailable condition persistsLine 499 emits a warning on every reconcile while the condition stays unavailable, which can flood the event stream. Consider emitting only on transition (status/reason/message change).
♻️ Suggested guard for event emission
- logger.Info("Topology constraints no longer enforced", "reason", reason, "message", groveTopoCond.Message) - r.Recorder.Eventf(dgd, corev1.EventTypeWarning, reason, "Topology constraints no longer enforced: %s", groveTopoCond.Message) + prev := metav1.FindStatusCondition(dgd.Status.Conditions, nvidiacomv1alpha1.ConditionTypeTopologyConstraintsEnforced) + shouldEmit := prev == nil || prev.Status != metav1.ConditionFalse || prev.Reason != reason || prev.Message != groveTopoCond.Message + if shouldEmit { + logger.Info("Topology constraints no longer enforced", "reason", reason, "message", groveTopoCond.Message) + r.Recorder.Eventf(dgd, corev1.EventTypeWarning, reason, "Topology constraints no longer enforced: %s", groveTopoCond.Message) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/internal/controller/dynamographdeployment_controller.go` around lines 486 - 500, The code currently calls r.Recorder.Eventf and logs inside the groveTopoCond.Status==True branch every reconcile, which floods events; modify the block around the metav1.Condition construction for ConditionTypeTopologyConstraintsEnforced (where groveTopoCond, reason, dynamoCond and dgd are referenced) to emit the warning only when the condition actually transitions or its reason/message/status changes: fetch the existing condition from dgd.Status.Conditions (by Type == nvidiacomv1alpha1.ConditionTypeTopologyConstraintsEnforced), compare Status/Reason/Message, and only call r.Recorder.Eventf and logger.Info when any of those fields differ (also set LastTransitionTime appropriately on real transitions); otherwise update the condition in status without emitting an event.deploy/operator/internal/webhook/validation/dynamographdeployment.go (1)
608-616:topologyConstraintsEqualcould be more maintainable against futureTopologyConstraintchanges.Currently,
TopologyConstrainthas only thePackDomainfield, so the comparison is complete. However, ifTopologyConstraintgains additional fields in the future, this function will silently miss them without a compile-time signal, potentially allowing unintended field changes during validation checks.Consider using
reflect.DeepEqualfor structural completeness:♻️ Option – use reflect.DeepEqual
+import "reflect" func topologyConstraintsEqual(a, b *nvidiacomv1alpha1.TopologyConstraint) bool { if a == nil && b == nil { return true } if a == nil || b == nil { return false } - return a.PackDomain == b.PackDomain + return reflect.DeepEqual(a, b) }Alternatively, add a comment documenting the intentional field-limited scope for clarity to future maintainers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/operator/internal/webhook/validation/dynamographdeployment.go` around lines 608 - 616, The current topologyConstraintsEqual function only compares PackDomain and will miss future fields; update topologyConstraintsEqual to use reflect.DeepEqual(a, b) (add import "reflect") so the comparison is structural and future-proof, or if the intent is to only compare PackDomain, add a clear comment above topologyConstraintsEqual explaining that the comparison is intentionally limited to PackDomain and must be updated when other fields should be considered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@deploy/helm/charts/crds/templates/nvidia.com_dynamocomponentdeployments.yaml`:
- Around line 11055-11075: The CRD change to topologyConstraint.packDomain is in
an auto-generated file and will be overwritten; update the operator Go type
(e.g., the Topology spec type in topology_types.go or the struct that defines
topologyConstraint/packDomain) to include the desired enum/description, then run
the manifests generation step (make manifests) to regenerate the CRDs so the
Helm template (topologyConstraint / packDomain) is updated from the canonical
source rather than editing deploy/helm/charts/crds/templates/* directly.
In `@deploy/helm/charts/platform/components/operator/templates/manager-rbac.yaml`:
- Around line 127-134: The manager Role currently includes rules for the
cluster-scoped resource "clustertopologies" (apiGroup grove.io, resource
clustertopologies) which is ineffective when namespaceRestriction.enabled=true;
remove that rule from the conditional manager Role/RoleBinding block and instead
create an always-present ClusterRole and matching ClusterRoleBinding granting
read-only verbs (get,list,watch) on clustertopologies.grove.io, modeled after
the existing always-ClusterRole for "queues" (refer to the queue-reader
ClusterRole/ClusterRoleBinding section), and update the
topology-aware-scheduling.md prerequisite note to reflect that read-only RBAC
for clustertopologies is only included when this ClusterRole is present.
In `@deploy/operator/api/v1alpha1/dynamographdeployment_types.go`:
- Around line 70-74: Add a helper like resolveTopologyConstraint(parent
*DynamographDeployment, svcConstraint *TopologyConstraint) that returns
svcConstraint if non-nil otherwise parent.Spec.TopologyConstraint, and replace
direct uses of component.TopologyConstraint in GenerateGrovePodCliqueSet and
generateSingleDCD with calls to this helper so both Grove and DCD pathways
consistently inherit the spec-level topology constraint when a service has none.
In
`@deploy/operator/internal/controller/dynamographdeployment_controller_test.go`:
- Around line 2103-2106: The test references consts.KubeAnnotationEnableGrove
but this file uses the commonconsts import alias; update the annotation key to
commonconsts.KubeAnnotationEnableGrove in the ObjectMeta block (where
ObjectMeta: metav1.ObjectMeta{ Name: "test", Namespace: "default", Annotations:
... }) to fix the unresolved identifier compile error.
In `@deploy/operator/internal/webhook/validation/dynamographdeployment.go`:
- Around line 506-511: The redundant CRD-existence error occurs because
validateTopologyDomainsAgainstCRD is run even when some packDomain strings
already failed IsValidTopologyDomain, producing a misleading "domain does not
exist" message; update the call-site or the validator to avoid duplicate errors:
either (A) in the caller (where v.mgr != nil && v.isGrovePathway()) add a guard
to skip calling v.validateTopologyDomainsAgainstCRD if any domain-level
validation error from IsValidTopologyDomain exists, or (B) modify
validateTopologyDomainsAgainstCRD to filter its domainsToCheck input to only
include domains that passed IsValidTopologyDomain before querying the
ClusterTopology CRD (use the same IsValidTopologyDomain helper and packDomain
identifiers), so only syntactically valid domains are checked against the CRD
and spurious errors are not appended.
- Around line 538-554: Change domainsToCheck from
map[nvidiacomv1alpha1.TopologyDomain]string to
map[nvidiacomv1alpha1.TopologyDomain][]string, collect all fieldPath entries
(push into the slice) in the existing loops that build domainsToCheck (both the
TopologyConstraint on v.deployment.Spec and per-service loop), then when
validating (the loop that iterates domainsToCheck) iterate deterministicly by
extracting and sorting the domain keys, sort each domain's slice of field paths,
and emit an error for each field path (or include all sorted paths in the error)
when the domain is missing (keep using topologyLevelDomains(ct) for the
available list). This ensures no field paths are lost and error order is
deterministic.
In `@docs/pages/kubernetes/topology-aware-scheduling.md`:
- Around line 60-62: The DynamoGraphDeployment YAML examples incorrectly use the
Kubernetes key `nvidia.com/gpu: "1"`; update each example (the three
DynamoGraphDeployment blocks in this document) to use the Dynamo-specific
shorthand `gpu: "1"` under the resources.limits section so they match the
operator's CR format and the rest of the docs (e.g., replace `nvidia.com/gpu:
"1"` with `gpu: "1"` in each DynamoGraphDeployment example).
- Around line 15-17: Replace every occurrence of the incorrect Grove repository
URL "https://github.com/ai-dynamo/grove" in
docs/pages/kubernetes/topology-aware-scheduling.md with the canonical
"https://github.com/NVIDIA/grove" (this applies to the markdown links such as
the Grove Installation Guide and Grove documentation references); ensure all
three instances in the file are updated so links like [Grove Installation
Guide](https://github.com/ai-dynamo/grove/...) and any other references to
ai-dynamo/grove are changed to NVIDIA/grove.
---
Nitpick comments:
In `@deploy/operator/api/v1alpha1/topology_types.go`:
- Around line 85-93: IsNarrowerOrEqual currently uses topologyDomainOrder lookup
that yields 0 for unknown domains, causing silent miscomparisons; update
IsNarrowerOrEqual to guard by calling IsValidTopologyDomain for both receiver
and other and if either is invalid return false (or alternatively change the
signature to return (bool, error) and return an error for invalid domains), and
if you choose the latter update all callers such as ValidateHierarchy to handle
the new error return; ensure references to topologyDomainOrder remain but are
only used after validity checks.
In `@deploy/operator/internal/controller/dynamographdeployment_controller.go`:
- Around line 486-500: The code currently calls r.Recorder.Eventf and logs
inside the groveTopoCond.Status==True branch every reconcile, which floods
events; modify the block around the metav1.Condition construction for
ConditionTypeTopologyConstraintsEnforced (where groveTopoCond, reason,
dynamoCond and dgd are referenced) to emit the warning only when the condition
actually transitions or its reason/message/status changes: fetch the existing
condition from dgd.Status.Conditions (by Type ==
nvidiacomv1alpha1.ConditionTypeTopologyConstraintsEnforced), compare
Status/Reason/Message, and only call r.Recorder.Eventf and logger.Info when any
of those fields differ (also set LastTransitionTime appropriately on real
transitions); otherwise update the condition in status without emitting an
event.
In `@deploy/operator/internal/dynamo/graph_test.go`:
- Around line 7067-7093: The test currently only checks cliques and PCSGs that
exist in the expectation maps (wantCliqueTC, wantPCSGTC) which allows extra or
missing entries to slip by; update the checks around pcs.Spec.Template.Cliques
and pcs.Spec.Template.PodCliqueScalingGroupConfigs to first assert that the set
of actual names equals the set of expected names (for example by collecting
names from pcs.Spec.Template.Cliques and comparing to keys of wantCliqueTC, and
similarly for PodCliqueScalingGroupConfigs vs wantPCSGTC), then iterate and
perform the existing nil/non-nil and PackDomain comparisons (using the existing
assertions for clique.TopologyConstraint and pcsg.TopologyConstraint) so unknown
or extra entries fail the test rather than being skipped.
In `@deploy/operator/internal/webhook/validation/dynamographdeployment_test.go`:
- Around line 1063-1357: The test suite contains a duplicated set of
topology-constraint subtests (same cases like "no topology constraints is valid
(backward compatible)", "valid topology constraints with spec and service
level", "service-only topology constraint (no spec-level) is valid", etc.)
within dynamographdeployment_test.go; remove the duplicated block of test cases
(the repeated slice entries used in the topology constraint validation table) or
refactor them into a single helper/fixture invoked once (e.g., consolidate into
the existing topology constraint test table or a shared function used by
TestValidateTopologyConstraints) so each case appears only once and the subtests
are not duplicated.
In `@deploy/operator/internal/webhook/validation/dynamographdeployment.go`:
- Around line 608-616: The current topologyConstraintsEqual function only
compares PackDomain and will miss future fields; update topologyConstraintsEqual
to use reflect.DeepEqual(a, b) (add import "reflect") so the comparison is
structural and future-proof, or if the intent is to only compare PackDomain, add
a clear comment above topologyConstraintsEqual explaining that the comparison is
intentionally limited to PackDomain and must be updated when other fields should
be considered.
In `@docs/pages/kubernetes/deployment/create-deployment.md`:
- Line 192: The heading "Topology Aware Scheduling" is inconsistent with the
LanguageTool suggestion to hyphenate; decide whether to use "Topology-Aware
Scheduling" or keep "Topology Aware Scheduling" and then apply that choice
uniformly across the PR — update the page title, this heading, nav entries, and
any other section headings or references (search for the exact phrase "Topology
Aware Scheduling") to match the chosen hyphenation.
|
few thoughts, non-blocking:
|
immutability is already documented, the rolling update support introduced recently is currently not using grove. |
111b729 to
e5d58c7
Compare
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
dillon-cullinan
left a comment
There was a problem hiding this comment.
Approving changes to broken_links script.
Overview:
add Topology Aware Scheduling support
according to proposal : ai-dynamo/enhancements#69
fixes DYN-775
Summary by CodeRabbit
New Features
Documentation