Bizz001/resize controlplane response messages#4786
Conversation
|
Skipping CI for Draft Pull Request. |
| return nil, err | ||
| } | ||
|
|
||
| return json.Marshal("All pre-flight checks passed") |
There was a problem hiding this comment.
Would it make sense to return the structured resizePreflightResult here as well instead of collapsing it back to a JSON string? We now compute per-check status and duration, and exposing that from the standalone prevalidation endpoint would make it more useful for SREs without requiring them to start the resize flow just to inspect the checks.
There was a problem hiding this comment.
Done in adca76b. The standalone prevalidation endpoint now returns the structured resizePreflightResult with per-check name, status, duration, and message.
| } | ||
|
|
||
| func buildResizePreflightResult(checks map[string]adminapi.ResizeControlPlaneCheck, duration time.Duration) *resizePreflightResult { | ||
| order := []string{ |
There was a problem hiding this comment.
Small maintainability nit: this fixed order list means a newly added pre-flight check can be silently dropped from the serialized result unless someone remembers to update this function too. Would it be safer to build the slice in-order as checks are recorded, or at least append unknown checks after the known ones so future validations don't disappear?
There was a problem hiding this comment.
Addressed in 8237efd. Extracted the order to a package-level var resizePreflightCheckOrder and added a fallback loop that appends any checks present in the map but missing from the order, so new checks are never silently dropped.
| FailedNode string `json:"failedNode,omitempty"` | ||
| FailedStep string `json:"failedStep,omitempty"` | ||
| NextAction string `json:"nextAction,omitempty"` | ||
| Nodes []ResizeControlPlaneNodeOperation `json:"nodes,omitempty"` |
There was a problem hiding this comment.
Small naming nit: Nodes []ResizeControlPlaneNodeOperation reads a bit like desired actions, but these entries are really execution results. Would something like NodeResults []ResizeControlPlaneNodeResult make the contract a little easier to scan for someone reading the payload without implementation context?
There was a problem hiding this comment.
I'd prefer to keep Nodes — the struct captures the full operation lifecycle (steps, failedStep, nextAction), not just a terminal result, so NodeResults would be slightly misleading. The parent type ResizeControlPlaneResponse already provides context.
| details = append(details, api.CloudErrorBody{ | ||
| Code: "ResizeRequest", | ||
| Target: report.ResourceID, | ||
| Message: fmt.Sprintf("Requested VM size %s with deallocateVM=%t. Elapsed time before failure: %s.", report.VMSize, report.DeallocateVM, formatResizeDurationMS(elapsedMS)), |
There was a problem hiding this comment.
Consistency nit: on success deallocateVM is exposed as a first-class JSON field, but on failure it only appears inside this prose ResizeRequest message. Would it be worth carrying it as its own structured field/detail in the error payload too, so consumers don't have to parse text to recover the original request parameters?
There was a problem hiding this comment.
Valid concern on the inconsistency. The error payload uses the standard ARM CloudErrorBody which only has Code, Target, Message, and nested Details — extending it would affect the entire RP. The deallocateVM value is captured in the ResizeRequest detail message, so it's still accessible to callers parsing the error.
|
|
||
| go f.Run(ctx, nil, nil) | ||
|
|
||
| requestURL := fmt.Sprintf("https://server/admin%s/resizecontrolplane?vmSize=%s&deallocateVM=%s", tt.resourceID, tt.vmSize, tt.deallocateVM) |
There was a problem hiding this comment.
Would it be worth trying t.Parallel() for these handler subtests once newTestInfra is known to be isolated? This table-driven suite is getting fairly large, so running the cases concurrently could both speed it up and help surface any hidden shared-state assumptions in the test/setup path.
There was a problem hiding this comment.
Added in 29eab89. Verified with -race -count=3 — newTestInfra creates fully isolated infrastructure per subtest (own listener, gomock controller, database clients, HTTP client) with no shared mutable state.
787f2cf to
7553ea5
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the admin resizecontrolplane endpoint response contract to make successful responses more compact by default, while preserving detailed failure output and allowing callers to opt into full success traces via verbose=true.
Changes:
- Introduces a structured
ResizeControlPlaneResponse(status/summary/preflight/nodes) and compacts successful responses unlessverbose=true. - Adds preflight check capture (with per-check status/duration/message) and includes it in the resize response.
- Improves failure reporting by wrapping errors with phase/node/step context and emitting detailed
CloudErrortraces.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go | Refactors preflight validation to optionally return structured check results (for embedding into resize responses). |
| pkg/frontend/admin_openshiftcluster_resize_controlplane.go | Reworks handler/orchestrator to build structured responses, add verbose support, and enrich failure CloudError details with phase/node/step context. |
| pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go | Updates/extends tests for compact vs. verbose success payloads and detailed failure shapes. |
| pkg/api/admin/resizecontrolplane.go | Adds new admin API types for resize control plane response payloads (statuses/phases/checks/nodes/steps). |
Comments suppressed due to low confidence (3)
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go:1011
- The KubeList mock for kube-apiserver pods omits the required label selector. validateAPIServerPods lists pods with selector "app=openshift-kube-apiserver"; update this expectation to include the selector (or match variadic args) so the verbose-response test can pass.
k.EXPECT().
KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver").
Return(healthyKubeAPIServerPodsJSON(), nil).
AnyTimes()
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go:1314
- This test’s KubeList expectation for kube-apiserver pods doesn’t include the label selector argument. The production code calls KubeList with "app=openshift-kube-apiserver", so gomock will treat this as an unexpected call; add the selector (or match variadic args) to the expectation.
k.EXPECT().KubeGet(gomock.Any(), "ClusterOperator.config.openshift.io", "", "kube-apiserver").
Return(healthyKubeAPIServerJSON(), nil).AnyTimes()
k.EXPECT().KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver").
Return(healthyKubeAPIServerPodsJSON(), nil).AnyTimes()
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go:1461
- The KubeList mock for kube-apiserver pods is missing the label selector parameter. validateAPIServerPods uses selector "app=openshift-kube-apiserver"; include it (or match variadic args) so this preflight-failure details test matches actual calls.
k.EXPECT().KubeGet(gomock.Any(), "ClusterOperator.config.openshift.io", "", "kube-apiserver").
Return(healthyKubeAPIServerJSON(), nil).AnyTimes()
k.EXPECT().KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver").
Return(healthyKubeAPIServerPodsJSON(), nil).AnyTimes()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6cf51b5 to
e3208f4
Compare
| k.EXPECT().CheckAPIServerReadyz(gomock.Any()).Return(nil).AnyTimes() | ||
| k.EXPECT().KubeGet(gomock.Any(), "ClusterOperator.config.openshift.io", "", "kube-apiserver"). | ||
| Return(healthyKubeAPIServerJSON(), nil).AnyTimes() | ||
| k.EXPECT().KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver", "app=openshift-kube-apiserver"). |
| k.EXPECT().CheckAPIServerReadyz(gomock.Any()).Return(nil).AnyTimes() | ||
| k.EXPECT().KubeGet(gomock.Any(), "ClusterOperator.config.openshift.io", "", "kube-apiserver"). | ||
| Return(healthyKubeAPIServerJSON(), nil).AnyTimes() | ||
| k.EXPECT().KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver", "app=openshift-kube-apiserver"). |
| def ensure_platform_workload_identities_for_upgrade(cmd, client, resource_group_name, oc, oc_update, upgradeable_to): | ||
| oc_update.identity = oc.identity | ||
| oc_update.platform_workload_identity_profile.platform_workload_identities = \ | ||
| oc.platform_workload_identity_profile.platform_workload_identities | ||
|
|
| type resizePreflightResult struct { | ||
| Checks []adminapi.ResizeControlPlaneCheck | ||
| DurationMS int64 | ||
| } |
| export LOCATION="${LOCATION:-westeurope}" | ||
| export NO_CACHE=false | ||
| export AZURE_EXTENSION_DEV_SOURCES="$(pwd)/python" | ||
| export AZURE_TOKEN_CREDENTIALS="dev" |
|
|
||
| # Holmes admin API configuration | ||
| # You can override this image with your own build if needed | ||
| export HOLMES_IMAGE=arointsvc.azurecr.io/holmes:latest |
| // HolmesImage contains the location of the HolmesGPT investigation container image | ||
| func HolmesImage(acrDomain string) string { | ||
| return acrDomain + "/holmesgpt:latest" | ||
| } |
| rotateIfAfter := registryProfile.IssueDate.Add(ACRTokenRotateAfter) | ||
| validityEnd := registryProfile.IssueDate.Add(ACRTokenMaxLifetime) | ||
|
|
||
| shouldRotate = now.After(rotateIfAfter) | ||
| isValid = validityEnd.After(now) | ||
| timeUntilNextRotate = rotateIfAfter.Sub(now) | ||
| if isValid { | ||
| timeUntilTokenExpiry = validityEnd.Sub(now) | ||
| } |
|
|
||
| // Only rotate the token if required | ||
| shouldRotate, _, durationUntilRotate, validityRemaining := acrtoken.ShouldRotateToken(m.env, registryProfile) | ||
| m.log.Infof("token has %s validity remaining, should rotate in %s", validityRemaining.String(), durationUntilRotate.String()) |
| rp := acrtoken.GetRegistryProfileFromSlice(env, registryProfiles) | ||
| if rp != nil { | ||
| now := time.Now().UTC() | ||
| issueDate := rp.IssueDate | ||
|
|
||
| if issueDate == nil { | ||
| return mimo.TerminalError(errors.New("no issue date detected, please rotate token")) | ||
| } | ||
|
|
||
| daysInterval := int32(now.Sub(*issueDate).Hours() / 24) | ||
|
|
||
| switch { | ||
| case daysInterval > daysValid: | ||
| return mimo.TerminalError(fmt.Errorf("azure container registry (acr) token is not valid, %d days have passed", daysInterval)) | ||
| case daysInterval >= daysShouldRotate: | ||
| return mimo.TerminalError(fmt.Errorf("%d days have passed since azure container registry (acr) token was issued, please rotate the token now", daysInterval)) | ||
| default: | ||
| th.SetResultMessage("azure container registry (acr) token is valid") | ||
| } | ||
|
|
||
| if rp == nil || rp.IssueDate == nil { | ||
| return mimo.TerminalError(errors.New("no issue date detected, please rotate token")) | ||
| } |
| - operator.openshift.io | ||
| resources: | ||
| - '*' | ||
| verbs: | ||
| - get | ||
| - list | ||
| - watch |
| r.Get("/selectors", f.getAdminOpenShiftClusterSelectors) | ||
|
|
||
| r.Post("/investigate", f.postAdminOpenShiftClusterInvestigate) | ||
| }) |
| Checks []adminapi.ResizeControlPlaneCheck | ||
| DurationMS int64 |
| for name := range checks { | ||
| if !known[name] { | ||
| result.Checks = append(result.Checks, checks[name]) | ||
| } | ||
| } |
| @@ -162,55 +364,117 @@ func resizeControlPlane(ctx context.Context, log *logrus.Entry, k adminactions.K | |||
| // resizeControlPlaneNode performs the full resize sequence for a single | |||
There was a problem hiding this comment.
i think this needs update, it got me confused
There was a problem hiding this comment.
Good catch — updated the doc comment to match the renamed function in e551b5e.
| return nil | ||
| } | ||
|
|
||
| // resizeControlPlane orchestrates the full control plane resize operation, |
There was a problem hiding this comment.
are we keeping resizeControlPlane by any reason? if I'm not mistaken, it's not directly called anymore by the _postAdminResizeControlPlane func
I suppose that removing it would mean modifying quite a number of tests, that they still rely on that func... this PR is already big, so I'd vote to have a follow-up task to clean it. But not in this PR, neither this is a blocker. But it would be good to update this comment.
There was a problem hiding this comment.
Removed the wrapper entirely in 3837579. The 11 unit tests now call resizeControlPlaneWithReport directly with a real report, and all 9 if report != nil guards are gone.
What this PR does / why we need it:
resizecontrolplaneresponses easier for SREs to consume by returning a compact success payload by defaultverbose=truesupport so callers can opt into the full success trace, including phases, checks, and node stepsdeallocateVMin the response bodyDetails
status,summary,preflight, and per-node resultsphases/stepsfrom default successful responses, but include them whenverbose=trueCloudErrordetail output on failure, including phase, validation check, node, and step informationverbosequery parameter using the sametrue/falsehandling asdeallocateVMExample output
Default success response:
{ "status": "Succeeded", "message": "Control plane resize completed successfully.", "resourceId": "/subscriptions/.../resourceGroups/.../providers/Microsoft.RedHatOpenShift/openShiftClusters/cluster", "vmSize": "Standard_D16s_v5", "deallocateVM": false, "durationMs": 28412, "summary": { "totalNodes": 3, "nodesResized": 1, "nodesSkipped": 2, "executionOrder": [ "master-2", "master-1", "master-0" ] }, "preflight": { "status": "Succeeded", "durationMs": 221 }, "nodes": [ { "name": "master-2", "sourceVmSize": "Standard_D8s_v3", "targetVmSize": "Standard_D16s_v5", "status": "Succeeded", "durationMs": 28064, "message": "Node resized successfully." }, { "name": "master-1", "sourceVmSize": "Standard_D16s_v5", "targetVmSize": "Standard_D16s_v5", "status": "Skipped", "durationMs": 0, "message": "Node already running target VM size; no resize required." }, { "name": "master-0", "sourceVmSize": "Standard_D16s_v5", "targetVmSize": "Standard_D16s_v5", "status": "Skipped", "durationMs": 0, "message": "Node already running target VM size; no resize required." } ] }Verbose success response (
verbose=true) additionally includesphasesand per-nodesteps.Failure responses always include the full
CloudErrordetail trace, regardless ofverbose.Test plan for issue:
verbose=truesuccess output, and query parameter validationIs there any documentation that needs to be updated for this PR?
How do you know this will function as expected in production?
CloudErrordetail for troubleshooting, whileverbose=truepreserves the full success trace when needed.