Skip to content

Bizz001/resize controlplane response messages#4786

Open
bizz001 wants to merge 15 commits into
masterfrom
bizz001/resize-controlplane-response-messages
Open

Bizz001/resize controlplane response messages#4786
bizz001 wants to merge 15 commits into
masterfrom
bizz001/resize-controlplane-response-messages

Conversation

@bizz001
Copy link
Copy Markdown
Collaborator

@bizz001 bizz001 commented Apr 22, 2026

What this PR does / why we need it:

  • make resizecontrolplane responses easier for SREs to consume by returning a compact success payload by default
  • add verbose=true support so callers can opt into the full success trace, including phases, checks, and node steps
  • keep failure responses fully detailed to preserve troubleshooting context, and always serialize deallocateVM in the response body

Details

  • add top-level success fields such as status, summary, preflight, and per-node results
  • hide verbose phases / steps from default successful responses, but include them when verbose=true
  • preserve full CloudError detail output on failure, including phase, validation check, node, and step information
  • validate the new verbose query parameter using the same true / false handling as deallocateVM
  • update handler tests for compact success responses, verbose success responses, and request parameter validation

Example output

Default success response:

{
  "status": "Succeeded",
  "message": "Control plane resize completed successfully.",
  "resourceId": "/subscriptions/.../resourceGroups/.../providers/Microsoft.RedHatOpenShift/openShiftClusters/cluster",
  "vmSize": "Standard_D16s_v5",
  "deallocateVM": false,
  "durationMs": 28412,
  "summary": {
    "totalNodes": 3,
    "nodesResized": 1,
    "nodesSkipped": 2,
    "executionOrder": [
      "master-2",
      "master-1",
      "master-0"
    ]
  },
  "preflight": {
    "status": "Succeeded",
    "durationMs": 221
  },
  "nodes": [
    {
      "name": "master-2",
      "sourceVmSize": "Standard_D8s_v3",
      "targetVmSize": "Standard_D16s_v5",
      "status": "Succeeded",
      "durationMs": 28064,
      "message": "Node resized successfully."
    },
    {
      "name": "master-1",
      "sourceVmSize": "Standard_D16s_v5",
      "targetVmSize": "Standard_D16s_v5",
      "status": "Skipped",
      "durationMs": 0,
      "message": "Node already running target VM size; no resize required."
    },
    {
      "name": "master-0",
      "sourceVmSize": "Standard_D16s_v5",
      "targetVmSize": "Standard_D16s_v5",
      "status": "Skipped",
      "durationMs": 0,
      "message": "Node already running target VM size; no resize required."
    }
  ]
}

Verbose success response (verbose=true) additionally includes phases and per-node steps.

Failure responses always include the full CloudError detail trace, regardless of verbose.

Test plan for issue:

  • Added unit tests for the compact success shape, verbose=true success output, and query parameter validation
  • Tests are passing

Is there any documentation that needs to be updated for this PR?

  • No separate documentation update planned for this admin-only endpoint change.

How do you know this will function as expected in production?

  • The RP now owns the response contract for success and failure output, keeping downstream Geneva-side formatting minimal.
  • Failure responses retain full CloudError detail for troubleshooting, while verbose=true preserves the full success trace when needed.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

return nil, err
}

return json.Marshal("All pre-flight checks passed")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to return the structured resizePreflightResult here as well instead of collapsing it back to a JSON string? We now compute per-check status and duration, and exposing that from the standalone prevalidation endpoint would make it more useful for SREs without requiring them to start the resize flow just to inspect the checks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in adca76b. The standalone prevalidation endpoint now returns the structured resizePreflightResult with per-check name, status, duration, and message.

}

func buildResizePreflightResult(checks map[string]adminapi.ResizeControlPlaneCheck, duration time.Duration) *resizePreflightResult {
order := []string{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small maintainability nit: this fixed order list means a newly added pre-flight check can be silently dropped from the serialized result unless someone remembers to update this function too. Would it be safer to build the slice in-order as checks are recorded, or at least append unknown checks after the known ones so future validations don't disappear?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 8237efd. Extracted the order to a package-level var resizePreflightCheckOrder and added a fallback loop that appends any checks present in the map but missing from the order, so new checks are never silently dropped.

FailedNode string `json:"failedNode,omitempty"`
FailedStep string `json:"failedStep,omitempty"`
NextAction string `json:"nextAction,omitempty"`
Nodes []ResizeControlPlaneNodeOperation `json:"nodes,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small naming nit: Nodes []ResizeControlPlaneNodeOperation reads a bit like desired actions, but these entries are really execution results. Would something like NodeResults []ResizeControlPlaneNodeResult make the contract a little easier to scan for someone reading the payload without implementation context?

Copy link
Copy Markdown
Collaborator Author

@bizz001 bizz001 May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep Nodes — the struct captures the full operation lifecycle (steps, failedStep, nextAction), not just a terminal result, so NodeResults would be slightly misleading. The parent type ResizeControlPlaneResponse already provides context.

details = append(details, api.CloudErrorBody{
Code: "ResizeRequest",
Target: report.ResourceID,
Message: fmt.Sprintf("Requested VM size %s with deallocateVM=%t. Elapsed time before failure: %s.", report.VMSize, report.DeallocateVM, formatResizeDurationMS(elapsedMS)),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency nit: on success deallocateVM is exposed as a first-class JSON field, but on failure it only appears inside this prose ResizeRequest message. Would it be worth carrying it as its own structured field/detail in the error payload too, so consumers don't have to parse text to recover the original request parameters?

Copy link
Copy Markdown
Collaborator Author

@bizz001 bizz001 May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern on the inconsistency. The error payload uses the standard ARM CloudErrorBody which only has Code, Target, Message, and nested Details — extending it would affect the entire RP. The deallocateVM value is captured in the ResizeRequest detail message, so it's still accessible to callers parsing the error.


go f.Run(ctx, nil, nil)

requestURL := fmt.Sprintf("https://server/admin%s/resizecontrolplane?vmSize=%s&deallocateVM=%s", tt.resourceID, tt.vmSize, tt.deallocateVM)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth trying t.Parallel() for these handler subtests once newTestInfra is known to be isolated? This table-driven suite is getting fairly large, so running the cases concurrently could both speed it up and help surface any hidden shared-state assumptions in the test/setup path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 29eab89. Verified with -race -count=3newTestInfra creates fully isolated infrastructure per subtest (own listener, gomock controller, database clients, HTTP client) with no shared mutable state.

Copilot AI review requested due to automatic review settings May 13, 2026 11:04
@bizz001 bizz001 force-pushed the bizz001/resize-controlplane-response-messages branch from 787f2cf to 7553ea5 Compare May 13, 2026 11:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the admin resizecontrolplane endpoint response contract to make successful responses more compact by default, while preserving detailed failure output and allowing callers to opt into full success traces via verbose=true.

Changes:

  • Introduces a structured ResizeControlPlaneResponse (status/summary/preflight/nodes) and compacts successful responses unless verbose=true.
  • Adds preflight check capture (with per-check status/duration/message) and includes it in the resize response.
  • Improves failure reporting by wrapping errors with phase/node/step context and emitting detailed CloudError traces.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go Refactors preflight validation to optionally return structured check results (for embedding into resize responses).
pkg/frontend/admin_openshiftcluster_resize_controlplane.go Reworks handler/orchestrator to build structured responses, add verbose support, and enrich failure CloudError details with phase/node/step context.
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go Updates/extends tests for compact vs. verbose success payloads and detailed failure shapes.
pkg/api/admin/resizecontrolplane.go Adds new admin API types for resize control plane response payloads (statuses/phases/checks/nodes/steps).
Comments suppressed due to low confidence (3)

pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go:1011

  • The KubeList mock for kube-apiserver pods omits the required label selector. validateAPIServerPods lists pods with selector "app=openshift-kube-apiserver"; update this expectation to include the selector (or match variadic args) so the verbose-response test can pass.
				k.EXPECT().
					KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver").
					Return(healthyKubeAPIServerPodsJSON(), nil).
					AnyTimes()

pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go:1314

  • This test’s KubeList expectation for kube-apiserver pods doesn’t include the label selector argument. The production code calls KubeList with "app=openshift-kube-apiserver", so gomock will treat this as an unexpected call; add the selector (or match variadic args) to the expectation.
	k.EXPECT().KubeGet(gomock.Any(), "ClusterOperator.config.openshift.io", "", "kube-apiserver").
		Return(healthyKubeAPIServerJSON(), nil).AnyTimes()
	k.EXPECT().KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver").
		Return(healthyKubeAPIServerPodsJSON(), nil).AnyTimes()

pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go:1461

  • The KubeList mock for kube-apiserver pods is missing the label selector parameter. validateAPIServerPods uses selector "app=openshift-kube-apiserver"; include it (or match variadic args) so this preflight-failure details test matches actual calls.
	k.EXPECT().KubeGet(gomock.Any(), "ClusterOperator.config.openshift.io", "", "kube-apiserver").
		Return(healthyKubeAPIServerJSON(), nil).AnyTimes()
	k.EXPECT().KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver").
		Return(healthyKubeAPIServerPodsJSON(), nil).AnyTimes()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go Outdated
@bizz001 bizz001 force-pushed the bizz001/resize-controlplane-response-messages branch 2 times, most recently from 6cf51b5 to e3208f4 Compare May 13, 2026 13:00
Copilot AI review requested due to automatic review settings May 27, 2026 11:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

k.EXPECT().CheckAPIServerReadyz(gomock.Any()).Return(nil).AnyTimes()
k.EXPECT().KubeGet(gomock.Any(), "ClusterOperator.config.openshift.io", "", "kube-apiserver").
Return(healthyKubeAPIServerJSON(), nil).AnyTimes()
k.EXPECT().KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver", "app=openshift-kube-apiserver").
k.EXPECT().CheckAPIServerReadyz(gomock.Any()).Return(nil).AnyTimes()
k.EXPECT().KubeGet(gomock.Any(), "ClusterOperator.config.openshift.io", "", "kube-apiserver").
Return(healthyKubeAPIServerJSON(), nil).AnyTimes()
k.EXPECT().KubeList(gomock.Any(), "Pod", "openshift-kube-apiserver", "app=openshift-kube-apiserver").
Copilot AI review requested due to automatic review settings May 27, 2026 13:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 60 out of 65 changed files in this pull request and generated 10 comments.

Files not reviewed (1)
  • portal/v2/package-lock.json: Language not supported

Comment on lines +1104 to +1108
def ensure_platform_workload_identities_for_upgrade(cmd, client, resource_group_name, oc, oc_update, upgradeable_to):
oc_update.identity = oc.identity
oc_update.platform_workload_identity_profile.platform_workload_identities = \
oc.platform_workload_identity_profile.platform_workload_identities

Comment on lines +116 to +119
type resizePreflightResult struct {
Checks []adminapi.ResizeControlPlaneCheck
DurationMS int64
}
Comment thread env.example
export LOCATION="${LOCATION:-westeurope}"
export NO_CACHE=false
export AZURE_EXTENSION_DEV_SOURCES="$(pwd)/python"
export AZURE_TOKEN_CREDENTIALS="dev"
Comment thread env.example

# Holmes admin API configuration
# You can override this image with your own build if needed
export HOLMES_IMAGE=arointsvc.azurecr.io/holmes:latest
Comment thread pkg/util/version/const.go
Comment on lines +86 to +89
// HolmesImage contains the location of the HolmesGPT investigation container image
func HolmesImage(acrDomain string) string {
return acrDomain + "/holmesgpt:latest"
}
Comment on lines +226 to +234
rotateIfAfter := registryProfile.IssueDate.Add(ACRTokenRotateAfter)
validityEnd := registryProfile.IssueDate.Add(ACRTokenMaxLifetime)

shouldRotate = now.After(rotateIfAfter)
isValid = validityEnd.After(now)
timeUntilNextRotate = rotateIfAfter.Sub(now)
if isValid {
timeUntilTokenExpiry = validityEnd.Sub(now)
}
Comment thread pkg/cluster/acrtoken.go

// Only rotate the token if required
shouldRotate, _, durationUntilRotate, validityRemaining := acrtoken.ShouldRotateToken(m.env, registryProfile)
m.log.Infof("token has %s validity remaining, should rotate in %s", validityRemaining.String(), durationUntilRotate.String())
Comment on lines 25 to 29
rp := acrtoken.GetRegistryProfileFromSlice(env, registryProfiles)
if rp != nil {
now := time.Now().UTC()
issueDate := rp.IssueDate

if issueDate == nil {
return mimo.TerminalError(errors.New("no issue date detected, please rotate token"))
}

daysInterval := int32(now.Sub(*issueDate).Hours() / 24)

switch {
case daysInterval > daysValid:
return mimo.TerminalError(fmt.Errorf("azure container registry (acr) token is not valid, %d days have passed", daysInterval))
case daysInterval >= daysShouldRotate:
return mimo.TerminalError(fmt.Errorf("%d days have passed since azure container registry (acr) token was issued, please rotate the token now", daysInterval))
default:
th.SetResultMessage("azure container registry (acr) token is valid")
}

if rp == nil || rp.IssueDate == nil {
return mimo.TerminalError(errors.New("no issue date detected, please rotate token"))
}
Comment on lines +150 to +156
- operator.openshift.io
resources:
- '*'
verbs:
- get
- list
- watch
Comment thread pkg/frontend/frontend.go
Comment on lines 431 to 434
r.Get("/selectors", f.getAdminOpenShiftClusterSelectors)

r.Post("/investigate", f.postAdminOpenShiftClusterInvestigate)
})
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment on lines +117 to +118
Checks []adminapi.ResizeControlPlaneCheck
DurationMS int64
Comment on lines +267 to +271
for name := range checks {
if !known[name] {
result.Checks = append(result.Checks, checks[name])
}
}
Comment thread pkg/api/admin/resizecontrolplane.go Outdated
Copilot AI review requested due to automatic review settings May 27, 2026 15:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@@ -162,55 +364,117 @@ func resizeControlPlane(ctx context.Context, log *logrus.Entry, k adminactions.K
// resizeControlPlaneNode performs the full resize sequence for a single
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this needs update, it got me confused

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — updated the doc comment to match the renamed function in e551b5e.

return nil
}

// resizeControlPlane orchestrates the full control plane resize operation,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we keeping resizeControlPlane by any reason? if I'm not mistaken, it's not directly called anymore by the _postAdminResizeControlPlane func

I suppose that removing it would mean modifying quite a number of tests, that they still rely on that func... this PR is already big, so I'd vote to have a follow-up task to clean it. But not in this PR, neither this is a blocker. But it would be good to update this comment.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the wrapper entirely in 3837579. The 11 unit tests now call resizeControlPlaneWithReport directly with a real report, and all 9 if report != nil guards are gone.

Copilot AI review requested due to automatic review settings May 28, 2026 14:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants