WIP: feat(must-gather): Refactor to golang to remove oc dependency. LOG-9008:#3315
WIP: feat(must-gather): Refactor to golang to remove oc dependency. LOG-9008:#3315jcantrill wants to merge 22 commits into
Conversation
|
@jcantrill: This pull request references LOG-9008 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.8.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold |
📝 WalkthroughWalkthroughReplaces shell-based must-gather collection with a Go binary, new shared API types, a Kubernetes client wrapper, multiple collectors, a concurrent orchestrator, and a CLI entrypoint. Updates the Makefile and Dockerfiles to build and package the binary. ChangesMust-gather binary migration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcantrill The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (1)
must-gather/cmd/main.go (1)
51-55: ⚖️ Poor tradeoffConsider adding signal handling for graceful cancellation.
Using
context.Background()means the gather operation cannot be cancelled via SIGINT/SIGTERM. For a potentially long-running collection process, signal-based cancellation would improve the user experience.This is optional for initial implementation but worth considering.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@must-gather/cmd/main.go` around lines 51 - 55, The gather.Run() call in main uses context.Background() which cannot be cancelled, preventing graceful shutdown on SIGINT/SIGTERM signals. Replace the context.Background() call with a context created via signal.NotifyContext() that listens for os.Interrupt and syscall.SIGTERM signals. This will allow the gather operation to detect cancellation requests and shut down gracefully when the user sends those signals.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@must-gather/cmd/main.go`:
- Around line 18-23: The `logFileName` variable is set from the `--log-file`
flag but the value is not being passed to the `NewGather` function, which
hardcodes the LogFileName field to "gather-debug.log". Pass the `logFileName`
variable as a parameter to the `NewGather` function call (located after the flag
parsing in main.go), and update the `NewGather` function signature in
`must-gather/gather.go` to accept and use this parameter value instead of
hardcoding it to "gather-debug.log".
- Line 39: The logFile.Close() call in the defer statement is not checking for
potential errors. Since Close() can fail when flushing buffered data to a file,
you need to handle the error it returns. Modify the defer statement to capture
and check the error returned by logFile.Close(), and handle it appropriately by
logging the error or returning it to the caller so failures are not silently
ignored.
In `@must-gather/gather.go`:
- Around line 82-85: The Run method currently returns nil unconditionally after
calling g.logResults(results), preventing callers from detecting when collectors
have failed. Instead of returning nil, you need to check the results for any
collector failures after logging, aggregate those errors into a single error
value, and return that aggregated error if any failures occurred, or nil if all
collectors succeeded. This allows callers to detect when the gather operation
was unsuccessful.
In `@must-gather/internal/api/logger.go`:
- Around line 20-24: The fmt.Fprintf call in the Log method of DefaultLogger is
not checking its error return value, which causes write failures to be silently
ignored and fails errcheck linting. Capture the error return value from the
fmt.Fprintf call and handle it explicitly by either returning the error from the
Log method signature, panicking on write failure, or logging the error to an
appropriate fallback output.
In `@must-gather/internal/client/client.go`:
- Around line 167-180: The ListResources function currently logs write errors
from WriteResourceToFile but continues processing and returns nil regardless of
whether any writes failed. Track whether any write errors occurred during the
loop by introducing an error variable that gets set when WriteResourceToFile
fails. Keep the existing warning log for debugging purposes, but after the loop
completes over listObj.Items, check if any errors were encountered and return an
error value instead of always returning nil to properly signal failure to the
caller.
In `@must-gather/internal/logstore/lokistack/collector.go`:
- Around line 47-50: The error handling in the LokiStack collector is
incorrectly treating all errors from DynamicClient.Resource(lokiGVR).List() as a
missing CRD and silently returning nil. Instead of catching all errors with a
generic message, examine the specific error type to determine if it truly
represents a missing CRD (like a not found or NoMatchError) versus other
failures like RBAC denials or transient API issues. For CRD-not-found errors,
continue with the current skip behavior, but for other errors, log them
appropriately and return the error to avoid silently dropping LokiStack
artifacts.
In `@must-gather/internal/metrics/collector.go`:
- Line 82: The WriteFile method call on line 82 returns an error that is
currently being ignored when writing to error.log. Capture the error return
value from the WriteFile call (in the line where
resultPath.Add("error.log").WriteFile is invoked) and handle it appropriately by
checking if the error is non-nil and taking appropriate action such as logging
or returning the error, rather than silently dropping the write failure.
In `@must-gather/internal/namespace/collector.go`:
- Around line 111-154: The collector creates unbounded goroutines in nested
loops: one for each namespace (starting at the outer loop around line 111) and
another for each resource within each namespace (in the inner loop around line
123). On large clusters, this causes runaway goroutine creation leading to API
saturation. Implement concurrency limiting using a semaphore pattern (such as a
buffered channel) to bound the number of concurrent goroutines for both the
namespace iteration loop and the resource iteration loop within it. Similarly,
the collectPodLogs method (called around line 151) also needs to be reviewed and
bounded if it creates additional unbounded goroutines. Apply the same semaphore
limiting pattern throughout to cap concurrent operations to a reasonable limit.
- Line 233: The defer statement for logs.Close() is currently ignoring the error
return value, which causes the errcheck static analysis tool to fail. Modify the
defer statement to capture and handle the error returned by logs.Close(), either
by logging the error or returning it from the function. This will satisfy the
errcheck linter requirement and ensure proper error handling during resource
cleanup.
- Around line 235-250: Replace the memory-buffering approach using io.ReadAll
with direct streaming to avoid memory pressure on large pod logs. Instead of
reading all log data into memory first and then writing it, create the
destination directory and file first, then stream the log data directly from the
logs reader to the file using io.Copy or similar streaming mechanism. This
eliminates the intermediate buffering step in the section where logData is
captured and later passed to logFile.WriteFile.
In `@must-gather/internal/ui/uiplugin_collector.go`:
- Around line 48-52: The error handling in the UIPlugin list operation treats
all errors as if the CRD is not available, which masks permission errors and
transient API failures. Modify the error handling block around the uiPluginList
list operation to check if the error specifically indicates a resource not found
condition (CRD absence) versus other types of failures. If the error is a
resource-not-found error, log it as CRD not available and return nil. For other
errors (permission denied, transient failures, etc.), either return the error or
log it at a warning/error level without skipping collection. Use appropriate
Kubernetes error checking utilities or type assertions to differentiate between
a missing CRD and other API/permission failures.
---
Nitpick comments:
In `@must-gather/cmd/main.go`:
- Around line 51-55: The gather.Run() call in main uses context.Background()
which cannot be cancelled, preventing graceful shutdown on SIGINT/SIGTERM
signals. Replace the context.Background() call with a context created via
signal.NotifyContext() that listens for os.Interrupt and syscall.SIGTERM
signals. This will allow the gather operation to detect cancellation requests
and shut down gracefully when the user sends those signals.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e3b574ac-ae00-4437-8647-e8451f03e21e
📒 Files selected for processing (24)
DockerfileDockerfile.artMakefilemust-gather/cmd/main.gomust-gather/collection-scripts/commonmust-gather/collection-scripts/gathermust-gather/collection-scripts/gather_cluster_logging_operator_resourcesmust-gather/collection-scripts/gather_collection_resourcesmust-gather/collection-scripts/gather_logstore_resourcesmust-gather/collection-scripts/gather_monitoringmust-gather/collection-scripts/monitoring_common.shmust-gather/gather.gomust-gather/internal/api/logger.gomust-gather/internal/api/path.gomust-gather/internal/api/path_test.gomust-gather/internal/api/suite_test.gomust-gather/internal/api/types.gomust-gather/internal/client/client.gomust-gather/internal/cluster/collector.gomust-gather/internal/collection/collection_collector.gomust-gather/internal/logstore/lokistack/collector.gomust-gather/internal/metrics/collector.gomust-gather/internal/namespace/collector.gomust-gather/internal/ui/uiplugin_collector.go
💤 Files with no reviewable changes (7)
- must-gather/collection-scripts/gather_cluster_logging_operator_resources
- must-gather/collection-scripts/gather_collection_resources
- must-gather/collection-scripts/gather_monitoring
- must-gather/collection-scripts/monitoring_common.sh
- must-gather/collection-scripts/gather
- must-gather/collection-scripts/common
- must-gather/collection-scripts/gather_logstore_resources
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/logforwarding/lokistack/forward_to_lokistack_test.go`:
- Line 136: The unconditional Fail(">>>>> REMOVE ME <<< TESTING MUST-GATHER")
call in the It block prevents the test from ever passing, even when the
forwarding functionality works correctly. Remove this Fail statement entirely
from the test flow, or if temporary debug output is needed for development
purposes, move it outside of the mergeable test code path behind explicit
debug-only conditional logic that would not execute in normal test runs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3ccd9a29-6c81-43f9-a46b-c462875645ee
📒 Files selected for processing (9)
must-gather/cmd/main.gomust-gather/gather.gomust-gather/internal/api/logger.gomust-gather/internal/client/client.gomust-gather/internal/logstore/lokistack/collector.gomust-gather/internal/metrics/collector.gomust-gather/internal/namespace/collector.gomust-gather/internal/ui/uiplugin_collector.gotest/e2e/logforwarding/lokistack/forward_to_lokistack_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
- must-gather/cmd/main.go
- must-gather/internal/api/logger.go
- must-gather/internal/logstore/lokistack/collector.go
- must-gather/internal/ui/uiplugin_collector.go
- must-gather/internal/namespace/collector.go
- must-gather/internal/metrics/collector.go
- must-gather/internal/client/client.go
| found, err = lokistackReceiver.HasInfrastructureLogs(serviceAccount.Name, framework.DefaultWaitForLogsTimeout) | ||
| Expect(err).To(BeNil()) | ||
| Expect(found).To(BeTrue()) | ||
| Fail(">>>>> REMOVE ME <<< TESTING MUST-GATHER") |
There was a problem hiding this comment.
Remove unconditional failure in test path
Line 136 unconditionally fails the It block, so this scenario can never pass even when forwarding works. This should be removed (or guarded behind explicit debug-only logic outside mergeable test flow).
Suggested fix
- Fail(">>>>> REMOVE ME <<< TESTING MUST-GATHER")As per coding guidelines, "Review Ginkgo test code for quality..." requires production-quality test behavior, and this hard-stop violates that expectation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Fail(">>>>> REMOVE ME <<< TESTING MUST-GATHER") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/logforwarding/lokistack/forward_to_lokistack_test.go` at line 136,
The unconditional Fail(">>>>> REMOVE ME <<< TESTING MUST-GATHER") call in the It
block prevents the test from ever passing, even when the forwarding
functionality works correctly. Remove this Fail statement entirely from the test
flow, or if temporary debug output is needed for development purposes, move it
outside of the mergeable test code path behind explicit debug-only conditional
logic that would not execute in normal test runs.
Source: Coding guidelines
|
/test images |
|
/test e2e-target |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@must-gather/internal/client/client.go`:
- Around line 190-193: Fail closed in WriteResourceToFile by propagating Secret
redaction failures instead of silently continuing. Update sanitizeResource to
return (runtime.Object, error), handle and return any errors from
unstructured.NestedMap and unstructured.SetNestedMap, and make
WriteResourceToFile stop and bubble up the error before marshaling when
redaction cannot be completed. Use the existing Client methods sanitizeResource
and WriteResourceToFile to locate and update the flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 65f1a653-1489-4578-a7ef-7017c1435f62
📒 Files selected for processing (1)
must-gather/internal/client/client.go
| func (c *Client) WriteResourceToFile(resource runtime.Object, destPath api.Path) error { | ||
| // Sanitize secrets before writing | ||
| sanitizedResource := c.sanitizeResource(resource) | ||
|
|
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify Secret sanitization does not discard NestedMap/SetNestedMap errors.
# Expected after the fix:
# - No `NestedMap(..., _)` discarded-error pattern in sanitizeResource.
# - Every `SetNestedMap` call is guarded by error handling.
sed -n '190,241p' must-gather/internal/client/client.go
rg -n 'NestedMap\(.*_, found, _|SetNestedMap\(' must-gather/internal/client/client.goRepository: openshift/cluster-logging-operator
Length of output: 1987
Fail closed when Secret redaction cannot be applied.
Lines 213 and 225 discard unstructured.NestedMap accessor errors, and lines 221 and 233 ignore unstructured.SetNestedMap return values. If either fails, WriteResourceToFile still marshals the deep copy with original Secret data intact. Change sanitizeResource to return (runtime.Object, error) and propagate errors up the call stack.
🛡️ Proposed fail-closed fix
func (c *Client) WriteResourceToFile(resource runtime.Object, destPath api.Path) error {
// Sanitize secrets before writing
- sanitizedResource := c.sanitizeResource(resource)
+ sanitizedResource, err := c.sanitizeResource(resource)
+ if err != nil {
+ return fmt.Errorf("failed to sanitize resource: %w", err)
+ }
// Marshal to YAML
yamlBytes, err := yaml.Marshal(sanitizedResource)
if err != nil {
return fmt.Errorf("failed to marshal resource to YAML: %w", err)
}
// Write to file (WriteFile creates parent directory automatically)
return destPath.WriteFile(yamlBytes)
}
// sanitizeResource redacts secret data values
-func (c *Client) sanitizeResource(resource runtime.Object) runtime.Object {
+func (c *Client) sanitizeResource(resource runtime.Object) (runtime.Object, error) {
// Check if this is a Secret
if unstructuredObj, ok := resource.(*unstructured.Unstructured); ok {
if unstructuredObj.GetKind() == "Secret" {
// Make a deep copy to avoid modifying the original
sanitized := unstructuredObj.DeepCopy()
// Redact all data values (base64 encoded)
- if data, found, _ := unstructured.NestedMap(sanitized.Object, "data"); found {
+ if data, found, err := unstructured.NestedMap(sanitized.Object, "data"); err != nil {
+ return nil, fmt.Errorf("failed to read Secret data: %w", err)
+ } else if found {
for key, val := range data {
length := 0
if strVal, ok := val.(string); ok {
length = len(strVal)
}
data[key] = fmt.Sprintf("REDACTED length=%d", length)
}
- unstructured.SetNestedMap(sanitized.Object, data, "data")
+ if err := unstructured.SetNestedMap(sanitized.Object, data, "data"); err != nil {
+ return nil, fmt.Errorf("failed to redact Secret data: %w", err)
+ }
}
// Redact all stringData values (plain text)
- if stringData, found, _ := unstructured.NestedMap(sanitized.Object, "stringData"); found {
+ if stringData, found, err := unstructured.NestedMap(sanitized.Object, "stringData"); err != nil {
+ return nil, fmt.Errorf("failed to read Secret stringData: %w", err)
+ } else if found {
for key, val := range stringData {
length := 0
if strVal, ok := val.(string); ok {
length = len(strVal)
}
stringData[key] = fmt.Sprintf("REDACTED length=%d", length)
}
- unstructured.SetNestedMap(sanitized.Object, stringData, "stringData")
+ if err := unstructured.SetNestedMap(sanitized.Object, stringData, "stringData"); err != nil {
+ return nil, fmt.Errorf("failed to redact Secret stringData: %w", err)
+ }
}
- return sanitized
+ return sanitized, nil
}
}
- return resource
+ return resource, nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (c *Client) WriteResourceToFile(resource runtime.Object, destPath api.Path) error { | |
| // Sanitize secrets before writing | |
| sanitizedResource := c.sanitizeResource(resource) | |
| func (c *Client) WriteResourceToFile(resource runtime.Object, destPath api.Path) error { | |
| // Sanitize secrets before writing | |
| sanitizedResource, err := c.sanitizeResource(resource) | |
| if err != nil { | |
| return fmt.Errorf("failed to sanitize resource: %w", err) | |
| } | |
| // Marshal to YAML | |
| yamlBytes, err := yaml.Marshal(sanitizedResource) | |
| if err != nil { | |
| return fmt.Errorf("failed to marshal resource to YAML: %w", err) | |
| } | |
| // Write to file (WriteFile creates parent directory automatically) | |
| return destPath.WriteFile(yamlBytes) | |
| } | |
| // sanitizeResource redacts secret data values | |
| func (c *Client) sanitizeResource(resource runtime.Object) (runtime.Object, error) { | |
| // Check if this is a Secret | |
| if unstructuredObj, ok := resource.(*unstructured.Unstructured); ok { | |
| if unstructuredObj.GetKind() == "Secret" { | |
| // Make a deep copy to avoid modifying the original | |
| sanitized := unstructuredObj.DeepCopy() | |
| // Redact all data values (base64 encoded) | |
| if data, found, err := unstructured.NestedMap(sanitized.Object, "data"); err != nil { | |
| return nil, fmt.Errorf("failed to read Secret data: %w", err) | |
| } else if found { | |
| for key, val := range data { | |
| length := 0 | |
| if strVal, ok := val.(string); ok { | |
| length = len(strVal) | |
| } | |
| data[key] = fmt.Sprintf("REDACTED length=%d", length) | |
| } | |
| if err := unstructured.SetNestedMap(sanitized.Object, data, "data"); err != nil { | |
| return nil, fmt.Errorf("failed to redact Secret data: %w", err) | |
| } | |
| } | |
| // Redact all stringData values (plain text) | |
| if stringData, found, err := unstructured.NestedMap(sanitized.Object, "stringData"); err != nil { | |
| return nil, fmt.Errorf("failed to read Secret stringData: %w", err) | |
| } else if found { | |
| for key, val := range stringData { | |
| length := 0 | |
| if strVal, ok := val.(string); ok { | |
| length = len(strVal) | |
| } | |
| stringData[key] = fmt.Sprintf("REDACTED length=%d", length) | |
| } | |
| if err := unstructured.SetNestedMap(sanitized.Object, stringData, "stringData"); err != nil { | |
| return nil, fmt.Errorf("failed to redact Secret stringData: %w", err) | |
| } | |
| } | |
| return sanitized, nil | |
| } | |
| } | |
| return resource, nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@must-gather/internal/client/client.go` around lines 190 - 193, Fail closed in
WriteResourceToFile by propagating Secret redaction failures instead of silently
continuing. Update sanitizeResource to return (runtime.Object, error), handle
and return any errors from unstructured.NestedMap and unstructured.SetNestedMap,
and make WriteResourceToFile stop and bubble up the error before marshaling when
redaction cannot be completed. Use the existing Client methods sanitizeResource
and WriteResourceToFile to locate and update the flow.
Source: Linters/SAST tools
|
@jcantrill: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
This PR:
Links
LOG-9007
cc @vparfonov @JoaoBraveCoding @xperimental
Summary by CodeRabbit
gatheris provided as a symlink to the executable.