Skip to content

Improve SupportBundleService performance, stability, and test coverage#25555

Open
todvora wants to merge 27 commits intomasterfrom
feature/support-bundle-tweaks
Open

Improve SupportBundleService performance, stability, and test coverage#25555
todvora wants to merge 27 commits intomasterfrom
feature/support-bundle-tweaks

Conversation

@todvora
Copy link
Copy Markdown
Contributor

@todvora todvora commented Apr 3, 2026

The support bundle service had several performance and correctness issues that became significant at scale: sequential API fan-outs, thread pool starvation risk, memory pressure from buffering large log files, and no test coverage beyond a few unit cases.

TODO: #25521 needs to be merged first, without it, the datanode.log content test fails.

Parallelism

Per-node data collection (thread dump, metrics, system stats, certificates) now runs as independent concurrent tasks instead of sequentially per node. Cluster-wide fan-out calls (requestOnAllNodes) are moved to a short-lived orchestration executor to avoid thread-pool starvation — previously wrapping them in supplyAsync on the shared executor could deadlock when the pool was exhausted by blocking fan-out wrappers waiting on sub-tasks submitted to the same pool.

Efficiency & correctness

Datanode log transfer switches from ResponseBody.bytes() (full heap buffer) to byteStream().transferTo() to avoid large allocations. Files.walk() in listBundles() is replaced with Files.list() since the directory is flat. closeEntry() in writeZipFile() is now guarded so it only runs if putNextEntry() succeeded, preventing a misleading second exception when zip entry creation fails. SimpleDateFormat (not thread-safe, per-call) is replaced with a static DateTimeFormatter. Several internal helpers are simplified or inlined, and public constants that have no external users are made private. Call timeout increased from 10 s to 30 s.

Error handling

When a support bundle is built, individual data collection tasks (per-node logs, metrics, thread dumps, certificates, cluster info, datanode info) now run independently and tolerate failures: instead of a single failed future aborting the entire bundle with an HTTP 500, each failure is caught, logged, and recorded as a BundleError entry (component path, error message, and full stacktrace). Once all tasks complete, the collected errors are written to errors.json at the root of the zip archive, giving support engineers a single file to inspect when diagnosing incomplete bundles.

Motivation and Context

Fixes or at least improves https://github.com/Graylog2/graylog-plugin-enterprise/issues/13625

How Has This Been Tested?

  • Added SupportBundleResourceIT — full backend integration test covering all endpoints: manifest, logfile download, bundle build/list/download/delete, permission checks (403s), path traversal guards (404s), and zip content verification
  • Zip content test asserts cluster.json, per-node thread-dump.txt/metrics.json/server.mem.log, and datanode-specific datanodes//datanode.log and datanodes//certificates.json
  • Test class annotated with @EnabledIfSearchServer(distribution = DATANODE) — runs against a datanode backend

Screenshots (if appropriate):

image image

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have requested a documentation update.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

@todvora todvora changed the title Feature/support bundle tweaks Support bundle tweaks Apr 3, 2026
@todvora todvora changed the title Support bundle tweaks Improve SupportBundleService performance, stability, and test coverage Apr 3, 2026
@todvora todvora requested a review from Copilot April 3, 2026 10:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Support Bundle feature’s scalability by parallelizing node/cluster data collection, reducing memory pressure during log streaming, and adding full-backend integration tests to cover the end-to-end bundle lifecycle and permissions.

Changes:

  • Parallelized per-node and cluster-wide support bundle data collection via CompletableFuture to reduce latency at scale.
  • Reduced heap usage and improved streaming correctness when proxying datanode logs.
  • Added full-backend integration tests covering manifest/logfile endpoints and bundle build/list/download/delete flows (including security/path traversal checks).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
graylog2-server/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java Refactors bundle collection to run more work concurrently; improves streaming and bundle listing implementation details.
full-backend-tests/src/test/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleResourceIT.java Adds end-to-end backend tests for support bundle APIs, permissions, and downloaded zip contents.
Comments suppressed due to low confidence (1)

graylog2-server/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java:561

  • listBundles(): Files.list(bundleDir) may include directories; the previous implementation explicitly filtered them out. Without filtering, Files.size(f) can throw (and break listing) if any directory under bundleDir happens to start with the bundle prefix. Consider restoring a regular-file filter (e.g., !Files.isDirectory(p) / Files.isRegularFile(p)) before computing size.
    public List<BundleFile> listBundles() {
        try (var files = Files.list(bundleDir)) {
            return files
                    .filter(p -> p.getFileName().toString().startsWith(BUNDLE_NAME_PREFIX))
                    .map(f -> {
                        try {
                            return new BundleFile(f.getFileName().toString(), Files.size(f));
                        } catch (IOException e) {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@todvora todvora marked this pull request as ready for review April 7, 2026 07:09
@todvora todvora requested a review from a team April 7, 2026 07:13
@patrickmann patrickmann requested review from patrickmann and removed request for a team April 9, 2026 08:12
.when()
.post(BUILD_URL)
.then()
.statusCode(202);
Copy link
Copy Markdown
Contributor

@patrickmann patrickmann Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not an async endpoint, it should return 200.

SupportBundleResource.java:111:

  public Response buildBundle(@Context HttpHeaders httpHeaders) {                                              
      supportBundleService.buildBundle(httpHeaders, getSubject());                                             
      return Response.ok().build();
  }  

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a (tiny) breaking change, the response code has been like this before, it's not a new method. IDK if it's worth fixing here and now.

Comment on lines +224 to +226
final CompletableFuture<Map<String, CallResult<SystemOverviewResponse>>> systemOverviewFuture =
CompletableFuture.supplyAsync(() -> proxiedResourceHelper.requestOnAllNodes(
RemoteSystemResource.class, RemoteSystemResource::system, CALL_TIMEOUT), orchestrationExecutor);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The four orchestration futures (systemOverviewFuture, jvmFuture, processBufferFuture, installedPluginsFuture) have no failure handler. If requestOnAllNodes throws on any of them, the graceful catch at line 264 is immediately undone by the .join() at lines 281-284.

The fix would be adding .exceptionally() to each orchestration future, similar to what datanodeInfoFuture
already does — returning an empty map on failure:

  final CompletableFuture<Map<String, CallResult<SystemOverviewResponse>>> systemOverviewFuture =
          CompletableFuture.supplyAsync(() -> proxiedResourceHelper.requestOnAllNodes(                         
                  RemoteSystemResource.class, RemoteSystemResource::system, CALL_TIMEOUT),
  orchestrationExecutor)                                                                                       
          .exceptionally(e -> {                             
              LOG.warn("Failed to collect system overview for support bundle", e);                             
              return Map.of();                              
          });

Same for jvmFuture, processBufferFuture, and installedPluginsFuture. Then the allOf().get() catch block
becomes unnecessary (since no future can fail), and cluster.json always gets written with whatever succeeded.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, there is still some room for improvements. The requestOnAllNodes won't fail if any of the nodes fails, will deliver error result instead. But the exceptionally logic still makes sense. I also added those node-level errors to the errors.json, so we see that there are some nodes failing. So far they have been silently ignored.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. The speed-up alone is a major improvement.

@todvora
Copy link
Copy Markdown
Contributor Author

todvora commented Apr 10, 2026

The build is still failing with

Error:    SupportBundleResourceIT.downloadedBundleFileContentsAreValid:419 [datanode log must be non-empty] 

Which is related to the, already merged, #25521

I guess we'll need to bump the datanode version somewhere, to force the build to get a fresh datanode container, right, @bernd? Thanks!

@bernd
Copy link
Copy Markdown
Member

bernd commented Apr 10, 2026

I guess we'll need to bump the datanode version somewhere, to force the build to get a fresh datanode container, right, @bernd? Thanks!

@todvora The DatanodeInstanceProvider builds a local container image for the Data Node during the test run. So that shouldn't be a problem. You moved the Data Node test instance classes to the graylog-storage-opensearch3 module, not sure if that's a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants