Improve SupportBundleService performance, stability, and test coverage by todvora · Pull Request #25555 · Graylog2/graylog2-server

todvora · 2026-04-03T08:33:55Z

The support bundle service had several performance and correctness issues that became significant at scale: sequential API fan-outs, thread pool starvation risk, memory pressure from buffering large log files, and no test coverage beyond a few unit cases.

TODO: #25521 needs to be merged first, without it, the datanode.log content test fails.

Parallelism

Per-node data collection (thread dump, metrics, system stats, certificates) now runs as independent concurrent tasks instead of sequentially per node. Cluster-wide fan-out calls (requestOnAllNodes) are moved to a short-lived orchestration executor to avoid thread-pool starvation — previously wrapping them in supplyAsync on the shared executor could deadlock when the pool was exhausted by blocking fan-out wrappers waiting on sub-tasks submitted to the same pool.

Efficiency & correctness

Datanode log transfer switches from ResponseBody.bytes() (full heap buffer) to byteStream().transferTo() to avoid large allocations. Files.walk() in listBundles() is replaced with Files.list() since the directory is flat. closeEntry() in writeZipFile() is now guarded so it only runs if putNextEntry() succeeded, preventing a misleading second exception when zip entry creation fails. SimpleDateFormat (not thread-safe, per-call) is replaced with a static DateTimeFormatter. Several internal helpers are simplified or inlined, and public constants that have no external users are made private. Call timeout increased from 10 s to 30 s.

Error handling

When a support bundle is built, individual data collection tasks (per-node logs, metrics, thread dumps, certificates, cluster info, datanode info) now run independently and tolerate failures: instead of a single failed future aborting the entire bundle with an HTTP 500, each failure is caught, logged, and recorded as a BundleError entry (component path, error message, and full stacktrace). Once all tasks complete, the collected errors are written to errors.json at the root of the zip archive, giving support engineers a single file to inspect when diagnosing incomplete bundles.

Motivation and Context

Fixes or at least improves https://github.com/Graylog2/graylog-plugin-enterprise/issues/13625

How Has This Been Tested?

Added SupportBundleResourceIT — full backend integration test covering all endpoints: manifest, logfile download, bundle build/list/download/delete, permission checks (403s), path traversal guards (404s), and zip content verification
Zip content test asserts cluster.json, per-node thread-dump.txt/metrics.json/server.mem.log, and datanode-specific datanodes//datanode.log and datanodes//certificates.json
Test class annotated with @EnabledIfSearchServer(distribution = DATANODE) — runs against a datanode backend

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactoring (non-breaking change)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have requested a documentation update.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.

…nst datanode

Copilot

Pull request overview

This PR improves the Support Bundle feature’s scalability by parallelizing node/cluster data collection, reducing memory pressure during log streaming, and adding full-backend integration tests to cover the end-to-end bundle lifecycle and permissions.

Changes:

Parallelized per-node and cluster-wide support bundle data collection via CompletableFuture to reduce latency at scale.
Reduced heap usage and improved streaming correctness when proxying datanode logs.
Added full-backend integration tests covering manifest/logfile endpoints and bundle build/list/download/delete flows (including security/path traversal checks).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`graylog2-server/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java`	Refactors bundle collection to run more work concurrently; improves streaming and bundle listing implementation details.
`full-backend-tests/src/test/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleResourceIT.java`	Adds end-to-end backend tests for support bundle APIs, permissions, and downloaded zip contents.

Comments suppressed due to low confidence (1)

graylog2-server/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java:561

listBundles(): Files.list(bundleDir) may include directories; the previous implementation explicitly filtered them out. Without filtering, Files.size(f) can throw (and break listing) if any directory under bundleDir happens to start with the bundle prefix. Consider restoring a regular-file filter (e.g., !Files.isDirectory(p) / Files.isRegularFile(p)) before computing size.

    public List<BundleFile> listBundles() {
        try (var files = Files.list(bundleDir)) {
            return files
                    .filter(p -> p.getFileName().toString().startsWith(BUNDLE_NAME_PREFIX))
                    .map(f -> {
                        try {
                            return new BundleFile(f.getFileName().toString(), Files.size(f));
                        } catch (IOException e) {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...rver/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java

...s/src/test/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleResourceIT.java

patrickmann · 2026-04-09T12:44:30Z

...s/src/test/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleResourceIT.java

+                .when()
+                .post(BUILD_URL)
+                .then()
+                .statusCode(202);


Since this is not an async endpoint, it should return 200.

SupportBundleResource.java:111:

public Response buildBundle(@Context HttpHeaders httpHeaders) { supportBundleService.buildBundle(httpHeaders, getSubject()); return Response.ok().build(); }

This would be a (tiny) breaking change, the response code has been like this before, it's not a new method. IDK if it's worth fixing here and now.

patrickmann · 2026-04-09T13:08:13Z

...rver/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java

+            final CompletableFuture<Map<String, CallResult<SystemOverviewResponse>>> systemOverviewFuture =
+                    CompletableFuture.supplyAsync(() -> proxiedResourceHelper.requestOnAllNodes(
+                            RemoteSystemResource.class, RemoteSystemResource::system, CALL_TIMEOUT), orchestrationExecutor);


The four orchestration futures (systemOverviewFuture, jvmFuture, processBufferFuture, installedPluginsFuture) have no failure handler. If requestOnAllNodes throws on any of them, the graceful catch at line 264 is immediately undone by the .join() at lines 281-284.

The fix would be adding .exceptionally() to each orchestration future, similar to what datanodeInfoFuture
already does — returning an empty map on failure:

final CompletableFuture<Map<String, CallResult<SystemOverviewResponse>>> systemOverviewFuture = CompletableFuture.supplyAsync(() -> proxiedResourceHelper.requestOnAllNodes( RemoteSystemResource.class, RemoteSystemResource::system, CALL_TIMEOUT), orchestrationExecutor) .exceptionally(e -> { LOG.warn("Failed to collect system overview for support bundle", e); return Map.of(); });

Same for jvmFuture, processBufferFuture, and installedPluginsFuture. Then the allOf().get() catch block
becomes unnecessary (since no future can fail), and cluster.json always gets written with whatever succeeded.

you are right, there is still some room for improvements. The requestOnAllNodes won't fail if any of the nodes fails, will deliver error result instead. But the exceptionally logic still makes sense. I also added those node-level errors to the errors.json, so we see that there are some nodes failing. So far they have been silently ignored.

Nice. The speed-up alone is a major improvement.

todvora · 2026-04-10T08:24:57Z

The build is still failing with

Error:    SupportBundleResourceIT.downloadedBundleFileContentsAreValid:419 [datanode log must be non-empty]

Which is related to the, already merged, #25521

I guess we'll need to bump the datanode version somewhere, to force the build to get a fresh datanode container, right, @bernd? Thanks!

bernd · 2026-04-10T08:38:49Z

I guess we'll need to bump the datanode version somewhere, to force the build to get a fresh datanode container, right, @bernd? Thanks!

@todvora The DatanodeInstanceProvider builds a local container image for the Data Node during the test run. So that shouldn't be a problem. You moved the Data Node test instance classes to the graylog-storage-opensearch3 module, not sure if that's a problem.

todvora added 11 commits April 3, 2026 10:28

Minor tweaks and full backend test

9853e81

Avoid creating unnecessary executor

b4a97e0

Parallelize information collection

509e183

timestamp cleanup

dca5bed

Code simplification

ce0b060

Better datanode log transfer

45e5431

Prevent thread blocking, run all tasks in parallel. Run the test agai…

34c2a80

…nst datanode

more parallelism for tasks, decreases bundle build time

1c3cd41

Code cleanup

e8ac420

private constants

370fa7c

increase call timeout to 30s

e38e66c

todvora changed the title ~~Feature/support bundle tweaks~~ Support bundle tweaks Apr 3, 2026

todvora changed the title ~~Support bundle tweaks~~ Improve SupportBundleService performance, stability, and test coverage Apr 3, 2026

todvora requested a review from Copilot April 3, 2026 10:53

Copilot started reviewing on behalf of todvora April 3, 2026 10:54 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

todvora added 6 commits April 3, 2026 14:24

better threading model

5618d64

better closing of zip entries

975833f

Add buffering for zip writing

7478d95

code cleanup and readability

49d2800

Merge branch 'master' into feature/support-bundle-tweaks

5425a22

Added test to verify valid content of the support bundle

496f838

todvora marked this pull request as ready for review April 7, 2026 07:09

added changelog

b05c6aa

todvora requested a review from a team April 7, 2026 07:13

todvora and others added 3 commits April 7, 2026 09:13

Merge branch 'master' into feature/support-bundle-tweaks

3d2eefb

Merge branch 'master' into feature/support-bundle-tweaks

ffada39

Merge branch 'master' into feature/support-bundle-tweaks

e6c416e

patrickmann requested review from patrickmann and removed request for a team April 9, 2026 08:12

failure tolerant future handling for cluster and datanode infos

1e5662d

patrickmann reviewed Apr 9, 2026

View reviewed changes

...s/src/test/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleResourceIT.java Outdated Show resolved Hide resolved

Added support for errors during bundle creation

2f4f246

patrickmann reviewed Apr 9, 2026

View reviewed changes

code cleanup

039ab13

patrickmann reviewed Apr 9, 2026

View reviewed changes

safer future error handling, added node call errors

7a2d834

patrickmann approved these changes Apr 9, 2026

View reviewed changes

todvora added 2 commits April 10, 2026 08:51

fixed test

8a302ef

Merge branch 'master' into feature/support-bundle-tweaks

ee423a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SupportBundleService performance, stability, and test coverage#25555

Improve SupportBundleService performance, stability, and test coverage#25555
todvora wants to merge 27 commits intomasterfrom
feature/support-bundle-tweaks

todvora commented Apr 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickmann Apr 9, 2026 •

edited

Loading

Uh oh!

todvora Apr 9, 2026

Uh oh!

patrickmann Apr 9, 2026

Uh oh!

todvora Apr 9, 2026

Uh oh!

patrickmann Apr 9, 2026

Uh oh!

todvora commented Apr 10, 2026

Uh oh!

bernd commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

todvora commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parallelism

Efficiency & correctness

Error handling

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickmann Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

todvora Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

patrickmann Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

todvora Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

patrickmann Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

todvora commented Apr 10, 2026

Uh oh!

bernd commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

todvora commented Apr 3, 2026 •

edited

Loading

patrickmann Apr 9, 2026 •

edited

Loading