Improve SupportBundleService performance, stability, and test coverage#25555
Improve SupportBundleService performance, stability, and test coverage#25555
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves the Support Bundle feature’s scalability by parallelizing node/cluster data collection, reducing memory pressure during log streaming, and adding full-backend integration tests to cover the end-to-end bundle lifecycle and permissions.
Changes:
- Parallelized per-node and cluster-wide support bundle data collection via
CompletableFutureto reduce latency at scale. - Reduced heap usage and improved streaming correctness when proxying datanode logs.
- Added full-backend integration tests covering manifest/logfile endpoints and bundle build/list/download/delete flows (including security/path traversal checks).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
graylog2-server/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java |
Refactors bundle collection to run more work concurrently; improves streaming and bundle listing implementation details. |
full-backend-tests/src/test/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleResourceIT.java |
Adds end-to-end backend tests for support bundle APIs, permissions, and downloaded zip contents. |
Comments suppressed due to low confidence (1)
graylog2-server/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java:561
- listBundles(): Files.list(bundleDir) may include directories; the previous implementation explicitly filtered them out. Without filtering, Files.size(f) can throw (and break listing) if any directory under bundleDir happens to start with the bundle prefix. Consider restoring a regular-file filter (e.g., !Files.isDirectory(p) / Files.isRegularFile(p)) before computing size.
public List<BundleFile> listBundles() {
try (var files = Files.list(bundleDir)) {
return files
.filter(p -> p.getFileName().toString().startsWith(BUNDLE_NAME_PREFIX))
.map(f -> {
try {
return new BundleFile(f.getFileName().toString(), Files.size(f));
} catch (IOException e) {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...rver/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java
Outdated
Show resolved
Hide resolved
...rver/src/main/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleService.java
Outdated
Show resolved
Hide resolved
...s/src/test/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleResourceIT.java
Show resolved
Hide resolved
...s/src/test/java/org/graylog2/rest/resources/system/debug/bundle/SupportBundleResourceIT.java
Outdated
Show resolved
Hide resolved
| .when() | ||
| .post(BUILD_URL) | ||
| .then() | ||
| .statusCode(202); |
There was a problem hiding this comment.
Since this is not an async endpoint, it should return 200.
SupportBundleResource.java:111:
public Response buildBundle(@Context HttpHeaders httpHeaders) {
supportBundleService.buildBundle(httpHeaders, getSubject());
return Response.ok().build();
}
There was a problem hiding this comment.
This would be a (tiny) breaking change, the response code has been like this before, it's not a new method. IDK if it's worth fixing here and now.
| final CompletableFuture<Map<String, CallResult<SystemOverviewResponse>>> systemOverviewFuture = | ||
| CompletableFuture.supplyAsync(() -> proxiedResourceHelper.requestOnAllNodes( | ||
| RemoteSystemResource.class, RemoteSystemResource::system, CALL_TIMEOUT), orchestrationExecutor); |
There was a problem hiding this comment.
The four orchestration futures (systemOverviewFuture, jvmFuture, processBufferFuture, installedPluginsFuture) have no failure handler. If requestOnAllNodes throws on any of them, the graceful catch at line 264 is immediately undone by the .join() at lines 281-284.
The fix would be adding .exceptionally() to each orchestration future, similar to what datanodeInfoFuture
already does — returning an empty map on failure:
final CompletableFuture<Map<String, CallResult<SystemOverviewResponse>>> systemOverviewFuture =
CompletableFuture.supplyAsync(() -> proxiedResourceHelper.requestOnAllNodes(
RemoteSystemResource.class, RemoteSystemResource::system, CALL_TIMEOUT),
orchestrationExecutor)
.exceptionally(e -> {
LOG.warn("Failed to collect system overview for support bundle", e);
return Map.of();
});
Same for jvmFuture, processBufferFuture, and installedPluginsFuture. Then the allOf().get() catch block
becomes unnecessary (since no future can fail), and cluster.json always gets written with whatever succeeded.
There was a problem hiding this comment.
you are right, there is still some room for improvements. The requestOnAllNodes won't fail if any of the nodes fails, will deliver error result instead. But the exceptionally logic still makes sense. I also added those node-level errors to the errors.json, so we see that there are some nodes failing. So far they have been silently ignored.
There was a problem hiding this comment.
Nice. The speed-up alone is a major improvement.
|
The build is still failing with Which is related to the, already merged, #25521 I guess we'll need to bump the datanode version somewhere, to force the build to get a fresh datanode container, right, @bernd? Thanks! |
@todvora The |
The support bundle service had several performance and correctness issues that became significant at scale: sequential API fan-outs, thread pool starvation risk, memory pressure from buffering large log files, and no test coverage beyond a few unit cases.
TODO: #25521 needs to be merged first, without it, the datanode.log content test fails.
Parallelism
Per-node data collection (thread dump, metrics, system stats, certificates) now runs as independent concurrent tasks instead of sequentially per node. Cluster-wide fan-out calls (requestOnAllNodes) are moved to a short-lived orchestration executor to avoid thread-pool starvation — previously wrapping them in supplyAsync on the shared executor could deadlock when the pool was exhausted by blocking fan-out wrappers waiting on sub-tasks submitted to the same pool.
Efficiency & correctness
Datanode log transfer switches from ResponseBody.bytes() (full heap buffer) to byteStream().transferTo() to avoid large allocations. Files.walk() in listBundles() is replaced with Files.list() since the directory is flat. closeEntry() in writeZipFile() is now guarded so it only runs if putNextEntry() succeeded, preventing a misleading second exception when zip entry creation fails. SimpleDateFormat (not thread-safe, per-call) is replaced with a static DateTimeFormatter. Several internal helpers are simplified or inlined, and public constants that have no external users are made private. Call timeout increased from 10 s to 30 s.
Error handling
When a support bundle is built, individual data collection tasks (per-node logs, metrics, thread dumps, certificates, cluster info, datanode info) now run independently and tolerate failures: instead of a single failed future aborting the entire bundle with an HTTP 500, each failure is caught, logged, and recorded as a BundleError entry (component path, error message, and full stacktrace). Once all tasks complete, the collected errors are written to errors.json at the root of the zip archive, giving support engineers a single file to inspect when diagnosing incomplete bundles.
Motivation and Context
Fixes or at least improves https://github.com/Graylog2/graylog-plugin-enterprise/issues/13625
How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: