Honor max_event_age in cluster event periodical and improve performance (7.0) (6.3)#25631
Merged
patrickmann merged 1 commit into6.3from Apr 14, 2026
Merged
Conversation
…ce (`7.0`) (#25514) * Honor max_event_age in cluster event periodical and improve performance (#25265) * honor max_event_age * CL * add validation; migrate test format to junit5 style * reduce default and min period * revise default and min value * add config param documentation --------- Co-authored-by: Anton Ebel <anton.ebel@graylog.com> (cherry picked from commit a2c2f52) * Drop old cluster_events index when creating the new one The original PR replaced the compound index from (timestamp, producer, consumers) to (consumers, timestamp) but didn't remove the old index. Drop it on startup if present. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * adjust for graylog collection wrapper --------- Co-authored-by: Anton Ebel <anton.ebel@graylog.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Ismail Belkacim <xd4rker@users.noreply.github.com> (cherry picked from commit a3e0792)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: This is a backport of #25514 from
7.0to6.3. Clean cherry-pick without changes.Resolves #25259
Problem
ClusterEventCleanupPeriodicalhad a hardcoded cleanup period of 1 day (86400s), regardless of the configuredmax_event_age. Ifmax_event_agewas set to e.g. 2 hours, stale cluster events could linger for up to ~25 hours before being cleaned up.Behavioral Changes
Dynamic cleanup period based on
max_event_age—getPeriodSeconds()now returns the configuredmax_event_ageduration (in seconds) instead of a fixed 1-day interval, clamped to a minimum of 1 hour to prevent excessive DB load. Default is reduced to 12 hours given the performance impact noted by customers.MongoDB index reordering (
ClusterEventPeriodical) — The compound index changed from(timestamp, producer, consumers)to(consumers, timestamp). Theproducerfield was removed since it's not used in the query predicate. This better matches theeventsIterable()query pattern, which filters onconsumers($nin) and sorts bytimestamp. The new index allows MongoDB to satisfy both the filter and sort from a single index scan, whereas the old index order required scanning all timestamps first. Given the 1-second polling frequency across all cluster nodes, the cumulative effect is substantial — fewer documents scanned, no in-memory sorts, and a smaller index to maintain. On a cluster with N nodes, that's N queries/second against this collection, continuously. The improvement scales with both cluster size and event volume.Joda-Time → java.time migration —
ClusterEventCleanupPeriodicalnow usesjava.time.Clock/Instant/Durationinstead of Joda'sDateTime. TheClockis injected via constructor, making the class properly testable without global time mocking.Motivation and Context
Relates to #25259
Manual test
Default behavior — Start server with default config (no
max_event_ageset). Verify in logs thatClusterEventCleanupPeriodicalschedules at 43200s (12h), not 86400s (1d).Custom
max_event_age— Setmax_event_age = 1hinserver.conf, restart. Confirm cleanup schedules at 3600s.Minimum clamp — Set
max_event_age = 30m, restart. Confirm cleanup still schedules at 3600s (1h minimum).Index — Check
db.cluster_events.getIndexes()in MongoDB. Should show{ consumers: 1, timestamp: 1 }(not the old{ timestamp: 1, producer: 1, consumers: 1 }).Smoke test
max_event_age = 1h, and confirmgetInitialDelaySeconds()returns 0).