Honor max_event_age in cluster event periodical and improve performance (`7.0`) (`6.3`) by graylog-internal-actions-access[bot] · Pull Request #25631 · Graylog2/graylog2-server

graylog-internal-actions-access · 2026-04-14T06:57:30Z

Note: This is a backport of #25514 from 7.0 to 6.3. Clean cherry-pick without changes.
Resolves #25259

Problem

ClusterEventCleanupPeriodical had a hardcoded cleanup period of 1 day (86400s), regardless of the configured max_event_age. If max_event_age was set to e.g. 2 hours, stale cluster events could linger for up to ~25 hours before being cleaned up.

Behavioral Changes

Dynamic cleanup period based on max_event_age — getPeriodSeconds() now returns the configured max_event_age duration (in seconds) instead of a fixed 1-day interval, clamped to a minimum of 1 hour to prevent excessive DB load. Default is reduced to 12 hours given the performance impact noted by customers.
MongoDB index reordering (ClusterEventPeriodical) — The compound index changed from (timestamp, producer, consumers) to (consumers, timestamp). The producer field was removed since it's not used in the query predicate. This better matches the eventsIterable() query pattern, which filters on consumers ($nin) and sorts by timestamp. The new index allows MongoDB to satisfy both the filter and sort from a single index scan, whereas the old index order required scanning all timestamps first. Given the 1-second polling frequency across all cluster nodes, the cumulative effect is substantial — fewer documents scanned, no in-memory sorts, and a smaller index to maintain. On a cluster with N nodes, that's N queries/second against this collection, continuously. The improvement scales with both cluster size and event volume.
Joda-Time → java.time migration — ClusterEventCleanupPeriodical now uses java.time.Clock / Instant / Duration instead of Joda's DateTime. The Clock is injected via constructor, making the class properly testable without global time mocking.

Motivation and Context

Relates to #25259

Manual test

Default behavior — Start server with default config (no max_event_age set). Verify in logs that ClusterEventCleanupPeriodical schedules at 43200s (12h), not 86400s (1d).
Custom max_event_age — Set max_event_age = 1h in server.conf, restart. Confirm cleanup schedules at 3600s.
Minimum clamp — Set max_event_age = 30m, restart. Confirm cleanup still schedules at 3600s (1h minimum).
Index — Check db.cluster_events.getIndexes() in MongoDB. Should show { consumers: 1, timestamp: 1 } (not the old { timestamp: 1, producer: 1, consumers: 1 }).

Smoke test

insert a stale event manually

  db.cluster_events.insertOne({                                                                           
    timestamp: NumberLong(0),                                                                             
    producer: "test",                                                                                     
    consumers: [],                                                                                        
    event_class: "java.lang.String",                                                                      
    payload: "test"                                                                                       
  })

restart with max_event_age = 1h, and confirm

the event is deleted shortly after startup (getInitialDelaySeconds() returns 0).
this line is in the log:

Starting [org.graylog2.events.ClusterEventCleanupPeriodical] periodical in [0s], polling every [3600s].

…ce (`7.0`) (#25514) * Honor max_event_age in cluster event periodical and improve performance (#25265) * honor max_event_age * CL * add validation; migrate test format to junit5 style * reduce default and min period * revise default and min value * add config param documentation --------- Co-authored-by: Anton Ebel <anton.ebel@graylog.com> (cherry picked from commit a2c2f52) * Drop old cluster_events index when creating the new one The original PR replaced the compound index from (timestamp, producer, consumers) to (consumers, timestamp) but didn't remove the old index. Drop it on startup if present. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * adjust for graylog collection wrapper --------- Co-authored-by: Anton Ebel <anton.ebel@graylog.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Ismail Belkacim <xd4rker@users.noreply.github.com> (cherry picked from commit a3e0792)

xd4rker

LGTM!

graylog-internal-actions-access bot added the backport label Apr 14, 2026

graylog-internal-actions-access bot requested a review from xd4rker April 14, 2026 06:57

patrickmann marked this pull request as ready for review April 14, 2026 06:58

xd4rker approved these changes Apr 14, 2026

View reviewed changes

patrickmann merged commit 4b33cb4 into 6.3 Apr 14, 2026
21 checks passed

patrickmann deleted the backport-6.3/backport-7.0/fix/issue-25259 branch April 14, 2026 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor max_event_age in cluster event periodical and improve performance (`7.0`) (`6.3`)#25631

Honor max_event_age in cluster event periodical and improve performance (`7.0`) (`6.3`)#25631
patrickmann merged 1 commit into6.3from
backport-6.3/backport-7.0/fix/issue-25259

graylog-internal-actions-access bot commented Apr 14, 2026 •

edited by patrickmann

Loading

Uh oh!

xd4rker left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

graylog-internal-actions-access bot commented Apr 14, 2026 • edited by patrickmann Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Behavioral Changes

Motivation and Context

Manual test

Smoke test

Uh oh!

xd4rker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

graylog-internal-actions-access bot commented Apr 14, 2026 •

edited by patrickmann

Loading