Skip to content

Honor max_event_age in cluster event periodical and improve performance (7.0) (6.3)#25631

Merged
patrickmann merged 1 commit into6.3from
backport-6.3/backport-7.0/fix/issue-25259
Apr 14, 2026
Merged

Honor max_event_age in cluster event periodical and improve performance (7.0) (6.3)#25631
patrickmann merged 1 commit into6.3from
backport-6.3/backport-7.0/fix/issue-25259

Conversation

@graylog-internal-actions-access
Copy link
Copy Markdown

@graylog-internal-actions-access graylog-internal-actions-access bot commented Apr 14, 2026

Note: This is a backport of #25514 from 7.0 to 6.3. Clean cherry-pick without changes.
Resolves #25259

Problem

ClusterEventCleanupPeriodical had a hardcoded cleanup period of 1 day (86400s), regardless of the configured max_event_age. If max_event_age was set to e.g. 2 hours, stale cluster events could linger for up to ~25 hours before being cleaned up.

Behavioral Changes

  1. Dynamic cleanup period based on max_event_agegetPeriodSeconds() now returns the configured max_event_age duration (in seconds) instead of a fixed 1-day interval, clamped to a minimum of 1 hour to prevent excessive DB load. Default is reduced to 12 hours given the performance impact noted by customers.

  2. MongoDB index reordering (ClusterEventPeriodical) — The compound index changed from (timestamp, producer, consumers) to (consumers, timestamp). The producer field was removed since it's not used in the query predicate. This better matches the eventsIterable() query pattern, which filters on consumers ($nin) and sorts by timestamp. The new index allows MongoDB to satisfy both the filter and sort from a single index scan, whereas the old index order required scanning all timestamps first. Given the 1-second polling frequency across all cluster nodes, the cumulative effect is substantial — fewer documents scanned, no in-memory sorts, and a smaller index to maintain. On a cluster with N nodes, that's N queries/second against this collection, continuously. The improvement scales with both cluster size and event volume.

  3. Joda-Time → java.time migrationClusterEventCleanupPeriodical now uses java.time.Clock / Instant / Duration instead of Joda's DateTime. The Clock is injected via constructor, making the class properly testable without global time mocking.

Motivation and Context

Relates to #25259

Manual test

  1. Default behavior — Start server with default config (no max_event_age set). Verify in logs that ClusterEventCleanupPeriodical schedules at 43200s (12h), not 86400s (1d).

  2. Custom max_event_age — Set max_event_age = 1h in server.conf, restart. Confirm cleanup schedules at 3600s.

  3. Minimum clamp — Set max_event_age = 30m, restart. Confirm cleanup still schedules at 3600s (1h minimum).

  4. Index — Check db.cluster_events.getIndexes() in MongoDB. Should show { consumers: 1, timestamp: 1 } (not the old { timestamp: 1, producer: 1, consumers: 1 }).

Smoke test

  1. insert a stale event manually
  db.cluster_events.insertOne({                                                                           
    timestamp: NumberLong(0),                                                                             
    producer: "test",                                                                                     
    consumers: [],                                                                                        
    event_class: "java.lang.String",                                                                      
    payload: "test"                                                                                       
  })                                                                                                      
  1. restart with max_event_age = 1h, and confirm
  • the event is deleted shortly after startup (getInitialDelaySeconds() returns 0).
  • this line is in the log:

Starting [org.graylog2.events.ClusterEventCleanupPeriodical] periodical in [0s], polling every [3600s].

…ce (`7.0`) (#25514)

* Honor max_event_age in cluster event periodical and improve performance (#25265)

* honor max_event_age

* CL

* add validation; migrate test format to junit5 style

* reduce default and min period

* revise default and min value

* add config param documentation

---------

Co-authored-by: Anton Ebel <anton.ebel@graylog.com>
(cherry picked from commit a2c2f52)

* Drop old cluster_events index when creating the new one

The original PR replaced the compound index from
(timestamp, producer, consumers) to (consumers, timestamp)
but didn't remove the old index. Drop it on startup if present.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* adjust for graylog collection wrapper

---------

Co-authored-by: Anton Ebel <anton.ebel@graylog.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Ismail Belkacim <xd4rker@users.noreply.github.com>
(cherry picked from commit a3e0792)
@patrickmann patrickmann marked this pull request as ready for review April 14, 2026 06:58
Copy link
Copy Markdown
Contributor

@xd4rker xd4rker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@patrickmann patrickmann merged commit 4b33cb4 into 6.3 Apr 14, 2026
21 checks passed
@patrickmann patrickmann deleted the backport-6.3/backport-7.0/fix/issue-25259 branch April 14, 2026 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants