Skip to content

Honor max_event_age in cluster event periodical and improve performance (7.0)#25514

Merged
patrickmann merged 4 commits into7.0from
backport-7.0/fix/issue-25259
Apr 14, 2026
Merged

Honor max_event_age in cluster event periodical and improve performance (7.0)#25514
patrickmann merged 4 commits into7.0from
backport-7.0/fix/issue-25259

Conversation

@patrickmann
Copy link
Copy Markdown
Contributor

@patrickmann patrickmann commented Mar 31, 2026

Note: This is a backport of #25265 to 7.0.
However, a few changes were required to adapt from 7.1 (mainly affecting unit tests, see conflict resolution comment below). Also we add a missing dropIndex of the obsolete old cluster_events index.

Problem

ClusterEventCleanupPeriodical had a hardcoded cleanup period of 1 day (86400s), regardless of the configured max_event_age. If max_event_age was set to e.g. 2 hours, stale cluster events could linger for up to ~25 hours before being cleaned up.

Behavioral Changes

  1. Dynamic cleanup period based on max_event_agegetPeriodSeconds() now returns the configured max_event_age duration (in seconds) instead of a fixed 1-day interval, clamped to a minimum of 1 hour to prevent excessive DB load. Default is reduced to 12 hours given the performance impact noted by customers.

  2. MongoDB index reordering (ClusterEventPeriodical) — The compound index changed from (timestamp, producer, consumers) to (consumers, timestamp). The producer field was removed since it's not used in the query predicate. This better matches the eventsIterable() query pattern, which filters on consumers ($nin) and sorts by timestamp. The new index allows MongoDB to satisfy both the filter and sort from a single index scan, whereas the old index order required scanning all timestamps first. Given the 1-second polling frequency across all cluster nodes, the cumulative effect is substantial — fewer documents scanned, no in-memory sorts, and a smaller index to maintain. On a cluster with N nodes, that's N queries/second against this collection, continuously. The improvement scales with both cluster size and event volume.

  3. Joda-Time → java.time migrationClusterEventCleanupPeriodical now uses java.time.Clock / Instant / Duration instead of Joda's DateTime. The Clock is injected via constructor, making the class properly testable without global time mocking.

#25404 ## Motivation and Context
Relates to #25259

Manual test

  1. Default behavior — Start server with default config (no max_event_age set). Verify in logs that ClusterEventCleanupPeriodical schedules at 43200s (12h), not 86400s (1d).

  2. Custom max_event_age — Set max_event_age = 1h in server.conf, restart. Confirm cleanup schedules at 3600s.

  3. Minimum clamp — Set max_event_age = 30m, restart. Confirm cleanup still schedules at 3600s (1h minimum).

  4. Index — Check db.cluster_events.getIndexes() in MongoDB. Should show { consumers: 1, timestamp: 1 } (not the old { timestamp: 1, producer: 1, consumers: 1 }).

Smoke test

  1. insert a stale event manually
  db.cluster_events.insertOne({                                                                           
    timestamp: NumberLong(0),                                                                             
    producer: "test",                                                                                     
    consumers: [],                                                                                        
    event_class: "java.lang.String",                                                                      
    payload: "test"                                                                                       
  })                                                                                                      
  1. restart with max_event_age = 1h, and confirm
  • the event is deleted shortly after startup (getInitialDelaySeconds() returns 0).
  • this line is in the log:

Starting [org.graylog2.events.ClusterEventCleanupPeriodical] periodical in [0s], polling every [3600s].

…ce (#25265)

* honor max_event_age

* CL

* add validation; migrate test format to junit5 style

* reduce default and min period

* revise default and min value

* add config param documentation

---------

Co-authored-by: Anton Ebel <anton.ebel@graylog.com>
(cherry picked from commit a2c2f52)
@patrickmann
Copy link
Copy Markdown
Contributor Author

Conflict resolutions

Three files conflicted due to the JUnit 4 → 5 migration on master that hasn't landed on 7.0 yet.
All resolutions preserve 7.0's JUnit 4 test infrastructure while applying the PR's actual changes.

Configuration.java

  • Kept org.graylog2.configuration.Documentation (7.0's class) instead of DocumentationConstants (master rename, unrelated to this PR)
  • Discarded jadconfig Documentation/DocumentationSection imports (not present on 7.0)
  • Took the PR's field changes: @Documentation(...) annotation, PositiveJavaDurationValidator, default 12h

ClusterEventCleanupPeriodicalTest.java

  • Kept JUnit 4 imports/annotations (@Rule, @Before, @After) and MongoDBInstance rule
  • Applied PR changes: Instant replacing Joda DateTime, FIXED_CLOCK injection, DateTimeUtils removal
  • Adapted new getPeriodSeconds_* tests to construct MongoCollections manually (no JUnit 5 parameter injection on 7.0)

ClusterEventPeriodicalTest.java

  • Kept JUnit 4 lifecycle (@Before/@After) and MongoDBInstance rule
  • Applied PR changes: removed DateTimeUtils global time mocking, made class package-private

patrickmann and others added 2 commits March 31, 2026 14:47
The original PR replaced the compound index from
(timestamp, producer, consumers) to (consumers, timestamp)
but didn't remove the old index. Drop it on startup if present.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@patrickmann patrickmann marked this pull request as ready for review March 31, 2026 13:03
@patrickmann patrickmann requested review from a team and removed request for AntonEbel and dennisoelkers March 31, 2026 13:04
Copy link
Copy Markdown
Contributor

@xd4rker xd4rker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on 7.0 and it worked as expected 👍

@patrickmann patrickmann merged commit a3e0792 into 7.0 Apr 14, 2026
15 checks passed
@patrickmann patrickmann deleted the backport-7.0/fix/issue-25259 branch April 14, 2026 06:55
garybot2 pushed a commit that referenced this pull request Apr 14, 2026
…ce (`7.0`) (#25514)

* Honor max_event_age in cluster event periodical and improve performance (#25265)

* honor max_event_age

* CL

* add validation; migrate test format to junit5 style

* reduce default and min period

* revise default and min value

* add config param documentation

---------

Co-authored-by: Anton Ebel <anton.ebel@graylog.com>
(cherry picked from commit a2c2f52)

* Drop old cluster_events index when creating the new one

The original PR replaced the compound index from
(timestamp, producer, consumers) to (consumers, timestamp)
but didn't remove the old index. Drop it on startup if present.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* adjust for graylog collection wrapper

---------

Co-authored-by: Anton Ebel <anton.ebel@graylog.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Ismail Belkacim <xd4rker@users.noreply.github.com>
(cherry picked from commit a3e0792)
patrickmann added a commit that referenced this pull request Apr 14, 2026
…ce (`7.0`) (#25514) (#25631)

* Honor max_event_age in cluster event periodical and improve performance (#25265)

* honor max_event_age

* CL

* add validation; migrate test format to junit5 style

* reduce default and min period

* revise default and min value

* add config param documentation

---------


(cherry picked from commit a2c2f52)

* Drop old cluster_events index when creating the new one

The original PR replaced the compound index from
(timestamp, producer, consumers) to (consumers, timestamp)
but didn't remove the old index. Drop it on startup if present.



* adjust for graylog collection wrapper

---------




(cherry picked from commit a3e0792)

Co-authored-by: Patrick Mann <patrickmann@users.noreply.github.com>
Co-authored-by: Anton Ebel <anton.ebel@graylog.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Ismail Belkacim <xd4rker@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants