Skip to content

Update microsoft-365 source pagination logic to not skip time range#5979

Merged
san81 merged 1 commit into
opensearch-project:mainfrom
wjyao0316:m365-pagination
Aug 11, 2025
Merged

Update microsoft-365 source pagination logic to not skip time range#5979
san81 merged 1 commit into
opensearch-project:mainfrom
wjyao0316:m365-pagination

Conversation

@wjyao0316

Copy link
Copy Markdown
Contributor

Description

This commit updates microsoft-365 source pagination to not skip events in certain cases.

PaginationCrawler saves Item lastModifiedAt as next poll attempt's startTime in the coordinator table. Currently m365 source sets Instant.now as lastModifiedAt(nextPollAttemptTime), which could skip new arrived events between current last event's contentCreated time and current timestamp.

The lastModifiedAt is updated to be contentCreated time(original implementation) when there is next page, and eventTime+1ms when there is no new page. This can ensure no missing event in any scenario as well as no duplicate event in common scenario.

Signed-off-by: Wenjie Yao wjyao@amazon.com

Manual Test
last event contentCreated time is 2025-08-07T14:06:56.164Z

Before the fix. The crawler directly jumps to the processing time from 2025-08-07T14:06:56.164Z to 2025-08-07T21:24:50.120423Z, which has around 7 hrs difference if there are new events arriving during this period. .

2025-08-07T14:26:01,877 [pool-7-thread-1] INFO  org.opensearch.dataprepper.plugins.source.microsoft_office365.Office365CrawlerClient - Starting to list Office 365 audit logs from 2025-08-07T21:24:50.120423Z
2025-08-07T14:26:01,878 [pool-7-thread-1] INFO  org.opensearch.dataprepper.plugins.source.microsoft_office365.Office365RestClient - Starting Office 365 subscriptions for audit logs
2025-08-07T14:26:03,549 [pool-7-thread-1] INFO  org.opensearch.dataprepper.plugins.source.microsoft_office365.Office365Iterator - Initializing Office 365 iterator from timestamp: 2025-08-07T21:24:50.120423Z
2025-08-07T14:26:03,550 [pool-7-thread-1] INFO  org.opensearch.dataprepper.plugins.source.source_crawler.base.PaginationCrawler - Starting to crawl the source with lastPollTime: 2025-08-07T21:24:50.120423Z

After the fix. The crawler polls the data on next timestamp.

...
2025-08-07T14:15:19,094 [pool-7-thread-1] INFO  org.opensearch.dataprepper.plugins.source.microsoft_office365.Office365CrawlerClient - Starting to list Office 365 audit logs from 2025-08-07T14:06:56.165Z
2025-08-07T14:15:19,097 [pool-7-thread-1] INFO  org.opensearch.dataprepper.plugins.source.microsoft_office365.Office365RestClient - Starting Office 365 subscriptions for audit logs
2025-08-07T14:15:20,507 [pool-7-thread-1] INFO  org.opensearch.dataprepper.plugins.source.microsoft_office365.Office365Iterator - Initializing Office 365 iterator from timestamp: 2025-08-07T14:06:56.165Z

Before the test, the last event's contentCreated time is

Check List

  • New functionality includes testing.
  • [N/A ] New functionality has a documentation issue. Please link to it in this PR.
    • [ N/A] New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

// 1ms is m365's smallest time unit so polling data starting from next 1ms would not skip any event.
// If not last page, keep nextPollAttemptStartTime to be contentCreated time so to avoid data loss in a rare scenario
// where 1ms have multiple events and split by nextPageUri
Instant nextPollAttemptStartTime = lastPage ? eventTime.plusMillis(1) : eventTime;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a unit test for this?

**Description**

This commit updates microsoft-365 source pagination to not skip
events in certain cases.

PaginationCrawler saves Item lastModifiedAt as next poll attempt's
startTime in the coordinator table. Currently m365 source sets
Instant.now as lastModifiedAt(nextPollAttemptTime), which could skip
new arrived events between current last event's contentCreated time
and current timestamp.

The lastModifiedAt is updated to be contentCreated time(original
implementation) when there is next page, and eventTime+1ms when there
is no new page. This can ensure no missing event in any scenario as
well as no duplicate event in common scenario.

Signed-off-by: Wenjie Yao <wjyao@amazon.com>
@san81 san81 merged commit 1b2b295 into opensearch-project:main Aug 11, 2025
46 of 47 checks passed
@wjyao0316 wjyao0316 deleted the m365-pagination branch August 21, 2025 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants