Update microsoft-365 source pagination logic to not skip time range#5979
Merged
Conversation
330be07 to
a40f66a
Compare
graytaylor0
reviewed
Aug 8, 2025
| // 1ms is m365's smallest time unit so polling data starting from next 1ms would not skip any event. | ||
| // If not last page, keep nextPollAttemptStartTime to be contentCreated time so to avoid data loss in a rare scenario | ||
| // where 1ms have multiple events and split by nextPageUri | ||
| Instant nextPollAttemptStartTime = lastPage ? eventTime.plusMillis(1) : eventTime; |
Member
There was a problem hiding this comment.
Should we have a unit test for this?
a40f66a to
5ccec65
Compare
**Description** This commit updates microsoft-365 source pagination to not skip events in certain cases. PaginationCrawler saves Item lastModifiedAt as next poll attempt's startTime in the coordinator table. Currently m365 source sets Instant.now as lastModifiedAt(nextPollAttemptTime), which could skip new arrived events between current last event's contentCreated time and current timestamp. The lastModifiedAt is updated to be contentCreated time(original implementation) when there is next page, and eventTime+1ms when there is no new page. This can ensure no missing event in any scenario as well as no duplicate event in common scenario. Signed-off-by: Wenjie Yao <wjyao@amazon.com>
5ccec65 to
628658f
Compare
graytaylor0
approved these changes
Aug 8, 2025
san81
approved these changes
Aug 11, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This commit updates microsoft-365 source pagination to not skip events in certain cases.
PaginationCrawler saves Item lastModifiedAt as next poll attempt's startTime in the coordinator table. Currently m365 source sets Instant.now as lastModifiedAt(nextPollAttemptTime), which could skip new arrived events between current last event's contentCreated time and current timestamp.
The lastModifiedAt is updated to be contentCreated time(original implementation) when there is next page, and eventTime+1ms when there is no new page. This can ensure no missing event in any scenario as well as no duplicate event in common scenario.
Signed-off-by: Wenjie Yao wjyao@amazon.com
Manual Test
last event contentCreated time is
2025-08-07T14:06:56.164ZBefore the fix. The crawler directly jumps to the processing time from
2025-08-07T14:06:56.164Zto2025-08-07T21:24:50.120423Z, which has around 7 hrs difference if there are new events arriving during this period. .After the fix. The crawler polls the data on next timestamp.
Before the test, the last event's contentCreated time is
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.