Skip to content

Add TokenPaginationCrawler for SAAS plugins#6008

Merged
dlvenable merged 4 commits into
opensearch-project:mainfrom
bbenner7635:feature/token-pagination-crawler
Aug 25, 2025
Merged

Add TokenPaginationCrawler for SAAS plugins#6008
dlvenable merged 4 commits into
opensearch-project:mainfrom
bbenner7635:feature/token-pagination-crawler

Conversation

@bbenner7635

Copy link
Copy Markdown
Contributor

Description

We add token-based pagination which closely follows PaginationCrawler logic. It will execute partitions based on pages of a certain batch size--however, a token-based API will require sequential retrieval before processing.

It can continue to require PaginationCrawlerWorkerProgressState since worker state is only determined by items in the partition, not timestamp or token (src). Moreover, LeaderPartition will fetch log IDs sequentially, while WorkerPartition will fetch log contents for the batch created by LeaderPartition.

Issues Resolved

Resolves #6007

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Brendan Benner <bbenner@amazon.com>
Signed-off-by: Brendan Benner <bbenner@amazon.com>
@bbenner7635 bbenner7635 force-pushed the feature/token-pagination-crawler branch from cb6ddcf to 0e48019 Compare August 22, 2025 01:17
@JsonProperty("last_token")
private String lastToken;

private Instant lastPollTime;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also needed here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because TokenPaginationCrawlerLeaderProgressState needs to implement LeaderProgressState which has required abstract method setLastPollTime for compatibility with existing PaginationCrawler.

…issues

Signed-off-by: Brendan Benner <bbenner@amazon.com>
@bbenner7635 bbenner7635 force-pushed the feature/token-pagination-crawler branch from b07991d to b0b0f09 Compare August 22, 2025 13:48
san81
san81 previously approved these changes Aug 22, 2025

@san81 san81 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of overlap between this and current last poll time based crawler. We can think of a way to generalize these approaches to minimize the redundant code using a custom state object passed to the crawler. If we define the behavior and have different implementations for the state object, that should fit both the use cases with backward compatibility.

san81
san81 previously approved these changes Aug 22, 2025
@bbenner7635 bbenner7635 requested a review from dlvenable August 22, 2025 21:43
dlvenable
dlvenable previously approved these changes Aug 22, 2025
Signed-off-by: Brendan Benner <bbenner@amazon.com>
@bbenner7635 bbenner7635 dismissed stale reviews from dlvenable and san81 via bf84302 August 22, 2025 23:06
@bbenner7635 bbenner7635 force-pushed the feature/token-pagination-crawler branch from 143eff7 to bf84302 Compare August 22, 2025 23:06

@dlvenable dlvenable left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @bbenner7635 !

@dlvenable dlvenable merged commit 8e37ee5 into opensearch-project:main Aug 25, 2025
49 of 51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add TokenPaginationCrawler for SAAS plugins

3 participants