Skip to content

Chunk queries for duplicate documents, add metrics, track duplicate documents found in queries#5847

Merged
graytaylor0 merged 2 commits into
opensearch-project:mainfrom
graytaylor0:QueryFix
Jul 2, 2025
Merged

Chunk queries for duplicate documents, add metrics, track duplicate documents found in queries#5847
graytaylor0 merged 2 commits into
opensearch-project:mainfrom
graytaylor0:QueryFix

Conversation

@graytaylor0
Copy link
Copy Markdown
Member

@graytaylor0 graytaylor0 commented Jun 30, 2025

Description

Makes the following improvements to querying for existing documents in the OpenSearch sink

  • chunk search requests into batches of 1,000. This prevents too many term values from being sent in the same search request
  • Improve error log for failed search requests to provide full error rather than just the reason
  • When 2 documents with the same term value are found in a query, log and add metric for document id
  • Add metric for the same term value being added to the query manager with addBulkOperation
  • Add metric for tracking the time taken to search for documents
  • Fixes issue where metric tracking documents being queried wasn't decremented when events were dropped and released, which could cause process workers to block and deadlock forever.

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…documents if found

Signed-off-by: Taylor Gray <tylgry@amazon.com>
))
))
));
final int batchSize = 1000;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make this configurable?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it configurable I suppose.

// Delete duplicate document
LOG.warn("Bulk operation for term value {} with id {} is null, potentially a duplicate document, deleting", queryTermValue, hit.id());
potentialDuplicatesDeleted.increment();
deleteDuplicateDocument(hit.index(), hit.id());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we supposed to delete existing documents?

Copy link
Copy Markdown
Member Author

@graytaylor0 graytaylor0 Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens when we find 2 documents with the same term value, so it does clean up a duplicate. It technically shouldn't happen but If you'd prefer I can just keep the metric and log that there is a duplicate rather than deleting it.

kkondaka
kkondaka previously approved these changes Jun 30, 2025
@graytaylor0 graytaylor0 changed the title Chunk queries for duplicate documents, add metrics, delete duplicate documents if found Chunk queries for duplicate documents, add metrics, track duplicate documents found in queries Jul 1, 2025
Signed-off-by: Taylor Gray <tylgry@amazon.com>
@graytaylor0 graytaylor0 merged commit 54f8e29 into opensearch-project:main Jul 2, 2025
65 of 68 checks passed
JonahCalvo pushed a commit to JonahCalvo/os-data-prepper that referenced this pull request Jul 17, 2025
…ocuments found in queries (opensearch-project#5847)

Signed-off-by: Taylor Gray <tylgry@amazon.com>
Signed-off-by: Jonah Calvo <caljonah@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants