Chunk queries for duplicate documents, add metrics, track duplicate documents found in queries#5847
Merged
Merged
Conversation
…documents if found Signed-off-by: Taylor Gray <tylgry@amazon.com>
kkondaka
reviewed
Jun 30, 2025
| )) | ||
| )) | ||
| )); | ||
| final int batchSize = 1000; |
Collaborator
There was a problem hiding this comment.
Do we want to make this configurable?
Member
Author
There was a problem hiding this comment.
We could make it configurable I suppose.
kkondaka
reviewed
Jun 30, 2025
| // Delete duplicate document | ||
| LOG.warn("Bulk operation for term value {} with id {} is null, potentially a duplicate document, deleting", queryTermValue, hit.id()); | ||
| potentialDuplicatesDeleted.increment(); | ||
| deleteDuplicateDocument(hit.index(), hit.id()); |
Collaborator
There was a problem hiding this comment.
Are we supposed to delete existing documents?
Member
Author
There was a problem hiding this comment.
This happens when we find 2 documents with the same term value, so it does clean up a duplicate. It technically shouldn't happen but If you'd prefer I can just keep the metric and log that there is a duplicate rather than deleting it.
kkondaka
previously approved these changes
Jun 30, 2025
Signed-off-by: Taylor Gray <tylgry@amazon.com>
kkondaka
approved these changes
Jul 2, 2025
sb2k16
approved these changes
Jul 2, 2025
JonahCalvo
pushed a commit
to JonahCalvo/os-data-prepper
that referenced
this pull request
Jul 17, 2025
…ocuments found in queries (opensearch-project#5847) Signed-off-by: Taylor Gray <tylgry@amazon.com> Signed-off-by: Jonah Calvo <caljonah@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Makes the following improvements to querying for existing documents in the OpenSearch sink
addBulkOperationCheck List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.