Skip to content

Optimize MongoDBExportPartitionSupplier for uniform _id type collections#6910

Open
dinujoh wants to merge 1 commit into
opensearch-project:mainfrom
dinujoh:main
Open

Optimize MongoDBExportPartitionSupplier for uniform _id type collections#6910
dinujoh wants to merge 1 commit into
opensearch-project:mainfrom
dinujoh:main

Conversation

@dinujoh
Copy link
Copy Markdown
Member

@dinujoh dinujoh commented Jun 7, 2026

Description

For collections with uniform _id types, replace the $or query with a simple Filters.gt("_id", value) for finding partition boundaries. This allows DocumentDB to use a single B-tree index seek instead of multi-index scan.

Changes:

  • Add isUniformIdType() that checks first/last doc _id types
  • Add buildNextStartFilter() with simple $gt for uniform types, falling back to $or-based query for mixed types
  • Use fresh Filters.gte() + skip() per iteration for partition end
  • Extract addPartition() helper to reduce duplication
  • Make BsonHelper.isClassNumber() public for numeric type grouping

Performance: 14M docs (10GB) partitioned in ~30 seconds.

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

For collections with uniform _id types, replace the 8-clause $or query
with a simple Filters.gt("_id", value) for finding partition boundaries.
This allows DocumentDB to use a single B-tree index seek instead of
multi-index scan.

Changes:
- Add isUniformIdType() that checks first/last doc _id types
- Add buildNextStartFilter() with simple $gt for uniform types,
  falling back to $or-based query for mixed types
- Use fresh Filters.gte() + skip() per iteration for partition end
- Extract addPartition() helper to reduce duplication
- Make BsonHelper.isClassNumber() public for numeric type grouping

Performance: 14M docs (10GB) partitioned in ~30 seconds.

Signed-off-by: Dinu John <86094133+dinujoh@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 7, 2026

⚠️ License Header Violations Found

The following newly added files are missing required license headers:

  • data-prepper-plugins/mongodb/src/test/java/org/opensearch/dataprepper/plugins/mongo/export/MongoDBExportPartitionSupplierIsUniformIdTypeTest.java

Please add the appropriate license header to each file and push your changes.

See the license header requirements: https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md#license-headers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant