Add BatchSize parameter to MongoDB source to prevent cursor timeouts on large collections#234
Add BatchSize parameter to MongoDB source to prevent cursor timeouts on large collections#234
Conversation
…rrors Co-authored-by: philnach <19275540+philnach@users.noreply.github.com>
Co-authored-by: philnach <19275540+philnach@users.noreply.github.com>
Co-authored-by: philnach <19275540+philnach@users.noreply.github.com>
|
@copilot Is this a good solution for largo MongoDB collecitons? Does the MongoDB drive/sdk support this scenario already? |
Yes, this is a good solution for large MongoDB collections. The MongoDB C# driver (v2.30.0) natively supports the How it works:
Why this solves the problem:
Alternative approaches considered:
The implementation follows MongoDB best practices and leverages native driver functionality. |
|
@markjbrown / @bowencode , this is an older issue, but it enables a scenario where there's large data coming from MongoDB. It allows the user to specify the BatchSize against the source. dmt already supported against the sink. |
philnach
left a comment
There was a problem hiding this comment.
Thanks for picking this up — the overall approach (using FindOptions.BatchSize to mitigate cursor timeouts on large MongoDB collections) is the right MongoDB primitive, and CI is green. Before merging I'd like to see three things addressed; details inline.
Blocking
- Silent-ignoring
BatchSize <= 0hides misconfiguration. A user typo like"BatchSize": -1will look configured but the cursor-timeout failure that motivated this PR will reappear unchanged. Please either log a warning when an invalid value is supplied, or validate at the settings layer with[Range(1, int.MaxValue)]so the user gets a clear error. - The new tests don't actually verify the plumbing. They exercise the auto-property setter on
MongoSourceSettings, but not the real contract — that whenBatchSizeis configured,FindOptions.BatchSizeis what gets passed to the driver. A small unit test onMongoRepository.FindAsyncusing aMock<IMongoCollection<BsonDocument>>capturing theFindOptionsargument would close that gap. - The basic README example now suggests
"BatchSize": 1000as a default. Users will read it as recommended/required. Please dropBatchSizefrom the basic example and rely on the dedicated subsection (which already explains it well) to introduce it.
Non-blocking suggestions
- Call out the
AsQueryable()→FindAsync(empty filter)switch in the PR description. It's a real (and probably good) async-correctness improvement that applies even whenBatchSizeisn't set — worth flagging for reviewers and future archaeology. LogBatchSizeIfSpecifiedis invoked from both branches ofEnumerateCollectionAsync; could be hoisted up to a single call there.- The comment in
MongoRepository.FindAsync("prevent exceptions during data migration") overstates the driver's behavior onBatchSize <= 0— please reword or remove. - One sentence in the README would help set expectations:
BatchSizemitigates cursor timeouts (smaller batches reset the idle timer between fetches) but does not disable them;NoCursorTimeoutis the actual disable, and isn't always honored on Azure Cosmos for MongoDB.
Nice-to-haves aside, I think items 1–3 are worth landing before merge. Happy to re-review quickly once those are addressed.
…ix README Agent-Logs-Url: https://github.com/AzureCosmosDB/data-migration-desktop-tool/sessions/a453d3dc-f1ec-499a-94b3-f45bf287abd2 Co-authored-by: philnach <19275540+philnach@users.noreply.github.com>
|
Thanks for the quick turnaround on
From my side this is good to go pending CI. Both workflows on |
Migrating MongoDB collections with 250k+ documents fails with
MongoCursorNotFoundExceptionbecause cursors timeout before all documents are fetched.Changes
BatchSizeproperty to control documents per batchFindAsyncto applyBatchSizeviaFindOptionswhen specifiedFindAsyncwith batch size support (previously unfiltered reads usedAsQueryable()which doesn't support batching). This change fromAsQueryable()toFindAsync(empty filter)provides async-correctness improvements (truly async, no thread-pool hop, supports server-side options) that apply even whenBatchSizeis not configured.BatchSize <= 0is specified to prevent silent misconfigurationNoCursorTimeoutwould, though it's not honored on Azure Cosmos DB)MongoRepositoryTests.cs) verifying thatFindOptions.BatchSizeis correctly passed to the MongoDB driverUsage
{ "ConnectionString": "mongodb://localhost:27017", "DatabaseName": "mydb", "Collection": "large_collection", "BatchSize": 1000 }If not specified, MongoDB's default batch size applies. Invalid values (
<= 0) trigger a warning log and are ignored to prevent migration failures.Validation
Invalid BatchSize values are logged with a warning:
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.