Skip to content

docs: document multi-worker DataLoader support for remote tables#260

Merged
prrao87 merged 1 commit into
mainfrom
mintlify/f5da8d82
Jul 3, 2026
Merged

docs: document multi-worker DataLoader support for remote tables#260
prrao87 merged 1 commit into
mainfrom
mintlify/f5da8d82

Conversation

@mintlify

@mintlify mintlify Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Document that remote LanceDB tables can now be used directly with multi-worker PyTorch DataLoaders, and refresh guidance on the optional with_connection_factory escape hatch.

Changes

  • Add a "Using multiple DataLoader workers" section to the PyTorch integration page covering num_workers, spawn, and persistent_workers.
  • Add a subsection showing remote (db://) tables working in worker processes out of the box.
  • Add a subsection showing how to provide a custom connection factory when credentials should be loaded inside the worker.

Context

Triggered by an upstream change that lets remote tables carry their connection state through pickling so they reopen correctly in PyTorch DataLoader workers, while keeping the connection factory available for custom credential loading.

@mintlify

mintlify Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
lancedb-bcbb4faf 🟢 Ready View Preview Jun 1, 2026, 9:57 AM

prrao87
prrao87 approved these changes Jun 9, 2026
@prrao87

prrao87 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

cc @westonpace - I presume these docs make sense based on your latest additions.

@prrao87

prrao87 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

I pushed a follow-up commit to make the PyTorch examples match current LanceDB behavior.

What I tested:

  • Built the latest lancedb main locally from /Users/prrao/code/lancedb with maturin develop.
  • Installed torch==2.12.1.
  • Ran python/tests/test_torch.py: 9 passed, 2 skipped.
  • Added temporary docs-level repros in /private/tmp/test_lancedb_torch_docs_snippets.py: 4 passed.
  • Reproduced the direct DataLoader(table, ...) example without collate_fn; it fails with PyTorch default collation:
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pyarrow.lib.ChunkedArray'>

Why this change is necessary:

  • Direct LanceDB Table datasets emit Arrow data, and PyTorch's default collate_fn does not accept Arrow batches directly.
  • The direct Table examples need collate_fn=tbl_to_tensor to produce tensors.
  • Permutation examples work as written with PyTorch's default collation because the default Permutation output is a list of Python dicts, which PyTorch batches into a dict of tensors.

@prrao87

prrao87 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@westonpace could you please review this PyTorch docs update when possible? As I was testing the code that the agent wrote, some more recent changes that we merged recently caused this error to surface.

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pyarrow.lib.ChunkedArray'>

I've patched a fix that uses tbl_to_tensor from the utils that were added, and it works now. Just want to be sure the explanation in the PyTorch page makes sense.

Also, are we missing anything in the documentation about how to handle these on RemoteTable specifically?

@prrao87 prrao87 requested a review from westonpace July 2, 2026 18:28

@westonpace westonpace left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions but nice examples

Comment thread docs/training/torch.mdx Outdated
Comment on lines +54 to +56
significant performance penalty converting from Arrow, Lance's internal representation, to this default format. Use a
direct `Table` with `collate_fn` when you want Arrow-to-tensor conversion, or a `Permutation` when you want the default
PyTorch dict-of-tensors behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, Permutation can also support direct Arrow-to-tensor conversion. It just isn't the default. This makes it sound like you'd have to use a Table.

@prrao87 prrao87 Jul 3, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, makes sense! Fixed.

Comment thread docs/training/torch.mdx Outdated

Set `num_workers > 0` to read from LanceDB in multiple PyTorch worker processes. LanceDB tables and `Permutation` objects are picklable, so each worker reopens the table after it starts.

Prefer the `spawn` start method when using multiple workers; LanceDB uses internal threads. See [the performance guide](/performance) for more multiprocessing guidance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually forkserver is probably better than spawn (and will be the new python default). I'd say that should be our preference.

Once the streaming dataset is available my guidance would be to use forkserver and use num_workers=1 unless you can prove you have GIL contention in your trasform function.

@prrao87 prrao87 Jul 3, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched the preference to forkserver (and added a note that it's POSIX-only, so Windows falls back to spawn).

On num_workers=1: since that guidance is gated on the streaming dataset, and that's still WIP / held until release (per #294), I've left the examples at num_workers=4 for now so this page doesn't document unshipped behavior. Tracking the num_workers=1 recommendation to land alongside the streaming dataset docs in #294 when it lands on main.

prrao87 added a commit that referenced this pull request Jul 3, 2026
Address review feedback on #260:
- Permutation can also produce tensors directly from Arrow; reword so it
  no longer implies a Table is required for Arrow-to-tensor conversion.
- Recommend the forkserver start method over spawn for DataLoader workers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address review feedback on #260:
- Clarify that Permutation can also produce tensors directly from Arrow,
  so a Table is not required for Arrow-to-tensor conversion.
- Recommend the forkserver start method over spawn for DataLoader workers.
- Note that forkserver is POSIX-only; Windows must use spawn.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@prrao87 prrao87 force-pushed the mintlify/f5da8d82 branch from 983aebe to 38dcaa3 Compare July 3, 2026 21:49
@prrao87 prrao87 merged commit f609d88 into main Jul 3, 2026
2 checks passed
@prrao87 prrao87 deleted the mintlify/f5da8d82 branch July 3, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs_new_release Only merge once we release a new version of LanceDB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants