docs: document multi-worker DataLoader support for remote tables#260
Conversation
|
Preview deployment for your docs. Learn more about Mintlify Previews.
|
|
cc @westonpace - I presume these docs make sense based on your latest additions. |
|
I pushed a follow-up commit to make the PyTorch examples match current LanceDB behavior. What I tested:
Why this change is necessary:
|
|
@westonpace could you please review this PyTorch docs update when possible? As I was testing the code that the agent wrote, some more recent changes that we merged recently caused this error to surface. TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pyarrow.lib.ChunkedArray'>I've patched a fix that uses Also, are we missing anything in the documentation about how to handle these on |
westonpace
left a comment
There was a problem hiding this comment.
Some suggestions but nice examples
| significant performance penalty converting from Arrow, Lance's internal representation, to this default format. Use a | ||
| direct `Table` with `collate_fn` when you want Arrow-to-tensor conversion, or a `Permutation` when you want the default | ||
| PyTorch dict-of-tensors behavior. |
There was a problem hiding this comment.
Hmm, Permutation can also support direct Arrow-to-tensor conversion. It just isn't the default. This makes it sound like you'd have to use a Table.
|
|
||
| Set `num_workers > 0` to read from LanceDB in multiple PyTorch worker processes. LanceDB tables and `Permutation` objects are picklable, so each worker reopens the table after it starts. | ||
|
|
||
| Prefer the `spawn` start method when using multiple workers; LanceDB uses internal threads. See [the performance guide](/performance) for more multiprocessing guidance. |
There was a problem hiding this comment.
Actually forkserver is probably better than spawn (and will be the new python default). I'd say that should be our preference.
Once the streaming dataset is available my guidance would be to use forkserver and use num_workers=1 unless you can prove you have GIL contention in your trasform function.
There was a problem hiding this comment.
Done. Switched the preference to forkserver (and added a note that it's POSIX-only, so Windows falls back to spawn).
On num_workers=1: since that guidance is gated on the streaming dataset, and that's still WIP / held until release (per #294), I've left the examples at num_workers=4 for now so this page doesn't document unshipped behavior. Tracking the num_workers=1 recommendation to land alongside the streaming dataset docs in #294 when it lands on main.
Address review feedback on #260: - Permutation can also produce tensors directly from Arrow; reword so it no longer implies a Table is required for Arrow-to-tensor conversion. - Recommend the forkserver start method over spawn for DataLoader workers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address review feedback on #260: - Clarify that Permutation can also produce tensors directly from Arrow, so a Table is not required for Arrow-to-tensor conversion. - Recommend the forkserver start method over spawn for DataLoader workers. - Note that forkserver is POSIX-only; Windows must use spawn. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
983aebe to
38dcaa3
Compare
Summary
Document that remote LanceDB tables can now be used directly with multi-worker PyTorch
DataLoaders, and refresh guidance on the optionalwith_connection_factoryescape hatch.Changes
num_workers,spawn, andpersistent_workers.db://) tables working in worker processes out of the box.Context
Triggered by an upstream change that lets remote tables carry their connection state through pickling so they reopen correctly in PyTorch DataLoader workers, while keeping the connection factory available for custom credential loading.