You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: refine multi-worker DataLoader guidance for PyTorch (#260)
Address review feedback on #260:
- Clarify that Permutation can also produce tensors directly from Arrow,
so a Table is not required for Arrow-to-tensor conversion.
- Recommend the forkserver start method over spawn for DataLoader workers.
- Note that forkserver is POSIX-only; Windows must use spawn.
Co-authored-by: prrao87 <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/training/torch.mdx
+12-7Lines changed: 12 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,9 +51,10 @@ column-major `torch.Tensor` with shape `(columns, rows)`.
51
51
52
52
`Permutation` works differently: its default output is a list of Python dicts, which PyTorch's default collate function
53
53
can batch into a dict of tensors. This is usually more convenient when you are getting started. However, there is a
54
-
significant performance penalty converting from Arrow, Lance's internal representation, to this default format. Use a
55
-
direct `Table` with `collate_fn` when you want Arrow-to-tensor conversion, or a `Permutation` when you want the default
56
-
PyTorch dict-of-tensors behavior.
54
+
significant performance penalty converting from Arrow, Lance's internal representation, to this default format. If you
55
+
want the default PyTorch dict-of-tensors behavior, use a `Permutation` as-is; if you want direct Arrow-to-tensor
56
+
conversion, either pass `lancedb.util.tbl_to_tensor` as `collate_fn` with a direct `Table` or configure a `Permutation`
57
+
with one of the transform formats described below.
57
58
58
59
To address this, the `Permutation` class provides a set of builtin transform functions that can be applied to map
59
60
the Arrow data in different ways. The `arrow` and `polars` formats will always avoid data copies. However, `numpy`,
@@ -107,7 +108,11 @@ for batch in dataloader:
107
108
108
109
Set `num_workers > 0` to read from LanceDB in multiple PyTorch worker processes. LanceDB tables and `Permutation` objects are picklable, so each worker reopens the table after it starts.
109
110
110
-
Prefer the `spawn` start method when using multiple workers; LanceDB uses internal threads. See [the performance guide](/performance) for more multiprocessing guidance.
111
+
Prefer the `forkserver` start method when using multiple workers. LanceDB uses internal threads, so the default `fork` method is unsafe; `forkserver` avoids that while being cheaper to start than `spawn`, and it is set to become the Python default. See [the performance guide](/performance) for more multiprocessing guidance.
112
+
113
+
<Note>
114
+
`forkserver` is only available on POSIX systems (Linux and macOS). On Windows, use `spawn` instead — it is the only start method available there.
0 commit comments