Skip to content

Commit f609d88

Browse files
mintlify[bot]prrao87claude
authored
docs: refine multi-worker DataLoader guidance for PyTorch (#260)
Address review feedback on #260: - Clarify that Permutation can also produce tensors directly from Arrow, so a Table is not required for Arrow-to-tensor conversion. - Recommend the forkserver start method over spawn for DataLoader workers. - Note that forkserver is POSIX-only; Windows must use spawn. Co-authored-by: prrao87 <35005448+prrao87@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 43d62fb commit f609d88

1 file changed

Lines changed: 12 additions & 7 deletions

File tree

docs/training/torch.mdx

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,10 @@ column-major `torch.Tensor` with shape `(columns, rows)`.
5151

5252
`Permutation` works differently: its default output is a list of Python dicts, which PyTorch's default collate function
5353
can batch into a dict of tensors. This is usually more convenient when you are getting started. However, there is a
54-
significant performance penalty converting from Arrow, Lance's internal representation, to this default format. Use a
55-
direct `Table` with `collate_fn` when you want Arrow-to-tensor conversion, or a `Permutation` when you want the default
56-
PyTorch dict-of-tensors behavior.
54+
significant performance penalty converting from Arrow, Lance's internal representation, to this default format. If you
55+
want the default PyTorch dict-of-tensors behavior, use a `Permutation` as-is; if you want direct Arrow-to-tensor
56+
conversion, either pass `lancedb.util.tbl_to_tensor` as `collate_fn` with a direct `Table` or configure a `Permutation`
57+
with one of the transform formats described below.
5758

5859
To address this, the `Permutation` class provides a set of builtin transform functions that can be applied to map
5960
the Arrow data in different ways. The `arrow` and `polars` formats will always avoid data copies. However, `numpy`,
@@ -107,7 +108,11 @@ for batch in dataloader:
107108

108109
Set `num_workers > 0` to read from LanceDB in multiple PyTorch worker processes. LanceDB tables and `Permutation` objects are picklable, so each worker reopens the table after it starts.
109110

110-
Prefer the `spawn` start method when using multiple workers; LanceDB uses internal threads. See [the performance guide](/performance) for more multiprocessing guidance.
111+
Prefer the `forkserver` start method when using multiple workers. LanceDB uses internal threads, so the default `fork` method is unsafe; `forkserver` avoids that while being cheaper to start than `spawn`, and it is set to become the Python default. See [the performance guide](/performance) for more multiprocessing guidance.
112+
113+
<Note>
114+
`forkserver` is only available on POSIX systems (Linux and macOS). On Windows, use `spawn` instead — it is the only start method available there.
115+
</Note>
111116

112117
```py Python icon=Python
113118
import torch
@@ -119,7 +124,7 @@ dataloader = torch.utils.data.DataLoader(
119124
batch_size=1024,
120125
shuffle=True,
121126
num_workers=4,
122-
multiprocessing_context="spawn",
127+
multiprocessing_context="forkserver",
123128
persistent_workers=True,
124129
)
125130
```
@@ -144,7 +149,7 @@ dataloader = torch.utils.data.DataLoader(
144149
table,
145150
batch_size=512,
146151
num_workers=4,
147-
multiprocessing_context="spawn",
152+
multiprocessing_context="forkserver",
148153
collate_fn=tbl_to_tensor,
149154
)
150155
```
@@ -180,6 +185,6 @@ dataloader = torch.utils.data.DataLoader(
180185
permutation,
181186
batch_size=512,
182187
num_workers=4,
183-
multiprocessing_context="spawn",
188+
multiprocessing_context="forkserver",
184189
)
185190
```

0 commit comments

Comments
 (0)