You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Document bulk ingestion and write parallelism (#193)
* Document bulk ingestion and write parallelism
`table.add()` now auto-parallelizes large writes, but the docs still showed
only the old iterator-based pattern. This rewrites the "Use Iterators" section
into "Loading Large Datasets" with guidance on `pyarrow.dataset` input, the
create-empty-then-add pattern, and auto-parallelism behavior. Updates the FAQ
to match.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* upgrade lancedb in Python
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/faq/faq-oss.mdx
+14-1Lines changed: 14 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,20 @@ For large-scale (>1M) or higher dimension vectors, it is beneficial to create a
46
46
47
47
### How can I speed up data inserts?
48
48
49
-
It's highly recommended to perform bulk inserts via batches (for e.g., Pandas DataFrames or lists of dicts in Python) to speed up inserts for large datasets. Inserting records one at a time is slow and can result in suboptimal performance because each insert creates a new data fragment on disk. Batching inserts allows LanceDB to create larger fragments (and their associated manifests), which are more efficient to read and write.
49
+
LanceDB auto-parallelizes large writes when you call `table.add()` with materialized
50
+
data such as `pa.Table`, `pd.DataFrame`, or `pa.dataset()`. No extra configuration
51
+
is needed — writes are automatically split into partitions of ~1M rows or 2GB.
52
+
53
+
For best results:
54
+
55
+
-**Create an empty table first**, then call `table.add()`. The `add()` path enables
56
+
automatic write parallelism, while passing data directly to `create_table()` does not.
57
+
-**For file-based data**, use `pyarrow.dataset.dataset("path/to/data/", format="parquet")`
58
+
so LanceDB can stream from disk without loading everything into memory.
59
+
-**Avoid inserting one row at a time.** Each insert creates a new data fragment on
60
+
disk. Batch your data into Arrow tables, DataFrames, or use iterators.
61
+
62
+
See [Loading Large Datasets](/tables/create#loading-large-datasets) for full examples.
50
63
51
64
### Do I need to set a refine factor when using an index?
exportconst PyAlterColumnsDataType ="# Change price from int32 to int64 for larger numbers\ntable.alter_columns({\"path\": \"price\", \"data_type\": pa.int64()})\n";
16
18
17
19
exportconst PyAlterColumnsMultiple ="# Rename, change type, and make nullable in one operation\ntable.alter_columns(\n {\n\"path\": \"sale_price\",\n\"rename\": \"final_price\",\n\"data_type\": pa.float64(),\n\"nullable\": True,\n }\n)\n";
@@ -24,13 +26,13 @@ export const PyAlterColumnsWithExpression = "# For custom transforms, create a n
When you run this code it, should raise the `ValidationError`.
219
220
220
-
### From Batch Iterators
221
+
### Loading Large Datasets
221
222
222
-
For bulk ingestion on large datasets, prefer batching instead of adding one row at a time. Python and Rust can create a table directly from Arrow batch iterators or readers. In TypeScript, the practical pattern today is to create an empty table and append Arrow batches in chunks.
223
+
When ingesting large datasets, use `table.add()` on an existing table rather than
224
+
passing all data to `create_table()`. The `add()` method auto-parallelizes large
225
+
writes, while `create_table(name, data)` does not.
226
+
227
+
<Tip>
228
+
For best performance with large datasets, create an empty table first and then call
229
+
`table.add()`. This enables automatic write parallelism for materialized data sources.
230
+
</Tip>
231
+
232
+
#### From files (Parquet, CSV, etc.)
233
+
<Badgecolor="green">Python Only</Badge>
234
+
235
+
For file-based data, pass a `pyarrow.dataset.Dataset` to `table.add()`. This streams
236
+
data from disk without loading the entire dataset into memory.
Python can also consume iterators of other supported types like Pandas DataFrames or Python lists.
245
276
246
-
### Write with Concurrency
247
-
248
-
For Python users who want to speed up bulk ingest jobs, it is usually better to write from Arrow-native sources that already produce batches, such as readers, datasets, or scanners, instead of first materializing everything as one large Python list.
277
+
#### Write parallelism
249
278
250
-
This is most useful when you are writing large amounts of data from an existing Arrow pipeline or another batch-oriented source.
279
+
<Notetitle="Automatic parallelism">
280
+
For materialized data (`pa.Table`, `pd.DataFrame`, `pa.dataset()`), LanceDB
281
+
automatically parallelizes large writes — no configuration needed. Auto-parallelism
282
+
targets approximately 1M rows or 2GB per write partition.
251
283
252
-
The current codebase also contains a lower-level ingest mechanism for describing a batch source together with extra metadata such as row counts and retry behavior. However, that path is not accepted by the released Python `create_table(...)` and `add(...)` workflow in `lancedb==0.30.0`, so we are not showing it as a docs example yet.
253
-
254
-
In Rust, the same lower-level ingest mechanism is available, but the common batch-reader example above is usually the better starting point unless you specifically need to define your own batch source or provide size and retry hints. In TypeScript, this lower-level mechanism is not exposed publicly, so chunked Arrow batch writes remain the recommended pattern.
284
+
For streaming sources (iterators, `RecordBatchReader`), LanceDB cannot determine
285
+
total size upfront. A `parallelism` parameter to control this manually is planned
0 commit comments