You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`table.add()` now auto-parallelizes large writes, but the docs still showed
only the old iterator-based pattern. This rewrites the "Use Iterators" section
into "Loading Large Datasets" with guidance on `pyarrow.dataset` input, the
create-empty-then-add pattern, and auto-parallelism behavior. Updates the FAQ
to match.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/faq/faq-oss.mdx
+14-1Lines changed: 14 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,7 +44,20 @@ For large-scale (>1M) or higher dimension vectors, it is beneficial to create a
44
44
45
45
### How can I speed up data inserts?
46
46
47
-
It's highly recommended to perform bulk inserts via batches (for e.g., Pandas DataFrames or lists of dicts in Python) to speed up inserts for large datasets. Inserting records one at a time is slow and can result in suboptimal performance because each insert creates a new data fragment on disk. Batching inserts allows LanceDB to create larger fragments (and their associated manifests), which are more efficient to read and write.
47
+
LanceDB auto-parallelizes large writes when you call `table.add()` with materialized
48
+
data such as `pa.Table`, `pd.DataFrame`, or `pa.dataset()`. No extra configuration
49
+
is needed — writes are automatically split into partitions of ~1M rows or 2GB.
50
+
51
+
For best results:
52
+
53
+
-**Create an empty table first**, then call `table.add()`. The `add()` path enables
54
+
automatic write parallelism, while passing data directly to `create_table()` does not.
55
+
-**For file-based data**, use `pyarrow.dataset.dataset("path/to/data/", format="parquet")`
56
+
so LanceDB can stream from disk without loading everything into memory.
57
+
-**Avoid inserting one row at a time.** Each insert creates a new data fragment on
58
+
disk. Batch your data into Arrow tables, DataFrames, or use iterators.
59
+
60
+
See [Loading Large Datasets](/tables/create#loading-large-datasets) for full examples.
48
61
49
62
### Do I need to set a refine factor when using an index?
exportconst TsQuickstartVectorSearch1 ="// Let's search for vectors similar to \"warrior\"\nlet queryVector = [0.8, 0.3, 0.8];\n\nlet result = await table.search(queryVector).limit(2).toArray();\nconsole.table(result);\n";
34
34
35
35
exportconst TsQuickstartVectorSearch2 ="// Let's search for vectors similar to \"wizard\"\nqueryVector = [0.7, 0.3, 0.5];\n\nconst results = await table.search(queryVector).limit(2).toArray();\nconsole.table(results);\n";
exportconst PyBasicHybridSearchAsync ="uri = \"data/sample-lancedb\"\nasync_db = await lancedb.connect_async(uri)\ndata = [\n {\"text\": \"rebel spaceships striking from a hidden base\"},\n {\"text\": \"have won their first victory against the evil Galactic Empire\"},\n {\"text\": \"during the battle rebel spies managed to steal secret plans\"},\n {\"text\": \"to the Empire's ultimate weapon the Death Star\"},\n]\nasync_tbl = await async_db.create_table(\"documents_async\", schema=Documents)\n# ingest docs with auto-vectorization\nawait async_tbl.add(data)\n# Create a fts index before the hybrid search\nawait async_tbl.create_index(\"text\", config=FTS())\ntext_query = \"flower moon\"\n# hybrid search with default re-ranker\nawait (await async_tbl.search(\"flower moon\", query_type=\"hybrid\")).to_pandas()\n";
exportconst PyAlterColumnsDataType ="# Change price from int32 to int64 for larger numbers\ntable.alter_columns({\"path\": \"price\", \"data_type\": pa.int64()})\n";
16
18
17
19
exportconst PyAlterColumnsMultiple ="# Rename, change type, and make nullable in one operation\ntable.alter_columns(\n {\n\"path\": \"sale_price\",\n\"rename\": \"final_price\",\n\"data_type\": pa.float64(),\n\"nullable\": True,\n }\n)\n";
@@ -24,13 +26,13 @@ export const PyAlterColumnsWithExpression = "# For custom transforms, create a n
When you run this code it, should raise the `ValidationError`.
219
220
220
-
### Use Iterators / Write Large Datasets
221
+
### Loading Large Datasets
221
222
222
-
For large ingests, prefer batching instead of adding one row at a time. Python and Rust can create a table directly from Arrow batch iterators or readers. In TypeScript, the practical pattern today is to create an empty table and append Arrow batches in chunks.
223
+
When ingesting large datasets, use `table.add()` on an existing table rather than
224
+
passing all data to `create_table()`. The `add()` method auto-parallelizes large
225
+
writes, while `create_table(name, data)` does not.
226
+
227
+
<Tip>
228
+
For best performance with large datasets, create an empty table first and then call
229
+
`table.add()`. This enables automatic write parallelism for materialized data sources.
230
+
</Tip>
231
+
232
+
#### From files (Parquet, CSV, etc.)
233
+
<Badgecolor="green">Python Only</Badge>
234
+
235
+
For file-based data, pass a `pyarrow.dataset.Dataset` to `table.add()`. This streams
236
+
data from disk without loading the entire dataset into memory.
0 commit comments