Skip to content

Commit 4a96c80

Browse files
wjones127claude
andcommitted
Document bulk ingestion and write parallelism
`table.add()` now auto-parallelizes large writes, but the docs still showed only the old iterator-based pattern. This rewrites the "Use Iterators" section into "Loading Large Datasets" with guidance on `pyarrow.dataset` input, the create-empty-then-add pattern, and auto-parallelism behavior. Updates the FAQ to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent fd702f2 commit 4a96c80

6 files changed

Lines changed: 120 additions & 27 deletions

File tree

docs/faq/faq-oss.mdx

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,20 @@ For large-scale (>1M) or higher dimension vectors, it is beneficial to create a
4444

4545
### How can I speed up data inserts?
4646

47-
It's highly recommended to perform bulk inserts via batches (for e.g., Pandas DataFrames or lists of dicts in Python) to speed up inserts for large datasets. Inserting records one at a time is slow and can result in suboptimal performance because each insert creates a new data fragment on disk. Batching inserts allows LanceDB to create larger fragments (and their associated manifests), which are more efficient to read and write.
47+
LanceDB auto-parallelizes large writes when you call `table.add()` with materialized
48+
data such as `pa.Table`, `pd.DataFrame`, or `pa.dataset()`. No extra configuration
49+
is needed — writes are automatically split into partitions of ~1M rows or 2GB.
50+
51+
For best results:
52+
53+
- **Create an empty table first**, then call `table.add()`. The `add()` path enables
54+
automatic write parallelism, while passing data directly to `create_table()` does not.
55+
- **For file-based data**, use `pyarrow.dataset.dataset("path/to/data/", format="parquet")`
56+
so LanceDB can stream from disk without loading everything into memory.
57+
- **Avoid inserting one row at a time.** Each insert creates a new data fragment on
58+
disk. Batch your data into Arrow tables, DataFrames, or use iterators.
59+
60+
See [Loading Large Datasets](/tables/create#loading-large-datasets) for full examples.
4861

4962
### Do I need to set a refine factor when using an index?
5063

docs/snippets/quickstart.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ export const PyQuickstartVectorSearch1Async = "# Let's search for vectors simila
1818

1919
export const PyQuickstartVectorSearch2 = "# Let's search for vectors similar to \"wizard\"\nquery_vector = [0.7, 0.3, 0.5]\n\nresults = table.search(query_vector).limit(2).to_polars()\nprint(results)\n";
2020

21+
export const TsQuickstartOutputPandas = "result = await table.search(queryVector).limit(2).toArray();\n";
22+
2123
export const TsQuickstartAddData = "const moreData = [\n { id: \"7\", text: \"mage\", vector: [0.6, 0.3, 0.4] },\n { id: \"8\", text: \"bard\", vector: [0.3, 0.8, 0.4] },\n];\n\n// Add data to table\nawait table.add(moreData);\n";
2224

2325
export const TsQuickstartCreateTable = "const data = [\n { id: \"1\", text: \"knight\", vector: [0.9, 0.4, 0.8] },\n { id: \"2\", text: \"ranger\", vector: [0.8, 0.4, 0.7] },\n { id: \"9\", text: \"priest\", vector: [0.6, 0.2, 0.6] },\n { id: \"4\", text: \"rogue\", vector: [0.7, 0.4, 0.7] },\n];\nlet table = await db.createTable(\"adventurers\", data, { mode: \"overwrite\" });\n";
@@ -28,8 +30,6 @@ export const TsQuickstartOpenTable = "table = await db.openTable(\"adventurers\"
2830

2931
export const TsQuickstartOutputArray = "result = await table.search(queryVector).limit(2).toArray();\nconsole.table(result);\n";
3032

31-
export const TsQuickstartOutputPandas = "result = await table.search(queryVector).limit(2).toArray();\n";
32-
3333
export const TsQuickstartVectorSearch1 = "// Let's search for vectors similar to \"warrior\"\nlet queryVector = [0.8, 0.3, 0.8];\n\nlet result = await table.search(queryVector).limit(2).toArray();\nconsole.table(result);\n";
3434

3535
export const TsQuickstartVectorSearch2 = "// Let's search for vectors similar to \"wizard\"\nqueryVector = [0.7, 0.3, 0.5];\n\nconst results = await table.search(queryVector).limit(2).toArray();\nconsole.table(results);\n";

docs/snippets/search.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ export const PyBasicHybridSearch = "data = [\n {\"text\": \"rebel spaceships
88

99
export const PyBasicHybridSearchAsync = "uri = \"data/sample-lancedb\"\nasync_db = await lancedb.connect_async(uri)\ndata = [\n {\"text\": \"rebel spaceships striking from a hidden base\"},\n {\"text\": \"have won their first victory against the evil Galactic Empire\"},\n {\"text\": \"during the battle rebel spies managed to steal secret plans\"},\n {\"text\": \"to the Empire's ultimate weapon the Death Star\"},\n]\nasync_tbl = await async_db.create_table(\"documents_async\", schema=Documents)\n# ingest docs with auto-vectorization\nawait async_tbl.add(data)\n# Create a fts index before the hybrid search\nawait async_tbl.create_index(\"text\", config=FTS())\ntext_query = \"flower moon\"\n# hybrid search with default re-ranker\nawait (await async_tbl.search(\"flower moon\", query_type=\"hybrid\")).to_pandas()\n";
1010

11-
export const PyClassDefinition = "class Metadata(BaseModel):\n source: str\n timestamp: datetime\n\n\nclass Document(BaseModel):\n content: str\n meta: Metadata\n\n\nclass LanceSchema(LanceModel):\n id: str\n vector: Vector(1536)\n payload: Document\n";
12-
1311
export const PyClassDocuments = "class Documents(LanceModel):\n vector: Vector(embeddings.ndims()) = embeddings.VectorField()\n text: str = embeddings.SourceField()\n";
1412

13+
export const PyClassDefinition = "class Metadata(BaseModel):\n source: str\n timestamp: datetime\n\n\nclass Document(BaseModel):\n content: str\n meta: Metadata\n\n\nclass LanceSchema(LanceModel):\n id: str\n vector: Vector(1536)\n payload: Document\n";
14+
1515
export const PyCreateTableAsyncWithNestedSchema = "# Let's add 100 sample rows to our dataset\ndata = [\n LanceSchema(\n id=f\"id{i}\",\n vector=np.random.randn(1536),\n payload=Document(\n content=f\"document{i}\",\n meta=Metadata(source=f\"source{i % 10}\", timestamp=datetime.now()),\n ),\n )\n for i in range(100)\n]\n\nasync_tbl = await async_db.create_table(\n \"documents_async\", data=data, mode=\"overwrite\"\n)\n";
1616

1717
export const PyCreateTableWithNestedSchema = "# Let's add 100 sample rows to our dataset\ndata = [\n LanceSchema(\n id=f\"id{i}\",\n vector=np.random.randn(1536),\n payload=Document(\n content=f\"document{i}\",\n meta=Metadata(source=f\"source{i % 10}\", timestamp=datetime.now()),\n ),\n )\n for i in range(100)\n]\n\n# Synchronous client\ntbl = db.create_table(\"documents\", data=data, mode=\"overwrite\")\n";

docs/snippets/tables.mdx

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ export const PyAddDataPydanticModel = "from lancedb.pydantic import LanceModel,
1212

1313
export const PyAddDataToTable = "import pyarrow as pa\n\n# create an empty table with schema\ndata = [\n {\"vector\": [3.1, 4.1], \"item\": \"foo\", \"price\": 10.0},\n {\"vector\": [5.9, 26.5], \"item\": \"bar\", \"price\": 20.0},\n {\"vector\": [10.2, 100.8], \"item\": \"baz\", \"price\": 30.0},\n {\"vector\": [1.4, 9.5], \"item\": \"fred\", \"price\": 40.0},\n]\n\nschema = pa.schema(\n [\n pa.field(\"vector\", pa.list_(pa.float32(), 2)),\n pa.field(\"item\", pa.utf8()),\n pa.field(\"price\", pa.float32()),\n ]\n)\n\ntable_name = \"basic_ingestion_example\"\ntable = db.create_table(table_name, schema=schema, mode=\"overwrite\")\n# Add data\ntable.add(data)\n";
1414

15+
export const PyAddFromDataset = "import pyarrow.dataset as ds\n\ndataset = ds.dataset(data_path, format=\"parquet\")\ndb = tmp_db\ntable = db.create_table(\"my_table\", schema=dataset.schema, mode=\"overwrite\")\ntable.add(dataset)\n";
16+
1517
export const PyAlterColumnsDataType = "# Change price from int32 to int64 for larger numbers\ntable.alter_columns({\"path\": \"price\", \"data_type\": pa.int64()})\n";
1618

1719
export const PyAlterColumnsMultiple = "# Rename, change type, and make nullable in one operation\ntable.alter_columns(\n {\n \"path\": \"sale_price\",\n \"rename\": \"final_price\",\n \"data_type\": pa.float64(),\n \"nullable\": True,\n }\n)\n";
@@ -24,13 +26,13 @@ export const PyAlterColumnsWithExpression = "# For custom transforms, create a n
2426

2527
export const PyAlterVectorColumn = "vector_dim = 768 # Your embedding dimension\ntable_name = \"vector_alter_example\"\ndb = tmp_db\ndata = [\n {\n \"id\": 1,\n \"embedding\": np.random.random(vector_dim).tolist(),\n },\n]\ntable = db.create_table(table_name, data, mode=\"overwrite\")\n\ntable.alter_columns(\n dict(path=\"embedding\", data_type=pa.list_(pa.float32(), vector_dim))\n)\n";
2628

27-
export const PyBatchDataInsertion = "import pyarrow as pa\n\ndef make_batches():\n for i in range(5): # Create 5 batches\n yield pa.RecordBatch.from_arrays(\n [\n pa.array([[3.1, 4.1], [5.9, 26.5]], pa.list_(pa.float32(), 2)),\n pa.array([f\"item{i*2+1}\", f\"item{i*2+2}\"]),\n pa.array([float((i * 2 + 1) * 10), float((i * 2 + 2) * 10)]),\n ],\n [\"vector\", \"item\", \"price\"],\n )\n\nschema = pa.schema(\n [\n pa.field(\"vector\", pa.list_(pa.float32(), 2)),\n pa.field(\"item\", pa.utf8()),\n pa.field(\"price\", pa.float32()),\n ]\n)\n# Create table with batches\ntable_name = \"batch_ingestion_example\"\ntable = db.create_table(table_name, make_batches(), schema=schema, mode=\"overwrite\")\n";
29+
export const PyBatchDataInsertion = "import pyarrow as pa\n\ndef make_batches():\n for i in range(5): # Create 5 batches\n yield pa.RecordBatch.from_arrays(\n [\n pa.array([[3.1, 4.1], [5.9, 26.5]], pa.list_(pa.float32(), 2)),\n pa.array([f\"item{i * 2 + 1}\", f\"item{i * 2 + 2}\"]),\n pa.array([float((i * 2 + 1) * 10), float((i * 2 + 2) * 10)]),\n ],\n [\"vector\", \"item\", \"price\"],\n )\n\nschema = pa.schema(\n [\n pa.field(\"vector\", pa.list_(pa.float32(), 2)),\n pa.field(\"item\", pa.utf8()),\n pa.field(\"price\", pa.float32()),\n ]\n)\n# Create table with batches\ntable_name = \"batch_ingestion_example\"\ntable = db.create_table(table_name, make_batches(), schema=schema, mode=\"overwrite\")\n";
2830

29-
export const PyConsistencyCheckoutLatest = "uri = str(tmp_db.uri)\nwriter_db = lancedb.connect(uri)\nreader_db = lancedb.connect(uri)\nwriter_table = writer_db.create_table(\"consistency_checkout_latest_table\", [{\"id\": 1}], mode=\"overwrite\")\nreader_table = reader_db.open_table(\"consistency_checkout_latest_table\")\n\nwriter_table.add([{\"id\": 2}])\nrows_before_refresh = reader_table.count_rows()\nprint(f\"Rows before checkout_latest: {rows_before_refresh}\")\n\nreader_table.checkout_latest()\nrows_after_refresh = reader_table.count_rows()\nprint(f\"Rows after checkout_latest: {rows_after_refresh}\")\n";
31+
export const PyConsistencyCheckoutLatest = "uri = str(tmp_db.uri)\nwriter_db = lancedb.connect(uri)\nreader_db = lancedb.connect(uri)\nwriter_table = writer_db.create_table(\n \"consistency_checkout_latest_table\", [{\"id\": 1}], mode=\"overwrite\"\n)\nreader_table = reader_db.open_table(\"consistency_checkout_latest_table\")\n\nwriter_table.add([{\"id\": 2}])\nrows_before_refresh = reader_table.count_rows()\nprint(f\"Rows before checkout_latest: {rows_before_refresh}\")\n\nreader_table.checkout_latest()\nrows_after_refresh = reader_table.count_rows()\nprint(f\"Rows after checkout_latest: {rows_after_refresh}\")\n";
3032

31-
export const PyConsistencyEventual = "from datetime import timedelta\n\nuri = str(tmp_db.uri)\nwriter_db = lancedb.connect(uri)\nreader_db = lancedb.connect(uri, read_consistency_interval=timedelta(seconds=3600))\nwriter_table = writer_db.create_table(\"consistency_eventual_table\", [{\"id\": 1}], mode=\"overwrite\")\nreader_table = reader_db.open_table(\"consistency_eventual_table\")\nwriter_table.add([{\"id\": 2}])\nrows_after_write = reader_table.count_rows()\nprint(f\"Rows visible before eventual refresh interval: {rows_after_write}\")\n";
33+
export const PyConsistencyEventual = "from datetime import timedelta\n\nuri = str(tmp_db.uri)\nwriter_db = lancedb.connect(uri)\nreader_db = lancedb.connect(uri, read_consistency_interval=timedelta(seconds=3600))\nwriter_table = writer_db.create_table(\n \"consistency_eventual_table\", [{\"id\": 1}], mode=\"overwrite\"\n)\nreader_table = reader_db.open_table(\"consistency_eventual_table\")\nwriter_table.add([{\"id\": 2}])\nrows_after_write = reader_table.count_rows()\nprint(f\"Rows visible before eventual refresh interval: {rows_after_write}\")\n";
3234

33-
export const PyConsistencyStrong = "from datetime import timedelta\n\nuri = str(tmp_db.uri)\nwriter_db = lancedb.connect(uri)\nreader_db = lancedb.connect(uri, read_consistency_interval=timedelta(0))\nwriter_table = writer_db.create_table(\"consistency_strong_table\", [{\"id\": 1}], mode=\"overwrite\")\nreader_table = reader_db.open_table(\"consistency_strong_table\")\nwriter_table.add([{\"id\": 2}])\nrows_after_write = reader_table.count_rows()\nprint(f\"Rows visible with strong consistency: {rows_after_write}\")\n";
35+
export const PyConsistencyStrong = "from datetime import timedelta\n\nuri = str(tmp_db.uri)\nwriter_db = lancedb.connect(uri)\nreader_db = lancedb.connect(uri, read_consistency_interval=timedelta(0))\nwriter_table = writer_db.create_table(\n \"consistency_strong_table\", [{\"id\": 1}], mode=\"overwrite\"\n)\nreader_table = reader_db.open_table(\"consistency_strong_table\")\nwriter_table.add([{\"id\": 2}])\nrows_after_write = reader_table.count_rows()\nprint(f\"Rows visible with strong consistency: {rows_after_write}\")\n";
3436

3537
export const PyCreateEmptyTable = "import pyarrow as pa\n\nschema = pa.schema(\n [\n pa.field(\"vector\", pa.list_(pa.float32(), 2)),\n pa.field(\"item\", pa.string()),\n pa.field(\"price\", pa.float32()),\n ]\n)\ndb = tmp_db\ntbl = db.create_table(\"test_empty_table\", schema=schema, mode=\"overwrite\")\n";
3638

@@ -60,11 +62,11 @@ export const PyDropColumnsSingle = "# Remove the first temporary column\ntable.d
6062

6163
export const PyDropTable = "db = tmp_db\n# Create a table first\ndata = [{\"vector\": [1.1, 1.2], \"lat\": 45.5}]\ndb.create_table(\"my_table\", data, mode=\"overwrite\")\n\n# Drop the table\ndb.drop_table(\"my_table\")\n";
6264

63-
export const PyInsertIfNotExists = "import pyarrow as pa\n\ntable = db.create_table(\n \"users_example\",\n data=pa.table(\n {\n \"id\": [1, 2],\n \"name\": [\"Alice\", \"Bob\"],\n \"login_count\": [10, 20],\n }\n ),\n mode=\"overwrite\",\n)\n\nincoming_users = pa.table(\n {\n \"id\": [2, 3],\n \"name\": [\"Bobby\", \"Charlie\"],\n \"login_count\": [21, 5],\n }\n)\n\n(\n table.merge_insert(\"id\")\n .when_not_matched_insert_all()\n .execute(incoming_users)\n)\n";
65+
export const PyInsertIfNotExists = "import pyarrow as pa\n\ntable = db.create_table(\n \"users_example\",\n data=pa.table(\n {\n \"id\": [1, 2],\n \"name\": [\"Alice\", \"Bob\"],\n \"login_count\": [10, 20],\n }\n ),\n mode=\"overwrite\",\n)\n\nincoming_users = pa.table(\n {\n \"id\": [2, 3],\n \"name\": [\"Bobby\", \"Charlie\"],\n \"login_count\": [21, 5],\n }\n)\n\n(table.merge_insert(\"id\").when_not_matched_insert_all().execute(incoming_users))\n";
6466

6567
export const PyMergeDeleteMissingBySource = "import pyarrow as pa\n\ntable = db.create_table(\n \"users_example\",\n data=pa.table(\n {\n \"id\": [1, 2, 3],\n \"name\": [\"Alice\", \"Bob\", \"Charlie\"],\n \"login_count\": [10, 20, 5],\n }\n ),\n mode=\"overwrite\",\n)\n\nincoming_users = pa.table(\n {\n \"id\": [2, 3],\n \"name\": [\"Bobby\", \"Charlie\"],\n \"login_count\": [21, 5],\n }\n)\n\n(\n table.merge_insert(\"id\")\n .when_matched_update_all()\n .when_not_matched_insert_all()\n .when_not_matched_by_source_delete()\n .execute(incoming_users)\n)\n";
6668

67-
export const PyMergeMatchedUpdateOnly = "import pyarrow as pa\n\ntable = db.create_table(\n \"users_example\",\n data=pa.table(\n {\n \"id\": [1, 2],\n \"name\": [\"Alice\", \"Bob\"],\n \"login_count\": [10, 20],\n }\n ),\n mode=\"overwrite\",\n)\n\nincoming_users = pa.table(\n {\n \"id\": [2, 3],\n \"name\": [\"Bobby\", \"Charlie\"],\n \"login_count\": [21, 5],\n }\n)\n\n(\n table.merge_insert(\"id\")\n .when_matched_update_all()\n .execute(incoming_users)\n)\n";
69+
export const PyMergeMatchedUpdateOnly = "import pyarrow as pa\n\ntable = db.create_table(\n \"users_example\",\n data=pa.table(\n {\n \"id\": [1, 2],\n \"name\": [\"Alice\", \"Bob\"],\n \"login_count\": [10, 20],\n }\n ),\n mode=\"overwrite\",\n)\n\nincoming_users = pa.table(\n {\n \"id\": [2, 3],\n \"name\": [\"Bobby\", \"Charlie\"],\n \"login_count\": [21, 5],\n }\n)\n\n(table.merge_insert(\"id\").when_matched_update_all().execute(incoming_users))\n";
6870

6971
export const PyMergePartialColumns = "import pyarrow as pa\n\ntable = db.create_table(\n \"users_example\",\n data=pa.table(\n {\n \"id\": [1, 2],\n \"name\": [\"Alice\", \"Bob\"],\n \"login_count\": [10, 20],\n }\n ),\n mode=\"overwrite\",\n)\n\nincoming_users = pa.table(\n {\n \"id\": [2, 3],\n \"name\": [\"Bobby\", \"Charlie\"],\n }\n)\n\n(\n table.merge_insert(\"id\")\n .when_matched_update_all()\n .when_not_matched_insert_all()\n .execute(incoming_users)\n)\n";
7072

docs/tables/create.mdx

Lines changed: 46 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ import {
1919
RsCreateTableFromArrow as RsCreateTableFromArrow,
2020
PyCreateTableFromPydantic as CreateTableFromPydantic,
2121
PyCreateTableNestedSchema as CreateTableNestedSchema,
22+
PyAddFromDataset as AddFromDataset,
2223
PyCreateTableFromIterator as CreateTableFromIterator,
2324
TsCreateTableFromIterator as TsCreateTableFromIterator,
2425
RsCreateTableFromIterator as RsCreateTableFromIterator,
@@ -217,9 +218,39 @@ for a `created_at` field.
217218

218219
When you run this code it, should raise the `ValidationError`.
219220

220-
### Use Iterators / Write Large Datasets
221+
### Loading Large Datasets
221222

222-
For large ingests, prefer batching instead of adding one row at a time. Python and Rust can create a table directly from Arrow batch iterators or readers. In TypeScript, the practical pattern today is to create an empty table and append Arrow batches in chunks.
223+
When ingesting large datasets, use `table.add()` on an existing table rather than
224+
passing all data to `create_table()`. The `add()` method auto-parallelizes large
225+
writes, while `create_table(name, data)` does not.
226+
227+
<Tip>
228+
For best performance with large datasets, create an empty table first and then call
229+
`table.add()`. This enables automatic write parallelism for materialized data sources.
230+
</Tip>
231+
232+
#### From files (Parquet, CSV, etc.)
233+
<Badge color="green">Python Only</Badge>
234+
235+
For file-based data, pass a `pyarrow.dataset.Dataset` to `table.add()`. This streams
236+
data from disk without loading the entire dataset into memory.
237+
238+
<CodeGroup>
239+
<CodeBlock filename="Python" language="Python" icon="python">
240+
{AddFromDataset}
241+
</CodeBlock>
242+
</CodeGroup>
243+
244+
<Note>
245+
`pa.dataset()` input is currently Python-only. TypeScript and Rust support for
246+
file-based dataset ingestion is tracked in
247+
[lancedb#3173](https://github.com/lancedb/lancedb/issues/3173).
248+
</Note>
249+
250+
#### From iterators (custom batch generation)
251+
252+
When you need custom batch logic — generating embeddings on the fly, transforming
253+
rows from an external source, etc. — use an iterator of `RecordBatch` objects.
223254

224255
<CodeGroup>
225256
<CodeBlock filename="Python" language="Python" icon="python">
@@ -237,6 +268,19 @@ For large ingests, prefer batching instead of adding one row at a time. Python a
237268

238269
Python can also consume iterators of other supported types like Pandas DataFrames or Python lists.
239270

271+
#### Write parallelism
272+
273+
<Note title="Automatic parallelism">
274+
For materialized data (`pa.Table`, `pd.DataFrame`, `pa.dataset()`), LanceDB
275+
automatically parallelizes large writes — no configuration needed. Auto-parallelism
276+
targets approximately 1M rows or 2GB per write partition.
277+
278+
For streaming sources (iterators, `RecordBatchReader`), LanceDB cannot determine
279+
total size upfront. A `parallelism` parameter to control this manually is planned
280+
but not yet exposed in Python or TypeScript
281+
([tracking issue](https://github.com/lancedb/lancedb/issues/3173)).
282+
</Note>
283+
240284
## Open existing tables
241285

242286
If you forget the name of your table, you can always get a listing of all table names.

0 commit comments

Comments
 (0)