You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/partitioning-spec.md
+49-30Lines changed: 49 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -297,43 +297,62 @@ This design ensures backward compatibility while enabling partition strategy evo
297
297
Operations within a single partition table are ACID-compliant according to the Lance table specification.
298
298
Each partition is an independent Lance table, so reads and writes to a single partition follow standard Lance transaction semantics.
299
299
300
-
### Multi-Partition Transaction
300
+
### Multi-Partition Transaction (Weak)
301
301
302
302
By default, operations across multiple partitions have weaker guarantees:
303
303
304
304
-**Writes across partitions are not atomic or consistent**: A write that affects multiple partitions may partially succeed, leaving some partitions updated while others are not.
305
305
-**Reads across partitions are not isolated**: A read spanning multiple partitions may observe different versions of each partition, leading to inconsistent views.
306
306
307
-
To enable stronger transactional guarantees across partitions, the `__manifest`table can optionally include `read_version`, `read_branch`, and `read_tag` columns for a table.
308
-
These columns record which version of each partition table to read.
307
+
In multi-partition transaction weak mode, write operations are directly committed to the main branch of each table. It means users can always see the fresh state of each leaf
308
+
partition table without additional information from partitioned namespace.
309
309
310
-
#### Read Behavior
310
+
Users need to handle writes across partitions carefully because there is no ACID guarantees. One way is to use idempotent write like `insert overwrite` then retry
311
+
for whatever error. Another way is writing partitions one by one.
312
+
313
+
**ACID**
314
+
* Read Behavior: Readers should always read the latest version from the main branch.
315
+
316
+
* Write Behavior: Writers should always commit to the main branch.
311
317
312
-
Users should specify one of the following combinations:
318
+
* Conflict Resolution: No conflict resolution for operations across multiple partition.
313
319
314
-
1.**`read_version` only**: Read the specified version from the main branch.
315
-
2.**`read_branch` + `read_version`**: Read the specified version from the specified branch.
316
-
3.**`read_tag` only**: Read the version referenced by the specified tag.
320
+
### Multi-Partition Transaction (Strong)
317
321
318
-
When all columns are NULL or not present, readers should read the latest version from the main branch.
322
+
To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version` column for a table.
323
+
The `read_version` records which version of each partition table to read.
324
+
325
+
In multi-partition transaction strong mode, write operations will use detached commit to each table. A detached commit is invisible unless
326
+
the version is set, it makes sure the intermediate state of a transaction remains invisible. Users need to first get `read_version` from partitioned
327
+
namespace, then read the leaf partition table.
328
+
329
+
#### Read Behavior
330
+
331
+
1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
332
+
2. Search `__manifest` table with `version=S0` to collect `read_version`s for the partition tables to read.
333
+
3. Read the specified version from the partition table.
319
334
320
335
#### Commit Behavior
321
336
322
337
Multi-partition transactions are guarded by commits against the `__manifest` table. A typical multi-partition write follows this pattern:
323
338
324
-
1. Write data to each affected partition table independently
325
-
2. Atomically update the `read_version` (and optionally `read_branch` or `read_tag`) of all affected partitions in a single `__manifest` commit
339
+
1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
340
+
2. Write data to each affected partition table independently
341
+
3. Get current version of `__manifest` table as snapshot id(S1), detect/resolve conflicts if `S1` is not `S0`.
342
+
4. Atomically update the `read_version` from `S1` to `S2` in a single `__manifest` commit. `S2` includes detached versions of all affected partitions.
326
343
327
344
This ensures all-or-nothing visibility of changes across partitions.
328
345
329
346
#### Conflict Resolution
330
347
331
-
If concurrent commits have been committed to `__manifest` since the transaction began, the implementation must either:
332
-
333
-
1. Rebase the current commit onto the latest `__manifest` version and retry the commit, or
334
-
2. Fail the current commit and return an error to the caller
335
-
336
-
Implementations are responsible for ensuring the appropriate conflict detection and resolution strategy to guarantee ACID semantics during multi-partition transactions.
348
+
1. Based on `Change Data Feed` of `__manifest` table to fetch the changes between S0 and S1.
349
+
2. For each updated table in S2:
350
+
* collect the transactions between S0 to S1;
351
+
* detect/resolve conflicts between the collected transactions and S2;
352
+
3. Commit the partitioned namespace if all conflicts are resolved.
353
+
4. Otherwise
354
+
* Rebase the current commit onto the latest `__manifest` version and retry the commit, or
355
+
* Fail the current commit and return an error to the caller.
337
356
338
357
## Appendices
339
358
@@ -431,15 +450,15 @@ The namespaces (`v1`, `v1$k7m2n9p4q8r5s3t6`, etc.) are tracked in the `__manifes
431
450
432
451
The `__manifest` table for a partitioned namespace with partition fields `event_date` (v1), `event_year` (v2) and `country` (v2), showing entries from both spec versions:
Note: The root namespace properties (`partition_spec_v1`, `partition_spec_v2`, `schema`) are stored in the `__manifest` table's metadata, not as a row. The `object_id` uses `$` as the namespace path separator. Partition columns use the naming convention `partition_field_{field_id}` where `{field_id}` is the partition field's string identifier. Partition values are inherited from parent namespaces. When retrieving properties via API, partition values are converted to `partition.<field_id> = <value>` entries.
445
464
@@ -480,10 +499,10 @@ One example way to perform such substitution is:
- For partition spec v1, the `country = 'US'` filter cannot be pushed to partition pruning (v1 has no `country` partition), so it must be applied during the table scan
489
508
- For partition spec v2, both filters are pushed down: `partition_field_event_year = 2025` (computed from `year(event_date)`) and `partition_field_country = 'US'`
0 commit comments