You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`read_version`|`uint64`| Table version for reads (optional, see [Transaction](#transaction)) |
220
+
|`partition_field_{field_id}`|`<type>`| Partition value for the field (nullable, inherited from parent namespaces) |
221
+
| ... | ... | Additional partition field columns as needed |
224
222
225
223
Partition values are inherited from parent namespaces - each row has all partition values from its ancestors.
226
224
See [Appendix C: Manifest Table Example](#appendix-c-manifest-table-example) for a complete example.
@@ -297,43 +295,62 @@ This design ensures backward compatibility while enabling partition strategy evo
297
295
Operations within a single partition table are ACID-compliant according to the Lance table specification.
298
296
Each partition is an independent Lance table, so reads and writes to a single partition follow standard Lance transaction semantics.
299
297
300
-
### Multi-Partition Transaction
298
+
### Multi-Partition Transaction (Weak)
301
299
302
300
By default, operations across multiple partitions have weaker guarantees:
303
301
304
302
-**Writes across partitions are not atomic or consistent**: A write that affects multiple partitions may partially succeed, leaving some partitions updated while others are not.
305
303
-**Reads across partitions are not isolated**: A read spanning multiple partitions may observe different versions of each partition, leading to inconsistent views.
306
304
307
-
To enable stronger transactional guarantees across partitions, the `__manifest`table can optionally include `read_version`, `read_branch`, and `read_tag` columns for a table.
308
-
These columns record which version of each partition table to read.
305
+
In multi-partition transaction weak mode, write operations are directly committed to the main branch of each table. It means users can always see the fresh state of each leaf
306
+
partition table without additional information from partitioned namespace.
309
307
310
-
#### Read Behavior
308
+
Users need to handle writes across partitions carefully because there is no ACID guarantees. One way is to use idempotent write like `insert overwrite` then retry
309
+
for whatever error. Another way is writing partitions one by one.
310
+
311
+
**ACID**
312
+
* Read Behavior: Readers should always read the latest version from the main branch.
313
+
314
+
* Write Behavior: Writers should always commit to the main branch.
311
315
312
-
Users should specify one of the following combinations:
316
+
* Conflict Resolution: No conflict resolution for operations across multiple partition.
313
317
314
-
1.**`read_version` only**: Read the specified version from the main branch.
315
-
2.**`read_branch` + `read_version`**: Read the specified version from the specified branch.
316
-
3.**`read_tag` only**: Read the version referenced by the specified tag.
318
+
### Multi-Partition Transaction (Strong)
317
319
318
-
When all columns are NULL or not present, readers should read the latest version from the main branch.
320
+
To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version` column for a table.
321
+
The `read_version` records the version timeline of each partition table. The last version in timeline is the current version to read.
322
+
323
+
In multi-partition transaction strong mode, write operations will use detached commit to each table. A detached commit is invisible unless
324
+
the version is set, it makes sure the intermediate state of a transaction remains invisible. Users need to first get `read_version` from partitioned
325
+
namespace, then read the leaf partition table using the current version.
326
+
327
+
#### Read Behavior
328
+
329
+
1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
330
+
2. Search `__manifest` table with `version=S0` to collect `read_version`s for the partition tables to read.
331
+
3. Read the current version from the partition table.
319
332
320
333
#### Commit Behavior
321
334
322
335
Multi-partition transactions are guarded by commits against the `__manifest` table. A typical multi-partition write follows this pattern:
323
336
324
-
1. Write data to each affected partition table independently
325
-
2. Atomically update the `read_version` (and optionally `read_branch` or `read_tag`) of all affected partitions in a single `__manifest` commit
337
+
1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
338
+
2. Write data to each affected partition table independently
339
+
3. Get current version of `__manifest` table as snapshot id(S1), detect/resolve conflicts if `S1` is not `S0`.
340
+
4. Atomically update the `read_version` from `S1` to `S2` in a single `__manifest` commit. This will updates the `read_version` timelines of all affected partitions.
326
341
327
342
This ensures all-or-nothing visibility of changes across partitions.
328
343
329
344
#### Conflict Resolution
330
345
331
-
If concurrent commits have been committed to `__manifest` since the transaction began, the implementation must either:
332
-
333
-
1. Rebase the current commit onto the latest `__manifest` version and retry the commit, or
334
-
2. Fail the current commit and return an error to the caller
335
-
336
-
Implementations are responsible for ensuring the appropriate conflict detection and resolution strategy to guarantee ACID semantics during multi-partition transactions.
346
+
1. Based on `read_version` timeline to fetch the changes between S0 and S1.
347
+
2. For each updated table in S2:
348
+
* collect the transactions between S0 to S1;
349
+
* detect/resolve conflicts between the collected transactions and S2;
350
+
3. Commit the partitioned namespace if all conflicts are resolved.
351
+
4. Otherwise
352
+
* Rebase the current commit onto the latest `__manifest` version and retry the commit, or
353
+
* Fail the current commit and return an error to the caller.
337
354
338
355
## Appendices
339
356
@@ -431,15 +448,15 @@ The namespaces (`v1`, `v1$k7m2n9p4q8r5s3t6`, etc.) are tracked in the `__manifes
431
448
432
449
The `__manifest` table for a partitioned namespace with partition fields `event_date` (v1), `event_year` (v2) and `country` (v2), showing entries from both spec versions:
Note: The root namespace properties (`partition_spec_v1`, `partition_spec_v2`, `schema`) are stored in the `__manifest` table's metadata, not as a row. The `object_id` uses `$` as the namespace path separator. Partition columns use the naming convention `partition_field_{field_id}` where `{field_id}` is the partition field's string identifier. Partition values are inherited from parent namespaces. When retrieving properties via API, partition values are converted to `partition.<field_id> = <value>` entries.
445
462
@@ -459,7 +476,7 @@ WHERE event_date = '2025-12-10' AND country = 'US'
459
476
The engine translates this to the following `__manifest` DataFusion query plan to examine related partition tables.
- For partition spec v1, the `country = 'US'` filter cannot be pushed to partition pruning (v1 has no `country` partition), so it must be applied during the table scan
489
506
- For partition spec v2, both filters are pushed down: `partition_field_event_year = 2025` (computed from `year(event_date)`) and `partition_field_country = 'US'`
490
-
- The engine reads each table at the version specified by `read_version`, `read_branch`, or `read_tag` for consistent snapshot reads
507
+
- The engine reads each table at the version specified by `read_version` for consistent snapshot reads
491
508
492
509
### Appendix E: Runtime Namespace Properties Example
0 commit comments