Skip to content

Commit 19ba3f9

Browse files
committed
feat: update transaction part in lance partitioning spec
1 parent 1915a12 commit 19ba3f9

1 file changed

Lines changed: 49 additions & 30 deletions

File tree

docs/src/partitioning-spec.md

Lines changed: 49 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -297,43 +297,62 @@ This design ensures backward compatibility while enabling partition strategy evo
297297
Operations within a single partition table are ACID-compliant according to the Lance table specification.
298298
Each partition is an independent Lance table, so reads and writes to a single partition follow standard Lance transaction semantics.
299299

300-
### Multi-Partition Transaction
300+
### Multi-Partition Transaction (Weak)
301301

302302
By default, operations across multiple partitions have weaker guarantees:
303303

304304
- **Writes across partitions are not atomic or consistent**: A write that affects multiple partitions may partially succeed, leaving some partitions updated while others are not.
305305
- **Reads across partitions are not isolated**: A read spanning multiple partitions may observe different versions of each partition, leading to inconsistent views.
306306

307-
To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version`, `read_branch`, and `read_tag` columns for a table.
308-
These columns record which version of each partition table to read.
307+
In multi-partition transaction weak mode, write operations are directly committed to the main branch of each table. It means users can always see the fresh state of each leaf
308+
partition table without additional information from partitioned namespace.
309309

310-
#### Read Behavior
310+
Users need to handle writes across partitions carefully because there is no ACID guarantees. One way is to use idempotent write like `insert overwrite` then retry
311+
for whatever error. Another way is writing partitions one by one.
312+
313+
**ACID**
314+
* Read Behavior: Readers should always read the latest version from the main branch.
315+
316+
* Write Behavior: Writers should always commit to the main branch.
311317

312-
Users should specify one of the following combinations:
318+
* Conflict Resolution: No conflict resolution for operations across multiple partition.
313319

314-
1. **`read_version` only**: Read the specified version from the main branch.
315-
2. **`read_branch` + `read_version`**: Read the specified version from the specified branch.
316-
3. **`read_tag` only**: Read the version referenced by the specified tag.
320+
### Multi-Partition Transaction (Strong)
317321

318-
When all columns are NULL or not present, readers should read the latest version from the main branch.
322+
To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version` column for a table.
323+
The `read_version` records which version of each partition table to read.
324+
325+
In multi-partition transaction strong mode, write operations will use detached commit to each table. A detached commit is invisible unless
326+
the version is set, it makes sure the intermediate state of a transaction remains invisible. Users need to first get `read_version` from partitioned
327+
namespace, then read the leaf partition table.
328+
329+
#### Read Behavior
330+
331+
1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
332+
2. Search `__manifest` table with `version=S0` to collect `read_version`s for the partition tables to read.
333+
3. Read the specified version from the partition table.
319334

320335
#### Commit Behavior
321336

322337
Multi-partition transactions are guarded by commits against the `__manifest` table. A typical multi-partition write follows this pattern:
323338

324-
1. Write data to each affected partition table independently
325-
2. Atomically update the `read_version` (and optionally `read_branch` or `read_tag`) of all affected partitions in a single `__manifest` commit
339+
1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
340+
2. Write data to each affected partition table independently
341+
3. Get current version of `__manifest` table as snapshot id(S1), detect/resolve conflicts if `S1` is not `S0`.
342+
4. Atomically update the `read_version` from `S1` to `S2` in a single `__manifest` commit. `S2` includes detached versions of all affected partitions.
326343

327344
This ensures all-or-nothing visibility of changes across partitions.
328345

329346
#### Conflict Resolution
330347

331-
If concurrent commits have been committed to `__manifest` since the transaction began, the implementation must either:
332-
333-
1. Rebase the current commit onto the latest `__manifest` version and retry the commit, or
334-
2. Fail the current commit and return an error to the caller
335-
336-
Implementations are responsible for ensuring the appropriate conflict detection and resolution strategy to guarantee ACID semantics during multi-partition transactions.
348+
1. Based on `Change Data Feed` of `__manifest` table to fetch the changes between S0 and S1.
349+
2. For each updated table in S2:
350+
* collect the transactions between S0 to S1;
351+
* detect/resolve conflicts between the collected transactions and S2;
352+
3. Commit the partitioned namespace if all conflicts are resolved.
353+
4. Otherwise
354+
* Rebase the current commit onto the latest `__manifest` version and retry the commit, or
355+
* Fail the current commit and return an error to the caller.
337356

338357
## Appendices
339358

@@ -431,15 +450,15 @@ The namespaces (`v1`, `v1$k7m2n9p4q8r5s3t6`, etc.) are tracked in the `__manifes
431450

432451
The `__manifest` table for a partitioned namespace with partition fields `event_date` (v1), `event_year` (v2) and `country` (v2), showing entries from both spec versions:
433452

434-
| object_id | object_type | metadata | read_version | read_branch | read_tag | partition_field_event_date | partition_field_event_year | partition_field_country |
435-
|-----------------------------------------------|-------------|----------|--------------|-------------|----------|----------------------------|----------------------------|-------------------------|
436-
| v1 | namespace | {} | NULL | NULL | NULL | NULL | NULL | NULL |
437-
| v1$k7m2n9p4q8r5s3t6 | namespace | {} | NULL | NULL | NULL | 2025-12-10 | NULL | NULL |
438-
| v1$k7m2n9p4q8r5s3t6$dataset | table | {} | 5 | NULL | NULL | 2025-12-10 | NULL | NULL |
439-
| v2 | namespace | {} | NULL | NULL | NULL | NULL | NULL | NULL |
440-
| v2$e9f0g1h2i3j4k5l6 | namespace | {} | NULL | NULL | NULL | NULL | 2025 | NULL |
441-
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4 | namespace | {} | NULL | NULL | NULL | NULL | 2025 | US |
442-
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | table | {} | 3 | NULL | NULL | NULL | 2025 | US |
453+
| object_id | object_type | metadata | read_version | partition_field_event_date | partition_field_event_year | partition_field_country |
454+
|----------------------------------------------|-------------|----------|--------------|----------------------------|----------------------------|-------------------------|
455+
| v1 | namespace | {} | NULL | NULL | NULL | NULL |
456+
| v1$k7m2n9p4q8r5s3t6 | namespace | {} | NULL | 2025-12-10 | NULL | NULL |
457+
| v1$k7m2n9p4q8r5s3t6$dataset | table | {} | 5 | 2025-12-10 | NULL | NULL |
458+
| v2 | namespace | {} | NULL | NULL | NULL | NULL |
459+
| v2$e9f0g1h2i3j4k5l6 | namespace | {} | NULL | NULL | 2025 | NULL |
460+
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4 | namespace | {} | NULL | NULL | 2025 | US |
461+
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | table | {} | 3 | NULL | 2025 | US |
443462

444463
Note: The root namespace properties (`partition_spec_v1`, `partition_spec_v2`, `schema`) are stored in the `__manifest` table's metadata, not as a row. The `object_id` uses `$` as the namespace path separator. Partition columns use the naming convention `partition_field_{field_id}` where `{field_id}` is the partition field's string identifier. Partition values are inherited from parent namespaces. When retrieving properties via API, partition values are converted to `partition.<field_id> = <value>` entries.
445464

@@ -480,10 +499,10 @@ One example way to perform such substitution is:
480499

481500
This query returns:
482501

483-
| object_id | location | read_version | read_branch | read_tag |
484-
|----------------------------------------------|-------------------------------------------------------|--------------|-------------|----------|
485-
| v1$k7m2n9p4q8r5s3t6$dataset | b4a3c2d1_v1$k7m2n9p4q8r5s3t6$dataset | 5 | NULL | NULL |
486-
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | aabbccdd_v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | 3 | NULL | NULL |
502+
| object_id | location | read_version |
503+
|----------------------------------------------|-------------------------------------------------------|--------------|
504+
| v1$k7m2n9p4q8r5s3t6$dataset | b4a3c2d1_v1$k7m2n9p4q8r5s3t6$dataset | 5 |
505+
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | aabbccdd_v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | 3 |
487506

488507
- For partition spec v1, the `country = 'US'` filter cannot be pushed to partition pruning (v1 has no `country` partition), so it must be applied during the table scan
489508
- For partition spec v2, both filters are pushed down: `partition_field_event_year = 2025` (computed from `year(event_date)`) and `partition_field_country = 'US'`

0 commit comments

Comments
 (0)