Skip to content

Commit da9d0a6

Browse files
committed
feat: update transaction part in lance partitioning spec
1 parent 1915a12 commit da9d0a6

1 file changed

Lines changed: 59 additions & 42 deletions

File tree

docs/src/partitioning-spec.md

Lines changed: 59 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -211,16 +211,14 @@ The column name is `partition_field_{i}` where `{i}` is the partition field's `f
211211
This naming convention avoids potential conflicts with user-defined column names.
212212
When a new partition spec version is defined, the `__manifest` table schema is updated accordingly to include any new partition columns.
213213

214-
| Column | Type | Description |
215-
|------------------------------|----------|-----------------------------------------------------------------------------|
216-
| `object_id` | `string` | Full namespace path with `$` separator (existing) |
217-
| `object_type` | `string` | `"namespace"` or `"table"` (existing) |
218-
| `metadata` | `string` | JSON-encoded metadata/properties (existing) |
219-
| `read_version` | `uint64` | Table version for reads (optional, see [Transaction](#transaction)) |
220-
| `read_branch` | `string` | Table branch for reads (optional, see [Transaction](#transaction)) |
221-
| `read_tag` | `string` | Table tag for reads (optional, see [Transaction](#transaction)) |
222-
| `partition_field_{field_id}` | `<type>` | Partition value for the field (nullable, inherited from parent namespaces) |
223-
| ... | ... | Additional partition field columns as needed |
214+
| Column | Type | Description |
215+
|------------------------------|----------|----------------------------------------------------------------------------|
216+
| `object_id` | `string` | Full namespace path with `$` separator (existing) |
217+
| `object_type` | `string` | `"namespace"` or `"table"` (existing) |
218+
| `metadata` | `string` | JSON-encoded metadata/properties (existing) |
219+
| `read_version` | `uint64` | Table version for reads (optional, see [Transaction](#transaction)) |
220+
| `partition_field_{field_id}` | `<type>` | Partition value for the field (nullable, inherited from parent namespaces) |
221+
| ... | ... | Additional partition field columns as needed |
224222

225223
Partition values are inherited from parent namespaces - each row has all partition values from its ancestors.
226224
See [Appendix C: Manifest Table Example](#appendix-c-manifest-table-example) for a complete example.
@@ -297,43 +295,62 @@ This design ensures backward compatibility while enabling partition strategy evo
297295
Operations within a single partition table are ACID-compliant according to the Lance table specification.
298296
Each partition is an independent Lance table, so reads and writes to a single partition follow standard Lance transaction semantics.
299297

300-
### Multi-Partition Transaction
298+
### Multi-Partition Transaction (Weak)
301299

302300
By default, operations across multiple partitions have weaker guarantees:
303301

304302
- **Writes across partitions are not atomic or consistent**: A write that affects multiple partitions may partially succeed, leaving some partitions updated while others are not.
305303
- **Reads across partitions are not isolated**: A read spanning multiple partitions may observe different versions of each partition, leading to inconsistent views.
306304

307-
To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version`, `read_branch`, and `read_tag` columns for a table.
308-
These columns record which version of each partition table to read.
305+
In multi-partition transaction weak mode, write operations are directly committed to the main branch of each table. It means users can always see the fresh state of each leaf
306+
partition table without additional information from partitioned namespace.
309307

310-
#### Read Behavior
308+
Users need to handle writes across partitions carefully because there is no ACID guarantees. One way is to use idempotent write like `insert overwrite` then retry
309+
for whatever error. Another way is writing partitions one by one.
310+
311+
**ACID**
312+
* Read Behavior: Readers should always read the latest version from the main branch.
313+
314+
* Write Behavior: Writers should always commit to the main branch.
311315

312-
Users should specify one of the following combinations:
316+
* Conflict Resolution: No conflict resolution for operations across multiple partition.
313317

314-
1. **`read_version` only**: Read the specified version from the main branch.
315-
2. **`read_branch` + `read_version`**: Read the specified version from the specified branch.
316-
3. **`read_tag` only**: Read the version referenced by the specified tag.
318+
### Multi-Partition Transaction (Strong)
317319

318-
When all columns are NULL or not present, readers should read the latest version from the main branch.
320+
To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version` column for a table.
321+
The `read_version` records the version timeline of each partition table. The last version in timeline is the current version to read.
322+
323+
In multi-partition transaction strong mode, write operations will use detached commit to each table. A detached commit is invisible unless
324+
the version is set, it makes sure the intermediate state of a transaction remains invisible. Users need to first get `read_version` from partitioned
325+
namespace, then read the leaf partition table using the current version.
326+
327+
#### Read Behavior
328+
329+
1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
330+
2. Search `__manifest` table with `version=S0` to collect `read_version`s for the partition tables to read.
331+
3. Read the current version from the partition table.
319332

320333
#### Commit Behavior
321334

322335
Multi-partition transactions are guarded by commits against the `__manifest` table. A typical multi-partition write follows this pattern:
323336

324-
1. Write data to each affected partition table independently
325-
2. Atomically update the `read_version` (and optionally `read_branch` or `read_tag`) of all affected partitions in a single `__manifest` commit
337+
1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
338+
2. Write data to each affected partition table independently
339+
3. Get current version of `__manifest` table as snapshot id(S1), detect/resolve conflicts if `S1` is not `S0`.
340+
4. Atomically update the `read_version` from `S1` to `S2` in a single `__manifest` commit. This will updates the `read_version` timelines of all affected partitions.
326341

327342
This ensures all-or-nothing visibility of changes across partitions.
328343

329344
#### Conflict Resolution
330345

331-
If concurrent commits have been committed to `__manifest` since the transaction began, the implementation must either:
332-
333-
1. Rebase the current commit onto the latest `__manifest` version and retry the commit, or
334-
2. Fail the current commit and return an error to the caller
335-
336-
Implementations are responsible for ensuring the appropriate conflict detection and resolution strategy to guarantee ACID semantics during multi-partition transactions.
346+
1. Based on `read_version` timeline to fetch the changes between S0 and S1.
347+
2. For each updated table in S2:
348+
* collect the transactions between S0 to S1;
349+
* detect/resolve conflicts between the collected transactions and S2;
350+
3. Commit the partitioned namespace if all conflicts are resolved.
351+
4. Otherwise
352+
* Rebase the current commit onto the latest `__manifest` version and retry the commit, or
353+
* Fail the current commit and return an error to the caller.
337354

338355
## Appendices
339356

@@ -431,15 +448,15 @@ The namespaces (`v1`, `v1$k7m2n9p4q8r5s3t6`, etc.) are tracked in the `__manifes
431448

432449
The `__manifest` table for a partitioned namespace with partition fields `event_date` (v1), `event_year` (v2) and `country` (v2), showing entries from both spec versions:
433450

434-
| object_id | object_type | metadata | read_version | read_branch | read_tag | partition_field_event_date | partition_field_event_year | partition_field_country |
435-
|-----------------------------------------------|-------------|----------|--------------|-------------|----------|----------------------------|----------------------------|-------------------------|
436-
| v1 | namespace | {} | NULL | NULL | NULL | NULL | NULL | NULL |
437-
| v1$k7m2n9p4q8r5s3t6 | namespace | {} | NULL | NULL | NULL | 2025-12-10 | NULL | NULL |
438-
| v1$k7m2n9p4q8r5s3t6$dataset | table | {} | 5 | NULL | NULL | 2025-12-10 | NULL | NULL |
439-
| v2 | namespace | {} | NULL | NULL | NULL | NULL | NULL | NULL |
440-
| v2$e9f0g1h2i3j4k5l6 | namespace | {} | NULL | NULL | NULL | NULL | 2025 | NULL |
441-
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4 | namespace | {} | NULL | NULL | NULL | NULL | 2025 | US |
442-
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | table | {} | 3 | NULL | NULL | NULL | 2025 | US |
451+
| object_id | object_type | metadata | read_version | partition_field_event_date | partition_field_event_year | partition_field_country |
452+
|----------------------------------------------|-------------|----------|--------------------------------------------|----------------------------|----------------------------|-------------------------|
453+
| v1 | namespace | {} | NULL | NULL | NULL | NULL |
454+
| v1$k7m2n9p4q8r5s3t6 | namespace | {} | NULL | 2025-12-10 | NULL | NULL |
455+
| v1$k7m2n9p4q8r5s3t6$dataset | table | {} | 13473201876543210951, 11120734598765432152 | 2025-12-10 | NULL | NULL |
456+
| v2 | namespace | {} | NULL | NULL | NULL | NULL |
457+
| v2$e9f0g1h2i3j4k5l6 | namespace | {} | NULL | NULL | 2025 | NULL |
458+
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4 | namespace | {} | NULL | NULL | 2025 | US |
459+
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | table | {} | 16045690984833335022 | NULL | 2025 | US |
443460

444461
Note: The root namespace properties (`partition_spec_v1`, `partition_spec_v2`, `schema`) are stored in the `__manifest` table's metadata, not as a row. The `object_id` uses `$` as the namespace path separator. Partition columns use the naming convention `partition_field_{field_id}` where `{field_id}` is the partition field's string identifier. Partition values are inherited from parent namespaces. When retrieving properties via API, partition values are converted to `partition.<field_id> = <value>` entries.
445462

@@ -459,7 +476,7 @@ WHERE event_date = '2025-12-10' AND country = 'US'
459476
The engine translates this to the following `__manifest` DataFusion query plan to examine related partition tables.
460477

461478
```sql
462-
SELECT object_id, location, read_version, read_branch, read_tag
479+
SELECT object_id, location, read_version
463480
FROM __manifest
464481
WHERE object_type = 'table'
465482
AND (
@@ -480,14 +497,14 @@ One example way to perform such substitution is:
480497

481498
This query returns:
482499

483-
| object_id | location | read_version | read_branch | read_tag |
484-
|----------------------------------------------|-------------------------------------------------------|--------------|-------------|----------|
485-
| v1$k7m2n9p4q8r5s3t6$dataset | b4a3c2d1_v1$k7m2n9p4q8r5s3t6$dataset | 5 | NULL | NULL |
486-
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | aabbccdd_v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | 3 | NULL | NULL |
500+
| object_id | location | read_version |
501+
|----------------------------------------------|-------------------------------------------------------|----------------------|
502+
| v1$k7m2n9p4q8r5s3t6$dataset | b4a3c2d1_v1$k7m2n9p4q8r5s3t6$dataset | 18446744073709551615 |
503+
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | aabbccdd_v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | 16045690984833335022 |
487504

488505
- For partition spec v1, the `country = 'US'` filter cannot be pushed to partition pruning (v1 has no `country` partition), so it must be applied during the table scan
489506
- For partition spec v2, both filters are pushed down: `partition_field_event_year = 2025` (computed from `year(event_date)`) and `partition_field_country = 'US'`
490-
- The engine reads each table at the version specified by `read_version`, `read_branch`, or `read_tag` for consistent snapshot reads
507+
- The engine reads each table at the version specified by `read_version` for consistent snapshot reads
491508

492509
### Appendix E: Runtime Namespace Properties Example
493510

0 commit comments

Comments
 (0)