Skip to content

Commit 6199e6c

Browse files
westonpaceclaude
andauthored
docs: document DataFile column_indices changes in 2.1 format (#6416)
## Summary - Add a 5.0.0 section to the migration guide documenting how `DataFile.column_indices` changed with data storage version 2.1: non-leaf fields (structs, lists) now get `-1` instead of sequential column indices - Add an admonition to the table format spec's Data Files section noting the version difference - Includes a concrete before/after example and opt-out instructions Closes #6411 ## Test plan - [x] Docs build successfully with `mkdocs build` - [ ] Verify rendered migration guide section and table format admonition look correct 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent efc9374 commit 6199e6c

2 files changed

Lines changed: 46 additions & 0 deletions

File tree

docs/src/format/table/index.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,17 @@ or independently of column indices due to variable encoding widths (for Lance fi
121121

122122
</details>
123123

124+
!!! note "Field-to-column mapping differs between data storage versions"
125+
126+
In **2.0**, all fields (including non-leaf fields like struct and list containers) are assigned
127+
sequential column indices in `column_indices`.
128+
129+
In **2.1+**, non-leaf fields (unpacked structs, list containers) are assigned `-1` in
130+
`column_indices` because their validity information is folded into repetition/definition
131+
levels. Only leaf fields and packed structs have column indices.
132+
133+
See the [5.0.0 migration guide](../../guide/migration.md#500) for a detailed example.
134+
124135
## Deletion Files
125136

126137
Deletion files (a.k.a. deletion vectors) track deleted rows without rewriting data files.

docs/src/guide/migration.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,41 @@ stable and breaking changes should generally be communicated (via warnings) for
66
give users a chance to migrate. This page documents the breaking changes between releases and gives advice on how to
77
migrate.
88

9+
## 5.0.0
10+
11+
* The default data storage version changed from 2.0 to 2.1. This affects the `column_indices`
12+
field in the `DataFile` protobuf message. In 2.0, every field (including non-leaf fields like
13+
struct containers and list containers) was assigned a sequential column index. In 2.1, non-leaf
14+
fields (unpacked structs, list containers) are assigned `-1` instead since their validity
15+
information is now folded into repetition/definition levels. Only leaf fields and packed structs
16+
are assigned column indices.
17+
18+
For example, given the schema:
19+
20+
```
21+
x: i32, y: [f32], z: { a: i32 }
22+
```
23+
24+
The fields (in depth-first order) are:
25+
26+
| Field ID | Field |
27+
|----------|---------------|
28+
| 0 | `x` (i32) |
29+
| 1 | `y` (list) |
30+
| 2 | `y.item` (f32)|
31+
| 3 | `z` (struct) |
32+
| 4 | `z.a` (i32) |
33+
34+
In **2.0**, `column_indices` = `[0, 1, 2, 3, 4]` — every field gets a column.
35+
36+
In **2.1**, `column_indices` = `[0, -1, 1, -1, 2]` — non-leaf fields (`y` and `z`) get `-1`.
37+
38+
* This change only affects advanced users who construct `DataFile` messages directly, for example
39+
when building operations by hand for `Dataset.commit`. Normal read and write paths are
40+
unaffected.
41+
42+
* To opt back to 2.0 format, set `data_storage_version="2.0"` when creating a dataset.
43+
944
## 1.0.0
1045
1146
* The `SearchResult` returned by scalar indices must now output information about null values.

0 commit comments

Comments
 (0)