|
| 1 | +- Start Date: 2026-03-04 |
| 2 | +- Tracking Issue: [vortex-data/vortex#0000](https://github.com/vortex-data/vortex/issues/0000) |
| 3 | + |
| 4 | +## Summary |
| 5 | + |
| 6 | +We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes |
| 7 | +the design of a fixed-shape tensor with contiguous backing memory. |
| 8 | + |
| 9 | +## Motivation |
| 10 | + |
| 11 | +#### Tensors in the wild |
| 12 | + |
| 13 | +Tensors are multi-dimensional (n-dimensional) arrays that generalize vectors (1D) and matrices (2D) |
| 14 | +to arbitrary dimensions. They are quite common in ML/AI and scientific computing applications. To |
| 15 | +name just a few examples: |
| 16 | + |
| 17 | +- Image or video data stored as `height x width x channels` |
| 18 | +- Multi-dimensional sensor or time-series data |
| 19 | +- Embedding vectors from language models and recommendation systems |
| 20 | + |
| 21 | +#### Tensors in Vortex |
| 22 | + |
| 23 | +In the current version of Vortex, there are two ways to represent fixed-shape tensors using the |
| 24 | +`FixedSizeList` `DType`, and neither seems satisfactory. |
| 25 | + |
| 26 | +The simplest approach is to flatten the tensor into a single `FixedSizeList<n>` whose size is the |
| 27 | +product of all dimensions (this is what Apache Arrow does). However, this discards shape information |
| 28 | +entirely: a `2x3` matrix and a `3x2` matrix would both become `FixedSizeList<6>`. Shape metadata |
| 29 | +must be stored separately, and any dimension-aware operation (slicing along an axis, transposing, |
| 30 | +etc.) reduces to manual index arithmetic with no type-level guarantees. |
| 31 | + |
| 32 | +The alternative is to nest `FixedSizeList` types, e.g., `FixedSizeList<FixedSizeList<n>, m>` for a |
| 33 | +matrix. This preserves some structure, but becomes unwieldy for higher-dimensional tensors. |
| 34 | +Axis-specific slicing or indexing on individual tensors (tensor scalars, not tensor arrays) would |
| 35 | +require custom expressions aware of the specific nesting depth, rather than operating on a single, |
| 36 | +uniform tensor type. |
| 37 | + |
| 38 | +Additionally, reshaping requires restructuring the entire nested type, and operations like |
| 39 | +transposes would be difficult to implement correctly. |
| 40 | + |
| 41 | +Beyond these structural issues, neither approach stores shape and stride metadata explicitly, making |
| 42 | +interoperability with external tensor libraries (NumPy, PyTorch, etc.) that expect contiguous memory |
| 43 | +with this metadata awkward. |
| 44 | + |
| 45 | +Thus, we propose a dedicated extension type that encapsulates tensor semantics (shape, strides, |
| 46 | +dimension-aware operations) on top of contiguous, row-major (C-style) backing memory. |
| 47 | + |
| 48 | +## Design |
| 49 | + |
| 50 | +Since the design of extension types has not been fully solved yet (see |
| 51 | +[RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), the complete design of tensors cannot be |
| 52 | +fully described here. However, we do know enough that we can present the general idea here. |
| 53 | + |
| 54 | +### Storage Type |
| 55 | + |
| 56 | +Extension types in Vortex require defining a canonical storage type that represents what the |
| 57 | +extension array looks like when it is canonicalized. For tensors, we will want this storage type to |
| 58 | +be a `FixedSizeList<p, s>`, where `p` is a numeric type (like `u8`, `f64`, or a decimal type), and |
| 59 | +where `s` is the product of all dimensions of the tensor. |
| 60 | + |
| 61 | +For example, if we want to represent a tensor of `i32` with dimensions `[2, 3, 4]`, the storage type |
| 62 | +for this tensor would be `FixedSizeList<i32, 24>` since `2 x 3 x 4 = 24`. |
| 63 | + |
| 64 | +This is equivalent to the design of Arrow's canonical Tensor extension type. For discussion on why |
| 65 | +we choose not to represent tensors as nested FSLs (for example |
| 66 | +`FixedSizeList<FixedSizeList<FixedSizeList<i32, 2>, 3>, 4>`), see the [alternatives](#alternatives) |
| 67 | +section. |
| 68 | + |
| 69 | +### Element Type |
| 70 | + |
| 71 | +We restrict tensor element types to `Primitive` and `Decimal`. Tensors are fundamentally about dense |
| 72 | +numeric computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size |
| 73 | +elements whose offsets are computable from strides. |
| 74 | + |
| 75 | +Variable-size types (like strings) would break this model entirely. `Bool` is excluded as well |
| 76 | +because Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. This |
| 77 | +matches PyTorch, which also restricts tensors to numeric types. |
| 78 | + |
| 79 | +Theoretically, we could allow more element types in the future, but it should remain a very low |
| 80 | +priority. |
| 81 | + |
| 82 | +### Validity |
| 83 | + |
| 84 | +We define two layers of nullability for tensors: the tensor itself may be null (within a tensor |
| 85 | +array), and individual elements within a tensor may be null. However, we do not support nulling out |
| 86 | +entire sub-dimensions of a tensor (e.g., marking a whole row or slice as null). |
| 87 | + |
| 88 | +The validity bitmap is flat (one bit per element) and follows the same contiguous layout as the |
| 89 | +backing data (just like `FixedSizeList`). This keeps stride-based access straightforward while still |
| 90 | +allowing sparse values within an otherwise dense tensor. |
| 91 | + |
| 92 | +Note that this design is specifically for a dense tensor. A sparse tensor would likely need to have |
| 93 | +a different representation (or different storage type) in order to compress better (likely `List` or |
| 94 | +`ListView` since it can compress runs of nulls very well). |
| 95 | + |
| 96 | +### Metadata |
| 97 | + |
| 98 | +Theoretically, we only need the dimensions of the tensor to have a useful Tensor type. However, we |
| 99 | +likely also want two other pieces of information, the dimension names and the permutation order, |
| 100 | +which mimics the [Arrow Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor) |
| 101 | +type (which is a Canonical Extension type). |
| 102 | + |
| 103 | +Here is what the metadata of an extension Tensor type in Vortex will look like (in Rust): |
| 104 | + |
| 105 | +```rust |
| 106 | +/// Metadata for a [`Tensor`] extension type. |
| 107 | +#[derive(Debug, Clone, PartialEq, Eq, Hash)] |
| 108 | +pub struct TensorMetadata { |
| 109 | + /// The shape of the tensor. |
| 110 | + /// |
| 111 | + /// The shape is always defined over row-major storage. May be empty (0D scalar tensor) or |
| 112 | + /// contain dimensions of size 0 (degenerate tensor). |
| 113 | + shape: Vec<usize>, |
| 114 | + |
| 115 | + /// Optional names for each dimension. Each name corresponds to a dimension in the `shape`. |
| 116 | + /// |
| 117 | + /// If names exist, there must be an equal number of names to dimensions. |
| 118 | + dim_names: Option<Vec<String>>, |
| 119 | + |
| 120 | + /// The permutation of the tensor's dimensions, mapping each logical dimension to its |
| 121 | + /// corresponding physical dimension: `permutation[logical] = physical`. |
| 122 | + /// |
| 123 | + /// If this is `None`, then the logical and physical layout are equal, and the permutation is |
| 124 | + /// in-order `[0, 1, ..., N-1]`. |
| 125 | + permutation: Option<Vec<usize>>, |
| 126 | +} |
| 127 | +``` |
| 128 | + |
| 129 | +#### Stride |
| 130 | + |
| 131 | +The stride of a tensor defines the number of elements to skip in memory to move one step along each |
| 132 | +dimension. Rather than storing strides explicitly as metadata, we can efficiently derive them from |
| 133 | +the shape and permutation. This is possible because the backing memory is always contiguous. |
| 134 | + |
| 135 | +For a row-major tensor with shape `d = [d_0, d_1, ..., d_{n-1}]` and no permutation, the strides |
| 136 | +are: |
| 137 | + |
| 138 | +``` |
| 139 | +stride[n-1] = 1 (innermost dimension always has stride 1) |
| 140 | +stride[i] = d[i+1] * stride[i+1] |
| 141 | +stride[i] = d[i+1] * d[i+2] * ... * d[n-1] |
| 142 | +``` |
| 143 | + |
| 144 | +For example, a tensor with shape `[2, 3, 4]` and no permutation has strides `[12, 4, 1]`: moving one |
| 145 | +step along dimension 0 skips 12 elements, along dimension 1 skips 4, and along dimension 2 skips 1. |
| 146 | +The element at index `[i, j, k]` is located at memory offset `12*i + 4*j + k`. |
| 147 | + |
| 148 | +When a permutation is present, the logical strides are simply the row-major strides permuted |
| 149 | +accordingly. Continuing the `[2, 3, 4]` example with row-major strides `[12, 4, 1]`, applying the |
| 150 | +permutation `[2, 0, 1]` yields logical strides `[1, 12, 4]`. This reorders which dimensions are |
| 151 | +contiguous in memory without copying any data. |
| 152 | + |
| 153 | +### Conversions |
| 154 | + |
| 155 | +#### Arrow |
| 156 | + |
| 157 | +Since our storage type and metadata are designed to match Arrow's Fixed Shape Tensor canonical |
| 158 | +extension type, conversion to and from Arrow is zero-copy (for tensors with at least one dimension). |
| 159 | +The `FixedSizeList` backing memory, shape, dimension names, and permutation all map directly between |
| 160 | +the two representations. |
| 161 | + |
| 162 | +#### NumPy and PyTorch |
| 163 | + |
| 164 | +Libraries like NumPy and PyTorch store strides as an independent, first-class field on their tensor |
| 165 | +objects. This allows them to represent non-contiguous views of memory. |
| 166 | + |
| 167 | +For example, slicing every other row of a matrix produces a view with a doubled row stride, sharing |
| 168 | +memory with the original without copying. However, this means that non-contiguous tensors can appear |
| 169 | +anywhere, and kernels must handle arbitrary stride patterns. PyTorch supposedly requires many |
| 170 | +operations to call `.contiguous()` before proceeding. |
| 171 | + |
| 172 | +Since Vortex tensors are always contiguous, we can always zero-copy _to_ NumPy and PyTorch since |
| 173 | +both libraries can construct a view from a pointer, shape, and strides. Going the other direction, |
| 174 | +we can only zero-copy _from_ NumPy/PyTorch tensors that are already contiguous. |
| 175 | + |
| 176 | +Our proposed design for Vortex Tensors will handle non-contiguous operations differently than the |
| 177 | +Python libraries. Rather than mutating strides to create non-contiguous views, operations like |
| 178 | +slicing, indexing, and `.contiguous()` (after a permutation) would be expressed as lazy |
| 179 | +`Expression`s over the tensor. |
| 180 | + |
| 181 | +These expressions describe the operation without materializing it, and when evaluated, they produce |
| 182 | +a new contiguous tensor. This fits naturally into Vortex's existing lazy compute system, where |
| 183 | +compute is deferred and composed rather than eagerly applied. |
| 184 | + |
| 185 | +The exact mechanism for defining expressions over extension types is still being designed (see |
| 186 | +[RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), but the intent is that tensor-specific |
| 187 | +operations like axis slicing, indexing, and reshaping would be custom expressions registered for the |
| 188 | +tensor extension type. |
| 189 | + |
| 190 | +### Edge Cases: 0D and Size-0 Dimensions |
| 191 | + |
| 192 | +We will support two edge cases that arise naturally from the tensor model. Recall that the number of |
| 193 | +elements in a tensor is the product of its shape dimensions, and that the |
| 194 | +[empty product](https://en.wikipedia.org/wiki/Empty_product) is 1 (the multiplicative identity). |
| 195 | + |
| 196 | +#### 0-dimensional tensors |
| 197 | + |
| 198 | +0D tensors have an empty shape `[]` and contain exactly one element (since the product of no |
| 199 | +dimensions is 1). These represent scalar values wrapped in the tensor type. The storage type is |
| 200 | +`FixedSizeList<p, 1>` (which is identical to a flat `PrimitiveArray` or `DecimalArray`). |
| 201 | + |
| 202 | +#### Size-0 dimensions |
| 203 | + |
| 204 | +Shapes may contain dimensions of size 0 (e.g., `[3, 0, 4]`), which produce tensors with zero |
| 205 | +elements (since the product includes a 0 factor). The storage type is a degenerate |
| 206 | +`FixedSizeList<p, 0>`, which Vortex already handles well. |
| 207 | + |
| 208 | +#### Compatibility |
| 209 | + |
| 210 | +Both NumPy and PyTorch support these cases. NumPy fully supports 0D arrays with shape `()`, and |
| 211 | +dimensions of size 0 are valid (e.g., `np.zeros((3, 0, 4))`). PyTorch supports 0D tensors since |
| 212 | +v0.4.0 and also allows size-0 dimensions. |
| 213 | + |
| 214 | +Arrow's Fixed Shape Tensor spec, however, requires at least one dimension (`ndim >= 1`), so 0D |
| 215 | +tensors would need special handling during Arrow conversion (we would likely just panic). |
| 216 | + |
| 217 | +### Compression |
| 218 | + |
| 219 | +Since the storage type is `FixedSizeList` over numeric types, Vortex's existing encodings (like ALP, |
| 220 | +FastLanes, etc.) will be applied to the flattened primitive buffer transparently. |
| 221 | + |
| 222 | +However, there may be tensor-specific compression opportunities we could take advantage of. We will |
| 223 | +leave this as an open question. |
| 224 | + |
| 225 | +### Scalar Representation |
| 226 | + |
| 227 | +Once we add the `ScalarValue::Array` variant (see tracking issue |
| 228 | +[vortex#6771](https://github.com/vortex-data/vortex/issues/6771)), we can easily pass around tensors |
| 229 | +as `ArrayRef` scalars as well as lazily computed slices. |
| 230 | + |
| 231 | +The `ExtVTable` also requires specifying an associated `NativeValue<'a>` Rust type that an extension |
| 232 | +scalar can be unpacked into. We will want a `NativeTensor<'a>` type that references the backing |
| 233 | +memory of the Tensor, and we can add useful operations to that type. |
| 234 | + |
| 235 | +## Compatibility |
| 236 | + |
| 237 | +Since this is a new type built on an existing canonical type (`FixedSizeList`), there should be no |
| 238 | +compatibility concerns. |
| 239 | + |
| 240 | +## Drawbacks |
| 241 | + |
| 242 | +- **Fixed shape only**: This design only supports tensors where every element in the array has the |
| 243 | + same shape. Variable-shape tensors (ragged arrays) are out of scope and would require a different |
| 244 | + type entirely. |
| 245 | +- **Yet another crate**: We will likely implement this in a `vortex-tensor` crate, which means even |
| 246 | + more surface area than we already have. |
| 247 | + |
| 248 | +## Alternatives |
| 249 | + |
| 250 | +### Nested `FixedSizeList` |
| 251 | + |
| 252 | +Rather than a flat `FixedSizeList` with metadata, we could represent tensors as nested |
| 253 | +`FixedSizeList` types (e.g., `FixedSizeList<FixedSizeList<FixedSizeList<i32, 4>, 3>, 2>` for a |
| 254 | +`[2, 3, 4]` tensor). This has several disadvantages: |
| 255 | + |
| 256 | +- Each nesting level introduces its own validity bitmap, even though sub-dimensional nullability is |
| 257 | + not meaningful for tensors. This wastes space and complicates null-handling logic. |
| 258 | +- This does not match Arrow's canonical Fixed Shape Tensor type, making zero-copy conversion |
| 259 | + impossible. |
| 260 | +- Expressions would need to be aware of the nesting depth, and operations like transpose or reshape |
| 261 | + would require restructuring the type itself rather than updating metadata. |
| 262 | + |
| 263 | +### Do nothing |
| 264 | + |
| 265 | +Users could continue to use `FixedSizeList` directly with out-of-band shape metadata. This works |
| 266 | +for simple storage, but as discussed in the [motivation](#motivation), it provides no type-level |
| 267 | +support for tensor operations and makes interoperability with tensor libraries awkward. |
| 268 | + |
| 269 | +## Prior Art |
| 270 | + |
| 271 | +_Note: This section was Claude-researched._ |
| 272 | + |
| 273 | +### Columnar formats |
| 274 | + |
| 275 | +- **Apache Arrow** defines a |
| 276 | + [Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor) |
| 277 | + canonical extension type. Our design closely follows Arrow's approach: a flat `FixedSizeList` |
| 278 | + storage type with shape, dimension names, and permutation metadata. Arrow also defines a |
| 279 | + [Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor) |
| 280 | + extension type for ragged tensors, which could inform future work. |
| 281 | +- **Lance** delegates entirely to Arrow's type system, including extension types. Arrow extension |
| 282 | + metadata (and therefore tensor metadata) is preserved end-to-end through Lance's storage layer, |
| 283 | + which validates the approach of building tensor semantics as an extension on top of `FixedSizeList` |
| 284 | + storage. |
| 285 | +- **Parquet** has no native `FixedSizeList` logical type. Arrow's `FixedSizeList` is stored as a |
| 286 | + regular `LIST` in Parquet, which adds conversion overhead via repetition levels. There is active |
| 287 | + discussion about introducing `FixedSizeList` as a Parquet logical type, partly motivated by |
| 288 | + tensor and embedding workloads. |
| 289 | + |
| 290 | +### Database systems |
| 291 | + |
| 292 | +- **DuckDB** has a native `ARRAY` type (fixed-size list) but no dedicated tensor type. Community |
| 293 | + discussions have proposed adding one, noting that nested `ARRAY` types can simulate |
| 294 | + multi-dimensional arrays but lack tensor-specific operations. |
| 295 | +- **DataFusion** uses Arrow's type system directly and has no dedicated tensor type. There is open |
| 296 | + discussion about a logical type layer that could support extension types as first-class citizens. |
| 297 | + |
| 298 | +### Tensor libraries |
| 299 | + |
| 300 | +- **NumPy** and **PyTorch** both represent tensors as contiguous (or non-contiguous) memory with |
| 301 | + shape and stride metadata. Our design is a subset of this model — we always require contiguous |
| 302 | + memory and derive strides from shape and permutation, as discussed in the |
| 303 | + [conversions](#conversions) section. |
| 304 | +- **[ndindex](https://quansight-labs.github.io/ndindex/index.html)** is a Python library that |
| 305 | + provides a unified interface for representing and manipulating NumPy array indices (slices, |
| 306 | + integers, ellipses, boolean arrays, etc.). It supports operations like canonicalization, shape |
| 307 | + inference, and re-indexing onto array chunks. We will want to implement tensor compute expressions |
| 308 | + in Vortex that are similar to the operations ndindex provides — for example, computing the result |
| 309 | + shape of a slice or translating a logical index into a physical offset. |
| 310 | + |
| 311 | +### Academic work |
| 312 | + |
| 313 | +- **TACO (Tensor Algebra Compiler)** separates the tensor storage format from the tensor program. |
| 314 | + Each dimension can independently be specified as dense or sparse, and dimensions can be reordered. |
| 315 | + The Vortex approach of storing tensors as flat contiguous memory with a permutation is one specific |
| 316 | + point in TACO's format space (all dimensions dense, with a specific dimension ordering). |
| 317 | + |
| 318 | +## Unresolved Questions |
| 319 | + |
| 320 | +- Are two tensors with different permutations but the same logical values considered equal? This |
| 321 | + affects deduplication and comparisons. The type metadata might be different but the entire tensor |
| 322 | + value might be equal, so it seems strange to say that they are not actually equal? |
| 323 | +- Are there potential tensor-specific compression schemes we can take advantage of? |
| 324 | + |
| 325 | +## Future Possibilities |
| 326 | + |
| 327 | +#### Variable-shape tensors |
| 328 | + |
| 329 | +Arrow defines a |
| 330 | +[Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor) |
| 331 | +extension type for arrays where each tensor can have a different shape. This would enable workloads |
| 332 | +like batched sequences of different lengths. |
| 333 | + |
| 334 | +#### Sparse tensors |
| 335 | + |
| 336 | +A similar Sparse Tensor type could use `List` or `ListView` as its storage type to efficiently |
| 337 | +represent tensors with many null or zero elements, as noted in the [validity](#validity) section. |
| 338 | + |
| 339 | +#### Tensor-specific encodings |
| 340 | + |
| 341 | +Beyond general-purpose compression, encodings tailored to tensor data (e.g., exploiting spatial |
| 342 | +locality across dimensions) could improve compression ratios for specific workloads. |
| 343 | + |
| 344 | +#### ndindex-style compute expressions |
| 345 | + |
| 346 | +As the extension type expression system matures, we can implement a rich set of tensor indexing and |
| 347 | +slicing operations inspired by [ndindex](https://quansight-labs.github.io/ndindex/index.html), |
| 348 | +including slice canonicalization, shape inference, and chunk-level re-indexing. |
0 commit comments