Skip to content

Commit 7a6f9c0

Browse files
authored
Vortex Tensor (#24)
[Rendered](https://github.com/vortex-data/rfcs/blob/ct/tensor/proposed/0024-tensor.md) We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes the design of a fixed-shape tensor with contiguous backing memory. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent a08d67c commit 7a6f9c0

1 file changed

Lines changed: 348 additions & 0 deletions

File tree

proposed/0024-tensor.md

Lines changed: 348 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,348 @@
1+
- Start Date: 2026-03-04
2+
- Tracking Issue: [vortex-data/vortex#0000](https://github.com/vortex-data/vortex/issues/0000)
3+
4+
## Summary
5+
6+
We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes
7+
the design of a fixed-shape tensor with contiguous backing memory.
8+
9+
## Motivation
10+
11+
#### Tensors in the wild
12+
13+
Tensors are multi-dimensional (n-dimensional) arrays that generalize vectors (1D) and matrices (2D)
14+
to arbitrary dimensions. They are quite common in ML/AI and scientific computing applications. To
15+
name just a few examples:
16+
17+
- Image or video data stored as `height x width x channels`
18+
- Multi-dimensional sensor or time-series data
19+
- Embedding vectors from language models and recommendation systems
20+
21+
#### Tensors in Vortex
22+
23+
In the current version of Vortex, there are two ways to represent fixed-shape tensors using the
24+
`FixedSizeList` `DType`, and neither seems satisfactory.
25+
26+
The simplest approach is to flatten the tensor into a single `FixedSizeList<n>` whose size is the
27+
product of all dimensions (this is what Apache Arrow does). However, this discards shape information
28+
entirely: a `2x3` matrix and a `3x2` matrix would both become `FixedSizeList<6>`. Shape metadata
29+
must be stored separately, and any dimension-aware operation (slicing along an axis, transposing,
30+
etc.) reduces to manual index arithmetic with no type-level guarantees.
31+
32+
The alternative is to nest `FixedSizeList` types, e.g., `FixedSizeList<FixedSizeList<n>, m>` for a
33+
matrix. This preserves some structure, but becomes unwieldy for higher-dimensional tensors.
34+
Axis-specific slicing or indexing on individual tensors (tensor scalars, not tensor arrays) would
35+
require custom expressions aware of the specific nesting depth, rather than operating on a single,
36+
uniform tensor type.
37+
38+
Additionally, reshaping requires restructuring the entire nested type, and operations like
39+
transposes would be difficult to implement correctly.
40+
41+
Beyond these structural issues, neither approach stores shape and stride metadata explicitly, making
42+
interoperability with external tensor libraries (NumPy, PyTorch, etc.) that expect contiguous memory
43+
with this metadata awkward.
44+
45+
Thus, we propose a dedicated extension type that encapsulates tensor semantics (shape, strides,
46+
dimension-aware operations) on top of contiguous, row-major (C-style) backing memory.
47+
48+
## Design
49+
50+
Since the design of extension types has not been fully solved yet (see
51+
[RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), the complete design of tensors cannot be
52+
fully described here. However, we do know enough that we can present the general idea here.
53+
54+
### Storage Type
55+
56+
Extension types in Vortex require defining a canonical storage type that represents what the
57+
extension array looks like when it is canonicalized. For tensors, we will want this storage type to
58+
be a `FixedSizeList<p, s>`, where `p` is a numeric type (like `u8`, `f64`, or a decimal type), and
59+
where `s` is the product of all dimensions of the tensor.
60+
61+
For example, if we want to represent a tensor of `i32` with dimensions `[2, 3, 4]`, the storage type
62+
for this tensor would be `FixedSizeList<i32, 24>` since `2 x 3 x 4 = 24`.
63+
64+
This is equivalent to the design of Arrow's canonical Tensor extension type. For discussion on why
65+
we choose not to represent tensors as nested FSLs (for example
66+
`FixedSizeList<FixedSizeList<FixedSizeList<i32, 2>, 3>, 4>`), see the [alternatives](#alternatives)
67+
section.
68+
69+
### Element Type
70+
71+
We restrict tensor element types to `Primitive` and `Decimal`. Tensors are fundamentally about dense
72+
numeric computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size
73+
elements whose offsets are computable from strides.
74+
75+
Variable-size types (like strings) would break this model entirely. `Bool` is excluded as well
76+
because Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. This
77+
matches PyTorch, which also restricts tensors to numeric types.
78+
79+
Theoretically, we could allow more element types in the future, but it should remain a very low
80+
priority.
81+
82+
### Validity
83+
84+
We define two layers of nullability for tensors: the tensor itself may be null (within a tensor
85+
array), and individual elements within a tensor may be null. However, we do not support nulling out
86+
entire sub-dimensions of a tensor (e.g., marking a whole row or slice as null).
87+
88+
The validity bitmap is flat (one bit per element) and follows the same contiguous layout as the
89+
backing data (just like `FixedSizeList`). This keeps stride-based access straightforward while still
90+
allowing sparse values within an otherwise dense tensor.
91+
92+
Note that this design is specifically for a dense tensor. A sparse tensor would likely need to have
93+
a different representation (or different storage type) in order to compress better (likely `List` or
94+
`ListView` since it can compress runs of nulls very well).
95+
96+
### Metadata
97+
98+
Theoretically, we only need the dimensions of the tensor to have a useful Tensor type. However, we
99+
likely also want two other pieces of information, the dimension names and the permutation order,
100+
which mimics the [Arrow Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor)
101+
type (which is a Canonical Extension type).
102+
103+
Here is what the metadata of an extension Tensor type in Vortex will look like (in Rust):
104+
105+
```rust
106+
/// Metadata for a [`Tensor`] extension type.
107+
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
108+
pub struct TensorMetadata {
109+
/// The shape of the tensor.
110+
///
111+
/// The shape is always defined over row-major storage. May be empty (0D scalar tensor) or
112+
/// contain dimensions of size 0 (degenerate tensor).
113+
shape: Vec<usize>,
114+
115+
/// Optional names for each dimension. Each name corresponds to a dimension in the `shape`.
116+
///
117+
/// If names exist, there must be an equal number of names to dimensions.
118+
dim_names: Option<Vec<String>>,
119+
120+
/// The permutation of the tensor's dimensions, mapping each logical dimension to its
121+
/// corresponding physical dimension: `permutation[logical] = physical`.
122+
///
123+
/// If this is `None`, then the logical and physical layout are equal, and the permutation is
124+
/// in-order `[0, 1, ..., N-1]`.
125+
permutation: Option<Vec<usize>>,
126+
}
127+
```
128+
129+
#### Stride
130+
131+
The stride of a tensor defines the number of elements to skip in memory to move one step along each
132+
dimension. Rather than storing strides explicitly as metadata, we can efficiently derive them from
133+
the shape and permutation. This is possible because the backing memory is always contiguous.
134+
135+
For a row-major tensor with shape `d = [d_0, d_1, ..., d_{n-1}]` and no permutation, the strides
136+
are:
137+
138+
```
139+
stride[n-1] = 1 (innermost dimension always has stride 1)
140+
stride[i] = d[i+1] * stride[i+1]
141+
stride[i] = d[i+1] * d[i+2] * ... * d[n-1]
142+
```
143+
144+
For example, a tensor with shape `[2, 3, 4]` and no permutation has strides `[12, 4, 1]`: moving one
145+
step along dimension 0 skips 12 elements, along dimension 1 skips 4, and along dimension 2 skips 1.
146+
The element at index `[i, j, k]` is located at memory offset `12*i + 4*j + k`.
147+
148+
When a permutation is present, the logical strides are simply the row-major strides permuted
149+
accordingly. Continuing the `[2, 3, 4]` example with row-major strides `[12, 4, 1]`, applying the
150+
permutation `[2, 0, 1]` yields logical strides `[1, 12, 4]`. This reorders which dimensions are
151+
contiguous in memory without copying any data.
152+
153+
### Conversions
154+
155+
#### Arrow
156+
157+
Since our storage type and metadata are designed to match Arrow's Fixed Shape Tensor canonical
158+
extension type, conversion to and from Arrow is zero-copy (for tensors with at least one dimension).
159+
The `FixedSizeList` backing memory, shape, dimension names, and permutation all map directly between
160+
the two representations.
161+
162+
#### NumPy and PyTorch
163+
164+
Libraries like NumPy and PyTorch store strides as an independent, first-class field on their tensor
165+
objects. This allows them to represent non-contiguous views of memory.
166+
167+
For example, slicing every other row of a matrix produces a view with a doubled row stride, sharing
168+
memory with the original without copying. However, this means that non-contiguous tensors can appear
169+
anywhere, and kernels must handle arbitrary stride patterns. PyTorch supposedly requires many
170+
operations to call `.contiguous()` before proceeding.
171+
172+
Since Vortex tensors are always contiguous, we can always zero-copy _to_ NumPy and PyTorch since
173+
both libraries can construct a view from a pointer, shape, and strides. Going the other direction,
174+
we can only zero-copy _from_ NumPy/PyTorch tensors that are already contiguous.
175+
176+
Our proposed design for Vortex Tensors will handle non-contiguous operations differently than the
177+
Python libraries. Rather than mutating strides to create non-contiguous views, operations like
178+
slicing, indexing, and `.contiguous()` (after a permutation) would be expressed as lazy
179+
`Expression`s over the tensor.
180+
181+
These expressions describe the operation without materializing it, and when evaluated, they produce
182+
a new contiguous tensor. This fits naturally into Vortex's existing lazy compute system, where
183+
compute is deferred and composed rather than eagerly applied.
184+
185+
The exact mechanism for defining expressions over extension types is still being designed (see
186+
[RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), but the intent is that tensor-specific
187+
operations like axis slicing, indexing, and reshaping would be custom expressions registered for the
188+
tensor extension type.
189+
190+
### Edge Cases: 0D and Size-0 Dimensions
191+
192+
We will support two edge cases that arise naturally from the tensor model. Recall that the number of
193+
elements in a tensor is the product of its shape dimensions, and that the
194+
[empty product](https://en.wikipedia.org/wiki/Empty_product) is 1 (the multiplicative identity).
195+
196+
#### 0-dimensional tensors
197+
198+
0D tensors have an empty shape `[]` and contain exactly one element (since the product of no
199+
dimensions is 1). These represent scalar values wrapped in the tensor type. The storage type is
200+
`FixedSizeList<p, 1>` (which is identical to a flat `PrimitiveArray` or `DecimalArray`).
201+
202+
#### Size-0 dimensions
203+
204+
Shapes may contain dimensions of size 0 (e.g., `[3, 0, 4]`), which produce tensors with zero
205+
elements (since the product includes a 0 factor). The storage type is a degenerate
206+
`FixedSizeList<p, 0>`, which Vortex already handles well.
207+
208+
#### Compatibility
209+
210+
Both NumPy and PyTorch support these cases. NumPy fully supports 0D arrays with shape `()`, and
211+
dimensions of size 0 are valid (e.g., `np.zeros((3, 0, 4))`). PyTorch supports 0D tensors since
212+
v0.4.0 and also allows size-0 dimensions.
213+
214+
Arrow's Fixed Shape Tensor spec, however, requires at least one dimension (`ndim >= 1`), so 0D
215+
tensors would need special handling during Arrow conversion (we would likely just panic).
216+
217+
### Compression
218+
219+
Since the storage type is `FixedSizeList` over numeric types, Vortex's existing encodings (like ALP,
220+
FastLanes, etc.) will be applied to the flattened primitive buffer transparently.
221+
222+
However, there may be tensor-specific compression opportunities we could take advantage of. We will
223+
leave this as an open question.
224+
225+
### Scalar Representation
226+
227+
Once we add the `ScalarValue::Array` variant (see tracking issue
228+
[vortex#6771](https://github.com/vortex-data/vortex/issues/6771)), we can easily pass around tensors
229+
as `ArrayRef` scalars as well as lazily computed slices.
230+
231+
The `ExtVTable` also requires specifying an associated `NativeValue<'a>` Rust type that an extension
232+
scalar can be unpacked into. We will want a `NativeTensor<'a>` type that references the backing
233+
memory of the Tensor, and we can add useful operations to that type.
234+
235+
## Compatibility
236+
237+
Since this is a new type built on an existing canonical type (`FixedSizeList`), there should be no
238+
compatibility concerns.
239+
240+
## Drawbacks
241+
242+
- **Fixed shape only**: This design only supports tensors where every element in the array has the
243+
same shape. Variable-shape tensors (ragged arrays) are out of scope and would require a different
244+
type entirely.
245+
- **Yet another crate**: We will likely implement this in a `vortex-tensor` crate, which means even
246+
more surface area than we already have.
247+
248+
## Alternatives
249+
250+
### Nested `FixedSizeList`
251+
252+
Rather than a flat `FixedSizeList` with metadata, we could represent tensors as nested
253+
`FixedSizeList` types (e.g., `FixedSizeList<FixedSizeList<FixedSizeList<i32, 4>, 3>, 2>` for a
254+
`[2, 3, 4]` tensor). This has several disadvantages:
255+
256+
- Each nesting level introduces its own validity bitmap, even though sub-dimensional nullability is
257+
not meaningful for tensors. This wastes space and complicates null-handling logic.
258+
- This does not match Arrow's canonical Fixed Shape Tensor type, making zero-copy conversion
259+
impossible.
260+
- Expressions would need to be aware of the nesting depth, and operations like transpose or reshape
261+
would require restructuring the type itself rather than updating metadata.
262+
263+
### Do nothing
264+
265+
Users could continue to use `FixedSizeList` directly with out-of-band shape metadata. This works
266+
for simple storage, but as discussed in the [motivation](#motivation), it provides no type-level
267+
support for tensor operations and makes interoperability with tensor libraries awkward.
268+
269+
## Prior Art
270+
271+
_Note: This section was Claude-researched._
272+
273+
### Columnar formats
274+
275+
- **Apache Arrow** defines a
276+
[Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor)
277+
canonical extension type. Our design closely follows Arrow's approach: a flat `FixedSizeList`
278+
storage type with shape, dimension names, and permutation metadata. Arrow also defines a
279+
[Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor)
280+
extension type for ragged tensors, which could inform future work.
281+
- **Lance** delegates entirely to Arrow's type system, including extension types. Arrow extension
282+
metadata (and therefore tensor metadata) is preserved end-to-end through Lance's storage layer,
283+
which validates the approach of building tensor semantics as an extension on top of `FixedSizeList`
284+
storage.
285+
- **Parquet** has no native `FixedSizeList` logical type. Arrow's `FixedSizeList` is stored as a
286+
regular `LIST` in Parquet, which adds conversion overhead via repetition levels. There is active
287+
discussion about introducing `FixedSizeList` as a Parquet logical type, partly motivated by
288+
tensor and embedding workloads.
289+
290+
### Database systems
291+
292+
- **DuckDB** has a native `ARRAY` type (fixed-size list) but no dedicated tensor type. Community
293+
discussions have proposed adding one, noting that nested `ARRAY` types can simulate
294+
multi-dimensional arrays but lack tensor-specific operations.
295+
- **DataFusion** uses Arrow's type system directly and has no dedicated tensor type. There is open
296+
discussion about a logical type layer that could support extension types as first-class citizens.
297+
298+
### Tensor libraries
299+
300+
- **NumPy** and **PyTorch** both represent tensors as contiguous (or non-contiguous) memory with
301+
shape and stride metadata. Our design is a subset of this model — we always require contiguous
302+
memory and derive strides from shape and permutation, as discussed in the
303+
[conversions](#conversions) section.
304+
- **[ndindex](https://quansight-labs.github.io/ndindex/index.html)** is a Python library that
305+
provides a unified interface for representing and manipulating NumPy array indices (slices,
306+
integers, ellipses, boolean arrays, etc.). It supports operations like canonicalization, shape
307+
inference, and re-indexing onto array chunks. We will want to implement tensor compute expressions
308+
in Vortex that are similar to the operations ndindex provides — for example, computing the result
309+
shape of a slice or translating a logical index into a physical offset.
310+
311+
### Academic work
312+
313+
- **TACO (Tensor Algebra Compiler)** separates the tensor storage format from the tensor program.
314+
Each dimension can independently be specified as dense or sparse, and dimensions can be reordered.
315+
The Vortex approach of storing tensors as flat contiguous memory with a permutation is one specific
316+
point in TACO's format space (all dimensions dense, with a specific dimension ordering).
317+
318+
## Unresolved Questions
319+
320+
- Are two tensors with different permutations but the same logical values considered equal? This
321+
affects deduplication and comparisons. The type metadata might be different but the entire tensor
322+
value might be equal, so it seems strange to say that they are not actually equal?
323+
- Are there potential tensor-specific compression schemes we can take advantage of?
324+
325+
## Future Possibilities
326+
327+
#### Variable-shape tensors
328+
329+
Arrow defines a
330+
[Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor)
331+
extension type for arrays where each tensor can have a different shape. This would enable workloads
332+
like batched sequences of different lengths.
333+
334+
#### Sparse tensors
335+
336+
A similar Sparse Tensor type could use `List` or `ListView` as its storage type to efficiently
337+
represent tensors with many null or zero elements, as noted in the [validity](#validity) section.
338+
339+
#### Tensor-specific encodings
340+
341+
Beyond general-purpose compression, encodings tailored to tensor data (e.g., exploiting spatial
342+
locality across dimensions) could improve compression ratios for specific workloads.
343+
344+
#### ndindex-style compute expressions
345+
346+
As the extension type expression system matures, we can implement a rich set of tensor indexing and
347+
slicing operations inspired by [ndindex](https://quansight-labs.github.io/ndindex/index.html),
348+
including slice canonicalization, shape inference, and chunk-level re-indexing.

0 commit comments

Comments
 (0)