Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 71 additions & 46 deletions proposed/0024-tensor.md → accepted/0024-tensor.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@

## Summary

We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes
the design of a fixed-shape tensor with contiguous backing memory.
We would like to add a `FixedShapeTensor` type to Vortex as an extension over `FixedSizeList`. This
RFC proposes the design of a fixed-shape tensor with contiguous backing memory.

## Motivation

Expand All @@ -18,7 +18,7 @@ name just a few examples:
- Multi-dimensional sensor or time-series data
- Embedding vectors from language models and recommendation systems

#### Tensors in Vortex
#### Fixed-shape tensors in Vortex

In the current version of Vortex, there are two ways to represent fixed-shape tensors using the
`FixedSizeList` `DType`, and neither seems satisfactory.
Expand Down Expand Up @@ -54,44 +54,44 @@ fully described here. However, we do know enough that we can present the general
### Storage Type

Extension types in Vortex require defining a canonical storage type that represents what the
extension array looks like when it is canonicalized. For tensors, we will want this storage type to
be a `FixedSizeList<p, s>`, where `p` is a numeric type (like `u8`, `f64`, or a decimal type), and
where `s` is the product of all dimensions of the tensor.
extension array looks like when it is canonicalized. For fixed-shape tensors, we will want this
storage type to be a `FixedSizeList<p, s>`, where `p` is a primitive type (like `u8`, `f64`, etc.),
and where `s` is the product of all dimensions of the tensor.

For example, if we want to represent a tensor of `i32` with dimensions `[2, 3, 4]`, the storage type
for this tensor would be `FixedSizeList<i32, 24>` since `2 x 3 x 4 = 24`.

This is equivalent to the design of Arrow's canonical Tensor extension type. For discussion on why
we choose not to represent tensors as nested FSLs (for example
This is equivalent to the design of Arrow's canonical Fixed Shape Tensor extension type. For
discussion on why we choose not to represent tensors as nested FSLs (for example
`FixedSizeList<FixedSizeList<FixedSizeList<i32, 2>, 3>, 4>`), see the [alternatives](#alternatives)
section.

### Element Type

We restrict tensor element types to `Primitive` and `Decimal`. Tensors are fundamentally about dense
numeric computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size
We restrict tensor element types to `Primitive`. Tensors are fundamentally about dense numeric
computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size
elements whose offsets are computable from strides.

Variable-size types (like strings) would break this model entirely. `Bool` is excluded as well
because Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. This
matches PyTorch, which also restricts tensors to numeric types.
Variable-size types (like strings) would break this model entirely. `Bool` is excluded because
Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. `Decimal` is
excluded because there are no fast implementations of tensor operations (e.g., matmul) for
fixed-point types. This matches PyTorch, which also restricts tensors to floating-point and integer
primitive types.

Theoretically, we could allow more element types in the future, but it should remain a very low
priority.
We could allow more element types in the future if a compelling use case arises, but it should
remain a very low priority.

### Validity

We define two layers of nullability for tensors: the tensor itself may be null (within a tensor
array), and individual elements within a tensor may be null. However, we do not support nulling out
entire sub-dimensions of a tensor (e.g., marking a whole row or slice as null).
Nullability exists only at the tensor level: within a tensor array, an individual tensor may be
null, but elements within a tensor may not be. This is because tensor operations like matmul cannot
be efficiently implemented over nullable elements, and most tensor libraries (e.g., PyTorch) do not
support per-element nulls either.
Comment on lines +86 to +89
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commenting here but maybe it should go on the previous PR?

IDK how arrow does it, but I don't think that's necessarily true.
Most vectorized compute just runs through null values that are zeroed out, IDK what's how you matmul the validity itself, but I think that's a reasonable thing

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think interpretation of NULLs is context dependent. If NULL means "there was no data observed at this position" and you're doing a weighted sum of the features, treating NULLs as zero is probably the right choice. The result is indeed the count of what you observed. You can't infer anything about things you did not observe.

On the other hand, if NULL means "there is some data here but for technical reasons it was unrecoverable" and you're doing a linear regression, you probably want to replace NULL by a mean value over some dimension(s). I don't have a good linear regression example, but suppose you flip one hundred coins and record heads as 1 and tails as 0. Suppose further that you lose 10 coins before observing them. If you compute the sum of this vector with NULL as zeros you'll conclude the coins are tails-biased! If you compute the sum of this vector with NULL as the sample mean, you'll have an unbiased estimate of the coin's heads/tails probability.

IMO, matmul, sum, etc. should only be defined on tensors with non-nullable elements. I suppose null elements are fine? if they're representable in torch (I think they are not?).

Numpy is able to represent them when you use the catchall-object-dtype, but if you request primitive types it converts them to NaNs.

In [8]: np.array([1., None])
Out[8]: array([1.0, None], dtype=object)

In [9]: np.array([1., None], dtype=float)
Out[9]: array([ 1., nan])

In [10]: np.array([1., None], dtype=np.dtype('f4'))
Out[10]: array([ 1., nan], dtype=float32)


The validity bitmap is flat (one bit per element) and follows the same contiguous layout as the
backing data (just like `FixedSizeList`). This keeps stride-based access straightforward while still
allowing sparse values within an otherwise dense tensor.
Since the storage type is `FixedSizeList`, the validity of the tensor array is inherited from the
`FixedSizeList`'s own validity bitmap (one bit per tensor, not per element).

Note that this design is specifically for a dense tensor. A sparse tensor would likely need to have
a different representation (or different storage type) in order to compress better (likely `List` or
`ListView` since it can compress runs of nulls very well).
This is a restriction we can relax in the future if a compelling use case arises.

### Metadata

Expand All @@ -100,12 +100,13 @@ likely also want two other pieces of information, the dimension names and the pe
which mimics the [Arrow Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor)
type (which is a Canonical Extension type).

Here is what the metadata of an extension Tensor type in Vortex will look like (in Rust):
Here is what the metadata of the `FixedShapeTensor` extension type in Vortex will look like (in
Rust):

```rust
/// Metadata for a [`Tensor`] extension type.
/// Metadata for a [`FixedShapeTensor`] extension type.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct TensorMetadata {
pub struct FixedShapeTensorMetadata {
/// The shape of the tensor.
///
/// The shape is always defined over row-major storage. May be empty (0D scalar tensor) or
Expand Down Expand Up @@ -156,8 +157,8 @@ contiguous in memory without copying any data.

Since our storage type and metadata are designed to match Arrow's Fixed Shape Tensor canonical
extension type, conversion to and from Arrow is zero-copy (for tensors with at least one dimension).
The `FixedSizeList` backing memory, shape, dimension names, and permutation all map directly between
the two representations.
The `FixedSizeList` backing memory, shape, dimension names, and permutation all map directly
between the two representations.

#### NumPy and PyTorch

Expand All @@ -169,13 +170,13 @@ memory with the original without copying. However, this means that non-contiguou
anywhere, and kernels must handle arbitrary stride patterns. PyTorch supposedly requires many
operations to call `.contiguous()` before proceeding.

Since Vortex tensors are always contiguous, we can always zero-copy _to_ NumPy and PyTorch since
both libraries can construct a view from a pointer, shape, and strides. Going the other direction,
we can only zero-copy _from_ NumPy/PyTorch tensors that are already contiguous.
Since Vortex fixed-shape tensors are always contiguous, we can always zero-copy _to_ NumPy and
PyTorch since both libraries can construct a view from a pointer, shape, and strides. Going the
other direction, we can only zero-copy _from_ NumPy/PyTorch tensors that are already contiguous.

Our proposed design for Vortex Tensors will handle non-contiguous operations differently than the
Python libraries. Rather than mutating strides to create non-contiguous views, operations like
slicing, indexing, and `.contiguous()` (after a permutation) would be expressed as lazy
Our proposed design for Vortex `FixedShapeTensor` will handle non-contiguous operations differently
than the Python libraries. Rather than mutating strides to create non-contiguous views, operations
like slicing, indexing, and `.contiguous()` (after a permutation) would be expressed as lazy
`Expression`s over the tensor.

These expressions describe the operation without materializing it, and when evaluated, they produce
Expand All @@ -197,7 +198,7 @@ elements in a tensor is the product of its shape dimensions, and that the

0D tensors have an empty shape `[]` and contain exactly one element (since the product of no
dimensions is 1). These represent scalar values wrapped in the tensor type. The storage type is
`FixedSizeList<p, 1>` (which is identical to a flat `PrimitiveArray` or `DecimalArray`).
`FixedSizeList<p, 1>` (which is identical to a flat `PrimitiveArray`).

#### Size-0 dimensions

Expand Down Expand Up @@ -225,12 +226,12 @@ leave this as an open question.
### Scalar Representation

Once we add the `ScalarValue::Array` variant (see tracking issue
[vortex#6771](https://github.com/vortex-data/vortex/issues/6771)), we can easily pass around tensors
as `ArrayRef` scalars as well as lazily computed slices.
[vortex#6771](https://github.com/vortex-data/vortex/issues/6771)), we can easily pass around
fixed-shape tensors as `ArrayRef` scalars as well as lazily computed slices.

The `ExtVTable` also requires specifying an associated `NativeValue<'a>` Rust type that an extension
scalar can be unpacked into. We will want a `NativeTensor<'a>` type that references the backing
memory of the Tensor, and we can add useful operations to that type.
scalar can be unpacked into. We will want a `NativeFixedShapeTensor<'a>` type that references the
backing memory of the tensor, and we can add useful operations to that type.

## Compatibility

Expand All @@ -242,8 +243,8 @@ compatibility concerns.
- **Fixed shape only**: This design only supports tensors where every element in the array has the
same shape. Variable-shape tensors (ragged arrays) are out of scope and would require a different
type entirely.
- **Yet another crate**: We will likely implement this in a `vortex-tensor` crate, which means even
more surface area than we already have.
- **Yet another crate**: We will likely implement this in a `vortex-tensor` crate, which means
even more surface area than we already have.

## Alternatives

Expand Down Expand Up @@ -301,6 +302,12 @@ _Note: This section was Claude-researched._
shape and stride metadata. Our design is a subset of this model — we always require contiguous
memory and derive strides from shape and permutation, as discussed in the
[conversions](#conversions) section.
- **[xarray](https://docs.xarray.dev/en/stable/)** extends NumPy with named dimensions and
coordinate labels. Its
[data model](https://docs.xarray.dev/en/stable/user-guide/terminology.html) attaches names to each
dimension and associates "coordinate" arrays along those dimensions (e.g., latitude and longitude
values for the rows and columns of a temperature matrix). Our `dim_names` metadata is a subset of
xarray's model; coordinate arrays could be a future extension.
- **[ndindex](https://quansight-labs.github.io/ndindex/index.html)** is a Python library that
provides a unified interface for representing and manipulating NumPy array indices (slices,
integers, ellipses, boolean arrays, etc.). It supports operations like canonicalization, shape
Expand All @@ -312,8 +319,8 @@ _Note: This section was Claude-researched._

- **TACO (Tensor Algebra Compiler)** separates the tensor storage format from the tensor program.
Each dimension can independently be specified as dense or sparse, and dimensions can be reordered.
The Vortex approach of storing tensors as flat contiguous memory with a permutation is one specific
point in TACO's format space (all dimensions dense, with a specific dimension ordering).
The Vortex approach of storing tensors as flat contiguous memory with a permutation is one
specific point in TACO's format space (all dimensions dense, with a specific dimension ordering).

## Unresolved Questions

Expand All @@ -333,8 +340,26 @@ like batched sequences of different lengths.

#### Sparse tensors

A similar Sparse Tensor type could use `List` or `ListView` as its storage type to efficiently
represent tensors with many null or zero elements, as noted in the [validity](#validity) section.
A sparse tensor type could use `List` or `ListView` as its storage type to efficiently represent
tensors with many zero or absent elements.

#### A unified `Tensor` type

This RFC proposes `FixedShapeTensor` as a single, concrete extension type. However, tensors
naturally vary along two axes: shape (fixed vs. variable) and density (dense vs. sparse). Both a
variable-shape tensor (fixed dimensionality, variable shape per element) and a sparse tensor would
need a different storage type, since it needs to efficiently skip over zero or null regions (and
for both this would likely be `List` or `ListView`).

Each combination would be its own extension type (`FixedShapeTensor`, `VariableShapeTensor`,
`SparseFixedShapeTensor`, etc.), but this proliferates types and fragments any shared tensor logic.
With the matching system on extension types, we could instead define a single unified `Tensor` type
that covers all combinations, dispatching to the appropriate storage type and metadata based on the
specific variant. This would be more complex to implement but would give users a single type to work
with and a single place to define tensor operations.

For now, `FixedShapeTensor` is the only variant we need. The others can be added incrementally
as use cases arise.

#### Tensor-specific encodings

Expand Down
Loading