Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,22 @@ Makefile target is used.

If you touch documentation run doc tests via `cargo test --doc`.

For Python binding changes under `vortex-python/`, run the narrow Python checks that match the
files touched before broader test suites. Useful checks include:

```bash
python -m py_compile <changed-python-files>
uv run --all-packages --reinstall-package vortex-data pytest <changed-python-tests>
```

If Python docstrings, `docs/api/python/`, or Sphinx configuration change, also run the docs checks
from a clean Sphinx environment:

```bash
uv run --all-packages make -C docs clean html
uv run --all-packages make -C docs clean doctest
```

## Linting, Formatting, and Generated Files

Run verification that matches the files changed. Do not run expensive Rust checks for changes that
Expand All @@ -80,6 +96,16 @@ with no Rust/API behavior impact. For docs/config-only changes, validate formatt
or with a targeted doc/config command, and verify symlink or path changes with `ls`, `find`, and
`git status`.

For Python binding changes under `vortex-python/`, run the relevant Python lint and type checks:

```bash
uv run basedpyright vortex-python
uv run ruff check <changed-python-files>
```

If PyO3 Rust files in `vortex-python/src/` change, include `cargo +nightly fmt --check -p
vortex-python`. Always finish Python binding work with `git diff --check`.

For Rust code, public API, feature flag, or generated-file changes, run these before stopping:

```bash
Expand Down
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 3 additions & 7 deletions docs/api/python/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -128,18 +128,14 @@ Pluggable Encodings
-------------------

Subclasses of :class:`~vortex.PyArray` can be used to implement custom Vortex encodings in Python. These encodings
can be registered with the :attr:`~vortex.registry` so they are available to use when reading Vortex files.
can be registered with :func:`vortex.registry.register` so they are available to use when reading Vortex files.

.. autoclass:: vortex.PyArray
:members:


Registry and Serde
------------------

.. autodata:: vortex.registry

.. autofunction:: vortex.registry.register
Serde
-----

.. autoclass:: vortex.ArrayContext
:members:
Expand Down
1 change: 0 additions & 1 deletion docs/api/python/expr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,4 +64,3 @@ the following expression represents the set of rows for which the `age` column l
... .to_pylist()
... )
[{'x': 1, 'y': {'yy': 'a'}}]

2 changes: 2 additions & 0 deletions docs/api/python/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@ API Reference
arrays
expr
compress
registry
io
store
dataset
runtime
type_aliases
8 changes: 8 additions & 0 deletions docs/api/python/registry.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Registry
========

The Python registry module registers Python extension types with the process-wide Vortex registry.

.. automodule:: vortex.registry

.. autofunction:: vortex.registry.register
21 changes: 21 additions & 0 deletions docs/api/python/runtime.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
Runtime
=======

Vortex drives async work on a shared background thread pool. The pool is sized on first use
to ``VORTEX_MAX_THREADS`` if that environment variable is set to a non-negative integer,
otherwise to the number of available CPU cores minus one. Use
:func:`vortex.set_worker_threads` to adjust the pool at runtime.

.. autosummary::
:nosignatures:

~vortex.set_worker_threads
~vortex.worker_threads

.. raw:: html

<hr>

.. autofunction:: vortex.set_worker_threads

.. autofunction:: vortex.worker_threads
3 changes: 3 additions & 0 deletions docs/getting-started/python.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ Use :func:`~vortex.io.write` to write the Vortex array to disk:
.. doctest::

>>> import vortex as vx
>>>
>>> vx.io.write(vtx, "example.vortex") # doctest: +SKIP

Small Vortex files (this one is just 71KiB) currently have substantial overhead relative to their
Expand All @@ -56,6 +57,7 @@ Use :func:`~vortex.open` to open and read the Vortex array from disk:
.. doctest::

>>> import vortex as vx
>>>
>>> cvtx = vx.open("example.vortex").scan().read_all() # doctest: +SKIP


Expand All @@ -68,6 +70,7 @@ IO and decoding and read just the data that is relevant to you:
.. doctest::

>>> import vortex as vx
>>>
>>> vf = vx.open("example.vortex") # doctest: +SKIP
>>> # row indices must be ordered and unique
>>> indices = vx.array([1, 2, 10])
Expand Down
2 changes: 2 additions & 0 deletions docs/user-guide/pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ convert:
```{doctest} pycon
>>> import vortex as vx
>>> import pyarrow.parquet as pq
>>>
>>> vx.io.write(pq.read_table("_static/example.parquet"), 'example.vortex')
>>>
>>> f = vx.open('example.vortex')
Expand Down Expand Up @@ -48,6 +49,7 @@ convert:

```{doctest} pycon
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'age': [25, 31, 33, 57], 'name': ['Joseph', 'Narendra', 'Angela', 'Mikhail']})
>>> vx.array(df).to_arrow_table()
pyarrow.Table
Expand Down
1 change: 1 addition & 0 deletions docs/user-guide/polars.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Vortex integrates with Polars via {meth}`.VortexFile.to_polars`, which returns a
```{doctest} pycon
>>> import vortex as vx
>>> import pyarrow.parquet as pq
>>>
>>> vx.io.write(pq.read_table("_static/example.parquet"), 'example.vortex')
>>>
>>> lf = vx.open('example.vortex').to_polars()
Expand Down
1 change: 1 addition & 0 deletions docs/user-guide/pyarrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ The {func}`~vortex.array` function constructs a Vortex array from an Arrow array

```{doctest} pycon
>>> import pyarrow as pa
>>>
>>> arrow = pa.array([1, 2, None, 3])
>>> arr = vx.array(arrow)
>>> arr.dtype
Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/ray.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@ and then run `make -C docs doctest` -->
>>> import vortex as vx
>>> import pyarrow.parquet as pq
>>> import os
>>>
>>> os.makedirs("ray_data", exist_ok=True)
>>> table = pq.read_table("_static/example.parquet")
>>>
>>> vx.io.write(table, 'ray_data/example-01.vortex')
>>> vx.io.write(table, 'ray_data/example-02.vortex')
>>> vx.io.write(table, 'ray_data/example-03.vortex')
>>>
>>> from vortex.ray.datasource import VortexDatasource
>>> from ray.data import read_datasource
>>>
>>> ds = read_datasource(VortexDatasource(url='ray_data')) # doctest: +SKIP
>>> ds.to_pandas() # doctest: +SKIP
VendorID tpep_pickup_datetime ... congestion_surcharge Airport_fee
Expand Down
50 changes: 50 additions & 0 deletions docs/user-guide/vortex-python.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ applied directly:

```{doctest} pycon
>>> import vortex.expr as ve
>>>
>>> arr = vx.array([
... {'name': 'Alice', 'age': 30},
... {'name': 'Bob', 'age': 25},
Expand All @@ -139,6 +140,7 @@ applied directly:

```{doctest} pycon
>>> import pyarrow.parquet as pq
>>>
>>> vx.io.write(pq.read_table("_static/example.parquet"), 'example.vortex')
>>>
>>> f = vx.open('example.vortex')
Expand Down Expand Up @@ -186,6 +188,54 @@ tip_amount: double
1000
```

## Threading Model

Vortex uses a shared runtime behind the Python API. When no background workers are configured, the
Python thread that is reading from a scan also polls the Vortex work needed to produce each batch.
This means multiple Python threads can make progress independently as long as each thread owns the
reader it is consuming:

```python
from concurrent.futures import ThreadPoolExecutor

import pyarrow.compute as pc
import vortex as vx


def sum_column(path: str, column: str) -> int | float:
reader = vx.open(path).to_arrow([column], batch_size=64_000)
total = 0

for batch in reader:
value = pc.sum(batch.column(column)).as_py()
if value is not None:
total += value

return total


columns = ["tip_amount", "fare_amount", "total_amount"]
with ThreadPoolExecutor(max_workers=len(columns)) as threads:
totals = list(threads.map(lambda column: sum_column("example.vortex", column), columns))
```

By default Vortex starts a background worker pool sized to `available_parallelism() - 1`.
Set `VORTEX_MAX_THREADS=n` to pin the pool to a specific size at startup. To adjust the pool
at runtime, use {func}`~vortex.set_worker_threads`; passing `None` resets it to the default:

```python
import vortex as vx

previous_workers = vx.worker_threads()
vx.set_worker_threads(None) # reset to available_parallelism() - 1

try:
reader = vx.open("example.vortex").to_arrow(batch_size=64_000)
table = reader.read_all()
finally:
vx.set_worker_threads(previous_workers)
```

## Conversion

Arrays convert to other formats:
Expand Down
7 changes: 4 additions & 3 deletions vortex-python/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,13 @@ default = ["extension-module", "tui"]
# Disable it when using vortex-python as a library dependency to avoid
# duplicate PyInit__lib symbols.
extension-module = []
tui = ["dep:vortex-tui"]
tui = ["dep:tokio", "dep:vortex-tui"]

[dependencies]
arrow-array = { workspace = true }
arrow-data = { workspace = true }
arrow-schema = { workspace = true }
async-fs = { workspace = true }
bytes = { workspace = true }
itertools = { workspace = true }
log = { workspace = true }
Expand All @@ -48,9 +49,9 @@ pyo3 = { workspace = true, features = ["abi3", "abi3-py311"] }
pyo3-bytes = { workspace = true }
pyo3-log = { workspace = true }
pyo3-object_store = { workspace = true }
tokio = { workspace = true, features = ["fs", "rt-multi-thread"] }
tokio = { workspace = true, features = ["rt-multi-thread"], optional = true }
url = { workspace = true }
vortex = { workspace = true, features = ["object_store", "tokio"] }
vortex = { workspace = true, features = ["object_store"] }
vortex-tui = { workspace = true, optional = true }

[dev-dependencies]
Expand Down
5 changes: 4 additions & 1 deletion vortex-python/benchmark/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@
params=[{"x"}, {"x", "y"}, {"x", "z"}, {"x", "y", "z"}],
ids=["int", "int_str", "int_float", "int_str_float"],
)
def vxf(tmpdir_factory: pytest.TempPathFactory, request: pytest.FixtureRequest) -> vx.VortexFile:
def vxf(
tmpdir_factory: pytest.TempPathFactory,
request: pytest.FixtureRequest,
) -> vx.VortexFile:
fname = tmpdir_factory.mktemp("data") / "foo.vortex"

if not os.path.exists(fname):
Expand Down
5 changes: 3 additions & 2 deletions vortex-python/benchmark/test_serialization.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# SPDX-FileCopyrightText: Copyright the Vortex contributors

import pickle
from typing import cast

import pytest
from pytest_benchmark.fixture import BenchmarkFixture # pyright: ignore[reportMissingTypeStubs]
Expand All @@ -24,6 +25,6 @@ def test_pickle(
benchmark(lambda: pickle.dumps(array_fixture, protocol=protocol))
elif operation == "loads":
pickled_data = pickle.dumps(array_fixture, protocol=protocol)
benchmark(lambda: pickle.loads(pickled_data)) # pyright: ignore[reportAny]
benchmark(lambda: cast(object, pickle.loads(pickled_data)))
elif operation == "roundtrip":
benchmark(lambda: pickle.loads(pickle.dumps(array_fixture, protocol=protocol))) # pyright: ignore[reportAny]
benchmark(lambda: cast(object, pickle.loads(pickle.dumps(array_fixture, protocol=protocol))))
9 changes: 8 additions & 1 deletion vortex-python/python/vortex/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,10 @@
utf8,
)
from ._lib.iter import ArrayIterator # pyright: ignore[reportMissingModuleSource]
from ._lib.runtime import ( # pyright: ignore[reportMissingModuleSource]
set_worker_threads,
worker_threads,
)
from ._lib.scalar import ( # pyright: ignore[reportMissingModuleSource]
BinaryScalar,
BoolScalar,
Expand Down Expand Up @@ -178,7 +182,7 @@
# Serde
"ArrayContext",
"SerializedArray",
# Pickle
# Pickle support
"_unpickle_array",
# File
"VortexFile",
Expand All @@ -187,6 +191,9 @@
"ArrayIterator",
# Scan
"RepeatedScan",
# Runtime
"set_worker_threads",
"worker_threads",
# Version
"__version__",
]
2 changes: 1 addition & 1 deletion vortex-python/python/vortex/_lib/arrays.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ class Array:
obj: pa.Array[pa.Scalar[pa.DataType]] | pa.ChunkedArray[pa.Scalar[pa.DataType]] | pa.Table,
) -> Array: ...
@staticmethod
def from_range(obj: range) -> Array: ...
def from_range(obj: range, *, dtype: DType | None = None) -> Array: ...
def to_arrow_array(self) -> pa.Array[pa.Scalar[pa.DataType]]: ...
@property
def id(self) -> str: ...
Expand Down
7 changes: 6 additions & 1 deletion vortex-python/python/vortex/_lib/file.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -51,4 +51,9 @@ class VortexFile:
def to_polars(self) -> pl.LazyFrame: ...
def splits(self) -> list[tuple[int, int]]: ...

def open(path: str, *, store: ObjectStore | None = None, without_segment_cache: bool = False) -> VortexFile: ...
def open(
path: str,
*,
store: ObjectStore | None = None,
without_segment_cache: bool = False,
) -> VortexFile: ...
1 change: 1 addition & 0 deletions vortex-python/python/vortex/_lib/io.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ def read_url(
projection: list[str] | list[int] | None = None,
row_filter: Expr | None = None,
indices: Array | None = None,
row_range: tuple[int, int] | None = None,
) -> Array: ...
def write(
iter: IntoArrayIterator,
Expand Down
5 changes: 5 additions & 0 deletions vortex-python/python/vortex/_lib/runtime.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright the Vortex contributors

def set_worker_threads(n: int | None = None) -> None: ...
def worker_threads() -> int: ...
Loading
Loading