Skip to content

feat(vortex-row): row-oriented byte encoder (size + encode passes)#8056

Closed
joseph-isaacs wants to merge 10 commits into
developfrom
ji/row-pr1-base
Closed

feat(vortex-row): row-oriented byte encoder (size + encode passes)#8056
joseph-isaacs wants to merge 10 commits into
developfrom
ji/row-pr1-base

Conversation

@joseph-isaacs

@joseph-isaacs joseph-isaacs commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds vortex-row, a new crate that encodes one or more columnar Vortex arrays into a single
ListView<u8> whose per-row byte slices are lexicographically comparable. The byte order
matches tuple ordering of the input values under per-column sort options, so the output works
directly as a sort key / row key — the Vortex analogue of arrow-row.

This is the base of the row-encoding work: the byte codec, the two-pass converter, the
public API, tests, and benches. Codec hot-path performance tuning and the per-encoding
(Constant / Dict / Patched / RunEnd / BitPacked / FoR / Delta) fast-path kernels land in
follow-up PRs and are intentionally out of scope here.

Design

Encoding runs as two scalar functions wired behind the RowEncoder API:

  1. Size pass — RowSize. Walks the N input columns once, classifies each column as
    fixed- or variable-width, accumulates the fixed-width prefix per row, and lazily collects
    per-row variable lengths. Returns Struct { fixed: u32, var: u32 } so callers read per-row
    widths without materializing the constant fixed slot as a per-row buffer.
  2. Encode pass — RowEncode. Uses those sizes to compute totals, allocate one contiguous
    elements buffer, build per-row absolute offsets, then writes each column left-to-right into
    its per-row slot via a write cursor that doubles as the ListView sizes array — so no
    separate finalize step is needed.

The converter is effectively 2 passes for the pure-fixed-width case and 3 when
variable-length columns require the prefix-sum offsets pass.

Per-column ordering is controlled by RowSortField { descending, nulls_first }: descending
reverses the encoded value bytes, and a leading sentinel byte (0x00 / 0x01 / 0x02) places
nulls before or after non-nulls independently of sort direction.

Public API

  • RowEncoder — primary entry point: new / with_options, encode, row_sizes, options.
  • convert_columns / compute_row_sizes — convenience helpers around RowEncoder.
  • RowEncodingOptions + RowSortField — per-column sort configuration.
  • initialize(session) — registers RowSize and RowEncode on a VortexSession so row
    encoding is reachable through the expression layer.

convert_columns and where the API lives

convert_columns(cols: &[ArrayRef], fields: &[RowSortField], ctx) -> VortexResult<ListViewArray>
is the one-shot entry point; RowEncoder is the reusable form (build once with options, encode
many). Each public item is defined in a single file:

Item File
RowEncoder, convert_columns(_with_options), compute_row_sizes(_with_options) src/encoder.rs
RowEncode scalar fn + the 5-phase encode driver src/encode.rs
RowSize scalar fn + the size/classify pass (compute_sizes) src/size.rs
RowEncodingOptions, RowSortField src/options.rs
per-dtype byte codec (field_size / field_encode) src/codec.rs
initialize(session) + re-exports src/lib.rs

Implementation. convert_columns validates the columns (non-empty, equal lengths, one
RowSortField per column) and delegates to a RowEncoder, which runs RowSize then
RowEncode: the size pass canonicalizes each column once and classifies it fixed- vs
variable-width, accumulating a constant fixed_per_row and — only when needed — per-row
var_lengths; the encode pass sums those into the total byte length, allocates one contiguous
elements buffer, computes per-row offsets (i * fixed_per_row plus a varlen prefix sum), then
writes each column left-to-right into its per-row slot using a cursor that becomes the ListView
sizes array, producing the final ListView<u8> with no separate finalize step.

Type coverage

Supported: nulls, booleans, integer/float primitives, decimals up to 128 bits, UTF-8 and
binary, structs, fixed-size lists, and extensions whose storage type is supported. Variant,
union, and variable-size list arrays are rejected — this crate does not define an ordering for
them.

Testing

  • cargo nextest run -p vortex-row — sort-order round-trip tests (bool, i64 asc/desc, u32,
    f64, utf8, multi-column, nulls first/last, struct), the single-buffer ListView invariant, and
    RowSize output shape.
  • cargo bench -p vortex-rowrow_encode divan benchmarks against an arrow-row baseline
    (primitive i64, utf8, struct_mixed).

claude added 8 commits May 17, 2026 22:00
Add an empty `vortex-row` crate with a minimal `initialize` stub so the
following commits can layer in the row-encoder, codec, scalar functions,
and per-encoding kernels without touching the workspace skeleton each
time. The crate is wired into the workspace members list and workspace
dependency table; `public-api.lock` is generated against the stub.

Signed-off-by: Claude <noreply@anthropic.com>
Introduce the per-column sort-field options and the variadic-function
options struct used by the upcoming RowSize / RowEncode scalar functions.

`RowEncodeOptions::fields` uses a `SmallVec<[SortField; 4]>` so typical
1-4 column keys avoid a heap allocation. Includes a compact serialize /
deserialize helper used later by the scalar-function metadata round-trip.

Signed-off-by: Claude <noreply@anthropic.com>
Add the byte-encoding kernels for the fixed-width portion of the row
encoder: Null, Bool, Primitive (12 PTypes), and Decimal (i8..i128). Each
encoder writes a 1-byte sentinel followed by the value's row-comparable
bytes (sign-flipped big-endian for signed ints, sign-aware mask for
floats, etc.).

The size pass is a constant `width-per-row` add for these types; the
encode pass walks rows and writes into the shared output buffer at
`offsets[i] + cursors[i]`. `row_width_for_dtype` classifies the column
based purely on its DType.

Scalar-level encoders (`encode_scalar_primitive` / `encode_scalar_bool`
/ `encode_scalar_null` / `encode_scalar` / `encoded_size_for_scalar`)
are included for the same fixed-width subset; varlen and nested
canonical variants bail with a clear "not yet supported" error and
land in follow-up commits.

The implementation is deliberately the simplest correct version:
bounds-checked array indexing, no `copy_nonoverlapping`, no validity
fast-path helper. Subsequent PRs evolve this toward the optimized form.

Signed-off-by: Claude <noreply@anthropic.com>
Extend the codec to handle Utf8/Binary via VarBinView arrays. Each value
encodes as a 1-byte sentinel followed by 32-byte chunks: every full
chunk has a 0xFF continuation marker; the final partial chunk pads with
zeros and writes the partial length (1..=32) as its trailing byte.

`encode_varlen_value` uses the simple byte-at-a-time XOR loop here; a
faster `copy_nonoverlapping` + stamped continuation version replaces it
in PR 2. `encode_varbinview` uses `arr.with_iterator(...)` for both the
nullable and non-nullable branches; a direct view walk for the no-nulls
branch lands in PR 2 too.

`row_width_for_dtype` now returns `Variable` for Utf8/Binary; the size
pass and encode dispatchers route through `add_size_varbinview` /
`encode_varbinview` correspondingly. The scalar encoder gains
`encode_scalar_varlen` and the matching Utf8/Binary arms.

Signed-off-by: Claude <noreply@anthropic.com>
Extend the codec to handle Struct, FixedSizeList, and Extension
canonical variants. Each nested row encodes as `outer_sentinel | child
bytes...`; for null rows the child bytes are zero-filled after the
recursive encoders run so two null rows compare equal regardless of
which non-null values would have been written by the children.

`row_width_for_dtype` recurses through Struct fields and FSL elements
to return `Fixed(w)` when every leaf is fixed; otherwise `Variable`.
Extension delegates to its storage dtype. List remains `Variable` and
ListView still bails (the row encoder's output is itself a ListView, so
nested ListView isn't a near-term use case). Variant and Union bail
explicitly.

Signed-off-by: Claude <noreply@anthropic.com>
Add the size-pass machinery used by both RowSize and the upcoming
RowEncode pipeline. `compute_sizes` walks the N input columns once,
classifying each via `row_width_for_dtype` and accumulating
fixed-width-prefix sums in `fixed_per_row` while pushing per-row sums
of variable-length columns into a lazily allocated `var_lengths` vec.

The classification result (`ColKind` + `SizePassResult`) is private to
the crate; RowEncode consumes it in a later commit to choose between
the arithmetic and cursor encode paths.

`RowSize` returns a `Struct { fixed: U32, var: U32 }` so callers can
read the per-row width without realizing the constant `fixed` slot as
a per-row buffer (it's a `ConstantArray`); the `var` slot is a
`ConstantArray(0)` when no varlen column is present.

`dispatch_size` is the fallback-only path for PR 1 (canonicalize, then
codec::field_size). The `RowSizeKernel` trait exists but is unused; per-
encoding fast paths and the inventory registry arrive in PR 3.

`initialize()` does NOT register RowSize yet - that lands once
RowEncode is in place, so the session-registered pair appears together.

Signed-off-by: Claude <noreply@anthropic.com>
Add the RowEncode variadic scalar function: encode N input columns into
a single ListView<u8> in a five-phase pipeline.

  Phase 1: size pass via `compute_sizes`.
  Phase 2: allocate a zero-initialized output buffer sized to fit every
           row's encoded bytes; bail if the total exceeds u32::MAX.
  Phase 3: build per-row `listview_offsets`: i * fixed_per_row for the
           pure-fixed case, or i * fixed_per_row + exclusive cumsum of
           varlen lengths otherwise. Uses the simple `Vec::push` +
           `checked_add` loop.
  Phase 4: walk columns left-to-right and call `dispatch_encode` for
           every column (cursor path for all). Each call writes its
           per-row bytes at `offsets[i] + cursors[i]` and advances the
           cursor.
  Phase 5: build the ListView<u8> via the validating `try_new`
           constructor.

`dispatch_encode` is the canonicalize-then-`codec::field_encode`
fallback; in-crate kernel arms and the inventory registry land in PR 3.
The `RowEncodeKernel` trait is defined but unused. PR 2 will iterate
on this pipeline (skip zero-init, skip ListView validation, auto-
vectorize the offsets loop, etc.).

Signed-off-by: Claude <noreply@anthropic.com>
Wire the RowSize/RowEncode scalar functions to the user-facing API:

- `convert_columns` accepts a slice of input arrays and per-column
  SortFields, constructs `RowEncodeOptions` + `VecExecutionArgs`, and
  returns the encoded `ListViewArray<u8>`.
- `compute_row_sizes` returns just the per-row sizes (the `Struct
  { fixed: u32, var: u32 }` output of `RowSize`).
- `initialize()` now registers `RowSize` and `RowEncode` on the given
  session so they are reachable via the expression layer.

Tests cover sort-order round-trips for bool, primitive (i64 asc/desc,
u32, f64), utf8, multi-column, nulls_first/last, struct sort-order, the
single-buffer invariant of the ListView output, and the structural
shape of `RowSize`. Tests that exercise per-encoding fast paths
(`constant_path_matches_canonical`, `dict_path_matches_canonical`) land
together with their respective kernels in PR 3.

The bench file uses divan + mimalloc and reports throughput in GB/s of
encoded output bytes for primitive_i64, utf8, and struct_mixed. Each
has an `arrow_row` baseline and a `vortex` measurement. Per-encoding
fast-path scenarios (constant/dict/patched/bitpacked/for/delta) gain
their triplets in PR 3.

Baseline measurements at this commit (sample-count=10):
  primitive_i64_vortex  ~1.97 GB/s  (vs arrow-row 4.12 GB/s)
  utf8_vortex           ~0.87 GB/s  (vs arrow-row 1.56 GB/s)
  struct_mixed_vortex   ~0.95 GB/s  (vs arrow-row 1.19 GB/s)

PR 2 closes most of the gap by replacing the validating
`ListViewArray::try_new` with `new_unchecked`, skipping the buffer
zero-init, auto-vectorizing the offsets and varlen-block paths, etc.

Signed-off-by: Claude <noreply@anthropic.com>
@codspeed-hq

codspeed-hq Bot commented May 22, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 11.29%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 1 regressed benchmark
✅ 1506 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation baseline_lt[16, 65536] 216.1 µs 243.7 µs -11.29%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ji/row-pr1-base (e5c07bb) with develop (3ac6c77)

Open in CodSpeed

@joseph-isaacs joseph-isaacs changed the title feat: vortex row crate feat(vortex-row): row-oriented byte encoder (size + encode passes) Jun 4, 2026
t
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
t
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants