Tracking Issue: random access Lance parity

Lance is better on feature vectors by around 3-15x.

1. No selection usage in repeated reads, solved by #8137  (2x improvement)
2. Most requests produce small splits where BTree overhead dominates https://github.com/vortex-data/vortex/pull/8194 (1.3x improvement)
3. Re-parsing flatbuffers and reinitializing chunk offsets is an overhead https://github.com/vortex-data/vortex/pull/8234 (likely 2x improvement
4. Chunked layout children are not cached https://github.com/vortex-data/vortex/pull/8209 #8244 

Big improvements:
1. Subsegment reads - on feature vectors as an example, we read 10x more data. https://github.com/vortex-data/vortex/pull/7368
2. io_uring - currently the cost of planning a single task for small reads outperforms the cost of reading itself.
3. same as (2) from different endpoint: vectored reads. If we can batch small read_at() calls for local disks, we'll save on tokio planning

Moving the needle (won't do):
1. Using Natural split heuristic not for range distance constant, but for ranges count: 33% improvement on feature-vectors, 10% regression on nested lists since this favours flat data.
2. For small split read tasks the time of reading is marginal compared to tokio task planning and LazyScanStream initialization. This may be solved by a heuristic - performance improves marginally only for the main thread.- 


TODOs:

- feature-vectors/correlated is faster with footer reopen. Why?


Additional context:

We were incorrectly measuring performance for vortex, likely influenced by Lance https://github.com/vortex-data/vortex/pull/8470

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: random access Lance parity #7915

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tracking Issue: random access Lance parity #7915

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions