Skip to content

Tracking Issue: random access Lance parity #7915

@myrrc

Description

@myrrc

Lance is better on feature vectors by around 3-15x.

  1. No selection usage in repeated reads, solved by Use selection in repeated scans #8137 (2x improvement)
  2. Most requests produce small splits where BTree overhead dominates Use Vec<u64> instead of BTreeSet for splits #8194 (1.3x improvement)
  3. Re-parsing flatbuffers and reinitializing chunk offsets is an overhead ViewedLayoutChildren child layout cache #8234 (likely 2x improvement
  4. Chunked layout children are not cached Chunk reader children cache #8209 Allow caching LayoutReader in VortexFile #8244

Big improvements:

  1. Subsegment reads - on feature vectors as an example, we read 10x more data. Sub segment read p2 #7368
  2. io_uring - currently the cost of planning a single task for small reads outperforms the cost of reading itself.
  3. same as (2) from different endpoint: vectored reads. If we can batch small read_at() calls for local disks, we'll save on tokio planning

Moving the needle (won't do):

  1. Using Natural split heuristic not for range distance constant, but for ranges count: 33% improvement on feature-vectors, 10% regression on nested lists since this favours flat data.
  2. For small split read tasks the time of reading is marginal compared to tokio task planning and LazyScanStream initialization. This may be solved by a heuristic - performance improves marginally only for the main thread.-

TODOs:

  • feature-vectors/correlated is faster with footer reopen. Why?

Additional context:

We were incorrectly measuring performance for vortex, likely influenced by Lance #8470

Metadata

Metadata

Assignees

Labels

tracking-issueShared implementation context for work likely to span multiple PRs.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions