Skip to content

[Proposal][C++]: Add LRU chunk cache to Arrow chunk readers to avoid redundant file I/O #860

@SYaoJun

Description

@SYaoJun

Background

Image

Currently, all Arrow chunk readers (
VertexPropertyArrowChunkReader, AdjListArrowChunkReader, AdjListOffsetArrowChunkReader, AdjListPropertyArrowChunkReader) discard the loaded chunk_table_ every time the chunk position changes via seek(), next_chunk(), or seek_chunk_index(). This means that if a user seeks back to a previously loaded chunk, the entire Parquet file must be re-opened, metadata parsed, and data decoded again — even though the data hasn't changed.

This is particularly costly in graph traversal workloads (BFS, PageRank, label filtering) where vertex/edge access patterns exhibit strong locality, causing the same chunks to be read repeatedly.

Proposal

Introduce a genericLruCache<Key, Value>and integrate it into all four chunk reader classes. When a chunk is loaded from disk, it is stored in the cache. On subsequent seeks to the same chunk, the cached arrow::Table is returned directly, avoiding file I/O entirely.

Component(s)

C++

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions