Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions docs/content/program-api/file-cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: "Local Disk Cache"
weight: 8
type: docs
aliases:
- /program-api/file-cache.html
- /pypaimon/file-cache.html
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Local Disk Cache

When reading files from remote storage (S3, OSS, HDFS, etc.), each seek+read goes over the network. Paimon provides a block-level local disk cache that transparently caches file reads on local disk, significantly reducing remote I/O for repeated access patterns.

## Cached File Types

The cache classifies files by type. By default, only `meta` and `global-index` types are cached. You can customize this via the `file-cache.whitelist` option.

| File Type | Config Name | Examples | Default Cached |
|-----------|-------------|----------|----------------|
| META | meta | snapshot, schema, manifest, statistics, tag | Yes |
| GLOBAL_INDEX | global-index | BTree, Lumina, Tantivy index files | Yes |
| BUCKET_INDEX | bucket-index | Hash, deletion vector index files | No |
| DATA | data | Data files (ORC, Parquet, etc.) | No |
| FILE_INDEX | file-index | Data-file level bloom filter, bitmap | No |

All file types can be added to the whitelist. The default whitelist is `meta,global-index`.

## Enable Cache

Use `table.copy()` to pass cache options as dynamic parameters:

{{< tabs "enable-cache" >}}

{{< tab "Java" >}}

```java
import org.apache.paimon.table.Table;

import java.util.HashMap;
import java.util.Map;

Table table = catalog.getTable(Identifier.create("my_db", "my_table"));

Map<String, String> options = new HashMap<>();
options.put("file-cache.enabled", "true");
// optional: customize cache directory and limits
options.put("file-cache.dir", "/tmp/paimon-file-cache");
options.put("file-cache.max-size", "2gb");
options.put("file-cache.block-size", "1mb");

// All subsequent reads on this table instance will use the cache
table = table.copy(options);
```

{{< /tab >}}

{{< tab "Python" >}}

```python
table = catalog.get_table("db.my_table")

# Enable cache with dynamic options
table = table.copy({
"file-cache.enabled": "true",
# optional: customize cache directory and limits
"file-cache.dir": "/tmp/paimon-file-cache",
"file-cache.max-size": "2gb",
"file-cache.block-size": "1mb",
})

# All subsequent reads on this table instance will use the cache
```

{{< /tab >}}

{{< /tabs >}}

## Cache Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `file-cache.enabled` | Boolean | false | Whether to enable local disk block cache. |
| `file-cache.dir` | String | `<tmpdir>/paimon-file-cache` | Directory for storing cached blocks. |
| `file-cache.max-size` | MemorySize | unlimited | Maximum total size of the cache. When exceeded, the least recently used blocks are evicted. |
| `file-cache.block-size` | MemorySize | 1 mb | Block size for caching. Files are logically divided into fixed-size blocks and cached independently. |
| `file-cache.whitelist` | String | meta,global-index | Comma-separated list of file types to cache. Supported values: `meta`, `global-index`, `bucket-index`, `data`, `file-index`. |

## How It Works

- Files are logically divided into fixed-size blocks (default 1 MB).
- On the first read, blocks are downloaded from remote storage and saved to local disk.
- Subsequent reads of the same block are served from local disk, skipping remote I/O.
- Cache files are keyed by remote file path and block offset, so they persist across process restarts and can be reused.
- When the cache exceeds `max-size`, the least recently used blocks are evicted automatically.
136 changes: 136 additions & 0 deletions docs/content/pypaimon/global-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
title: "Global Index"
weight: 6
type: docs
aliases:
- /pypaimon/global-index.html
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Global Index

PyPaimon supports querying global indexes built on Data Evolution (append) tables. Three index types are available:

- **BTree Index**: B-tree based index for scalar column lookups. Supports equality, IN, range, and combined predicates.
- **Vector Index (Lumina)**: Approximate nearest neighbor (ANN) index for vector similarity search.
- **Full-Text Index (Tantivy)**: Full-text search index for text retrieval with relevance scoring.

> Global indexes must be built beforehand (e.g., via Spark or Flink). See [Global Index]({{< ref "append-table/global-index" >}}) for how to create indexes.

## BTree Index

BTree index is automatically used during scan when a filter predicate matches the indexed column. No special API is needed — just set a filter on the read builder.

```python
import pypaimon

catalog = pypaimon.create_catalog(...)
table = catalog.get_table("db.my_table")

# BTree index is used automatically when filtering on indexed columns
read_builder = table.new_read_builder()
read_builder = read_builder.with_filter(
pypaimon.PredicateBuilder(table.fields)
.in_("name", ["a200", "a300"])
)

scan = read_builder.new_scan()
read = read_builder.new_read()
splits = scan.plan().splits
data = read.to_arrow(splits)
```

Supported predicates: `equal`, `not_equal`, `less_than`, `less_or_equal`, `greater_than`, `greater_or_equal`, `in_`, `not_in`, `between`, `is_null`, `is_not_null`.

## Vector Index (Lumina)

Use `VectorSearchBuilder` to perform approximate nearest neighbor search on a vector column, then read the matched rows.

```python
table = catalog.get_table("db.my_table")

# Step 1: vector search to get matching row IDs
builder = table.new_vector_search_builder()
index_result = (
builder
.with_vector_column("embedding")
.with_query_vector([1.0, 2.0, 3.0, ...])
.with_limit(10)
.execute_local()
)

# Step 2: read actual data for matched rows
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
scan.with_global_index_result(index_result)
read = read_builder.new_read()
data = read.to_arrow(scan.plan().splits)
```

You can also add a scalar filter to pre-filter rows before vector search:

```python
predicate = (
pypaimon.PredicateBuilder(table.fields)
.equal("category", "electronics")
)

index_result = (
table.new_vector_search_builder()
.with_vector_column("embedding")
.with_query_vector([1.0, 2.0, 3.0, ...])
.with_limit(10)
.with_filter(predicate)
.execute_local()
)

read_builder = table.new_read_builder()
scan = read_builder.new_scan()
scan.with_global_index_result(index_result)
read = read_builder.new_read()
data = read.to_arrow(scan.plan().splits)
```

## Full-Text Index (Tantivy)

Use `FullTextSearchBuilder` to perform full-text search on a text column, then read the matched rows.

```python
table = catalog.get_table("db.my_table")

# Step 1: full-text search to get matching row IDs
builder = table.new_full_text_search_builder()
index_result = (
builder
.with_text_column("content")
.with_query_text("search keywords")
.with_limit(20)
.execute_local()
)

# Step 2: read actual data for matched rows
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
scan.with_global_index_result(index_result)
read = read_builder.new_read()
data = read.to_arrow(scan.plan().splits)
```

For better performance when reading from remote storage, consider enabling the [Local Disk Cache]({{< ref "program-api/file-cache" >}}).
30 changes: 30 additions & 0 deletions docs/layouts/shortcodes/generated/core_configuration.html
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,36 @@
<td>String</td>
<td>Default aggregate function of all fields for partial-update and aggregate merge function.</td>
</tr>
<tr>
<td><h5>file-cache.block-size</h5></td>
<td style="word-wrap: break-word;">1 mb</td>
<td>MemorySize</td>
<td>Block size for local disk cache.</td>
</tr>
<tr>
<td><h5>file-cache.dir</h5></td>
<td style="word-wrap: break-word;">(none)</td>
<td>String</td>
<td>Directory for file block cache. Defaults to a 'paimon-file-cache' subdirectory under the system temp directory.</td>
</tr>
<tr>
<td><h5>file-cache.enabled</h5></td>
<td style="word-wrap: break-word;">false</td>
<td>Boolean</td>
<td>Whether to enable local disk block cache for file reads.</td>
</tr>
<tr>
<td><h5>file-cache.max-size</h5></td>
<td style="word-wrap: break-word;">(none)</td>
<td>MemorySize</td>
<td>Maximum total size of the local disk block cache. Unlimited by default.</td>
</tr>
<tr>
<td><h5>file-cache.whitelist</h5></td>
<td style="word-wrap: break-word;">"meta,global-index"</td>
<td>String</td>
<td>Comma-separated list of file types to cache. Supported values: meta, global-index, bucket-index, data, file-index.</td>
</tr>
<tr>
<td><h5>file-index.in-manifest-threshold</h5></td>
<td style="word-wrap: break-word;">500 bytes</td>
Expand Down
57 changes: 57 additions & 0 deletions paimon-api/src/main/java/org/apache/paimon/CoreOptions.java
Original file line number Diff line number Diff line change
Expand Up @@ -710,6 +710,41 @@ public InlineElement getDescription() {
.defaultValue(MemorySize.parse("64 kb"))
.withDescription("Memory page size for caching.");

public static final ConfigOption<Boolean> FILE_CACHE_ENABLED =
key("file-cache.enabled")
.booleanType()
.defaultValue(false)
.withDescription("Whether to enable local disk block cache for file reads.");

public static final ConfigOption<String> FILE_CACHE_DIR =
key("file-cache.dir")
.stringType()
.noDefaultValue()
.withDescription(
"Directory for file block cache. "
+ "Defaults to a 'paimon-file-cache' subdirectory under the system temp directory.");

public static final ConfigOption<MemorySize> FILE_CACHE_MAX_SIZE =
key("file-cache.max-size")
.memoryType()
.noDefaultValue()
.withDescription(
"Maximum total size of the local disk block cache. Unlimited by default.");

public static final ConfigOption<MemorySize> FILE_CACHE_BLOCK_SIZE =
key("file-cache.block-size")
.memoryType()
.defaultValue(MemorySize.ofMebiBytes(1))
.withDescription("Block size for local disk cache.");

public static final ConfigOption<String> FILE_CACHE_WHITELIST =
key("file-cache.whitelist")
.stringType()
.defaultValue("meta,global-index")
.withDescription(
"Comma-separated list of file types to cache. "
+ "Supported values: meta, global-index, bucket-index, data, file-index.");

public static final ConfigOption<MemorySize> TARGET_FILE_SIZE =
key("target-file-size")
.memoryType()
Expand Down Expand Up @@ -2887,6 +2922,28 @@ public int cachePageSize() {
return (int) options.get(CACHE_PAGE_SIZE).getBytes();
}

public boolean fileCacheEnabled() {
return options.get(FILE_CACHE_ENABLED);
}

@Nullable
public String fileCacheDir() {
return options.get(FILE_CACHE_DIR);
}

@Nullable
public MemorySize fileCacheMaxSize() {
return options.get(FILE_CACHE_MAX_SIZE);
}

public MemorySize fileCacheBlockSize() {
return options.get(FILE_CACHE_BLOCK_SIZE);
}

public String fileCacheWhitelist() {
return options.get(FILE_CACHE_WHITELIST);
}

public MemorySize lookupCacheMaxMemory() {
return options.get(LOOKUP_CACHE_MAX_MEMORY_SIZE);
}
Expand Down
Loading
Loading