Skip to content

Commit 5ba15c1

Browse files
eolivellidian-lun-linMarkWolters
authored
Port datastax#659: Streaming N:1 on-disk graph index compaction (#6)
* Add on-disk graph index compaction algorithm Introduce OnDiskGraphIndexCompactor and PQRetrainer for streaming N:1 merging of on-disk HNSW indexes without full in-memory materialization. Supports deletion filtering via live-node bitsets, custom ordinal mapping, and PQ codebook retraining. * Add compaction unit tests Tests for OnDiskGraphIndexCompactor covering basic compaction, deletions, ordinal remapping, multi-source merging, and FusedPQ compaction scenarios. * Add reporting and storage infrastructure for CompactorBenchmark Add JFR recording, system stats collection, JSONL logging, git info capture, thread allocation tracking, dataset partitioning, and cloud storage layout utilities used by CompactorBenchmark. Switch jvector-examples logging from logback to log4j2 for consistency with benchmarks-jmh and to avoid duplicate SLF4J bindings in the fat jar. * Add CompactorBenchmark and tooling JMH-based benchmark with configurable workload modes (PARTITION_AND_COMPACT, PARTITION_ONLY, COMPACT_ONLY, BUILD_FROM_SCRATCH), recall measurement, JFR recording, and JSONL result logging. Includes BenchmarkParamCounter for progress tracking, EventLogAnalyzer for post-run analysis, GHA workflow, and exec-maven-plugin integration. Add forced vectorization provider property to VectorizationProvider for benchmark reproducibility. * Update build config and project metadata for compaction Add result file patterns to .gitignore, update rat-excludes for the new compaction workflow and catalog cache files. * Fix JMH jar selection in run-compaction.yml The benchmarks-jmh-*.jar glob matched the -javadoc jar first, which has no Main-Class. Select the shaded JMH jar explicitly by excluding -javadoc and -sources jars. * Fix CompactorBenchmark invocation in run-compaction.yml Use -cp with CompactorBenchmark.main() instead of -jar with JMH Main to avoid BenchmarkList discovery issues in CI's shaded jar. * Address PR review feedback - Extract CompactWriter into its own file to reduce OnDiskGraphIndexCompactor size - Rewrite SystemStatsCollector to read /proc files directly in Java instead of spawning bash - Clarify recall section description in docs/compaction.md * Fix benchmark invocation in docs and default dataset Use -cp instead of -jar in docs since the benchmarks-jmh-*.jar glob matches the -javadoc jar first. Change default dataset from glove-100-angular to ada002-100k. Note -Xmx should be adjusted to fit the dataset. * Fix jar selection: use fixed output name compactor-benchmark.jar The benchmarks-jmh-*.jar glob expands to multiple jars (shaded + javadoc), causing -cp to misinterpret the second jar as the main class. Configure shade plugin outputFile to produce a fixed compactor-benchmark.jar name. Update docs and CI workflow. * Refactor workload modes and fix build-from-scratch timing Simplify WorkloadMode enum: PARTITION_ONLY/COMPACT_ONLY/COMPACT_AND_RECALL/ BUILD_FROM_SCRATCH collapsed into PARTITION/COMPACT/BUILD/PARTITION_AND_COMPACT plus a separate measureRecall flag. Fix buildFromScratch timing to include PQ computation and graph construction (previously only timed the write step). Add fair comparison guidelines to CompactorBenchmark.md. * Add TIERED_10_90 and TIERED_1_99 split distributions Support 10%/90% and 1%/99% partition splits for benchmarking compaction of a small new segment into a large existing index. Add split distribution reference table to CompactorBenchmark.md. * fix for bug when fused pq is used with no hierarchy (datastax#664) --------- Co-authored-by: dian-lun-lin <cyc4542000@gmail.com> Co-authored-by: Mark Wolters <mwolters138@gmail.com>
1 parent 17cf5d9 commit 5ba15c1

34 files changed

Lines changed: 7696 additions & 17 deletions
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
name: Run Compaction Bench
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
dataset:
7+
description: 'Dataset name passed to CompactorBenchmark (-p datasetNames)'
8+
required: false
9+
default: 'ada002-100k'
10+
branches:
11+
description: 'Space-separated list of branches to benchmark'
12+
required: false
13+
default: 'main'
14+
pull_request:
15+
types: [opened, synchronize, ready_for_review]
16+
branches:
17+
- main
18+
paths:
19+
- '**/src/main/java/**'
20+
- 'pom.xml'
21+
- '**/pom.xml'
22+
23+
jobs:
24+
# Job to generate the matrix configuration
25+
generate-matrix:
26+
runs-on: ubuntu-latest
27+
outputs:
28+
matrix: ${{ steps.set-matrix.outputs.matrix }}
29+
steps:
30+
- name: Generate matrix
31+
id: set-matrix
32+
run: |
33+
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
34+
BRANCHES='["main", "${{ github.head_ref }}"]'
35+
elif [[ "${{ github.event_name }}" == "workflow_dispatch" && -n "${{ github.event.inputs.branches }}" ]]; then
36+
BRANCHES_INPUT="${{ github.event.inputs.branches }}"
37+
BRANCHES="["
38+
for branch in $BRANCHES_INPUT; do
39+
if [[ "$BRANCHES" != "[" ]]; then
40+
BRANCHES="$BRANCHES, "
41+
fi
42+
BRANCHES="$BRANCHES\"$branch\""
43+
done
44+
BRANCHES="$BRANCHES]"
45+
else
46+
BRANCHES='["main"]'
47+
fi
48+
49+
echo "matrix={\"jdk\":[24],\"isa\":[\"isa-avx512f\"],\"branch\":$BRANCHES}" >> $GITHUB_OUTPUT
50+
51+
test-compaction:
52+
needs: generate-matrix
53+
strategy:
54+
matrix: ${{ fromJSON(needs.generate-matrix.outputs.matrix) }}
55+
runs-on: ${{ matrix.isa }}
56+
steps:
57+
- name: Set up GCC
58+
run: sudo apt install -y gcc
59+
- uses: actions/checkout@v4
60+
- name: Set up JDK ${{ matrix.jdk }}
61+
uses: actions/setup-java@v3
62+
with:
63+
java-version: ${{ matrix.jdk }}
64+
distribution: temurin
65+
cache: maven
66+
67+
- name: Checkout branch
68+
uses: actions/checkout@v4
69+
with:
70+
ref: ${{ matrix.branch }}
71+
fetch-depth: 0
72+
73+
- name: Build branch
74+
run: mvn -B -Punix-amd64-profile package --file pom.xml
75+
76+
- name: Run CompactorBenchmark
77+
id: run-benchmark
78+
run: |
79+
TOTAL_MEM_GB=$(free -g | awk '/^Mem:/ {print $2}')
80+
if [[ -z "$TOTAL_MEM_GB" ]] || [[ "$TOTAL_MEM_GB" -le 0 ]]; then
81+
TOTAL_MEM_GB=16
82+
fi
83+
HALF_MEM_GB=$((TOTAL_MEM_GB / 2))
84+
if [[ "$HALF_MEM_GB" -lt 1 ]]; then
85+
HALF_MEM_GB=1
86+
fi
87+
88+
DATASET="${{ github.event.inputs.dataset }}"
89+
if [[ -z "$DATASET" ]]; then
90+
DATASET="ada002-100k"
91+
fi
92+
93+
SAFE_BRANCH=$(echo "${{ matrix.branch }}" | sed 's/[^A-Za-z0-9_-]/_/g')
94+
echo "safe_branch=$SAFE_BRANCH" >> $GITHUB_OUTPUT
95+
96+
java --enable-native-access=ALL-UNNAMED --add-modules=jdk.incubator.vector \
97+
-Djvector.experimental.enable_native_vectorization=true \
98+
-Xmx${HALF_MEM_GB}g \
99+
-cp benchmarks-jmh/target/compactor-benchmark.jar \
100+
io.github.jbellis.jvector.bench.CompactorBenchmark \
101+
-p workloadMode=PARTITION_AND_COMPACT \
102+
-p datasetNames=$DATASET \
103+
-p numPartitions=4 \
104+
-p splitDistribution=FIBONACCI \
105+
-p indexPrecision=FUSEDPQ \
106+
-jvmArgsPrepend "-Xmx${HALF_MEM_GB}g" \
107+
-wi 0 -i 1 -f 1
108+
109+
- name: Upload compaction results
110+
uses: actions/upload-artifact@v4
111+
with:
112+
name: compaction-results-${{ matrix.isa }}-jdk${{ matrix.jdk }}-${{ steps.run-benchmark.outputs.safe_branch }}
113+
path: target/benchmark-results/compactor-*/compactor-results.jsonl
114+
if-no-files-found: warn

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@ local/
66
dataset_
77
**/local_datasets/**
88

9+
### Testing Results
10+
**results**.json
11+
**results**.jsonl
12+
913
### Bench caches
1014
pq_cache/
1115
index_cache/

benchmarks-jmh/pom.xml

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@
1515
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
1616
<maven.compiler.release>22</maven.compiler.release>
1717
<jmh.version>1.37</jmh.version>
18+
<awssdk.version>2.21.10</awssdk.version>
19+
<!-- Default benchmark arguments (empty) -->
20+
<args></args>
1821
</properties>
1922

2023
<dependencies>
@@ -53,6 +56,11 @@
5356
<artifactId>log4j-slf4j2-impl</artifactId>
5457
<version>2.24.3</version>
5558
</dependency>
59+
<dependency>
60+
<groupId>software.amazon.awssdk</groupId>
61+
<artifactId>ec2</artifactId>
62+
<version>${awssdk.version}</version>
63+
</dependency>
5664

5765
</dependencies>
5866

@@ -85,6 +93,7 @@
8593
<goal>shade</goal>
8694
</goals>
8795
<configuration>
96+
<outputFile>${project.build.directory}/compactor-benchmark.jar</outputFile>
8897
<transformers>
8998
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
9099
<mainClass>org.openjdk.jmh.Main</mainClass>
@@ -94,6 +103,35 @@
94103
</execution>
95104
</executions>
96105
</plugin>
106+
107+
<plugin>
108+
<groupId>org.codehaus.mojo</groupId>
109+
<artifactId>exec-maven-plugin</artifactId>
110+
<executions>
111+
<execution>
112+
<id>compactor</id>
113+
<goals>
114+
<goal>exec</goal>
115+
</goals>
116+
<configuration>
117+
<skip>false</skip>
118+
<executable>java</executable>
119+
<commandlineArgs>--enable-native-access=ALL-UNNAMED --add-modules=jdk.incubator.vector -Djvector.experimental.enable_native_vectorization=true -cp %classpath io.github.jbellis.jvector.bench.CompactorBenchmark ${args}</commandlineArgs>
120+
</configuration>
121+
</execution>
122+
<execution>
123+
<id>analyze</id>
124+
<goals>
125+
<goal>exec</goal>
126+
</goals>
127+
<configuration>
128+
<skip>false</skip>
129+
<executable>java</executable>
130+
<commandlineArgs>-cp %classpath io.github.jbellis.jvector.bench.benchtools.EventLogAnalyzer ${args}</commandlineArgs>
131+
</configuration>
132+
</execution>
133+
</executions>
134+
</plugin>
97135
</plugins>
98136
</build>
99-
</project>
137+
</project>

0 commit comments

Comments
 (0)