Skip to content

Commit b7f0ef6

Browse files
Merge pull request #110 from Hogglo/vector-index
Add Vector Index Sample
2 parents 1f827ae + 5f8042f commit b7f0ef6

4 files changed

Lines changed: 338 additions & 0 deletions

File tree

ai-vectors/vector-index/README.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# Vector Indexes in Db2 (Early Access Program)
2+
3+
This README provides step-by-step instructions for exploring the new Vector Indexes feature introduced in IBM Db2 as part of the Early Access Program (EAP).
4+
5+
Vector Indexes enable efficient similarity search over high-dimensional vector data, supporting use cases such as AI-powered retrieval, recommendation systems, and semantic search.
6+
7+
⚠️ Vector Index functionality is only available in the Db2 Early Access Program (EAP 97). It is not included in general availability (GA) releases.
8+
9+
## Before You Begin
10+
11+
To understand the capabilities, limitations, and prerequisites of the Vector Index feature in Db2, please read the official Early Access Program documentation.
12+
13+
## Workflow Overview
14+
15+
This guide walks you through the following steps:
16+
1. Downloading sample vector data
17+
2. Formatting the data for Db2 LOAD
18+
3. Creating a vector table
19+
4. Loading the vector data into Db2
20+
5. Creating a vector index
21+
6. Querying the vector index
22+
7. Dropping the vector index
23+
24+
## Sample Dataset
25+
26+
The sample vector data used in this guide is the SIFT1M dataset which is commonly used for benchmarking similarity search algorithms. SIFT1M consists of:
27+
* 1 million vectors
28+
* Each vector has 128 dimensions
29+
30+
## Prerequisites and Environment Setup
31+
32+
Before running the example, ensure the following prerequisites are met:
33+
34+
* CPU: AMD64 with AVX2
35+
* Operating System: RHEL 9.4
36+
* Python: Version 3+ with pip
37+
* Tools: curl (for downloading the dataset)
38+
* Db2: Access to the Early Access Program (EAP 97)
39+
40+
Next, download all the files contained in this directory to your local machine.
41+
42+
## Step-by-Step Instructions
43+
44+
### Step 1: Download and Format Sample Vector Data
45+
46+
Run the provided shell script to download the SIFT1M dataset and convert it into a CSV format suitable for Db2 LOAD:
47+
48+
```bash
49+
./downloadAndFormatVectorData.sh
50+
```
51+
52+
Output:
53+
* `sift_base.csv` containing 1M rows of 128-dimensional vectors.
54+
* `sift_query_100.csv` containing 100 randomly selected vectors from the SIFT1M dataset.
55+
* `sift_groundtruth_100.csv` containing the top 100 nearest neighbor IDs (from `sift_base.csv`) for each query, ordered by increasing squared Euclidean distance.
56+
57+
_Note: The script may take a couple of minutes to complete depending on your network speed and system performance._
58+
59+
### Step 2: Enable Vector Index Feature in Db2
60+
61+
_Reminder: Make sure you've reviewed the EAP documentation to confirm your environment meets all prerequisites._
62+
63+
Set the required registry variable to enable vector indexing:
64+
65+
```bash
66+
db2set DB2_VECTOR_INDEXING=TRUE
67+
```
68+
69+
The instance does not need to be restarted to take effect.
70+
71+
### Step 3: Create the Vector Tables and Load Data
72+
73+
This step sets up the tables for evaluating approximate nearest neighbor (ANN) search performance.
74+
75+
#### Create the Vector Table
76+
77+
Create a table with an ID and a vector column:
78+
79+
```sql
80+
CREATE TABLE SIFT_BASE (
81+
ID INT NOT NULL,
82+
EMBEDDING VECTOR(128, FLOAT32) NOT NULL
83+
)
84+
```
85+
86+
Load the formatted CSV data into the table:
87+
88+
```sql
89+
LOAD FROM sift_base.csv OF DEL
90+
INSERT INTO SIFT_BASE
91+
```
92+
93+
#### Create the Query Table
94+
95+
```sql
96+
CREATE TABLE SIFT_QUERY (
97+
ID INT NOT NULL,
98+
EMBEDDING VECTOR(128, FLOAT32) NOT NULL
99+
)
100+
```
101+
102+
Load the query vectors from the CSV file:
103+
104+
```sql
105+
LOAD FROM sift_query_100.csv OF DEL
106+
INSERT INTO SIFT_QUERY
107+
```
108+
109+
### Step 4: Create Vector Index and Collect Statistics
110+
111+
Create a vector index using Euclidean distance:
112+
113+
```sql
114+
CREATE VECTOR INDEX SIFT_EUCLIDEAN
115+
ON SIFT_BASE (EMBEDDING)
116+
WITH DISTANCE EUCLIDEAN
117+
```
118+
119+
_Note: Index creation will take a while to complete and will depend on your system performance._
120+
121+
RUNSTATS to optimize query performance and allow the use of the index over a brute-force search:
122+
123+
```sql
124+
RUNSTATS ON TABLE SIFT_BASE FOR INDEXES ALL
125+
```
126+
127+
### Step 5: Query Using Approximate Nearest Neighbor Search and Compare with Ground Truth
128+
129+
Retrieve the top 5 approximate nearest neighbors for a sample query (e.g. first query in SIFT_QUERY table):
130+
131+
```sql
132+
SELECT
133+
ID,
134+
VECTOR_DISTANCE(
135+
(SELECT EMBEDDING
136+
FROM SIFT_QUERY
137+
FETCH FIRST 1 ROWS ONLY),
138+
EMBEDDING,
139+
EUCLIDEAN)
140+
AS DISTANCE
141+
FROM SIFT_BASE
142+
ORDER BY DISTANCE
143+
FETCH APPROX FIRST 10 ROWS ONLY
144+
```
145+
146+
FETCH *APPROX* FIRST enables approximate search for faster results.
147+
148+
### Step 6: Compare Brute-Force Search and Groundtruth vs. ANN Search
149+
150+
To run a brute-force search (exact nearest neighbors), use FETCH EXACT clause:
151+
152+
```sql
153+
SELECT
154+
ID,
155+
VECTOR_DISTANCE(
156+
(SELECT EMBEDDING
157+
FROM SIFT_QUERY
158+
FETCH FIRST 1 ROWS ONLY),
159+
EMBEDDING,
160+
EUCLIDEAN)
161+
AS DISTANCE
162+
FROM SIFT_BASE
163+
ORDER BY DISTANCE
164+
FETCH EXACT FIRST 10 ROWS ONLY
165+
```
166+
167+
Comparison:
168+
169+
* Compare the result set above with the ANN results from Step 5. Are the top-k neighbors the same?
170+
* You can also verify against the ground truth by checking the query ID:
171+
172+
```sql
173+
SELECT ID
174+
FROM SIFT_QUERY
175+
FETCH FIRST 1 ROWS ONLY
176+
```
177+
178+
Then use the query ID to look up the expected nearest neighbors in the ground
179+
truth file:
180+
181+
```bash
182+
grep -E "^<query_id>," sift_groundtruth_100.csv
183+
```
184+
185+
Evaluation:
186+
187+
* Accuracy: How many of the ANN results match the brute-force or ground truth results (e.g., recall@k)?
188+
* Latency: Measure query execution time for each method
189+
* Resource Usage: Monitor CPU and memory consumption during query execution
190+
191+
### Step 7: Cleanup
192+
193+
After completing your evaluation, you can clean up the environment by dropping the vector index and tables:
194+
195+
```sql
196+
DROP INDEX SIFT_EUCLIDEAN
197+
```
198+
199+
```sql
200+
DROP TABLE SIFT_BASE
201+
DROP TABLE SIFT_QUERY
202+
DROP TABLE SIFT_GROUNDTRUTH
203+
```
204+
205+
## Conclusion and Key Takeaways
206+
207+
This demo guided you through the process of using Vector Indexes in Db2, showcasing how to prepare vector data, enable the feature, perform similarity search using SQL, and compare against a brute force search.
208+
209+
### Key Takeaways
210+
211+
* Vector Indexes introduce native support for high-dimensional similarity search in Db2, enabling AI-driven use cases without external tooling.
212+
* The SIFT1M dataset serves as a practical benchmark for testing performance and accuracy of vector search.
213+
* Approximate search using FETCH APPROX FIRST provides fast results, ideal for large-scale datasets where latency matters more than exact precision.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
#!/bin/bash
2+
3+
# Check that prerequisites are installed before running script
4+
if ! command -v curl >/dev/null 2>&1; then
5+
echo "Error: curl is not installed. Exiting..."
6+
exit 1
7+
fi
8+
9+
if ! command -v python >/dev/null 2>&1; then
10+
echo "Error: python is not installed. Exiting..."
11+
exit 1
12+
fi
13+
14+
# Download the SIFT1M dataset
15+
echo "Downloading SIFT1M dataset..."
16+
curl -O "ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz"
17+
18+
# Check if the download was successful
19+
if [ -f "sift.tar.gz" ]; then
20+
echo "Download successful. Extracting..."
21+
tar -xzf sift.tar.gz
22+
else
23+
echo "Error: download failed. Exiting..."
24+
exit 1
25+
fi
26+
27+
# Check if files exist and are readable
28+
if [ -f "sift/sift_base.fvecs" ] && [ -r "sift/sift_base.fvecs" ] &&
29+
[ -f "sift/sift_groundtruth.ivecs" ] && [ -r "sift/sift_groundtruth.ivecs" ] &&
30+
[ -f "sift/sift_learn.fvecs" ] && [ -r "sift/sift_learn.fvecs" ] &&
31+
[ -f "sift/sift_query.fvecs" ] && [ -r "sift/sift_query.fvecs" ]; then
32+
echo "All required files exist and are readable."
33+
else
34+
echo "Error: Not all required files exist or are readable. Exiting..."
35+
exit 1
36+
fi
37+
38+
# Creating python virtual environment with dependencies
39+
echo "Creating python virtual environment with dependencies..."
40+
python -m venv .venv
41+
source .venv/bin/activate
42+
pip install -r requirements.txt
43+
44+
# Format vectors into CSV
45+
echo "Formatting vectors into CSV..."
46+
if [ -f "prep_binary.py" ]; then
47+
python prep_binary.py
48+
else
49+
echo "Error: missing file prep_binary.py. Exiting..."
50+
exit 1
51+
fi
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
import numpy as np
2+
import random
3+
4+
import numpy as np
5+
6+
def read_fvecs(filename):
7+
with open(filename, 'rb') as f:
8+
dim = np.fromfile(f, dtype=np.int32, count=1)[0]
9+
f.seek(0)
10+
data = np.fromfile(f, dtype=np.float32)
11+
data = data.reshape(-1, dim + 1)[:, 1:]
12+
return data
13+
14+
def read_ivecs(filename):
15+
with open(filename, 'rb') as f:
16+
dim = np.fromfile(f, dtype=np.int32, count=1)[0]
17+
f.seek(0)
18+
data = np.fromfile(f, dtype=np.int32)
19+
data = data.reshape(-1, dim + 1)[:, 1:]
20+
return data
21+
22+
def prepend_indexes_to_lines(file_path, indexes):
23+
"""
24+
Reads a file and prepends each line with the corresponding index from the indexes array.
25+
26+
Parameters:
27+
- file_path (str): Path to the input file.
28+
- indexes (list): List of indexes to prepend to each line.
29+
30+
Returns:
31+
- list: A list of strings with indexes prepended.
32+
"""
33+
with open(file_path, 'r') as file:
34+
lines = file.readlines()
35+
36+
if len(lines) != len(indexes):
37+
raise ValueError("Number of indexes must match the number of lines in the file.")
38+
39+
modified_lines = [f"{index},\"[{line.rstrip()}]\"\n" for index, line in zip(indexes, lines)]
40+
41+
with open(file_path, 'w') as file:
42+
file.writelines(modified_lines)
43+
44+
# ==== Load data ====
45+
print("Loading base vectors ...")
46+
base = read_fvecs("sift/sift_base.fvecs") # (1,000,000 x 128)
47+
48+
print("Loading queries and ground truth ...")
49+
queries = read_fvecs("sift/sift_query.fvecs") # (10,000 x 128)
50+
groundtruth = read_ivecs("sift/sift_groundtruth.ivecs") # (10,000 x 100)
51+
52+
# ==== Randomly sample 100 queries ====
53+
selected_indices = np.random.choice(len(queries), size=100, replace=False)
54+
queries_sampled = queries[selected_indices]
55+
gt_sampled = groundtruth[selected_indices]
56+
57+
# ==== Export to CSV with brackets ====
58+
print("Saving sift_base.csv ...")
59+
np.savetxt("sift_base.csv", base, delimiter=",")
60+
prepend_indexes_to_lines("sift_base.csv", range(0,1000000))
61+
62+
print("Saving sift_query_100.csv ...")
63+
np.savetxt("sift_query_100.csv", queries_sampled, delimiter=",")
64+
prepend_indexes_to_lines("sift_query_100.csv", selected_indices)
65+
66+
print("Saving sift_groundtruth_100.csv ...")
67+
np.savetxt("sift_groundtruth_100.csv", gt_sampled, fmt='%d', delimiter=",")
68+
prepend_indexes_to_lines("sift_groundtruth_100.csv", selected_indices)
69+
70+
print("Done! Files saved:")
71+
print("- sift_base.csv")
72+
print("- sift_query_100.csv")
73+
print("- sift_groundtruth_100.csv")
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
numpy

0 commit comments

Comments
 (0)