Merge pull request #110 from Hogglo/vector-index

cgarciaarellano · web-flow · commit b7f0ef652543 · 2025-10-02T23:22:14.000-04:00
Add Vector Index Sample
diff --git a/ai-vectors/vector-index/README.md b/ai-vectors/vector-index/README.md
@@ -0,0 +1,213 @@
+# Vector Indexes in Db2 (Early Access Program)
+
+This README provides step-by-step instructions for exploring the new Vector Indexes feature introduced in IBM Db2 as part of the Early Access Program (EAP).
+
+Vector Indexes enable efficient similarity search over high-dimensional vector data, supporting use cases such as AI-powered retrieval, recommendation systems, and semantic search.
+
+⚠️ Vector Index functionality is only available in the Db2 Early Access Program (EAP 97). It is not included in general availability (GA) releases.
+
+## Before You Begin
+
+To understand the capabilities, limitations, and prerequisites of the Vector Index feature in Db2, please read the official Early Access Program documentation.
+
+## Workflow Overview
+
+This guide walks you through the following steps:
+1. Downloading sample vector data
+2. Formatting the data for Db2 LOAD
+3. Creating a vector table
+4. Loading the vector data into Db2
+5. Creating a vector index
+6. Querying the vector index
+7. Dropping the vector index
+
+## Sample Dataset
+
+The sample vector data used in this guide is the SIFT1M dataset which is commonly used for benchmarking similarity search algorithms. SIFT1M consists of:
+* 1 million vectors
+* Each vector has 128 dimensions
+
+## Prerequisites and Environment Setup
+
+Before running the example, ensure the following prerequisites are met:
+
+* CPU: AMD64 with AVX2
+* Operating System: RHEL 9.4
+* Python: Version 3+ with pip
+* Tools: curl (for downloading the dataset)
+* Db2: Access to the Early Access Program (EAP 97)
+
+Next, download all the files contained in this directory to your local machine.
+
+## Step-by-Step Instructions
+
+### Step 1: Download and Format Sample Vector Data
+
+Run the provided shell script to download the SIFT1M dataset and convert it into a CSV format suitable for Db2 LOAD:
+
+```bash
+./downloadAndFormatVectorData.sh
+```
+
+Output:
+* `sift_base.csv` containing 1M rows of 128-dimensional vectors.
+* `sift_query_100.csv` containing 100 randomly selected vectors from the SIFT1M dataset.
+* `sift_groundtruth_100.csv` containing the top 100 nearest neighbor IDs (from `sift_base.csv`) for each query, ordered by increasing squared Euclidean distance.
+
+_Note: The script may take a couple of minutes to complete depending on your network speed and system performance._
+
+### Step 2: Enable Vector Index Feature in Db2
+
+_Reminder: Make sure you've reviewed the EAP documentation to confirm your environment meets all prerequisites._
+
+Set the required registry variable to enable vector indexing:
+
+```bash
+db2set DB2_VECTOR_INDEXING=TRUE
+```
+
+The instance does not need to be restarted to take effect.
+
+### Step 3: Create the Vector Tables and Load Data
+
+This step sets up the tables for evaluating approximate nearest neighbor (ANN) search performance.
+
+#### Create the Vector Table
+
+Create a table with an ID and a vector column:
+
+```sql
+CREATE TABLE SIFT_BASE (
+   ID INT NOT NULL,
+   EMBEDDING VECTOR(128, FLOAT32) NOT NULL
+)
+```
+
+Load the formatted CSV data into the table:
+
+```sql
+LOAD FROM sift_base.csv OF DEL
+INSERT INTO SIFT_BASE
+```
+
+#### Create the Query Table
+
+```sql
+CREATE TABLE SIFT_QUERY (
+   ID INT NOT NULL,
+   EMBEDDING VECTOR(128, FLOAT32) NOT NULL
+)
+```
+
+Load the query vectors from the CSV file:
+
+```sql
+LOAD FROM sift_query_100.csv OF DEL
+INSERT INTO SIFT_QUERY
+```
+
+### Step 4: Create Vector Index and Collect Statistics
+
+Create a vector index using Euclidean distance:
+
+```sql
+CREATE VECTOR INDEX SIFT_EUCLIDEAN
+ON SIFT_BASE (EMBEDDING)
+WITH DISTANCE EUCLIDEAN
+```
+
+_Note: Index creation will take a while to complete and will depend on your system performance._
+
+RUNSTATS to optimize query performance and allow the use of the index over a brute-force search:
+
+```sql
+RUNSTATS ON TABLE SIFT_BASE FOR INDEXES ALL
+```
+
+### Step 5: Query Using Approximate Nearest Neighbor Search and Compare with Ground Truth
+
+Retrieve the top 5 approximate nearest neighbors for a sample query (e.g. first query in SIFT_QUERY table):
+
+```sql
+SELECT
+   ID,
+   VECTOR_DISTANCE(
+      (SELECT EMBEDDING
+       FROM SIFT_QUERY
+       FETCH FIRST 1 ROWS ONLY),
+      EMBEDDING,
+      EUCLIDEAN)
+   AS DISTANCE
+   FROM SIFT_BASE
+   ORDER BY DISTANCE
+   FETCH APPROX FIRST 10 ROWS ONLY
+```
+
+FETCH *APPROX* FIRST enables approximate search for faster results.
+
+### Step 6: Compare Brute-Force Search and Groundtruth vs. ANN Search
+
+To run a brute-force search (exact nearest neighbors), use FETCH EXACT clause:
+
+```sql
+SELECT
+   ID,
+   VECTOR_DISTANCE(
+      (SELECT EMBEDDING
+       FROM SIFT_QUERY
+       FETCH FIRST 1 ROWS ONLY),
+      EMBEDDING,
+      EUCLIDEAN)
+   AS DISTANCE
+FROM SIFT_BASE
+ORDER BY DISTANCE
+FETCH EXACT FIRST 10 ROWS ONLY
+```
+
+Comparison:
+
+* Compare the result set above with the ANN results from Step 5. Are the top-k neighbors the same?
+* You can also verify against the ground truth by checking the query ID:
+
+```sql
+SELECT ID
+FROM SIFT_QUERY
+FETCH FIRST 1 ROWS ONLY
+```
+
+Then use the query ID to look up the expected nearest neighbors in the ground
+truth file:
+
+```bash
+grep -E "^<query_id>," sift_groundtruth_100.csv
+```
+
+Evaluation:
+
+* Accuracy: How many of the ANN results match the brute-force or ground truth results (e.g., recall@k)?
+* Latency: Measure query execution time for each method
+* Resource Usage: Monitor CPU and memory consumption during query execution
+
+### Step 7: Cleanup
+
+After completing your evaluation, you can clean up the environment by dropping the vector index and tables:
+
+```sql
+DROP INDEX SIFT_EUCLIDEAN
+```
+
+```sql
+DROP TABLE SIFT_BASE
+DROP TABLE SIFT_QUERY
+DROP TABLE SIFT_GROUNDTRUTH
+```
+
+## Conclusion and Key Takeaways
+
+This demo guided you through the process of using Vector Indexes in Db2, showcasing how to prepare vector data, enable the feature, perform similarity search using SQL, and compare against a brute force search.
+
+### Key Takeaways
+
+* Vector Indexes introduce native support for high-dimensional similarity search in Db2, enabling AI-driven use cases without external tooling.
+* The SIFT1M dataset serves as a practical benchmark for testing performance and accuracy of vector search.
+* Approximate search using FETCH APPROX FIRST provides fast results, ideal for large-scale datasets where latency matters more than exact precision.
diff --git a/ai-vectors/vector-index/downloadAndFormatVectorData.sh b/ai-vectors/vector-index/downloadAndFormatVectorData.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+
+# Check that prerequisites are installed before running script
+if ! command -v curl >/dev/null 2>&1; then
+  echo "Error: curl is not installed. Exiting..."
+  exit 1
+fi
+
+if ! command -v python >/dev/null 2>&1; then
+  echo "Error: python is not installed. Exiting..."
+  exit 1
+fi
+
+# Download the SIFT1M dataset
+echo "Downloading SIFT1M dataset..."
+curl -O "ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz"
+
+# Check if the download was successful
+if [ -f "sift.tar.gz" ]; then
+    echo "Download successful. Extracting..."
+    tar -xzf sift.tar.gz
+else
+    echo "Error: download failed. Exiting..."
+    exit 1
+fi
+
+# Check if files exist and are readable
+if [ -f "sift/sift_base.fvecs" ] && [ -r "sift/sift_base.fvecs" ] &&
+   [ -f "sift/sift_groundtruth.ivecs" ] && [ -r "sift/sift_groundtruth.ivecs" ] &&
+   [ -f "sift/sift_learn.fvecs" ] && [ -r "sift/sift_learn.fvecs" ] &&
+   [ -f "sift/sift_query.fvecs" ] && [ -r "sift/sift_query.fvecs" ]; then
+   echo "All required files exist and are readable."
+else
+   echo "Error: Not all required files exist or are readable. Exiting..."
+   exit 1
+fi
+
+# Creating python virtual environment with dependencies
+echo "Creating python virtual environment with dependencies..."
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+
+# Format vectors into CSV
+echo "Formatting vectors into CSV..."
+if [ -f "prep_binary.py" ]; then
+    python prep_binary.py
+else
+  echo "Error: missing file prep_binary.py. Exiting..."
+  exit 1
+fi
diff --git a/ai-vectors/vector-index/prep_binary.py b/ai-vectors/vector-index/prep_binary.py
@@ -0,0 +1,73 @@
+import numpy as np
+import random
+
+import numpy as np
+
+def read_fvecs(filename):
+    with open(filename, 'rb') as f:
+        dim = np.fromfile(f, dtype=np.int32, count=1)[0]
+        f.seek(0)
+        data = np.fromfile(f, dtype=np.float32)
+        data = data.reshape(-1, dim + 1)[:, 1:]
+    return data
+
+def read_ivecs(filename):
+    with open(filename, 'rb') as f:
+        dim = np.fromfile(f, dtype=np.int32, count=1)[0]
+        f.seek(0)
+        data = np.fromfile(f, dtype=np.int32)
+        data = data.reshape(-1, dim + 1)[:, 1:]
+    return data
+
+def prepend_indexes_to_lines(file_path, indexes):
+    """
+    Reads a file and prepends each line with the corresponding index from the indexes array.
+
+    Parameters:
+    - file_path (str): Path to the input file.
+    - indexes (list): List of indexes to prepend to each line.
+
+    Returns:
+    - list: A list of strings with indexes prepended.
+    """
+    with open(file_path, 'r') as file:
+        lines = file.readlines()
+
+    if len(lines) != len(indexes):
+        raise ValueError("Number of indexes must match the number of lines in the file.")
+
+    modified_lines = [f"{index},\"[{line.rstrip()}]\"\n" for index, line in zip(indexes, lines)]
+
+    with open(file_path, 'w') as file:
+        file.writelines(modified_lines)
+
+# ==== Load data ====
+print("Loading base vectors ...")
+base = read_fvecs("sift/sift_base.fvecs")        # (1,000,000 x 128)
+
+print("Loading queries and ground truth ...")
+queries = read_fvecs("sift/sift_query.fvecs")   # (10,000 x 128)
+groundtruth = read_ivecs("sift/sift_groundtruth.ivecs")  # (10,000 x 100)
+
+# ==== Randomly sample 100 queries ====
+selected_indices = np.random.choice(len(queries), size=100, replace=False)
+queries_sampled = queries[selected_indices]
+gt_sampled = groundtruth[selected_indices]
+
+# ==== Export to CSV with brackets ====
+print("Saving sift_base.csv ...")
+np.savetxt("sift_base.csv", base, delimiter=",")
+prepend_indexes_to_lines("sift_base.csv", range(0,1000000))
+
+print("Saving sift_query_100.csv ...")
+np.savetxt("sift_query_100.csv", queries_sampled, delimiter=",")
+prepend_indexes_to_lines("sift_query_100.csv", selected_indices)
+
+print("Saving sift_groundtruth_100.csv ...")
+np.savetxt("sift_groundtruth_100.csv", gt_sampled, fmt='%d', delimiter=",")
+prepend_indexes_to_lines("sift_groundtruth_100.csv", selected_indices)
+
+print("Done! Files saved:")
+print("- sift_base.csv")
+print("- sift_query_100.csv")
+print("- sift_groundtruth_100.csv")
diff --git a/ai-vectors/vector-index/requirements.txt b/ai-vectors/vector-index/requirements.txt
@@ -0,0 +1 @@
+numpy