Skip to content

Latest commit

 

History

History
751 lines (597 loc) · 22.3 KB

File metadata and controls

751 lines (597 loc) · 22.3 KB

Hands-On Lab - pgvector Embeddings

Prerequisites

  • Azure Database for PostgreSQL Flexible Server provisioned and accessible (from Module 01)
  • Azure OpenAI resource with a text-embedding-ada-002 or text-embedding-3-small deployment
  • Python 3.10+ with pip
  • psql client installed (or Azure Cloud Shell)
  • A terminal (bash or zsh)

Initial Setup

If you completed Module 01, reuse those variables. Otherwise, set them now:

# Reuse from Module 01 or set your values
SERVER_NAME="ai200pg12345"       # Your actual server name
ADMIN_USER="pgadmin"
ADMIN_PASSWORD="your-password"
DB_NAME="vectordb"

# Azure OpenAI settings
AOAI_ENDPOINT="https://your-openai.openai.azure.com/"
AOAI_DEPLOYMENT="text-embedding-ada-002"
AOAI_API_VERSION="2024-02-01"

Lab Objectives

  • Enable the pgvector extension on Azure PostgreSQL
  • Create a table with a vector column
  • Generate embeddings using Azure OpenAI
  • Insert embeddings and perform similarity search
  • Compare distance operators (cosine, L2, inner product)
  • Build and test a vector index

Exercise 1: Enable pgvector Extension

Step 1.1: Allowlist pgvector via Azure CLI

Azure Database for PostgreSQL requires extensions to be allowlisted before they can be installed.

az postgres flexible-server parameter set \
  --resource-group ai200-postgresql-lab \
  --server-name $SERVER_NAME \
  --name azure.extensions \
  --value VECTOR

Expected output (truncated):

{
  "name": "azure.extensions",
  "value": "VECTOR",
  "source": "user-override"
}

Step 1.2: Connect via psql and Enable the Extension

psql "host=$SERVER_NAME.postgres.database.azure.com \
  port=5432 dbname=$DB_NAME user=$ADMIN_USER \
  password=$ADMIN_PASSWORD sslmode=require"

Once connected:

-- Install the vector extension
CREATE EXTENSION IF NOT EXISTS vector;

Expected output:

CREATE EXTENSION

Step 1.3: Verify the Extension

-- Check installed version
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';

Expected output:

 extname | extversion
---------+------------
 vector  | 0.7.4
-- Verify vector type is available
SELECT typname FROM pg_type WHERE typname = 'vector';

Expected output:

 typname
---------
 vector

Type \q to exit psql.


Exercise 2: Create a Documents Table with Vector Column

Step 2.1: Create the Table

Connect to the database again if needed, then run:

CREATE TABLE documents (
    id          SERIAL PRIMARY KEY,
    title       VARCHAR(255) NOT NULL,
    content     TEXT NOT NULL,
    embedding   vector(1536),
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

Expected output:

CREATE TABLE

The vector(1536) column stores embeddings with 1536 dimensions, which matches the output of text-embedding-ada-002 and text-embedding-3-small.

Step 2.2: Verify Table Structure

\d documents

Expected output:

                                         Table "public.documents"
   Column    |           Type           | Collation | Nullable |                Default
-------------+--------------------------+-----------+----------+---------------------------------------
 id          | integer                  |           | not null | nextval('documents_id_seq'::regclass)
 title       | character varying(255)   |           | not null |
 content     | text                     |           | not null |
 embedding   | vector(1536)             |           |          |
 created_at  | timestamp with time zone |           |          | now()
Indexes:
    "documents_pkey" PRIMARY KEY, btree (id)

Exercise 3: Generate Embeddings with Azure OpenAI

Step 3.1: Set Up the Python Environment

pip install "psycopg[binary]" pgvector openai azure-identity

Step 3.2: Generate Embeddings from Sample Texts

Create a file named generate_embeddings.py:

import os
from openai import AzureOpenAI

# Azure OpenAI configuration
client = AzureOpenAI(
    azure_endpoint=os.environ["AOAI_ENDPOINT"],
    api_key=os.environ["AOAI_API_KEY"],
    api_version=os.environ.get("AOAI_API_VERSION", "2024-02-01"),
)
DEPLOYMENT = os.environ["AOAI_DEPLOYMENT"]

# Sample documents about Azure services
SAMPLE_DOCS = [
    {
        "title": "Azure Functions",
        "content": (
            "Azure Functions is a serverless compute service that lets you "
            "run event-triggered code without having to explicitly provision "
            "or manage infrastructure. It supports multiple languages "
            "including C#, Python, JavaScript, and Java."
        ),
    },
    {
        "title": "Azure Cosmos DB",
        "content": (
            "Azure Cosmos DB is a globally distributed, multi-model database "
            "service designed for low-latency and elastic scalability. It "
            "supports document, key-value, graph, and column-family data "
            "models with turnkey global distribution."
        ),
    },
    {
        "title": "Azure Kubernetes Service",
        "content": (
            "Azure Kubernetes Service (AKS) simplifies deploying a managed "
            "Kubernetes cluster in Azure by offloading the operational "
            "overhead to Azure. It handles critical tasks like health "
            "monitoring, scaling, and maintenance of the control plane."
        ),
    },
    {
        "title": "Azure Blob Storage",
        "content": (
            "Azure Blob Storage is a massively scalable object storage "
            "service for unstructured data. It is optimized for storing "
            "large amounts of text or binary data such as images, videos, "
            "backups, and log files with tiered access levels."
        ),
    },
    {
        "title": "Azure Container Apps",
        "content": (
            "Azure Container Apps is a fully managed serverless container "
            "service for building and deploying modern apps at scale. It "
            "supports microservices, event-driven processing, and background "
            "tasks with built-in KEDA autoscaling and Dapr integration."
        ),
    },
]


def get_embedding(text: str) -> list[float]:
    """Generate an embedding for a single text using Azure OpenAI."""
    response = client.embeddings.create(input=text, model=DEPLOYMENT)
    return response.data[0].embedding


def main():
    print(f"Generating embeddings using deployment: {DEPLOYMENT}")
    print(f"Endpoint: {client._base_url}\n")

    for doc in SAMPLE_DOCS:
        embedding = get_embedding(doc["content"])
        doc["embedding"] = embedding
        print(
            f"  {doc['title']:30s} -> {len(embedding)} dimensions, "
            f"first 3 values: [{embedding[0]:.6f}, {embedding[1]:.6f}, {embedding[2]:.6f}]"
        )

    print(f"\nGenerated {len(SAMPLE_DOCS)} embeddings successfully.")
    print(f"Each embedding has {len(SAMPLE_DOCS[0]['embedding'])} dimensions.")


if __name__ == "__main__":
    main()

Step 3.3: Run the Script

# Set environment variables
export AOAI_ENDPOINT="https://your-openai.openai.azure.com/"
export AOAI_API_KEY="your-api-key"
export AOAI_DEPLOYMENT="text-embedding-ada-002"

python generate_embeddings.py

Expected output:

Generating embeddings using deployment: text-embedding-ada-002
Endpoint: https://your-openai.openai.azure.com/

  Azure Functions                -> 1536 dimensions, first 3 values: [-0.012345, 0.023456, -0.005678]
  Azure Cosmos DB                -> 1536 dimensions, first 3 values: [-0.008901, 0.015234, -0.003456]
  Azure Kubernetes Service       -> 1536 dimensions, first 3 values: [-0.011234, 0.019876, -0.007890]
  Azure Blob Storage             -> 1536 dimensions, first 3 values: [-0.009012, 0.021345, -0.004567]
  Azure Container Apps           -> 1536 dimensions, first 3 values: [-0.010123, 0.018765, -0.006789]

Generated 5 embeddings successfully.
Each embedding has 1536 dimensions.

The exact floating-point values will differ on each run -- the important thing is that each embedding has 1536 dimensions.


Exercise 4: Store Embeddings in PostgreSQL

Step 4.1: Write the Insertion Script

Create a file named store_embeddings.py:

import os
import psycopg
from pgvector.psycopg import register_vector
from openai import AzureOpenAI

# Azure OpenAI configuration
client = AzureOpenAI(
    azure_endpoint=os.environ["AOAI_ENDPOINT"],
    api_key=os.environ["AOAI_API_KEY"],
    api_version=os.environ.get("AOAI_API_VERSION", "2024-02-01"),
)
DEPLOYMENT = os.environ["AOAI_DEPLOYMENT"]

# PostgreSQL connection string
CONNINFO = (
    f"host={os.environ['SERVER_NAME']}.postgres.database.azure.com "
    f"port=5432 dbname=vectordb user={os.environ['ADMIN_USER']} "
    f"password={os.environ['ADMIN_PASSWORD']} sslmode=require"
)

# Same sample documents from Exercise 3
SAMPLE_DOCS = [
    {
        "title": "Azure Functions",
        "content": (
            "Azure Functions is a serverless compute service that lets you "
            "run event-triggered code without having to explicitly provision "
            "or manage infrastructure. It supports multiple languages "
            "including C#, Python, JavaScript, and Java."
        ),
    },
    {
        "title": "Azure Cosmos DB",
        "content": (
            "Azure Cosmos DB is a globally distributed, multi-model database "
            "service designed for low-latency and elastic scalability. It "
            "supports document, key-value, graph, and column-family data "
            "models with turnkey global distribution."
        ),
    },
    {
        "title": "Azure Kubernetes Service",
        "content": (
            "Azure Kubernetes Service (AKS) simplifies deploying a managed "
            "Kubernetes cluster in Azure by offloading the operational "
            "overhead to Azure. It handles critical tasks like health "
            "monitoring, scaling, and maintenance of the control plane."
        ),
    },
    {
        "title": "Azure Blob Storage",
        "content": (
            "Azure Blob Storage is a massively scalable object storage "
            "service for unstructured data. It is optimized for storing "
            "large amounts of text or binary data such as images, videos, "
            "backups, and log files with tiered access levels."
        ),
    },
    {
        "title": "Azure Container Apps",
        "content": (
            "Azure Container Apps is a fully managed serverless container "
            "service for building and deploying modern apps at scale. It "
            "supports microservices, event-driven processing, and background "
            "tasks with built-in KEDA autoscaling and Dapr integration."
        ),
    },
]


def get_embedding(text: str) -> list[float]:
    """Generate an embedding for a single text using Azure OpenAI."""
    response = client.embeddings.create(input=text, model=DEPLOYMENT)
    return response.data[0].embedding


def main():
    with psycopg.connect(CONNINFO) as conn:
        # Register the vector type with psycopg
        register_vector(conn)

        with conn.cursor() as cur:
            # Clear any existing data from previous runs
            cur.execute("DELETE FROM documents")

            # Generate and insert each document with its embedding
            for doc in SAMPLE_DOCS:
                embedding = get_embedding(doc["content"])
                cur.execute(
                    """
                    INSERT INTO documents (title, content, embedding)
                    VALUES (%s, %s, %s::vector)
                    """,
                    (doc["title"], doc["content"], str(embedding)),
                )
                print(f"  Inserted: {doc['title']}")

        conn.commit()
        print(f"\nInserted {len(SAMPLE_DOCS)} documents with embeddings.")

        # Verify the data
        with conn.cursor() as cur:
            cur.execute("SELECT count(*) FROM documents")
            count = cur.fetchone()[0]
            print(f"\nVerification: {count} rows in documents table.")

            cur.execute(
                "SELECT id, left(title, 30) AS title, vector_dims(embedding) AS dims "
                "FROM documents ORDER BY id"
            )
            rows = cur.fetchall()
            print(f"\n{'ID':<5} {'Title':<32} {'Dims':<6}")
            print("-" * 43)
            for row in rows:
                print(f"{row[0]:<5} {row[1]:<32} {row[2]:<6}")


if __name__ == "__main__":
    main()

Step 4.2: Run the Insertion Script

# Set PostgreSQL environment variables (in addition to the AOAI ones from Exercise 3)
export SERVER_NAME="ai200pg12345"
export ADMIN_USER="pgadmin"
export ADMIN_PASSWORD="your-password"

python store_embeddings.py

Expected output:

  Inserted: Azure Functions
  Inserted: Azure Cosmos DB
  Inserted: Azure Kubernetes Service
  Inserted: Azure Blob Storage
  Inserted: Azure Container Apps

Inserted 5 documents with embeddings.

Verification: 5 rows in documents table.

ID    Title                            Dims
-------------------------------------------
1     Azure Functions                  1536
2     Azure Cosmos DB                  1536
3     Azure Kubernetes Service         1536
4     Azure Blob Storage               1536
5     Azure Container Apps             1536

Step 4.3: Verify in psql

Connect via psql and run:

-- Row count
SELECT count(*) FROM documents;

-- Inspect stored embeddings (first 3 dimensions only)
SELECT id, left(title, 30) AS title, vector_dims(embedding) AS dims
FROM documents ORDER BY id;

Expected output:

 count
-------
     5

 id |             title              | dims
----+--------------------------------+------
  1 | Azure Functions                | 1536
  2 | Azure Cosmos DB                | 1536
  3 | Azure Kubernetes Service       | 1536
  4 | Azure Blob Storage             | 1536
  5 | Azure Container Apps           | 1536

Exercise 5: Similarity Search

Step 5.1: Create the Similarity Search Script

Create a file named similarity_search.py:

import os
import psycopg
from pgvector.psycopg import register_vector
from openai import AzureOpenAI

# Azure OpenAI configuration
client = AzureOpenAI(
    azure_endpoint=os.environ["AOAI_ENDPOINT"],
    api_key=os.environ["AOAI_API_KEY"],
    api_version=os.environ.get("AOAI_API_VERSION", "2024-02-01"),
)
DEPLOYMENT = os.environ["AOAI_DEPLOYMENT"]

# PostgreSQL connection string
CONNINFO = (
    f"host={os.environ['SERVER_NAME']}.postgres.database.azure.com "
    f"port=5432 dbname=vectordb user={os.environ['ADMIN_USER']} "
    f"password={os.environ['ADMIN_PASSWORD']} sslmode=require"
)


def get_embedding(text: str) -> list[float]:
    """Generate an embedding for a single text using Azure OpenAI."""
    response = client.embeddings.create(input=text, model=DEPLOYMENT)
    return response.data[0].embedding


def search(cur, query_text: str, operator: str, label: str, limit: int = 3):
    """Run a similarity search with the given distance operator."""
    query_embedding = get_embedding(query_text)
    query_vec = str(query_embedding)

    # Map operator to SQL expression
    op_map = {
        "<=>": ("embedding <=> %s::vector", "Cosine distance"),
        "<->": ("embedding <-> %s::vector", "L2 distance"),
        "<#>": ("(embedding <#> %s::vector) * -1", "Inner product (negated)"),
    }
    distance_expr, description = op_map[operator]

    print(f"\n{'='*60}")
    print(f"{label}: {description}")
    print(f"Query: \"{query_text}\"")
    print(f"Operator: {operator}")
    print(f"{'='*60}")

    cur.execute(
        f"""
        SELECT id, title, {distance_expr} AS distance
        FROM documents
        ORDER BY embedding {operator} %s::vector
        LIMIT %s
        """,
        (query_vec, query_vec, limit),
    )
    rows = cur.fetchall()
    for rank, row in enumerate(rows, 1):
        print(f"  #{rank}  {row[1]:<30s}  distance={row[2]:.6f}")


def main():
    query_text = "serverless event-driven application platform"

    with psycopg.connect(CONNINFO) as conn:
        register_vector(conn)

        with conn.cursor() as cur:
            # Step 5.2: Cosine distance search
            search(cur, query_text, "<=>", "Search A")

            # Step 5.3: L2 (Euclidean) distance search
            search(cur, query_text, "<->", "Search B")

            # Step 5.4: Inner product search
            search(cur, query_text, "<#>", "Search C")


if __name__ == "__main__":
    main()

Step 5.2: Run the Search Script

python similarity_search.py

Expected output:

============================================================
Search A: Cosine distance
Query: "serverless event-driven application platform"
Operator: <=>
============================================================
  #1  Azure Functions                 distance=0.134521
  #2  Azure Container Apps            distance=0.178432
  #3  Azure Kubernetes Service        distance=0.245678

============================================================
Search B: L2 distance
Query: "serverless event-driven application platform"
Operator: <->
============================================================
  #1  Azure Functions                 distance=0.519234
  #2  Azure Container Apps            distance=0.597345
  #3  Azure Kubernetes Service        distance=0.701234

============================================================
Search C: Inner product (negated)
Query: "serverless event-driven application platform"
Operator: <#>
============================================================
  #1  Azure Functions                 distance=0.865432
  #2  Azure Container Apps            distance=0.821567
  #3  Azure Kubernetes Service        distance=0.754321

The exact distances will vary, but the ranking should place Azure Functions and Azure Container Apps at the top because their content is most semantically related to "serverless event-driven application platform."

Key Observations

Operator SQL Distance type Notes
<=> a <=> b Cosine distance Range 0-2. Best for normalized embeddings.
<-> a <-> b L2 (Euclidean) Range 0+. Sensitive to vector magnitude.
<#> a <#> b Negative inner product Negate to get similarity. Used when magnitude matters.

For OpenAI embeddings, cosine distance (<=>) is the recommended operator because the embeddings are already normalized.


Exercise 6: Build and Test a Vector Index

Step 6.1: Baseline Query Without an Index

Connect via psql and run:

-- Check the query plan without an index
EXPLAIN ANALYZE
SELECT id, title, embedding <=> (
    SELECT embedding FROM documents WHERE id = 1
) AS distance
FROM documents
ORDER BY embedding <=> (
    SELECT embedding FROM documents WHERE id = 1
)
LIMIT 3;

Expected output (key line):

 Sort  (cost=... rows=5 width=...)
   Sort Key: ...
   Sort Method: top-N heapsort  Memory: ...
   ->  Seq Scan on documents  (cost=... rows=5 width=...)
         ...
 Planning Time: ... ms
 Execution Time: ... ms

Note the Seq Scan -- PostgreSQL is scanning every row because there is no vector index.

Step 6.2: Create an IVFFlat Index

-- Create an IVFFlat index for cosine distance
-- lists = number of clusters. For small tables use sqrt(row_count).
-- With 5 rows, 1 list is the minimum; using 1 for this lab.
CREATE INDEX idx_documents_embedding
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1);

Expected output:

CREATE INDEX

For production tables with more data, a general guideline is:

  • Up to 1M rows: lists = sqrt(row_count)
  • Over 1M rows: lists = sqrt(row_count) or experiment with higher values

Step 6.3: Verify the Index Is Used

-- Set probes (number of lists to search at query time)
SET ivfflat.probes = 1;

-- Re-run the same EXPLAIN ANALYZE
EXPLAIN ANALYZE
SELECT id, title, embedding <=> (
    SELECT embedding FROM documents WHERE id = 1
) AS distance
FROM documents
ORDER BY embedding <=> (
    SELECT embedding FROM documents WHERE id = 1
)
LIMIT 3;

Expected output (key line):

 ->  Index Scan using idx_documents_embedding on documents  (cost=... rows=5 width=...)
       Order By: (embedding <=> ...)
       ...
 Planning Time: ... ms
 Execution Time: ... ms

The plan now shows Index Scan using idx_documents_embedding instead of a sequential scan.

Step 6.4: Verify Index Details

-- Show the index details
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'documents' AND indexname = 'idx_documents_embedding';

Expected output:

       indexname         |                                    indexdef
-------------------------+--------------------------------------------------------------------------------
 idx_documents_embedding | CREATE INDEX idx_documents_embedding ON public.documents USING ivfflat ...

With only 5 rows the performance difference is negligible, but the query plan change confirms the index is working. On tables with thousands or millions of rows, an IVFFlat index reduces similarity search from O(n) to approximately O(sqrt(n)).


Cleanup

If you are done with all PostgreSQL labs and want to remove the resources:

az group delete --name ai200-postgresql-lab --yes --no-wait

echo "Cleanup initiated. Resource group deletion runs in the background."

If you plan to continue to Module 03 (Vector Search Optimization), keep the resources running.


Verification Checklist

  • pgvector extension installed and version confirmed
  • Documents table created with vector(1536) column
  • 5 embeddings generated via Azure OpenAI and stored in PostgreSQL
  • Cosine similarity search returns semantically ranked results
  • L2 and inner product searches run and results compared
  • IVFFlat index created and confirmed in query plan via EXPLAIN ANALYZE

Lab Summary

Exercise Key Skills Practiced
Exercise 1 Extension allowlisting via Azure CLI, CREATE EXTENSION vector, version verification
Exercise 2 Table design with vector(1536) column, schema inspection
Exercise 3 Azure OpenAI embeddings API, Python openai SDK, embedding dimensions
Exercise 4 psycopg + pgvector integration, register_vector, bulk insertion, vector_dims()
Exercise 5 Cosine (<=>), L2 (<->), and inner product (<#>) similarity search, result ranking
Exercise 6 EXPLAIN ANALYZE, IVFFlat index creation, vector_cosine_ops, query plan verification