Skip to content

Latest commit

 

History

History
181 lines (111 loc) · 11.4 KB

File metadata and controls

181 lines (111 loc) · 11.4 KB

Practice Questions - pgvector Embeddings

Question 1

Your application uses Azure Database for PostgreSQL Flexible Server. You need to store 1536-dimensional embeddings generated by Azure OpenAI. You attempt to run CREATE EXTENSION vector; but receive an error. What should you do first?

A. Upgrade the PostgreSQL version to 16
B. Add pgvector to the azure.extensions server parameter
C. Install pgvector from source on the server
D. Switch to a Memory Optimized compute tier

Show answer

Correct answer: B

On Azure Database for PostgreSQL Flexible Server, extensions must be allowlisted in the azure.extensions server parameter before they can be installed. Run az postgres flexible-server parameter set --name azure.extensions --value pgvector first, then retry CREATE EXTENSION vector;. You do not need to compile from source (Azure manages the extension binaries), and the error is not related to PostgreSQL version or compute tier.

Question 2

You are building a semantic search feature that stores text embeddings from Azure OpenAI's text-embedding-ada-002 model. You need to find the most semantically similar documents to a user query. Which pgvector operator should you use?

A. <-> (L2 distance)
B. <=> (cosine distance)
C. <#> (negative inner product)
D. <~> (Hamming distance)

Show answer

Correct answer: B

For text embeddings, cosine distance (<=>) is the standard choice because it measures the angle between vectors regardless of their magnitude. This is ideal for semantic similarity where the direction of the embedding matters more than its length. L2 (Euclidean) distance is sensitive to vector magnitude and is better suited to spatial data. Negative inner product (<#>) can work with normalized vectors but is less intuitive for similarity ranking. Hamming distance (<~>) does not exist in pgvector.

Question 3

A developer on your team creates a table with a vector(2048) column on Azure Database for PostgreSQL Flexible Server running pgvector 0.5.1. INSERT statements fail with a dimension error. What is the most likely cause?

A. The column type should be real[] instead of vector
B. pgvector 0.5.x supports a maximum of 2000 dimensions per vector
C. The embeddings must be normalized before insertion
D. The shared_preload_libraries parameter is missing the vector entry

Show answer

Correct answer: B

pgvector 0.5.x has a hard limit of 2000 dimensions per vector. This limit was increased to 16000 dimensions in pgvector 0.7.0. To use 2048-dimensional vectors, the team must upgrade to pgvector 0.7.0 or later. Alternatively, they could use a model that produces embeddings within the 2000-dimension limit, such as text-embedding-ada-002 (1536 dimensions) or text-embedding-3-small with a reduced dimensions parameter.

Question 4

You are choosing an Azure OpenAI embedding model for a new application. You need 256-dimensional embeddings to minimize storage costs while maintaining reasonable search quality. Which model and configuration should you use?

A. text-embedding-ada-002 with a dimensions parameter set to 256
B. text-embedding-3-small with a dimensions parameter set to 256
C. text-embedding-3-large with no dimensions parameter
D. text-embedding-ada-002 with output truncated to the first 256 floats

Show answer

Correct answer: B

The text-embedding-3-small and text-embedding-3-large models support a dimensions parameter that uses Matryoshka Representation Learning to produce shorter embeddings that still preserve semantic meaning. Setting dimensions=256 returns a valid 256-dimensional embedding. text-embedding-ada-002 does not support the dimensions parameter and always returns 1536 dimensions. Simply truncating output floats from ada-002 would destroy the embedding's semantic properties because that model was not trained with Matryoshka techniques.

Question 5

You need to estimate the storage required for 1 million embeddings from text-embedding-ada-002 (1536 dimensions) in a pgvector vector column. Which calculation is correct?

A. 1,000,000 x (4 bytes x 1536) = approximately 5.7 GB
B. 1,000,000 x (4 bytes x 1536 + 8 bytes) = approximately 5.7 GB
C. 1,000,000 x (8 bytes x 1536 + 8 bytes) = approximately 11.5 GB
D. 1,000,000 x (2 bytes x 1536) = approximately 2.9 GB

Show answer

Correct answer: B

pgvector stores each dimension as a 4-byte single-precision float (float32) and adds 8 bytes of overhead per vector for internal metadata (dimension count and storage header). The calculation is: 1,000,000 x ((4 x 1536) + 8) = 1,000,000 x 6,152 = 6,152,000,000 bytes, which is approximately 5.7 GB. This accounts only for the vector column data itself; row overhead, indexes, and other columns add to the total table size.

Question 6

You have a table with 5 million product embeddings and need to create an approximate nearest neighbor index to speed up similarity queries. The table already contains all the data. Which index type and reasoning is correct?

A. HNSW, because it supports concurrent inserts and does not require a training step
B. IVFFlat, because it partitions existing data into lists and builds faster on populated tables
C. HNSW, because it requires the table to be fully populated before index creation
D. IVFFlat, because it provides exact nearest neighbor results

Show answer

Correct answer: B

IVFFlat is often the better choice when the table already contains a representative dataset because it uses k-means clustering to partition vectors into inverted lists during index creation. This training step produces well-balanced clusters when run on existing data. HNSW does not require data to exist first and handles incremental inserts well, but it takes significantly longer to build and uses more memory. IVFFlat builds faster and works well when you can rebuild the index periodically. Note that both index types provide approximate (not exact) nearest neighbor results. If you need to index an empty table that will receive data incrementally, HNSW is the better choice since IVFFlat trained on empty or sparse data produces poor-quality clusters.

Question 7

You are writing a Python application using psycopg (version 3) to query pgvector embeddings. Your query returns vector data, but each result is a raw string like [0.1,0.2,0.3] instead of a Python list or numpy array. What should you do?

A. Cast the column to text[] in the SQL query
B. Call register_vector() from the pgvector.psycopg module on the connection
C. Set autocommit=True on the connection before querying
D. Parse the string manually with json.loads() in application code

Show answer

Correct answer: B

The pgvector Python package provides a register_vector() function that registers a custom type adapter with psycopg. After calling register_vector(conn), pgvector columns are automatically deserialized into numpy arrays (or Python lists if numpy is not installed). Without this registration, psycopg does not know how to handle the vector type and returns raw string representations. While json.loads() could parse the string, it is fragile and misses the type registration that also handles parameterized queries with vector inputs.

Question 8

You are building a batch pipeline that generates embeddings for 100,000 documents using the Azure OpenAI embeddings API. During testing, you frequently encounter HTTP 429 (Too Many Requests) errors. Which approach best addresses this?

A. Send all 100,000 documents in a single API call to minimize request count
B. Implement exponential backoff with jitter, batch documents into groups of up to 2048 tokens per request, and respect the Retry-After header
C. Create 10 separate Azure OpenAI deployments and distribute requests evenly across them
D. Switch to a self-hosted embedding model to avoid rate limits entirely

Show answer

Correct answer: B

The Azure OpenAI embeddings API enforces rate limits based on tokens per minute and requests per minute. The correct approach is to batch multiple texts per API call (the API accepts an array of inputs), implement exponential backoff with jitter when receiving 429 responses, and honor the Retry-After header value. Sending all documents in one call is not possible due to per-request token limits. Creating multiple deployments adds unnecessary complexity and cost, and may still hit subscription-level limits. Self-hosted models avoid rate limits but introduce significant operational overhead and may not match the quality of Azure OpenAI models.

Question 9

A developer stores embeddings in a pgvector column and notices that cosine distance queries return unexpected results, with documents that should be dissimilar receiving low distance scores. The embeddings come from a custom model that produces unnormalized vectors with widely varying magnitudes. What is the most likely cause and fix?

A. pgvector truncates vectors longer than 1024 dimensions; reduce the dimensionality
B. The float32 precision of pgvector introduces rounding errors; switch to double precision[]
C. Cosine distance in pgvector handles magnitude differences correctly; the issue is in the embedding model
D. The vectors should be L2-normalized before storage so that cosine distance produces meaningful results, or use the <-> (L2 distance) operator instead

Show answer

Correct answer: D

While pgvector's <=> operator does compute cosine distance (which theoretically accounts for magnitude), numerical precision issues with float32 storage can amplify when vector magnitudes vary dramatically. The best practice is to L2-normalize vectors (scale each vector to unit length) before inserting them into pgvector. With normalized vectors, cosine distance and inner product become equivalent, and results are more numerically stable. Alternatively, if normalization is not feasible, switching to L2 distance (<->) may produce more predictable results for unnormalized vectors, depending on the use case.

Question 10

You need to insert a 3-dimensional embedding vector into a pgvector column using a raw SQL statement. Which INSERT syntax is correct?

A. INSERT INTO items (embedding) VALUES (ARRAY[0.1, 0.2, 0.3]::vector);
B. INSERT INTO items (embedding) VALUES ('[0.1, 0.2, 0.3]');
C. INSERT INTO items (embedding) VALUES (vector(0.1, 0.2, 0.3));
D. INSERT INTO items (embedding) VALUES ({0.1, 0.2, 0.3}::vector);

Show answer

Correct answer: B

pgvector uses a string literal format with square brackets and comma-separated values: '[0.1, 0.2, 0.3]'. PostgreSQL automatically casts this string to the vector type when the target column is defined as vector. Option A would also work because pgvector supports casting from real[] arrays to vector, but the canonical and most commonly documented syntax is the string literal format in option B. Option C is invalid because vector() is not a constructor function in pgvector. Option D uses curly braces, which is PostgreSQL's array literal syntax and is not valid for the vector type without an intermediate array cast.