Comprehensive guide to understanding and working with embedding models in the Multi-Modal Academic Research System.
- What Are Embeddings?
- Current Implementation
- Choosing an Embedding Model
- Model Comparison
- Changing Embedding Models
- Fine-Tuning Embeddings
- Optimization Techniques
- Advanced Topics
Embeddings are dense vector representations of text that capture semantic meaning. Similar concepts have similar vectors (measured by cosine similarity or other distance metrics).
Example:
"machine learning" → [0.23, -0.45, 0.67, ..., 0.12] (384 dimensions)
"neural networks" → [0.21, -0.42, 0.69, ..., 0.15] (similar vector)
"pizza recipe" → [-0.67, 0.34, -0.12, ..., 0.89] (different vector)
Traditional keyword search has limitations:
- Exact matching only: "ML" won't match "machine learning"
- No semantic understanding: Can't understand synonyms or context
- Poor ranking: Simple term frequency doesn't capture relevance
Embeddings enable:
- Semantic search: Find conceptually similar content
- Context awareness: Understand meaning, not just words
- Better ranking: Relevance based on meaning
The system uses sentence-transformers/all-MiniLM-L6-v2 by default.
Specifications:
- Parameters: 22.7 million
- Embedding dimension: 384
- Max sequence length: 256 tokens
- Speed: ~1000 sentences/second (CPU)
- Performance: Good balance of speed and quality
Location in code: multi_modal_rag/indexing/opensearch_manager.py
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')- During indexing:
def generate_embedding(self, text: str) -> List[float]:
"""Generate embedding vector for text."""
embedding = self.model.encode(
text,
convert_to_numpy=True,
normalize_embeddings=True
)
return embedding.tolist()- During search:
def hybrid_search(self, query_text: str, size: int = 10):
"""Search using both keywords and embeddings."""
# Generate query embedding
query_embedding = self.generate_embedding(query_text)
# Combine with keyword search
query = {
"query": {
"bool": {
"should": [
# Keyword search
{
"multi_match": {
"query": query_text,
"fields": ["title^2", "abstract", "content"]
}
},
# Semantic search
{
"knn": {
"embedding": {
"vector": query_embedding,
"k": size
}
}
}
]
}
}
}-
Embedding dimension
- Larger = more information, but slower and more memory
- Smaller = faster, less memory, but less nuanced
-
Model size
- Larger models: Better quality, slower inference
- Smaller models: Faster, use less memory
-
Domain specialization
- General-purpose vs. domain-specific (e.g., scientific text)
- Pre-trained on similar data = better performance
-
Speed requirements
- Real-time search needs fast models
- Batch processing can use slower, higher-quality models
-
Hardware constraints
- GPU available? Can use larger models
- CPU only? Need efficient models
Do you need real-time search?
├─ Yes → Use small, fast model (MiniLM, TinyBERT)
└─ No → Do you have GPU?
├─ Yes → Use larger model (MPNet, BERT-large)
└─ No → Use medium model (MiniLM, BERT-base)
Is your content domain-specific?
├─ Scientific → Use SciBERT, BioBERT
├─ Legal → Use Legal-BERT
├─ Code → Use CodeBERT
└─ General → Use general models (SBERT, MPNet)
| Model | Dimensions | Size (MB) | Speed* | Quality** |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 80 | Fast | Good |
| all-mpnet-base-v2 | 768 | 420 | Medium | Excellent |
| all-distilroberta-v1 | 768 | 290 | Medium | Very Good |
| paraphrase-MiniLM-L3-v2 | 384 | 60 | Very Fast | Fair |
| msmarco-distilbert-base-v4 | 768 | 250 | Medium | Very Good |
| multi-qa-mpnet-base-dot-v1 | 768 | 420 | Medium | Excellent |
*Speed: Sentences per second on CPU **Quality: Performance on semantic search benchmarks
# Pros:
# - Fast inference (~1000 sentences/sec)
# - Small memory footprint
# - Good general-purpose performance
# - Easy to run on CPU
# Cons:
# - Lower quality than larger models
# - 256 token limit (short sequences)
# - Not domain-specialized
# Best for:
# - General-purpose applications
# - CPU-only environments
# - Real-time search# Pros:
# - State-of-the-art quality
# - 514 token limit (longer sequences)
# - Excellent on benchmarks
# - Still reasonably fast
# Cons:
# - Larger size (420MB vs 80MB)
# - Slower inference (~500 sentences/sec)
# - Needs more memory
# Best for:
# - Quality-focused applications
# - When you have GPU
# - Longer documents# Pros:
# - Optimized for question-answering
# - Excellent for asymmetric search (query vs document)
# - High quality results
# - Uses dot product (faster than cosine)
# Cons:
# - Larger model size
# - Slower inference
# - Needs reindexing with dot product space
# Best for:
# - Question-answering systems (like this one!)
# - Asymmetric search scenarios
# - When GPU is available# Model: allenai/scibert_scivocab_uncased
# Dimensions: 768
# Trained on: Scientific papers (1.14M papers)
# Best for: Academic/scientific content
# Use when: Most content is from scientific papers
model = SentenceTransformer('sentence-transformers/allenai-specter')# Model: dmis-lab/biobert-base-cased-v1.2
# Trained on: PubMed articles
# Best for: Medical/biological research
model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')# Option 1: Higher quality (recommended)
new_model_name = 'sentence-transformers/all-mpnet-base-v2'
new_dimension = 768
# Option 2: Faster inference
new_model_name = 'sentence-transformers/paraphrase-MiniLM-L3-v2'
new_dimension = 384
# Option 3: Scientific content
new_model_name = 'sentence-transformers/allenai-specter'
new_dimension = 768# In __init__ method
class OpenSearchManager:
def __init__(self, host='localhost', port=9200):
# ... existing code ...
# Load new model
self.model = SentenceTransformer('all-mpnet-base-v2') # Changed!
# Update dimension
self.embedding_dimension = 768 # Changed from 384!def create_index(self, index_name='research_assistant'):
"""Create index with updated dimension."""
index_body = {
"settings": {
"index": {
"knn": True,
"knn.space_type": "cosinesimil"
}
},
"mappings": {
"properties": {
"embedding": {
"type": "knn_vector",
"dimension": 768 # Updated dimension!
},
# ... other fields ...
}
}
}
self.client.indices.create(index=index_name, body=index_body)# Delete old index
client.indices.delete(index='research_assistant')
# Create new index with updated dimension
manager.create_index('research_assistant')
# Reindex all documents
# Run your data collection and indexing pipeline#!/usr/bin/env python3
"""
Script to migrate to a new embedding model.
"""
from multi_modal_rag.indexing.opensearch_manager import OpenSearchManager
from sentence_transformers import SentenceTransformer
import json
def migrate_embeddings(
old_index='research_assistant',
new_index='research_assistant_v2',
new_model_name='all-mpnet-base-v2',
new_dimension=768
):
"""Migrate to new embedding model."""
# Initialize
manager = OpenSearchManager()
new_model = SentenceTransformer(new_model_name)
# Create new index
print(f"Creating new index: {new_index}")
# Update index creation with new dimension
manager.create_index(new_index)
# Get all documents from old index
print(f"Fetching documents from {old_index}")
from opensearchpy import helpers
docs = helpers.scan(
manager.client,
index=old_index,
query={"query": {"match_all": {}}}
)
# Reindex with new embeddings
print("Reindexing with new embeddings...")
batch = []
for i, doc in enumerate(docs):
# Extract text
text = doc['_source'].get('content', '')
# Generate new embedding
new_embedding = new_model.encode(text).tolist()
# Update document
doc['_source']['embedding'] = new_embedding
batch.append({
'_index': new_index,
'_id': doc['_id'],
'_source': doc['_source']
})
# Batch insert
if len(batch) >= 100:
helpers.bulk(manager.client, batch)
print(f"Processed {i+1} documents...")
batch = []
# Insert remaining
if batch:
helpers.bulk(manager.client, batch)
print(f"Migration complete! {i+1} documents reindexed.")
# Verify
count = manager.client.count(index=new_index)
print(f"New index has {count['count']} documents")
# Switch alias (optional)
print("Switching alias...")
manager.client.indices.update_aliases(body={
"actions": [
{"remove": {"index": old_index, "alias": "research"}},
{"add": {"index": new_index, "alias": "research"}}
]
})
print("Done! You can now delete the old index:")
print(f" curl -X DELETE http://localhost:9200/{old_index}")
if __name__ == '__main__':
migrate_embeddings()Fine-tune when:
- Your domain is very specialized
- You have labeled training data
- Pre-trained models perform poorly
- You need maximum quality
# Triplet format: (anchor, positive, negative)
training_data = [
{
'anchor': 'What is machine learning?',
'positive': 'Machine learning is a subset of AI...',
'negative': 'The weather today is sunny...'
},
# ... more examples
]
# Or pairs format: (sentence1, sentence2, similarity_score)
training_pairs = [
('deep learning', 'neural networks', 0.9),
('deep learning', 'pizza recipe', 0.1),
# ... more pairs
]from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare training data
train_examples = [
InputExample(texts=[item['anchor'], item['positive'], item['negative']])
for item in training_data
]
# Create dataloader
train_dataloader = DataLoader(
train_examples,
shuffle=True,
batch_size=16
)
# Define loss
train_loss = losses.TripletLoss(model=model)
# Train
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path='./fine_tuned_model'
)
# Use fine-tuned model
custom_model = SentenceTransformer('./fine_tuned_model')from sentence_transformers import SentenceTransformer
import torch
# Check GPU availability
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
# Load model on GPU
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
# Batch encoding (much faster)
texts = ["text 1", "text 2", ..., "text N"]
embeddings = model.encode(
texts,
batch_size=64, # Adjust based on GPU memory
show_progress_bar=True,
convert_to_tensor=True # Keep on GPU
)import pickle
from functools import lru_cache
class EmbeddingCache:
"""Cache embeddings to avoid recomputation."""
def __init__(self, cache_file='embeddings_cache.pkl'):
self.cache_file = cache_file
self.cache = self._load_cache()
def _load_cache(self):
try:
with open(self.cache_file, 'rb') as f:
return pickle.load(f)
except:
return {}
def get_embedding(self, text, model):
"""Get embedding from cache or compute."""
# Use hash as key
text_hash = hash(text)
if text_hash not in self.cache:
self.cache[text_hash] = model.encode(text).tolist()
self._save_cache()
return self.cache[text_hash]
def _save_cache(self):
with open(self.cache_file, 'wb') as f:
pickle.dump(self.cache, f)
# Usage
cache = EmbeddingCache()
embedding = cache.get_embedding("machine learning", model)Reduce embedding dimension for faster search:
from sklearn.decomposition import PCA
import numpy as np
# Get original embeddings
embeddings = model.encode(texts) # Shape: (N, 384)
# Reduce dimensions
pca = PCA(n_components=128) # Reduce 384 → 128
reduced_embeddings = pca.fit_transform(embeddings)
# Quality vs speed tradeoff:
# 384 → 256: ~99% quality, ~1.5x faster
# 384 → 128: ~95% quality, ~3x faster
# 384 → 64: ~85% quality, ~6x fasterReduce memory and increase speed:
import numpy as np
def quantize_embeddings(embeddings, bits=8):
"""Quantize embeddings to reduce memory."""
# Normalize to [0, 1]
min_val = embeddings.min()
max_val = embeddings.max()
normalized = (embeddings - min_val) / (max_val - min_val)
# Quantize
scale = (2 ** bits) - 1
quantized = (normalized * scale).astype(np.uint8)
return quantized, min_val, max_val
def dequantize_embeddings(quantized, min_val, max_val, bits=8):
"""Reconstruct embeddings."""
scale = (2 ** bits) - 1
normalized = quantized.astype(np.float32) / scale
return normalized * (max_val - min_val) + min_val
# Usage: 8-bit quantization reduces memory by 4x
quantized, min_v, max_v = quantize_embeddings(embeddings, bits=8)For long documents, use multiple embeddings:
def chunk_and_embed(text, model, chunk_size=256):
"""Split text into chunks and embed each."""
# Split text
words = text.split()
chunks = [
' '.join(words[i:i+chunk_size])
for i in range(0, len(words), chunk_size)
]
# Embed each chunk
embeddings = model.encode(chunks)
return embeddings
# Store in OpenSearch
doc = {
'content': long_text,
'embeddings': chunk_and_embed(long_text, model), # List of vectors
'num_chunks': len(embeddings)
}
# Search: Compare query to all chunks, take max similarityUse cross-encoder for better ranking:
from sentence_transformers import CrossEncoder
# Step 1: Fast bi-encoder retrieval (current method)
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
candidates = retrieve_top_100(query, bi_encoder)
# Step 2: Slow cross-encoder reranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
# Create query-document pairs
pairs = [[query, doc['content']] for doc in candidates]
# Score pairs
scores = cross_encoder.predict(pairs)
# Rerank
reranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)[:10] # Top 10 after rerankingFor multilingual support:
# Use multilingual model
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
# Supports 50+ languages
# Same embedding space across languages!
# Example: Search in any language
query_en = "machine learning"
query_fr = "apprentissage automatique"
query_de = "maschinelles Lernen"
# All produce similar embeddings
emb_en = model.encode(query_en)
emb_fr = model.encode(query_fr)
emb_de = model.encode(query_de)
# Cosine similarity ~0.9+from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def evaluate_embeddings(model, test_queries, relevant_docs):
"""Evaluate embedding quality."""
# Embed queries and documents
query_embeddings = model.encode(test_queries)
doc_embeddings = model.encode(relevant_docs)
# Calculate similarities
similarities = cosine_similarity(query_embeddings, doc_embeddings)
# Calculate metrics
# Mean Reciprocal Rank (MRR)
reciprocal_ranks = []
for i, sim_row in enumerate(similarities):
# Get rank of correct document
rank = np.where(np.argsort(sim_row)[::-1] == i)[0][0] + 1
reciprocal_ranks.append(1.0 / rank)
mrr = np.mean(reciprocal_ranks)
# Mean Average Precision (MAP)
# ... implement MAP calculation
return {
'mrr': mrr,
'map': map_score
}import time
def benchmark_model(model, texts, num_runs=5):
"""Benchmark embedding generation speed."""
# Warmup
model.encode(texts[:10])
# Benchmark
times = []
for _ in range(num_runs):
start = time.time()
embeddings = model.encode(texts, batch_size=32)
times.append(time.time() - start)
avg_time = np.mean(times)
throughput = len(texts) / avg_time
print(f"Average time: {avg_time:.2f}s")
print(f"Throughput: {throughput:.0f} sentences/sec")
return throughput- Sentence Transformers Documentation
- Hugging Face Model Hub
- MTEB Leaderboard (Benchmark rankings)
- Hybrid Search Guide
- Performance Optimization
- Start with general models: all-MiniLM-L6-v2 or all-mpnet-base-v2
- Use GPU when possible: 10-100x speedup for large batches
- Cache embeddings: Don't recompute for same text
- Batch processing: Much faster than one-by-one
- Monitor quality: Use evaluation metrics
- Consider domain: Use specialized models when available
- Test before switching: Compare quality on your data
- Document your choice: Note why you chose a particular model