| layout | default |
|---|---|
| title | Chapter 4: Hybrid Search |
| parent | LanceDB Tutorial |
| nav_order | 4 |
Welcome to Chapter 4: Hybrid Search. In this part of LanceDB Tutorial: Serverless Vector Database for AI, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Combine vector similarity search with full-text search and SQL filters for powerful retrieval systems.
flowchart TD
A[Query] --> B[Vector search branch]
A --> C[Full-text search BM25 branch]
A --> D[Scalar filter branch]
B --> E[Vector scores]
C --> F[BM25 scores]
D --> G[Filtered candidates]
E --> H[RRF fusion]
F --> H
G --> H
H --> I[Hybrid results]
Hybrid search combines multiple retrieval strategies to improve search quality. This chapter covers LanceDB's full-text search capabilities, combining vector and keyword search, and building effective hybrid retrieval pipelines.
import lancedb
db = lancedb.connect("./my_lancedb")
# Create table with text content
data = [
{"id": 1, "title": "Introduction to Machine Learning", "content": "Machine learning is a subset of AI..."},
{"id": 2, "title": "Deep Learning Fundamentals", "content": "Deep learning uses neural networks..."},
{"id": 3, "title": "Natural Language Processing", "content": "NLP deals with text and speech..."},
]
table = db.create_table("articles", data)
# Create full-text search index
table.create_fts_index("content") # Index the content column
# Also index title
table.create_fts_index("title")# Search using full-text index
results = table.search("machine learning", query_type="fts") \
.limit(10) \
.to_pandas()
print(results)
# Returns: id, title, content, _score (BM25 score)# Search specific column
results = table.search("neural networks") \
.query_type("fts") \
.fts_columns(["title", "content"]) \
.limit(10) \
.to_pandas()
# Boolean queries
results = table.search("machine AND learning") \
.query_type("fts") \
.limit(10) \
.to_pandas()
# Phrase search
results = table.search('"deep learning"') \
.query_type("fts") \
.limit(10) \
.to_pandas()
# Fuzzy search
results = table.search("machne~1") \ # Allows 1 edit distance
.query_type("fts") \
.limit(10) \
.to_pandas()import lancedb
import numpy as np
db = lancedb.connect("./my_lancedb")
table = db.open_table("articles")
# Query text
query_text = "machine learning basics"
# Get vector embedding (using your embedding function)
query_vector = embed_text(query_text)
# Hybrid search: combines vector and FTS
results = table.search(query_vector) \
.query_type("hybrid") \ # Enable hybrid search
.fts_query(query_text) \ # Add full-text query
.limit(10) \
.to_pandas()def weighted_hybrid_search(
table,
query_text: str,
query_vector: list,
vector_weight: float = 0.5,
fts_weight: float = 0.5,
limit: int = 10
):
"""Perform weighted hybrid search."""
# Get vector search results
vector_results = table.search(query_vector) \
.limit(limit * 2) \
.to_list()
# Get full-text search results
fts_results = table.search(query_text, query_type="fts") \
.limit(limit * 2) \
.to_list()
# Normalize and combine scores
combined = {}
# Process vector results
if vector_results:
max_dist = max(r['_distance'] for r in vector_results)
for r in vector_results:
item_id = r['id']
# Convert distance to similarity score (0-1)
vector_score = 1 - (r['_distance'] / max_dist if max_dist > 0 else 0)
combined[item_id] = {
'item': r,
'vector_score': vector_score,
'fts_score': 0
}
# Process FTS results
if fts_results:
max_score = max(r.get('_score', 0) for r in fts_results)
for r in fts_results:
item_id = r['id']
fts_score = r.get('_score', 0) / max_score if max_score > 0 else 0
if item_id in combined:
combined[item_id]['fts_score'] = fts_score
else:
combined[item_id] = {
'item': r,
'vector_score': 0,
'fts_score': fts_score
}
# Calculate weighted scores
for item_id in combined:
combined[item_id]['final_score'] = (
vector_weight * combined[item_id]['vector_score'] +
fts_weight * combined[item_id]['fts_score']
)
# Sort by final score
sorted_results = sorted(
combined.values(),
key=lambda x: x['final_score'],
reverse=True
)
return [r['item'] for r in sorted_results[:limit]]
# Usage
results = weighted_hybrid_search(
table,
query_text="machine learning tutorial",
query_vector=embed_text("machine learning tutorial"),
vector_weight=0.7,
fts_weight=0.3,
limit=10
)def rrf_hybrid_search(
table,
query_text: str,
query_vector: list,
k: int = 60,
limit: int = 10
):
"""Combine vector and FTS using Reciprocal Rank Fusion."""
# Get vector search results
vector_results = table.search(query_vector) \
.limit(limit * 3) \
.to_list()
# Get full-text search results
fts_results = table.search(query_text, query_type="fts") \
.limit(limit * 3) \
.to_list()
# Calculate RRF scores
scores = {}
for rank, r in enumerate(vector_results):
item_id = r['id']
if item_id not in scores:
scores[item_id] = {'item': r, 'score': 0}
scores[item_id]['score'] += 1 / (k + rank + 1)
for rank, r in enumerate(fts_results):
item_id = r['id']
if item_id not in scores:
scores[item_id] = {'item': r, 'score': 0}
scores[item_id]['score'] += 1 / (k + rank + 1)
# Sort by RRF score
sorted_results = sorted(
scores.values(),
key=lambda x: x['score'],
reverse=True
)
return [r['item'] for r in sorted_results[:limit]]# Query table with SQL
results = db.execute_sql("""
SELECT id, title, content
FROM articles
WHERE category = 'technology'
ORDER BY created_at DESC
LIMIT 100
""").to_pandas()# Vector search with SQL-like filters
results = table.search(query_vector) \
.where("""
category IN ('technology', 'science')
AND published_at >= '2024-01-01'
AND is_active = true
""") \
.limit(10) \
.to_pandas()# Aggregations (via SQL)
stats = db.execute_sql("""
SELECT
category,
COUNT(*) as count,
AVG(view_count) as avg_views
FROM articles
GROUP BY category
ORDER BY count DESC
""").to_pandas()
# Subqueries
results = db.execute_sql("""
SELECT *
FROM articles
WHERE author_id IN (
SELECT id FROM authors WHERE expertise = 'AI'
)
LIMIT 100
""").to_pandas()def two_stage_retrieval(
table,
query_text: str,
query_vector: list,
first_stage_limit: int = 100,
final_limit: int = 10
):
"""
Stage 1: Fast vector search for candidates
Stage 2: Rerank with cross-encoder or FTS boost
"""
# Stage 1: Get candidates with vector search
candidates = table.search(query_vector) \
.limit(first_stage_limit) \
.to_list()
# Stage 2: Score candidates with multiple signals
scored = []
for item in candidates:
# Vector similarity (convert distance to score)
vector_score = 1 / (1 + item['_distance'])
# Keyword match score
title_match = query_text.lower() in item.get('title', '').lower()
content_match = query_text.lower() in item.get('content', '').lower()
keyword_score = 0.3 * title_match + 0.2 * content_match
# Recency boost
# recency_score = calculate_recency(item.get('created_at'))
# Combined score
final_score = vector_score + keyword_score
scored.append((final_score, item))
# Sort and return top results
scored.sort(key=lambda x: x[0], reverse=True)
return [item for score, item in scored[:final_limit]]def multi_index_retrieval(
db,
query_text: str,
query_vector: list,
limit: int = 10
):
"""Search across multiple tables and combine results."""
# Search different content types
articles = db.open_table("articles").search(query_vector).limit(limit).to_list()
docs = db.open_table("documents").search(query_vector).limit(limit).to_list()
faqs = db.open_table("faqs").search(query_vector).limit(limit).to_list()
# Tag results with source
for r in articles:
r['_source'] = 'articles'
for r in docs:
r['_source'] = 'documents'
for r in faqs:
r['_source'] = 'faqs'
# Combine and sort
all_results = articles + docs + faqs
all_results.sort(key=lambda x: x['_distance'])
return all_results[:limit]def rag_retrieve(
table,
query: str,
embed_fn,
top_k: int = 5
) -> list[str]:
"""Retrieve context for RAG."""
query_vector = embed_fn(query)
results = table.search(query_vector) \
.limit(top_k) \
.select(["content"]) \
.to_list()
return [r['content'] for r in results]
def rag_generate(query: str, context: list[str], llm) -> str:
"""Generate response using retrieved context."""
context_text = "\n\n".join(context)
prompt = f"""Based on the following context, answer the question.
Context:
{context_text}
Question: {query}
Answer:"""
return llm.generate(prompt)
# Usage
context = rag_retrieve(table, "What is LanceDB?", embed_fn)
response = rag_generate("What is LanceDB?", context, llm)def hybrid_rag_retrieve(
table,
query: str,
embed_fn,
top_k: int = 5
) -> list[str]:
"""Hybrid retrieval for RAG."""
query_vector = embed_fn(query)
# Vector search
vector_results = table.search(query_vector) \
.limit(top_k * 2) \
.to_list()
# Full-text search
fts_results = table.search(query, query_type="fts") \
.limit(top_k * 2) \
.to_list()
# Combine with RRF
combined = rrf_hybrid_search_internal(vector_results, fts_results, k=60)
return [r['content'] for r in combined[:top_k]]def expand_query_with_synonyms(query: str, synonyms: dict) -> list[str]:
"""Expand query with synonyms."""
words = query.lower().split()
expanded_queries = [query]
for word in words:
if word in synonyms:
for syn in synonyms[word]:
new_query = query.replace(word, syn)
expanded_queries.append(new_query)
return expanded_queries
# Usage
synonyms = {
"ml": ["machine learning", "artificial intelligence"],
"db": ["database", "data store"],
"fast": ["quick", "rapid", "speedy"]
}
queries = expand_query_with_synonyms("fast ml db", synonyms)
# ['fast ml db', 'quick ml db', 'rapid ml db', ...]
# Search with expanded queries
all_results = []
for q in queries:
results = table.search(q, query_type="fts").limit(10).to_list()
all_results.extend(results)
# Deduplicate and rank
unique_results = deduplicate_by_id(all_results)def llm_expand_query(query: str, llm) -> list[str]:
"""Use LLM to generate query variations."""
prompt = f"""Generate 3 alternative search queries for: "{query}"
Return only the queries, one per line, without numbering or explanation."""
response = llm.generate(prompt)
variations = [q.strip() for q in response.strip().split('\n') if q.strip()]
return [query] + variations[:3]def choose_search_strategy(query: str) -> str:
"""Choose search strategy based on query characteristics."""
# Short queries might be keyword-heavy
if len(query.split()) <= 2:
return "hybrid"
# Questions often need semantic understanding
if query.endswith('?') or query.lower().startswith(('what', 'how', 'why', 'when')):
return "vector"
# Queries with special terms might need keyword match
if any(char in query for char in ['"', '+', '-']):
return "fts"
# Default to hybrid
return "hybrid"# Good: Selective filters first
results = table.search(query_vector) \
.where("category = 'technology'") \ # Filters 90% of data
.where("year >= 2023") \ # Additional filter
.limit(10) \
.to_pandas()
# Avoid: Non-selective filters
# where("is_active = true") # If 99% are active, not usefulIn this chapter, you've learned:
- Full-Text Search: Creating FTS indexes and BM25 search
- Hybrid Search: Combining vector and keyword search
- Fusion Methods: Weighted scoring and RRF
- SQL Integration: Complex queries and aggregations
- Multi-Stage: Two-stage retrieval and reranking
- RAG Integration: Building retrieval pipelines
- Query Expansion: Synonym and LLM-based expansion
- Hybrid is Often Best: Combine vector + FTS for best results
- RRF Works Well: Simple and effective fusion method
- Use Pre-filtering: More efficient than post-filtering
- Match to Use Case: Choose strategy based on query type
- Expand When Needed: Query expansion helps recall
Now that you understand hybrid search, let's explore Integrations in Chapter 5 for connecting LanceDB with LangChain, LlamaIndex, and other tools.
Ready for Chapter 5? Integrations
Generated for Awesome Code Docs
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for search, limit, table so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 4: Hybrid Search as an operating subsystem inside LanceDB Tutorial: Serverless Vector Database for AI, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around results, query, query_vector as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 4: Hybrid Search usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
search. - Input normalization: shape incoming data so
limitreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
table. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- Awesome Code Docs
Why it matters: authoritative reference on
Awesome Code Docs(github.com).
Suggested trace strategy:
- search upstream code for
searchandlimitto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production