Created: 2025-01-08
Purpose: Consolidate industry best practices for RAG system design and implementation
Status: Active Implementation
This document consolidates best practices from leading research and production RAG systems:
- Best Practices for RAG Pipeline
- Optimizing RAG Retrieval - Google Cloud
- RAG for LLMs - Prompt Engineering Guide
┌─────────────────────────────────────────────────────────┐
│ Naive RAG │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Indexing │ → │ Retrieval│ → │Generation│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Problems: Low precision, low recall, hallucination │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Advanced RAG │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Pre-Retr. │ → │ Retrieval│ → │Post-Retr.│ │
│ │Optimize │ │+Enhanced │ │Re-ranking│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Improvements: Better chunking, query rewriting, │
│ multi-signal scoring │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Modular RAG (Our System) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Query │ → │ Multi- │ → │ Dedup & │ │
│ │ Rewrite │ │ Stage │ │ Rerank │ │
│ │ │ │ Retrieval│ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ↓ ↓ ↓ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Concept │ │ Hybrid │ │Position │ │
│ │Extract │ │ Search │ │Optimize │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Features: Flexible modules, context engineering, │
│ transparent scoring, evaluation-ready │
└─────────────────────────────────────────────────────────┘
Purpose: Determine if retrieval is needed (not all queries require RAG)
Best Practices:
- Classify queries into 15 task types
- Mark as "sufficient" if query has enough information
- Mark as "insufficient" if external knowledge needed
- Train automated classifier for production
Implementation:
class QueryClassifier:
"""Classify if query needs RAG retrieval."""
SUFFICIENT_PATTERNS = [
"what is the result of", # Math/logic
"convert X to Y", # Transformation
"define X", # Simple definition
]
INSUFFICIENT_PATTERNS = [
"latest news about", # Current events
"how does X work in", # Domain-specific
"best practices for", # Expert knowledge
]
def needs_retrieval(self, query: str) -> bool:
"""Determine if query needs external retrieval."""
query_lower = query.lower()
# Check for sufficient patterns
if any(pattern in query_lower for pattern in self.SUFFICIENT_PATTERNS):
return False
# Check for insufficient patterns
if any(pattern in query_lower for pattern in self.INSUFFICIENT_PATTERNS):
return True
# Default: use retrieval (safer)
return TrueResearch Findings:
- Optimal chunk size: 512 tokens (balances context and precision)
- Chunk overlap: 20-50 tokens (maintains continuity)
- Chunking level: Sentence-level (preserves meaning)
- Separators:
["\n\n", "\n", ". ", " ", ""](natural boundaries)
Implementation:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Optimal configuration based on research
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Optimal size for balance
chunk_overlap=50, # 20-50 token overlap
separators=["\n\n", "\n", ". ", " ", ""], # Sentence-level
length_function=len
)Key Metrics:
- Faithfulness: Ensures response accuracy
- Relevancy: Retrieved text matches query
Research Winner: LLM-Embedder or BAAI/bge-large-en
Criteria:
- Performance vs. size trade-off
- Semantic understanding quality
- Inference speed
- Resource requirements
Our Choice:
from langchain_huggingface import HuggingFaceEmbeddings
# HuggingFace all-MiniLM-L6-v2: Fast, lightweight, effective
embeddings = HuggingFaceEmbeddings(
model_name="all-MiniLM-L6-v2",
model_kwargs={'device': 'cpu'},
encode_kwargs={'normalize_embeddings': True}
)Research Winner: Milvus (for production scale) Our Choice: Qdrant (local, auto-persistent, production-ready)
Comparison:
| Database | Performance | Ease of Use | Persistence | Production Ready |
|---|---|---|---|---|
| FAISS | Fast | Easy | Manual | No (in-memory) |
| Chroma | Good | Very Easy | Automatic | Yes |
| Qdrant | Excellent | Easy | Automatic | Yes |
| Milvus | Best | Complex | Automatic | Yes (scale) |
Why Qdrant:
- Auto-persistent (no manual save/load)
- Local embedded mode (no server needed)
- Production-ready with scaling path
- Excellent performance
- Clean API
Research Winner: HyDE + Hybrid Search
Retrieval Strategies:
- Query Rewriting: Improve query formulation
- Query Decomposition: Break into sub-questions
- Pseudo-Document Generation (HyDE): Generate hypothetical documents
- Hybrid Search: Combine BM25 (sparse) + semantic (dense)
Our Implementation:
class AdvancedRetrieval:
"""Multi-stage retrieval with best practices."""
async def retrieve(self, query: str) -> List[Document]:
"""Execute advanced retrieval pipeline."""
# Stage 1: Query rewriting (multiple variants)
query_variants = self._rewrite_query(query)
# Stage 2: Retrieve with each variant
all_results = []
for variant in query_variants:
results = await self.semantic_search(variant, limit=5)
all_results.extend(results)
# Stage 3: Extract key concepts and search again
concepts = self._extract_key_concepts(query)
for concept in concepts[:3]:
results = await self.semantic_search(concept, limit=5)
all_results.extend(results)
# Stage 4: Deduplicate and re-rank
return self._deduplicate_and_rerank(all_results, query)Research Winner: monoT5 (balance) or RankLLaMA (best performance)
Our Implementation: Multi-signal scoring
def rerank_results(results: List[Dict], query: str) -> List[Dict]:
"""Multi-signal re-ranking."""
for result in results:
# Signal 1: Semantic relevance (50%)
semantic_score = result.get('relevance_score', 0.5)
# Signal 2: Keyword overlap - BM25-inspired (25%)
query_words = set(query.lower().split())
content_words = set(result['content'].lower().split())
keyword_score = len(query_words & content_words) / len(query_words)
# Signal 3: Content quality (15%)
quality_score = min(len(result['content']) / 1000, 1.0)
# Signal 4: Diversity bonus (10%)
diversity_score = calculate_diversity(result, results)
# Combined weighted score
result['combined_score'] = (
semantic_score * 0.50 +
keyword_score * 0.25 +
quality_score * 0.15 +
diversity_score * 0.10
)
# Sort by combined score
return sorted(results, key=lambda x: x['combined_score'], reverse=True)Research Finding: "Lost in the Middle" problem
Best Strategy: Reverse or Sides
- Place most relevant at beginning AND end
- Less relevant in middle
- LLMs attend to edges more than middle
Our Implementation:
def optimize_context_position(ranked_results: List[Dict]) -> List[Dict]:
"""
Optimize result positions to avoid 'lost in the middle'.
Strategy: Place most relevant at start and end.
"""
if len(ranked_results) <= 5:
return ranked_results
# Reorder: alternate high-priority at edges
reordered = []
for i in range(0, len(ranked_results), 2):
if i < len(ranked_results):
reordered.append(ranked_results[i])
for i in range(1, len(ranked_results), 2):
if i < len(ranked_results):
reordered.insert(len(reordered)//2, ranked_results[i])
return reorderedResearch Winner: Recomp (with LongLLMLingua as alternative)
Purpose: Reduce redundancy, prevent long prompts
Methods:
- Extractive: Select key sentences
- Generative: Synthesize information
- Selective Context: Remove redundant info
Our Approach: Let LLM handle with optimized context
# We provide top 15 results (already de-duplicated and re-ranked)
# LLM naturally focuses on most relevant
# No forced summarization unless context exceeds limits-
Create Golden Dataset
- High-quality questions
- Broad coverage of data
- Real-world variations
- Known-good answers
-
Change One Variable at a Time
- Isolate what's being tested
- Keep evaluation consistent
- Track improvements systematically
-
Automated + Human Evaluation
- Quantitative metrics for iteration
- Qualitative feedback for quality
Model-Based Metrics (Vertex AI Generative AI Evaluation):
class RAGEvaluationMetrics:
"""Comprehensive RAG evaluation metrics."""
def evaluate_response(
self,
query: str,
response: str,
context: List[str],
reference: str = None
) -> Dict[str, float]:
"""Evaluate RAG response quality."""
metrics = {}
# 1. Response Groundedness
# How well response aligns with retrieved context
metrics['groundedness'] = self._evaluate_groundedness(
response, context
)
# 2. Answer Relevance
# How relevant response is to query
metrics['relevance'] = self._evaluate_relevance(
query, response
)
# 3. Faithfulness
# Does response hallucinate or stay factual?
metrics['faithfulness'] = self._evaluate_faithfulness(
response, context
)
# 4. Context Relevance
# Are retrieved docs relevant to query?
metrics['context_relevance'] = self._evaluate_context_relevance(
query, context
)
# 5. Context Recall (if reference available)
# Did we retrieve all necessary information?
if reference:
metrics['context_recall'] = self._evaluate_context_recall(
reference, context
)
# 6. Answer Correctness (if reference available)
if reference:
metrics['answer_correctness'] = self._evaluate_correctness(
response, reference
)
return metricsComputation-Based Metrics:
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
def evaluate_with_reference(response: str, reference: str) -> Dict:
"""Evaluate response against reference answer."""
# ROUGE Score (recall-oriented)
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
rouge_scores = scorer.score(reference, response)
# BLEU Score (precision-oriented)
bleu_score = sentence_bleu([reference.split()], response.split())
return {
'rouge1': rouge_scores['rouge1'].fmeasure,
'rouge2': rouge_scores['rouge2'].fmeasure,
'rougeL': rouge_scores['rougeL'].fmeasure,
'bleu': bleu_score
}Recommended Test Matrix:
TEST_EXPERIMENTS = {
'chunking': [
{'chunk_size': 256, 'overlap': 20},
{'chunk_size': 512, 'overlap': 50}, # Optimal
{'chunk_size': 1024, 'overlap': 100},
],
'retrieval_count': [5, 10, 15, 20],
'retrieval_methods': [
'semantic_only',
'semantic_with_rewrite',
'multi_stage', # Our current method
],
'reranking': [
'none',
'semantic_only',
'multi_signal', # Our current method
],
'models': [
'gemini-pro',
'gemini-1.5-pro',
'gemini-2.0-flash',
]
}class RAGTransparency:
"""Make RAG decisions transparent and debuggable."""
def generate_transparency_report(
self,
query: str,
retrieval_results: List[Dict],
final_response: str
) -> Dict:
"""Generate comprehensive transparency report."""
return {
'query_analysis': {
'original_query': query,
'query_variants': self._get_query_variants(query),
'key_concepts': self._get_key_concepts(query),
'needs_retrieval': self._needs_retrieval(query)
},
'retrieval_details': {
'stages_executed': [
'query_rewriting',
'semantic_search',
'concept_extraction',
'deduplication',
'reranking'
],
'total_results_retrieved': len(retrieval_results),
'results_after_deduplication': self._count_unique(retrieval_results),
'top_results': [
{
'rank': i+1,
'content_preview': r['content'][:200],
'scores': r.get('scoring_details', {}),
'combined_score': r.get('combined_score', 0)
}
for i, r in enumerate(retrieval_results[:5])
]
},
'scoring_breakdown': {
'semantic_weight': 0.50,
'keyword_weight': 0.25,
'quality_weight': 0.15,
'diversity_weight': 0.10,
'explanation': 'Weights based on industry research'
},
'context_used': {
'num_documents': len(retrieval_results),
'total_characters': sum(len(r['content']) for r in retrieval_results),
'estimated_tokens': self._estimate_tokens(retrieval_results),
'position_optimization': 'reverse_packing'
},
'generation_details': {
'model': 'gemini-pro',
'temperature': 0.7,
'response_length': len(final_response),
'context_grounding': 'RAG-enhanced'
},
'quality_indicators': {
'context_relevance': self._assess_context_relevance(query, retrieval_results),
'response_groundedness': self._assess_groundedness(final_response, retrieval_results),
'estimated_confidence': self._estimate_confidence(retrieval_results)
}
}class RAGMonitoring:
"""Monitor RAG system performance in real-time."""
def __init__(self):
self.metrics = {
'queries_processed': 0,
'avg_retrieval_time': 0,
'avg_generation_time': 0,
'avg_results_retrieved': 0,
'retrieval_failures': 0,
'generation_failures': 0
}
def log_query_execution(self, execution_details: Dict):
"""Log query execution for monitoring."""
self.metrics['queries_processed'] += 1
# Update rolling averages
self._update_rolling_avg(
'avg_retrieval_time',
execution_details['retrieval_time']
)
self._update_rolling_avg(
'avg_generation_time',
execution_details['generation_time']
)
self._update_rolling_avg(
'avg_results_retrieved',
execution_details['results_count']
)
# Track failures
if execution_details.get('retrieval_failed'):
self.metrics['retrieval_failures'] += 1
if execution_details.get('generation_failed'):
self.metrics['generation_failures'] += 1
def get_health_report(self) -> Dict:
"""Generate RAG system health report."""
total = self.metrics['queries_processed']
return {
'total_queries': total,
'success_rate': 1 - (
(self.metrics['retrieval_failures'] +
self.metrics['generation_failures']) / total
) if total > 0 else 0,
'avg_retrieval_time_ms': self.metrics['avg_retrieval_time'] * 1000,
'avg_generation_time_ms': self.metrics['avg_generation_time'] * 1000,
'avg_results_per_query': self.metrics['avg_results_retrieved'],
'health_status': self._calculate_health_status()
}- Optimal chunking (512 tokens, 50 overlap, sentence-level)
- Qdrant vector database integration
- HuggingFace embeddings (all-MiniLM-L6-v2)
- Basic semantic search
- Query rewriting (multiple variants)
- Multi-stage retrieval (variants + concepts)
- Smart deduplication
- Multi-signal re-ranking
- Position optimization (lost in the middle mitigation)
- Evaluation metrics implementation
- Golden dataset creation
- Automated testing framework
- Transparency reporting
- Real-time monitoring dashboard
- Query classification (skip retrieval when not needed)
- Context caching
- Hybrid search (BM25 + semantic)
- Human evaluation framework
- A/B testing infrastructure
- Retrieval Time: < 500ms per query
- Total Response Time: < 2s end-to-end
- Throughput: > 50 queries/minute
- Groundedness: > 0.85
- Relevance: > 0.80
- Faithfulness: > 0.90
- Context Recall: > 0.85
- Success Rate: > 99%
- Retrieval Failures: < 1%
- Generation Failures: < 0.5%
- Best Practices for RAG Pipeline - MasteringLLM
- Optimizing RAG Retrieval - Google Cloud
- RAG for LLMs - Prompt Engineering Guide
- Retrieval-Augmented Generation for Large Language Models: A Survey
Status: Living Document - Updated with Each Research Finding
Last Updated: 2025-01-08
Next Review: After Phase 3 completion