This document describes the enhanced Retrieval Augmented Generation (RAG) features added to Infomaid, including TF-IDF, BM25, improved document chunking, and hybrid retrieval methods.
The enhanced RAG system provides multiple retrieval strategies and advanced document processing to improve the quality and relevance of responses when querying local document collections.
- Uses semantic embeddings for similarity search
- Best for: General semantic understanding
- Command:
--retrievalmethod vector
- Term Frequency-Inverse Document Frequency with cosine similarity
- Best for: Keyword-based retrieval, technical documents
- Command:
--retrievalmethod tfidf
- Best Match 25 ranking algorithm (used by search engines)
- Best for: Information retrieval, ranking by relevance
- Command:
--retrievalmethod bm25
- Combines vector embeddings and TF-IDF scores
- Best for: Balanced results, optimal performance
- Command:
--retrievalmethod hybrid(default)
- Respects sentence boundaries and maintains context
- Uses overlap for continuity between chunks
- Preserves meaning better than character-based splitting
- Groups related content using TF-IDF and K-means clustering
- Creates topically coherent chunks
- Automatically determines optimal number of topics
- Follows document structure (paragraphs → sentences)
- Preserves document organization
- Adapts to content size dynamically
- Adjusts chunk size based on content complexity
- Considers sentence length, punctuation, and structure
- Optimizes for different content types
- Automatically adds related terms using TF-IDF vocabulary
- Improves recall for complex queries
- Uses co-occurrence analysis for term relationships
- Applies additional scoring after initial retrieval
- Multiple reranking algorithms available
- Improves precision of top results
- Removes similar or duplicate chunks using cosine similarity
- Configurable similarity threshold
- Keeps longer, more informative chunks
Install the additional dependencies for enhanced features:
poetry add scikit-learn numpyOr update your existing installation:
poetry install# Use enhanced RAG with hybrid search (recommended)
poetry run infomaid --useowndata --enhancedrag --prompt "Your question here"# Use TF-IDF for keyword-heavy documents
poetry run infomaid --useowndata --enhancedrag --retrievalmethod tfidf --prompt "Find specific technical terms"
# Use BM25 for search engine-like ranking
poetry run infomaid --useowndata --enhancedrag --retrievalmethod bm25 --prompt "Rank by relevance"
# Use pure vector similarity
poetry run infomaid --useowndata --enhancedrag --retrievalmethod vector --prompt "Semantic similarity"# Get 3 results using hybrid approach
poetry run infomaid --useowndata --enhancedrag --retrievalmethod hybrid --count 3 --prompt "Compare different approaches"# For technical documentation
poetry run infomaid --resetdb --usepdf --enhancedrag --retrievalmethod tfidf
# For general documents
poetry run infomaid --resetdb --usepdf --enhancedrag --retrievalmethod hybrid
# For search/ranking tasks
poetry run infomaid --resetdb --usepdf --enhancedrag --retrievalmethod bm25| Method | Keyword Matching | Semantic Understanding | Ranking Quality | Speed |
|---|---|---|---|---|
| Vector | Fair | Excellent | Good | Fast |
| TF-IDF | Excellent | Fair | Good | Very Fast |
| BM25 | Excellent | Fair | Excellent | Fast |
| Hybrid | Excellent | Excellent | Excellent | Moderate |
max_chunk_size: Maximum characters per chunk (default: 800)min_chunk_size: Minimum characters per chunk (default: 200)chunk_overlap: Overlap between chunks (default: 80)
max_features: Maximum vocabulary size (default: 10000)ngram_range: N-gram range for features (default: 1-3)min_df: Minimum document frequency (default: 2)max_df: Maximum document frequency (default: 0.95)
k1: Term frequency saturation (default: 1.5)b: Length normalization (default: 0.75)
alpha: Weight for vector search (default: 0.5)- Vector weight = alpha, TF-IDF weight = (1 - alpha)
-
Vector Search:
- General purpose queries
- When semantic meaning is important
- Cross-language similarity
-
TF-IDF Search:
- Technical documentation
- Keyword-heavy content
- When exact term matching is crucial
-
BM25 Search:
- Search engine-like behavior
- Ranking multiple results
- Information retrieval tasks
-
Hybrid Search:
- Best overall performance
- When you need both semantic and keyword matching
- Complex queries with multiple aspects
- For Technical Documents: Use TF-IDF or BM25 methods
- For Narrative Content: Use vector or hybrid methods
- For Mixed Content: Use hybrid with semantic chunking
- For Structured Documents: Use hierarchical chunking
- Include key terms you want to find
- Use specific language from your documents
- Try different retrieval methods for comparison
- Use multiple results (
--count) to see variations
Error: Enhanced RAG features not available
Solution: Install dependencies with `poetry add scikit-learn numpy`
- Try different retrieval methods
- Check if your query terms appear in the documents
- Use hybrid search for balanced results
- Increase the number of results (
--count)
- Use vector or TF-IDF instead of hybrid
- Reduce the number of chunks in your database
- Use smaller chunk sizes when populating database
- Uses sklearn's TfidfVectorizer
- Supports n-grams (1-3 terms)
- Filters stop words and rare terms
- Cosine similarity for scoring
- Classic BM25 formula with tunable parameters
- Document length normalization
- Term frequency saturation
- Inverse document frequency weighting
- Linear combination of normalized scores
- Min-max normalization for score alignment
- Configurable weighting between methods
- Semantic: Sentence boundary awareness
- Topic: K-means clustering on TF-IDF vectors
- Hierarchical: Document structure preservation
- Adaptive: Content complexity analysis
To contribute enhancements to the RAG system:
- Focus on improving retrieval quality
- Add new chunking strategies
- Implement additional scoring methods
- Optimize performance for large document collections
- BM25: Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond
- TF-IDF: Salton, G. & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval
- Vector Embeddings: Modern transformer-based embeddings for semantic similarity