Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions .gitignore.backup
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
js/sdk/**/*.js
js/tests/*
**/.DS_Store

.env
.gitignore
*.log
.DS_Store
*.gguf
logs/
workspace/
py/workspace/
uploads/
**/__pycache__
**/.mypy_cache
**/.pytest_cache
dump/*
.next
node_modules
.idea

coverage.xml
.coverage

**/*.sqlite*
**/*.sqlite3*

node_modules/
dist/
**/.data/*

*.exe
*.exe~
*.dll
*.so
*.dylib
*.test
go.work
go.work.sum

.vscode/
.python-version
.ruff_cache/
*.egg-info
.venv
299 changes: 299 additions & 0 deletions ADVANCED_FEATURES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
# R2R Enhanced Template - Advanced Features

This template includes advanced features inspired by enterprise RAG implementations.

## 🧩 **Hierarchical Chunking**

### What It Is
A two-level chunking strategy that creates both **parent** and **child** chunks:
- **Child chunks**: Detailed, small chunks (512 tokens) for precise retrieval
- **Parent chunks**: Summarized sections (groups of 4 child chunks) for context

### When to Use
- **Long documents** where context matters
- **Technical documents** with hierarchical structure
- **Research papers** with sections and subsections
- **Legal documents** with nested clauses

### Configuration
```toml
[ingestion]
chunking_strategy = "hierarchical"
child_chunk_size = 512
child_overlap = 50
children_per_parent = 4
generate_summaries = true
```

### Benefits
- **Better context retention** - Parent chunks provide section-level context
- **Precise retrieval** - Child chunks provide exact source text
- **Improved RAG quality** - Multi-level context improves answer accuracy

## πŸ“Š **Enhanced Spreadsheet Processing**

### What It Is
Advanced processing of Excel/CSV files that creates:
- **Narrative summaries** for natural language search
- **Structured data storage** for precise queries
- **Schema analysis** with column descriptions and statistics

### When to Use
- **Financial reports** with tables and data
- **Research datasets** requiring both search and analysis
- **Business documents** with embedded spreadsheets
- **Any structured data** that needs both RAG and SQL-like queries

### Configuration
```toml
[ingestion.spreadsheet_settings]
generate_narrative = true
store_structured_data = true
max_rows_for_narrative = 100
```

### Benefits
- **Dual access patterns** - Search narratively OR query structured data
- **Automatic schema detection** - Understands data types and relationships
- **Performance optimized** - Limits narrative generation for large datasets
- **RAG-friendly** - Converts tables into searchable text descriptions

## πŸ”— **Enhanced Citation System**

### What It Is
Comprehensive citation and metadata system that provides:
- **Precise source attribution** with deep links
- **Rich metadata extraction** from documents
- **Multiple citation styles** (detailed, compact, academic)
- **Confidence scoring** for citation quality

### When to Use
- **Research applications** requiring source verification
- **Legal/compliance** systems needing audit trails
- **Enterprise RAG** where trust and verifiability matter
- **Any application** where users need to verify information

### Configuration
```toml
[orchestration.citation_settings]
citation_style = "detailed" # "detailed", "compact", "academic"
max_citations = 5
min_citation_score = 0.7
generate_deep_links = true
```

### Benefits
- **Verifiable responses** - Every claim linked to source
- **Deep linking** - Direct links to specific document locations
- **Rich metadata** - Author, page, section, creation date
- **Confidence scoring** - Quality assessment for each citation
- **Multiple formats** - Academic, compact, or detailed citations

### Example Output
```json
{
"response": "Apple Inc. was founded in 1976 [1] and is headquartered in Cupertino [2].",
"citations": [
{
"index": 1,
"text": "[1] Apple_History.pdf, by Walter Isaacson, page 15 (2023-01-15)",
"confidence": 0.95,
"deep_link": "/documents/view/doc123?page=15&highlight=founded%201976"
}
]
}
```

## πŸ› οΈ **Tool-Augmented Orchestration**

### What It Is
Intelligent workflow that can pivot between RAG and SQL queries based on data type:
- **Hybrid approach** - Uses both vector search AND SQL queries
- **Automatic detection** - Identifies when structured data is available
- **Text-to-SQL generation** - Converts natural language to SQL queries
- **Result fusion** - Combines RAG context with SQL results

### When to Use
- **Mixed content** - Documents containing both text and structured data
- **Precise queries** - Questions requiring exact data from spreadsheets
- **Business intelligence** - Analytical questions about numerical data
- **Financial reports** - Queries about specific metrics or trends

### Configuration
```toml
[orchestration.tool_augmented_settings]
enable_sql_workflow = true
sql_confidence_threshold = 0.7
max_sql_results = 50
structured_data_types = ["spreadsheet", "csv", "xlsx"]
```

### Workflow Example
1. **User asks**: "What was the revenue in Q3?"
2. **Vector search** finds relevant documents
3. **Metadata inspection** detects spreadsheet with financial data
4. **SQL generation**: `SELECT revenue FROM data WHERE quarter = 'Q3'`
5. **SQL execution** returns precise numerical results
6. **Result fusion** combines with document context
7. **Final response** includes both narrative and exact figures

### Benefits
- **Precise answers** - Exact data from structured sources
- **Contextual responses** - Combines with document narrative
- **Automatic workflow** - No manual intervention required
- **Flexible queries** - Handles both simple and complex questions

## 🌐 **Web Search Integration**

### What It Is
Intelligent web search integration with both automatic fallback and user control:
- **Smart fallback** - Automatically uses web search when RAG results are insufficient
- **User-controlled toggle** - Frontend apps can enable/disable web search
- **Quality assessment** - Evaluates RAG result quality to decide on web search
- **Source attribution** - Clear distinction between internal and web sources

### Configuration Options
```toml
[orchestration.web_search_settings]
enable_web_fallback = true
web_confidence_threshold = 0.6
min_rag_results = 2
web_search_provider = "serper" # "serper" or "tavily"
max_web_results = 5
```

### Frontend Integration
```javascript
// React/Next.js example
const searchWithWebControl = async (query, useWebSearch = false) => {
const response = await fetch('/api/search', {
method: 'POST',
body: JSON.stringify({
query,
use_web_search: useWebSearch, // Smart fallback
force_web_search: false, // User override
web_search_provider: "serper" // Provider choice
})
});
return response.json();
};
```

### Automatic Fallback Triggers
- **No RAG results** found for the query
- **Low confidence scores** (< 0.5) from vector search
- **Insufficient content** (< 200 characters) from internal sources
- **Too few results** (< 2 relevant documents)

### Response Format
```json
{
"response": "Apple's revenue was $89.5B in Q3 [1], driven by strong iPhone sales [W1] and services growth [2].",
"citations": [
{
"index": 1,
"text": "[1] Q3_Financial_Report.xlsx, Revenue sheet, page 3 (2023-10-15)",
"confidence": 0.95,
"deep_link": "/documents/view/doc123?page=3&section=Revenue",
"source_type": "internal_knowledge"
},
{
"index": 2,
"text": "[W1] Apple Reports Strong Q3 Results, from apple.com via Serper search (https://apple.com/newsroom/2023/10/apple-reports-q3-results/)",
"confidence": 0.85,
"deep_link": "https://apple.com/newsroom/2023/10/apple-reports-q3-results/",
"source_type": "web_search"
}
],
"metadata": {
"source_breakdown": {
"internal_knowledge": 2,
"web_search": 1
},
"web_search_reason": "Limited internal knowledge, supplemented with web search"
}
}
```

### Benefits
- **Comprehensive coverage** - Never leave questions unanswered
- **Quality-driven** - Only uses web search when needed
- **User control** - Frontend apps can customize behavior
- **Source transparency** - Clear attribution for all information
- **Fallback reliability** - Graceful degradation when internal knowledge is limited

## πŸ”§ **Custom Extensions Framework**

This template is designed to be extended with custom providers:

### Custom Chunking Providers
```python
# Example: Domain-specific chunking
class LegalDocumentChunkingProvider(ChunkingProvider):
def chunk(self, document):
# Custom logic for legal documents
pass
```

### Custom Parsing Providers
```python
# Example: Structured data parsing
class SpreadsheetParser(ParsingProvider):
def parse(self, file):
# Extract structured data + generate narrative
pass
```

### Custom Orchestration Providers
```python
# Example: Tool-augmented RAG
class ToolAugmentedOrchestrationProvider(OrchestrationProvider):
def orchestrate(self, query):
# Add SQL queries, web search, etc.
pass
```

## 🎯 **Enterprise Patterns**

### Multi-Modal Processing
- **Images**: Automatic narrative generation from images
- **Audio**: High-quality transcription with modern models
- **Spreadsheets**: Enhanced processing with narrative summaries + structured data storage

### Advanced Graph Features
- **Automatic entity extraction** with configurable types
- **Relationship mapping** with custom relationship types
- **Community detection** for knowledge clustering

### Integration Patterns
- **CrewAI integration** for multi-agent reasoning
- **Supabase integration** for structured data storage
- **Tool augmentation** for external data sources

## πŸš€ **Scaling Considerations**

### Performance Optimizations
- **High-quality embeddings** (3072 dimensions)
- **Batch processing** for large document sets
- **Concurrent processing** for multi-modal content

### Security Features
- **API key protection** with proper .gitignore
- **Environment templates** for safe configuration
- **Pre-commit hooks** for secret detection

## πŸ“š **Learning Resources**

### Recommended Reading
- [R2R Official Documentation](https://r2r-docs.sciphi.ai/)
- [CrewAI Documentation](https://docs.crewai.com/)
- [Supabase Documentation](https://supabase.com/docs)

### Example Projects
- **Ellen V2**: Advanced intelligence platform (see `/docs` directory)
- **Enterprise RAG**: Multi-tenant RAG systems
- **Research Assistant**: Academic paper analysis

---

**This template provides the foundation for building sophisticated RAG applications while maintaining simplicity for basic use cases.**
15 changes: 15 additions & 0 deletions Dockerfile.custom
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Custom R2R with bug fixes
FROM sciphi/r2r:latest

# Copy your fixed core files
COPY py/core/main/services/graph_service.py /app/core/main/services/graph_service.py
COPY py/core/parsers/media/audio_parser.py /app/core/parsers/media/audio_parser.py
COPY py/core/agent/base.py /app/core/agent/base.py
COPY py/core/base/providers/llm.py /app/core/base/providers/llm.py

# Copy your custom configuration
COPY py/r2r/r2r.toml /app/r2r.toml

# Ensure proper permissions
RUN chmod +x /app/core/main/services/graph_service.py
RUN chmod +x /app/core/parsers/media/audio_parser.py
Loading