SciPhi-AI · chrisgscott · Sep 22, 2025 · Sep 22, 2025 · Sep 23, 2025
diff --git a/.gitignore.backup b/.gitignore.backup
@@ -0,0 +1,45 @@
+js/sdk/**/*.js
+js/tests/*
+**/.DS_Store
+
+.env
+.gitignore
+*.log
+.DS_Store
+*.gguf
+logs/
+workspace/
+py/workspace/
+uploads/
+**/__pycache__
+**/.mypy_cache
+**/.pytest_cache
+dump/*
+.next
+node_modules
+.idea
+
+coverage.xml
+.coverage
+
+**/*.sqlite*
+**/*.sqlite3*
+
+node_modules/
+dist/
+**/.data/*
+
+*.exe
+*.exe~
+*.dll
+*.so
+*.dylib
+*.test
+go.work
+go.work.sum
+
+.vscode/
+.python-version
+.ruff_cache/
+*.egg-info
+.venv
diff --git a/ADVANCED_FEATURES.md b/ADVANCED_FEATURES.md
@@ -0,0 +1,299 @@
+# R2R Enhanced Template - Advanced Features
+
+This template includes advanced features inspired by enterprise RAG implementations.
+
+## 🧩 **Hierarchical Chunking**
+
+### What It Is
+A two-level chunking strategy that creates both **parent** and **child** chunks:
+- **Child chunks**: Detailed, small chunks (512 tokens) for precise retrieval
+- **Parent chunks**: Summarized sections (groups of 4 child chunks) for context
+
+### When to Use
+- **Long documents** where context matters
+- **Technical documents** with hierarchical structure
+- **Research papers** with sections and subsections
+- **Legal documents** with nested clauses
+
+### Configuration
+```toml
+[ingestion]
+chunking_strategy = "hierarchical"
+child_chunk_size = 512
+child_overlap = 50
+children_per_parent = 4
+generate_summaries = true
+```
+
+### Benefits
+- **Better context retention** - Parent chunks provide section-level context
+- **Precise retrieval** - Child chunks provide exact source text
+- **Improved RAG quality** - Multi-level context improves answer accuracy
+
+## 📊 **Enhanced Spreadsheet Processing**
+
+### What It Is
+Advanced processing of Excel/CSV files that creates:
+- **Narrative summaries** for natural language search
+- **Structured data storage** for precise queries
+- **Schema analysis** with column descriptions and statistics
+
+### When to Use
+- **Financial reports** with tables and data
+- **Research datasets** requiring both search and analysis
+- **Business documents** with embedded spreadsheets
+- **Any structured data** that needs both RAG and SQL-like queries
+
+### Configuration
+```toml
+[ingestion.spreadsheet_settings]
+generate_narrative = true
+store_structured_data = true
+max_rows_for_narrative = 100
+```
+
+### Benefits
+- **Dual access patterns** - Search narratively OR query structured data
+- **Automatic schema detection** - Understands data types and relationships
+- **Performance optimized** - Limits narrative generation for large datasets
+- **RAG-friendly** - Converts tables into searchable text descriptions
+
+## 🔗 **Enhanced Citation System**
+
+### What It Is
+Comprehensive citation and metadata system that provides:
+- **Precise source attribution** with deep links
+- **Rich metadata extraction** from documents
+- **Multiple citation styles** (detailed, compact, academic)
+- **Confidence scoring** for citation quality
+
+### When to Use
+- **Research applications** requiring source verification
+- **Legal/compliance** systems needing audit trails
+- **Enterprise RAG** where trust and verifiability matter
+- **Any application** where users need to verify information
+
+### Configuration
+```toml
+[orchestration.citation_settings]
+citation_style = "detailed"  # "detailed", "compact", "academic"
+max_citations = 5
+min_citation_score = 0.7
+generate_deep_links = true
+```
+
+### Benefits
+- **Verifiable responses** - Every claim linked to source
+- **Deep linking** - Direct links to specific document locations
+- **Rich metadata** - Author, page, section, creation date
+- **Confidence scoring** - Quality assessment for each citation
+- **Multiple formats** - Academic, compact, or detailed citations
+
+### Example Output
+```json
+{
+  "response": "Apple Inc. was founded in 1976 [1] and is headquartered in Cupertino [2].",
+  "citations": [
+    {
+      "index": 1,
+      "text": "[1] Apple_History.pdf, by Walter Isaacson, page 15 (2023-01-15)",
+      "confidence": 0.95,
+      "deep_link": "/documents/view/doc123?page=15&highlight=founded%201976"
+    }
+  ]
+}
+```
+
+## 🛠️ **Tool-Augmented Orchestration**
+
+### What It Is
+Intelligent workflow that can pivot between RAG and SQL queries based on data type:
+- **Hybrid approach** - Uses both vector search AND SQL queries
+- **Automatic detection** - Identifies when structured data is available
+- **Text-to-SQL generation** - Converts natural language to SQL queries
+- **Result fusion** - Combines RAG context with SQL results
+
+### When to Use
+- **Mixed content** - Documents containing both text and structured data
+- **Precise queries** - Questions requiring exact data from spreadsheets
+- **Business intelligence** - Analytical questions about numerical data
+- **Financial reports** - Queries about specific metrics or trends
+
+### Configuration
+```toml
+[orchestration.tool_augmented_settings]
+enable_sql_workflow = true
+sql_confidence_threshold = 0.7
+max_sql_results = 50
+structured_data_types = ["spreadsheet", "csv", "xlsx"]
+```
+
+### Workflow Example
+1. **User asks**: "What was the revenue in Q3?"
+2. **Vector search** finds relevant documents
+3. **Metadata inspection** detects spreadsheet with financial data
+4. **SQL generation**: `SELECT revenue FROM data WHERE quarter = 'Q3'`
+5. **SQL execution** returns precise numerical results
+6. **Result fusion** combines with document context
+7. **Final response** includes both narrative and exact figures
+
+### Benefits
+- **Precise answers** - Exact data from structured sources
+- **Contextual responses** - Combines with document narrative
+- **Automatic workflow** - No manual intervention required
+- **Flexible queries** - Handles both simple and complex questions
+
+## 🌐 **Web Search Integration**
+
+### What It Is
+Intelligent web search integration with both automatic fallback and user control:
+- **Smart fallback** - Automatically uses web search when RAG results are insufficient
+- **User-controlled toggle** - Frontend apps can enable/disable web search
+- **Quality assessment** - Evaluates RAG result quality to decide on web search
+- **Source attribution** - Clear distinction between internal and web sources
+
+### Configuration Options
+```toml
+[orchestration.web_search_settings]
+enable_web_fallback = true
+web_confidence_threshold = 0.6
+min_rag_results = 2
+web_search_provider = "serper"  # "serper" or "tavily"
+max_web_results = 5
+```
+
+### Frontend Integration
+```javascript
+// React/Next.js example
+const searchWithWebControl = async (query, useWebSearch = false) => {
+  const response = await fetch('/api/search', {
+    method: 'POST',
+    body: JSON.stringify({
+      query,
+      use_web_search: useWebSearch,      // Smart fallback
+      force_web_search: false,           // User override
+      web_search_provider: "serper"      // Provider choice
+    })
+  });
+  return response.json();
+};
+```
+
+### Automatic Fallback Triggers
+- **No RAG results** found for the query
+- **Low confidence scores** (< 0.5) from vector search
+- **Insufficient content** (< 200 characters) from internal sources
+- **Too few results** (< 2 relevant documents)
+
+### Response Format
+```json
+{
+  "response": "Apple's revenue was $89.5B in Q3 [1], driven by strong iPhone sales [W1] and services growth [2].",
+  "citations": [
+    {
+      "index": 1,
+      "text": "[1] Q3_Financial_Report.xlsx, Revenue sheet, page 3 (2023-10-15)",
+      "confidence": 0.95,
+      "deep_link": "/documents/view/doc123?page=3&section=Revenue",
+      "source_type": "internal_knowledge"
+    },
+    {
+      "index": 2,
+      "text": "[W1] Apple Reports Strong Q3 Results, from apple.com via Serper search (https://apple.com/newsroom/2023/10/apple-reports-q3-results/)",
+      "confidence": 0.85,
+      "deep_link": "https://apple.com/newsroom/2023/10/apple-reports-q3-results/",
+      "source_type": "web_search"
+    }
+  ],
+  "metadata": {
+    "source_breakdown": {
+      "internal_knowledge": 2,
+      "web_search": 1
+    },
+    "web_search_reason": "Limited internal knowledge, supplemented with web search"
+  }
+}
+```
+
+### Benefits
+- **Comprehensive coverage** - Never leave questions unanswered
+- **Quality-driven** - Only uses web search when needed
+- **User control** - Frontend apps can customize behavior
+- **Source transparency** - Clear attribution for all information
+- **Fallback reliability** - Graceful degradation when internal knowledge is limited
+
+## 🔧 **Custom Extensions Framework**
+
+This template is designed to be extended with custom providers:
+
+### Custom Chunking Providers
+```python
+# Example: Domain-specific chunking
+class LegalDocumentChunkingProvider(ChunkingProvider):
+    def chunk(self, document):
+        # Custom logic for legal documents
+        pass
+```
+
+### Custom Parsing Providers
+```python
+# Example: Structured data parsing
+class SpreadsheetParser(ParsingProvider):
+    def parse(self, file):
+        # Extract structured data + generate narrative
+        pass
+```
+
+### Custom Orchestration Providers
+```python
+# Example: Tool-augmented RAG
+class ToolAugmentedOrchestrationProvider(OrchestrationProvider):
+    def orchestrate(self, query):
+        # Add SQL queries, web search, etc.
+        pass
+```
+
+## 🎯 **Enterprise Patterns**
+
+### Multi-Modal Processing
+- **Images**: Automatic narrative generation from images
+- **Audio**: High-quality transcription with modern models
+- **Spreadsheets**: Enhanced processing with narrative summaries + structured data storage
+
+### Advanced Graph Features
+- **Automatic entity extraction** with configurable types
+- **Relationship mapping** with custom relationship types
+- **Community detection** for knowledge clustering
+
+### Integration Patterns
+- **CrewAI integration** for multi-agent reasoning
+- **Supabase integration** for structured data storage
+- **Tool augmentation** for external data sources
+
+## 🚀 **Scaling Considerations**
+
+### Performance Optimizations
+- **High-quality embeddings** (3072 dimensions)
+- **Batch processing** for large document sets
+- **Concurrent processing** for multi-modal content
+
+### Security Features
+- **API key protection** with proper .gitignore
+- **Environment templates** for safe configuration
+- **Pre-commit hooks** for secret detection
+
+## 📚 **Learning Resources**
+
+### Recommended Reading
+- [R2R Official Documentation](https://r2r-docs.sciphi.ai/)
+- [CrewAI Documentation](https://docs.crewai.com/)
+- [Supabase Documentation](https://supabase.com/docs)
+
+### Example Projects
+- **Ellen V2**: Advanced intelligence platform (see `/docs` directory)
+- **Enterprise RAG**: Multi-tenant RAG systems
+- **Research Assistant**: Academic paper analysis
+
+---
+
+**This template provides the foundation for building sophisticated RAG applications while maintaining simplicity for basic use cases.**
diff --git a/Dockerfile.custom b/Dockerfile.custom
@@ -0,0 +1,15 @@
+# Custom R2R with bug fixes
+FROM sciphi/r2r:latest
+
+# Copy your fixed core files
+COPY py/core/main/services/graph_service.py /app/core/main/services/graph_service.py
+COPY py/core/parsers/media/audio_parser.py /app/core/parsers/media/audio_parser.py
+COPY py/core/agent/base.py /app/core/agent/base.py
+COPY py/core/base/providers/llm.py /app/core/base/providers/llm.py
+
+# Copy your custom configuration
+COPY py/r2r/r2r.toml /app/r2r.toml
+
+# Ensure proper permissions
+RUN chmod +x /app/core/main/services/graph_service.py
+RUN chmod +x /app/core/parsers/media/audio_parser.py