Comprehensive documentation for all modules in the Multi-Modal Academic Research System. This index provides quick access to detailed module documentation with code examples, API references, and integration guides.
File: docs/modules/data-collectors.md (16KB)
Key Sections:
-
AcademicPaperCollector
collect_arxiv_papers()- Collect from ArXiv with PDF downloadcollect_pubmed_central()- PubMed Central Open Access paperscollect_semantic_scholar()- Semantic Scholar API integration
-
YouTubeLectureCollector
search_youtube_lectures()- Search educational videoscollect_video_metadata()- Extract metadata and transcriptsextract_video_id()- Parse YouTube URLs
-
PodcastCollector
collect_podcast_episodes()- RSS feed parsingtranscribe_audio()- Whisper-based transcriptionget_educational_podcasts()- Curated podcast list
Topics Covered:
- API integration examples (ArXiv, YouTube, RSS)
- Rate limiting and best practices
- Error handling patterns
- Complete code examples for each collector
- Troubleshooting common issues
File: docs/modules/data-processors.md (21KB)
Key Sections:
-
PDFProcessor
extract_text_and_images()- PyMuPDF-based extractionanalyze_with_gemini()- AI-powered content analysis- Gemini Vision for diagram understanding
-
VideoProcessor
analyze_video_content()- Transcript and metadata analysisextract_key_frames()- Frame extraction (placeholder)
Topics Covered:
- PDF text and image extraction with PyMuPDF
- Gemini Vision integration for diagrams
- Multimodal AI prompts and responses
- Document structuring for indexing
- Performance optimization tips
- Gemini SDK usage patterns
File: docs/modules/indexing.md (25KB)
Key Sections:
- OpenSearchManager
create_index()- Index schema creationindex_document()- Single document indexingbulk_index()- Efficient batch indexinghybrid_search()- Multi-field text search
Topics Covered:
- Index schema design for multi-modal content
- Embedding generation with SentenceTransformer
- Hybrid search strategy (BM25 + vector search)
- Field boosting and relevance tuning
- Performance optimization
- Advanced search queries and aggregations
- kNN vector search concepts
File: docs/modules/database.md (24KB)
Key Sections:
- CollectionDatabaseManager
add_collection()- Track new collectionsadd_paper(),add_video(),add_podcast()- Type-specific datamark_as_indexed()- Update indexing statusget_statistics()- Analytics and reportingsearch_collections()- Database search
Topics Covered:
- Complete database schema (SQLite)
- CRUD operations for all content types
- Collection statistics and analytics
- Search and filtering
- Data export (JSON, CSV)
- Performance considerations
- Backup and recovery
File: docs/modules/api.md (22KB)
Key Sections:
- FastAPI Endpoints
GET /api/collections- Retrieve collections with filteringGET /api/collections/{id}- Get collection detailsGET /api/statistics- Database statisticsGET /api/search- Search collectionsGET /viz- Visualization dashboardGET /health- Health check
Topics Covered:
- Complete REST API reference
- Request/response formats
- Query parameters and validation
- Error handling and status codes
- CORS configuration
- Frontend integration examples (React, Python)
- CLI tool example
- Deployment guides (Docker, Kubernetes)
- Security considerations
File: docs/modules/orchestration.md (24KB)
Key Sections:
-
ResearchOrchestrator
process_query()- End-to-end query pipelineformat_context_with_citations()- Context formattingextract_citations()- Citation extraction from responsesgenerate_related_queries()- Suggestion generation
-
CitationTracker
add_citation()- Track citation usageget_citation_report()- Analyticsexport_bibliography()- BibTeX/APA export
Topics Covered:
- LangChain integration patterns
- Prompt engineering for research
- Citation extraction with regex
- Conversation memory management
- Related query generation
- Bibliography export formats
- Multi-turn conversations
File: docs/modules/ui.md (23KB)
Key Sections:
- ResearchAssistantUI
create_interface()- Gradio app creationhandle_search()- Research query processinghandle_data_collection()- Content collection workflowget_database_statistics()- Statistics display
Topics Covered:
- Complete Gradio interface guide
- 5 main tabs (Research, Data Collection, Citation Manager, Settings, Visualization)
- Event handlers and workflows
- UI customization and theming
- Launch configurations
- User workflows and examples
- Performance optimization
- Accessibility features
- Start with:
data-collectors.md- Understand data sources - Then:
data-processors.md- Learn content processing - Next:
indexing.md- Understand search infrastructure - Follow with:
orchestration.md- See how queries work - Finally:
ui.md- Explore the user interface
- Architecture:
indexing.md+database.md- Core infrastructure - Data Flow:
data-collectors.md→data-processors.md→indexing.md - Query Pipeline:
orchestration.md- LangChain integration - Interfaces:
ui.md+api.md- User/programmatic access
- API Reference:
api.md- REST endpoints - Data Structure:
database.md- Schema and models - Search:
indexing.md- Query capabilities - Integration: Examples in
api.md
# Collect papers from ArXiv
from multi_modal_rag.data_collectors import AcademicPaperCollector
collector = AcademicPaperCollector()
papers = collector.collect_arxiv_papers("machine learning", max_results=50)See: data-collectors.md - Section: AcademicPaperCollector
# Extract and analyze PDF
from multi_modal_rag.data_processors import PDFProcessor
processor = PDFProcessor(gemini_api_key="your_key")
content = processor.extract_text_and_images("paper.pdf")
analysis = processor.analyze_with_gemini(content)See: data-processors.md - Section: PDFProcessor
# Hybrid search
from multi_modal_rag.indexing import OpenSearchManager
manager = OpenSearchManager()
results = manager.hybrid_search("research_assistant", "transformers", k=10)See: indexing.md - Section: hybrid_search()
# Process research query
from multi_modal_rag.orchestration import ResearchOrchestrator
orchestrator = ResearchOrchestrator(api_key, opensearch_manager)
result = orchestrator.process_query("How do transformers work?", "research_assistant")See: orchestration.md - Section: process_query()
# Fetch collections via API
import requests
response = requests.get("http://localhost:8000/api/collections?content_type=paper")
papers = response.json()['collections']See: api.md - Section: GET /api/collections
Location in docs:
- Collection:
data-collectors.md- Integration Example section - Processing:
data-processors.md- Integration Workflow section - Indexing:
indexing.md- Usage examples - Querying:
orchestration.md- Complete Research Workflow section
Location: ui.md - Section: Integration with Main Application
Example: main.py setup connecting all components
Location: api.md - Section: Integration Examples
Examples: React frontend, Python client, CLI tool
Data Sources → Collectors → Processors → Indexing → Search
↓
Database Tracking
Detailed docs:
- Collectors:
data-collectors.md - Processors:
data-processors.md - Indexing:
indexing.md - Database:
database.md
User Query → Orchestrator → OpenSearch → Context Formatting
↓ ↓
Memory ←─────── Gemini LLM ←───────┘
↓
Citation Extraction
Detailed docs:
- Orchestrator:
orchestration.md - Search:
indexing.md - Citations:
orchestration.md- CitationTracker section
Docs:
data-collectors.md- collect_arxiv_papers()indexing.md- bulk_index()database.md- add_collection()
Docs:
data-processors.md- extract_text_and_images()data-processors.md- analyze_with_gemini()
Docs:
indexing.md- hybrid_search()api.md- Frontend Integrationorchestration.md- process_query()
Docs:
orchestration.md- CitationTrackerorchestration.md- export_bibliography()
Docs:
api.md- Deployment sectionapi.md- Docker Deployment
-
Data Collectors:
data-collectors.md- Troubleshooting section- yt-dlp not found
- YouTube transcript unavailable
- RSS feed parsing fails
-
Data Processors:
data-processors.md- Troubleshooting section- Gemini API authentication
- PDF extraction errors
- Image analysis failures
-
Indexing:
indexing.md- Troubleshooting section- OpenSearch connection refused
- SSL certificate errors
- Search returns no results
-
Database:
database.md- Troubleshooting section- Database locked errors
- Corrupted database recovery
- JSON parse errors
-
API:
api.md- Troubleshooting section- Port already in use
- CORS errors
- 422 validation errors
-
Orchestration:
orchestration.md- Troubleshooting section- Related queries not JSON
- Citations not extracted
- Memory grows too large
-
UI:
ui.md- Troubleshooting section- Gradio won't launch
- Share link doesn't work
- UI freezes during collection
Doc: indexing.md - Performance Tuning section
Key topics:
- Bulk indexing vs single documents
- Batch embedding generation
- Shard configuration
- Query optimization
Doc: indexing.md - Search Performance section
Key topics:
- Result size limits
- Filter optimization
- Field selection
- Caching strategies
Doc: data-processors.md - Performance Optimization section
Key topics:
- Image limit (5 max)
- Batch processing
- Result caching
- Transcript truncation
Doc: ui.md - Performance Considerations section
Key topics:
- Loading states
- Result display limits
- Async operations
- Queue management
Doc: api.md - Security Considerations section
Topics:
- CORS configuration
- API authentication
- Rate limiting
- HTTPS deployment
Doc: database.md - Error Handling section
Topics:
- Transaction safety
- SQL injection prevention
- Backup strategies
| Module | Key Dependencies | Doc Reference |
|---|---|---|
| Data Collectors | arxiv, yt-dlp, feedparser, whisper | data-collectors.md |
| Data Processors | google-generativeai, pypdf, pymupdf | data-processors.md |
| Indexing | opensearch-py, sentence-transformers | indexing.md |
| Database | sqlite3 (built-in) | database.md |
| API | fastapi, uvicorn | api.md |
| Orchestration | langchain, langchain-google-genai | orchestration.md |
| UI | gradio | ui.md |
# All dependencies
pip install -r requirements.txt
# By module (see individual docs for details)
pip install arxiv yt-dlp youtube-transcript-api feedparser openai-whisper
pip install google-generativeai pypdf pymupdf pillow
pip install opensearch-py sentence-transformers
pip install fastapi uvicorn
pip install langchain langchain-google-genai
pip install gradioGET /- API info →api.mdGET /api/collections- List collections →api.mdGET /api/collections/{id}- Get details →api.mdGET /api/statistics- Statistics →api.mdGET /api/search- Search →api.mdGET /viz- Dashboard →api.mdGET /health- Health check →api.md
- Collectors:
data-collectors.md- Methods section - Processors:
data-processors.md- Methods section - Indexing:
indexing.md- Methods section - Database:
database.md- Methods section - Orchestrator:
orchestration.md- Methods section - UI:
ui.md- Event Handler Methods section
data-collectors.md: 16KBdata-processors.md: 21KBindexing.md: 25KBdatabase.md: 24KBapi.md: 22KBorchestration.md: 24KBui.md: 23KB
Total: 155KB of comprehensive documentation
Each documentation file includes:
- Overview - Module purpose and architecture
- Class/Function Reference - Complete API documentation
- Parameters & Returns - Detailed type information
- Code Examples - Working examples for all features
- Integration Examples - How modules work together
- Error Handling - Common errors and solutions
- Performance Tips - Optimization strategies
- Troubleshooting - Common issues and fixes
- Dependencies - Required packages
- Future Enhancements - Planned features
When updating documentation:
- Follow existing structure and format
- Include code examples for all new features
- Add troubleshooting entries for common issues
- Update this index with new sections
- Keep examples tested and working
- Update file sizes if significantly changed
For documentation improvements or corrections:
- Create issue with
documentationlabel - Specify file and section
- Provide suggested changes
- Include code examples if applicable
Last Updated: October 2024 Documentation Version: 1.0 Total Modules Documented: 7