Common questions and answers about the Multi-Modal Academic Research System.
- Installation & Setup
- Configuration
- Data Collection
- Search & Retrieval
- Performance
- Features & Capabilities
- Troubleshooting
- Advanced Topics
A: Minimum requirements:
- Python: 3.8 or higher
- RAM: 4GB minimum, 8GB recommended
- Disk: 10GB free space (more for storing papers/videos)
- Docker: For running OpenSearch
- Internet: Required for API calls and data collection
A: No, the system uses free services:
- Gemini API: Free tier (60 requests/minute)
- OpenSearch: Self-hosted locally (no cost)
- ArXiv, YouTube, etc.: Free public APIs
- Embedding model: Runs locally (no API costs)
A: Docker is required only for OpenSearch. Alternatives:
- Install OpenSearch locally without Docker
- Use Elasticsearch instead (similar API)
- Connect to a remote OpenSearch instance
However, Docker is the recommended and easiest method.
A: Common reasons:
- Large dependencies: PyTorch, transformers (can be 1-2GB)
- Slow internet: Downloads from PyPI
- Compiling packages: Some packages compile from source
Speed up installation:
# Use binary wheels when available
pip install --only-binary=:all: -r requirements.txt
# Or use conda for pre-compiled packages
conda install pytorch sentence-transformersA: Yes, but with some considerations:
- Use
venv\Scripts\activateinstead ofsource venv/bin/activate - Docker Desktop required for Windows
- Path separators are different (
\vs/) - Some shell commands may differ
Consider using WSL2 (Windows Subsystem for Linux) for better compatibility.
A: Follow these steps:
- Visit https://makersuite.google.com/app/apikey
- Sign in with your Google account
- Click "Create API Key"
- Copy the key to your
.envfile:GEMINI_API_KEY=your_key_here
A: Yes, the system can be adapted for other LLMs:
- OpenAI GPT: Modify
research_orchestrator.pyto use OpenAI API - Local LLMs: Use Ollama or LM Studio
- Claude: Use Anthropic API
- Open-source models: Use Hugging Face transformers
See Advanced Topics for implementation details.
A: Papers are stored in:
- PDFs:
data/papers/ - Videos:
data/videos/ - Podcasts:
data/podcasts/ - Processed data:
data/processed/
You can change locations in configuration files.
A: Update your .env file:
OPENSEARCH_HOST=localhost
OPENSEARCH_PORT=9201 # Change from default 9200Then start OpenSearch with the new port:
docker run -p 9201:9200 opensearchproject/opensearch:latestA: Yes, update .env:
OPENSEARCH_HOST=your-opensearch-host.com
OPENSEARCH_PORT=9200
OPENSEARCH_USE_SSL=true
OPENSEARCH_USER=admin
OPENSEARCH_PASSWORD=your_passwordA: Current sources:
- Papers: ArXiv, PubMed Central, Semantic Scholar
- Videos: YouTube (educational channels)
- Podcasts: RSS feeds
See Custom Collectors to add more sources.
A: Limits:
- ArXiv: 2000 per query (API limit)
- Semantic Scholar: Rate-limited (100/minute)
- YouTube: No hard limit, but respect rate limits
- Local storage: Limited by disk space
Recommended: Start with 10-50 papers for testing.
A: Partially:
- ArXiv: Yes, filter by category (e.g.,
cs.LGfor ML) - PubMed: Yes, filter by journal name
- Semantic Scholar: Yes, use field filters
Example ArXiv query:
query = "cat:cs.LG AND ti:neural networks"A: Use date filters:
# ArXiv
query = "submittedDate:[20230101 TO 20231231] AND all:machine learning"
# In the UI, you can filter after collection
# Or modify the collector to add date filtersA: Common reasons:
- Access restrictions: Some papers require subscriptions
- Invalid URLs: Paper may have been removed
- Rate limiting: Too many requests too fast
- Network issues: Firewall or proxy blocking
The system focuses on open-access content (ArXiv, PMC).
A: Yes, you can:
- Place PDFs in
data/papers/ - Use the processing pipeline:
from multi_modal_rag.data_processors import pdf_processor processor = pdf_processor.PDFProcessor() result = processor.process_pdf("path/to/paper.pdf")
- Index the processed content
A: Modify youtube_collector.py:
def collect_from_channel(channel_id, max_videos=50):
"""Collect videos from specific channel."""
from googleapiclient.discovery import build
youtube = build('youtube', 'v3', developerKey=api_key)
request = youtube.search().list(
part='snippet',
channelId=channel_id,
maxResults=max_videos,
type='video'
)
response = request.execute()
return response['items']A: Hybrid search combines:
- Keyword search: BM25 algorithm (like traditional search)
- Semantic search: Vector similarity using embeddings
Results from both are combined and ranked. See Hybrid Search Guide for details.
A: Possible causes:
- Mismatch between query and content: Try different keywords
- Not enough indexed content: Add more documents
- Poor field weights: Adjust in
opensearch_manager.py - Wrong search mode: Try semantic-only or keyword-only
Tips:
- Use specific technical terms
- Try multiple phrasings
- Check if documents are actually indexed
A: Recommendations:
- Minimum: 10-20 papers for basic testing
- Good: 100-200 papers for decent coverage
- Excellent: 500+ papers for comprehensive results
More documents = better context, but slower indexing.
A: Yes, filter by content type:
query = {
"query": {
"bool": {
"must": [
{"match": {"content": "machine learning"}}
],
"filter": [
{"term": {"content_type": "paper"}} # or "video", "podcast"
]
}
}
}Or use the UI filters (if implemented).
A: Optimization strategies:
- Reduce result size: Return fewer documents
- Use filters: Pre-filter before searching
- Optimize OpenSearch: Increase memory, reduce shards
- Cache results: Cache common queries
- Use pagination: Don't load all results at once
See Performance Guide for details.
A: Yes:
query = {
"query": {
"bool": {
"must": [
{"match": {"content": "neural networks"}}
],
"filter": [
{"term": {"authors": "Geoffrey Hinton"}}
]
}
}
}A: First query may be slow due to:
- Model loading: Embedding model loaded into memory
- Index warming: OpenSearch caches are cold
- LLM initialization: Gemini client initialization
Subsequent queries are faster (cached models, warm indices).
A: Typical usage:
- Python application: 1-2GB
- OpenSearch: 2-4GB (configurable)
- Embedding model: 500MB-1GB
- Total: 4-8GB recommended
For large-scale usage, 16GB+ recommended.
A: Possible but not recommended:
- RAM limitation: Need at least 4GB
- CPU: Will be very slow
- Storage: Need sufficient space
Better options: Use cloud instance or local machine.
A: Speed improvements:
- Use GPU: For embedding generation
- Batch processing: Process multiple docs at once
- Reduce Gemini calls: Cache vision analysis results
- Parallel processing: Use multiprocessing
- Skip large PDFs: Set size limits
Example:
# Process in parallel
from multiprocessing import Pool
with Pool(processes=4) as pool:
results = pool.map(process_pdf, pdf_files)A: Common bottlenecks:
- Embedding generation: CPU-bound operation
- Gemini API calls: Rate-limited to 60/min
- OpenSearch indexing: Network and disk I/O
- PDF processing: Complex PDFs with many images
Solutions: See Performance Optimization.
A: Limited support:
- Papers: Primarily English (ArXiv, PMC)
- Embeddings: Model supports multiple languages
- Gemini: Supports many languages
- Search: Works with non-English content
For full multilingual support, you'd need:
- Multilingual embedding model
- Language-specific data sources
- Translation capabilities
A: Yes, export options:
- Citations: Export via Citation Manager tab
- Results: Save as JSON, CSV, or BibTeX
- Summaries: Copy from UI or save to file
Implementation:
import json
# Export results
with open('results.json', 'w') as f:
json.dump(search_results, f, indent=2)A: Not built-in, but you could:
- Shared index: Multiple users connect to same OpenSearch
- Cloud deployment: Deploy on server, share URL
- Export/import: Share indexed content between instances
Future enhancement could add user accounts and sharing features.
A: Not directly, but possible:
- Export papers from Zotero
- Import PDFs into this system
- Use citation export to import back
Or build a custom integration using their APIs.
A: Yes, basic features:
- Track citations: From LLM responses
- View bibliography: In Citation Manager
- Export citations: BibTeX, JSON formats
For advanced features, use dedicated tools like Zotero.
A: Not currently implemented, but could be added:
- Store notes in OpenSearch alongside documents
- Add UI for note-taking
- Include notes in search results
A: No, this is a retrieval and question-answering system, not a PDF reader. Use dedicated PDF tools for markup.
A: Troubleshooting steps:
- Check Docker is running:
docker ps - Check port availability:
lsof -i :9200 - Check Docker logs:
docker logs <container_id> - Try different port:
-p 9201:9200 - Increase memory: Add
--memory=4g
See OpenSearch Troubleshooting for details.
A: Steps to fix:
- Verify key in
.envfile (no quotes, no spaces) - Test key at https://makersuite.google.com/
- Generate new key if expired
- Check for typos or extra characters
- Ensure
.envis in project root
A: Check these:
- Index exists:
curl http://localhost:9200/_cat/indices - Documents indexed:
curl http://localhost:9200/research_assistant/_count - Query format: Test with simple query
- Content mismatch: Verify indexed content matches query
A: Common issues:
- Port in use: Kill process using port 7860
- Python error: Check terminal for stack trace
- Missing dependencies: Reinstall requirements
- OpenSearch down: Start OpenSearch first
A: Add to your code:
import logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)Or set in main.py before launching.
A: Logs are in:
- Console output: Check terminal
- Log files:
logs/directory (if enabled) - OpenSearch logs:
docker logs opensearch-node - Gradio logs: In console output
A: Yes, modify opensearch_manager.py:
from sentence_transformers import SentenceTransformer
# Change model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
# Update dimension in index mapping (768 for above model)A: Create a new collector:
- Create
custom_collector.pyindata_collectors/ - Implement collection logic
- Register in
main.py - Add UI option in
gradio_app.py
A: Yes, adapt for any content:
- Modify collectors for your sources
- Adjust data schema in OpenSearch
- Update UI for your use case
The architecture is general-purpose RAG.
A: Adjust field weights:
query = {
"query": {
"multi_match": {
"query": query_text,
"fields": [
"title^3", # 3x weight
"abstract^2", # 2x weight
"content^1", # 1x weight
"key_concepts^2"
]
}
}
}Test different weights for your use case.
A: Yes, but consider:
- Authentication: Add user auth
- Scaling: Use managed OpenSearch (AWS, etc.)
- Rate limiting: Implement proper limits
- Monitoring: Add logging and metrics
- Security: Sanitize inputs, use HTTPS
- Costs: Monitor API usage
A: Backup strategies:
- OpenSearch snapshots:
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_1?wait_for_completion=true"- File backup:
tar -czf backup.tar.gz data/- Export to JSON:
# Export all documents
from opensearchpy import helpers
docs = helpers.scan(client, index='research_assistant')
with open('backup.json', 'w') as f:
json.dump(list(docs), f)A: Yes! Contributions welcome:
- Submit bug reports
- Suggest features
- Submit pull requests
- Improve documentation
See CONTRIBUTING.md for guidelines (if available).
A: Key differences:
- Local control: You control the data and index
- Free: No subscription costs (except API usage)
- Customizable: Modify for your needs
- Private: Data stays local
- Academic focus: Specialized for research
- Multi-modal: Integrates papers, videos, podcasts
A: Potential enhancements:
- More data sources (Google Scholar, JSTOR)
- Better visualization (knowledge graphs)
- Collaborative features (sharing, comments)
- Mobile app
- Better citation management
- Integration with reference managers
- Support for more file formats
If your question isn't answered here:
- Check the Troubleshooting Guide
- Review the CLAUDE.md project overview
- Search GitHub issues
- Create a new issue with your question