Frequently Asked Questions (FAQ)

Common questions and answers about the Multi-Modal Academic Research System.

Installation & Setup
Configuration
Data Collection
Search & Retrieval
Performance
Features & Capabilities
Troubleshooting
Advanced Topics

Installation & Setup

Q: What are the system requirements?

A: Minimum requirements:

Python: 3.8 or higher
RAM: 4GB minimum, 8GB recommended
Disk: 10GB free space (more for storing papers/videos)
Docker: For running OpenSearch
Internet: Required for API calls and data collection

Q: Do I need to pay for any services?

A: No, the system uses free services:

Gemini API: Free tier (60 requests/minute)
OpenSearch: Self-hosted locally (no cost)
ArXiv, YouTube, etc.: Free public APIs
Embedding model: Runs locally (no API costs)

Q: Can I run this without Docker?

A: Docker is required only for OpenSearch. Alternatives:

Install OpenSearch locally without Docker
Use Elasticsearch instead (similar API)
Connect to a remote OpenSearch instance

However, Docker is the recommended and easiest method.

Q: Why does installation take so long?

A: Common reasons:

Large dependencies: PyTorch, transformers (can be 1-2GB)
Slow internet: Downloads from PyPI
Compiling packages: Some packages compile from source

Speed up installation:

# Use binary wheels when available
pip install --only-binary=:all: -r requirements.txt

# Or use conda for pre-compiled packages
conda install pytorch sentence-transformers

Q: Can I use this on Windows?

A: Yes, but with some considerations:

Use venv\Scripts\activate instead of source venv/bin/activate
Docker Desktop required for Windows
Path separators are different (\ vs /)
Some shell commands may differ

Consider using WSL2 (Windows Subsystem for Linux) for better compatibility.

Configuration

Q: How do I get a Gemini API key?

A: Follow these steps:

Visit https://makersuite.google.com/app/apikey
Sign in with your Google account
Click "Create API Key"
Copy the key to your .env file:
```
GEMINI_API_KEY=your_key_here
```

Q: Can I use other LLMs instead of Gemini?

A: Yes, the system can be adapted for other LLMs:

OpenAI GPT: Modify research_orchestrator.py to use OpenAI API
Local LLMs: Use Ollama or LM Studio
Claude: Use Anthropic API
Open-source models: Use Hugging Face transformers

See Advanced Topics for implementation details.

Q: Where are my downloaded papers stored?

A: Papers are stored in:

PDFs: data/papers/
Videos: data/videos/
Podcasts: data/podcasts/
Processed data: data/processed/

You can change locations in configuration files.

Q: How do I change the OpenSearch port?

A: Update your .env file:

OPENSEARCH_HOST=localhost
OPENSEARCH_PORT=9201  # Change from default 9200

Then start OpenSearch with the new port:

docker run -p 9201:9200 opensearchproject/opensearch:latest

Q: Can I use a remote OpenSearch instance?

A: Yes, update .env:

OPENSEARCH_HOST=your-opensearch-host.com
OPENSEARCH_PORT=9200
OPENSEARCH_USE_SSL=true
OPENSEARCH_USER=admin
OPENSEARCH_PASSWORD=your_password

Data Collection

Q: What data sources are supported?

A: Current sources:

Papers: ArXiv, PubMed Central, Semantic Scholar
Videos: YouTube (educational channels)
Podcasts: RSS feeds

See Custom Collectors to add more sources.

Q: How many papers can I collect at once?

A: Limits:

ArXiv: 2000 per query (API limit)
Semantic Scholar: Rate-limited (100/minute)
YouTube: No hard limit, but respect rate limits
Local storage: Limited by disk space

Recommended: Start with 10-50 papers for testing.

Q: Can I collect papers from specific journals?

A: Partially:

ArXiv: Yes, filter by category (e.g., cs.LG for ML)
PubMed: Yes, filter by journal name
Semantic Scholar: Yes, use field filters

Example ArXiv query:

query = "cat:cs.LG AND ti:neural networks"

Q: How do I collect papers published in a specific year?

A: Use date filters:

# ArXiv
query = "submittedDate:[20230101 TO 20231231] AND all:machine learning"

# In the UI, you can filter after collection
# Or modify the collector to add date filters

Q: Why can't I download some papers?

A: Common reasons:

Access restrictions: Some papers require subscriptions
Invalid URLs: Paper may have been removed
Rate limiting: Too many requests too fast
Network issues: Firewall or proxy blocking

The system focuses on open-access content (ArXiv, PMC).

Q: Can I import my own PDFs?

A: Yes, you can:

Place PDFs in data/papers/

Use the processing pipeline:

from multi_modal_rag.data_processors import pdf_processor

processor = pdf_processor.PDFProcessor()
result = processor.process_pdf("path/to/paper.pdf")

Index the processed content

Q: How do I collect videos from specific YouTube channels?

A: Modify youtube_collector.py:

def collect_from_channel(channel_id, max_videos=50):
    """Collect videos from specific channel."""
    from googleapiclient.discovery import build

    youtube = build('youtube', 'v3', developerKey=api_key)

    request = youtube.search().list(
        part='snippet',
        channelId=channel_id,
        maxResults=max_videos,
        type='video'
    )

    response = request.execute()
    return response['items']

Search & Retrieval

Q: How does hybrid search work?

A: Hybrid search combines:

Keyword search: BM25 algorithm (like traditional search)
Semantic search: Vector similarity using embeddings

Results from both are combined and ranked. See Hybrid Search Guide for details.

Q: Why are my search results not relevant?

A: Possible causes:

Mismatch between query and content: Try different keywords
Not enough indexed content: Add more documents
Poor field weights: Adjust in opensearch_manager.py
Wrong search mode: Try semantic-only or keyword-only

Tips:

Use specific technical terms
Try multiple phrasings
Check if documents are actually indexed

Q: How many documents should I index for good results?

A: Recommendations:

Minimum: 10-20 papers for basic testing
Good: 100-200 papers for decent coverage
Excellent: 500+ papers for comprehensive results

More documents = better context, but slower indexing.

Q: Can I search only specific content types?

A: Yes, filter by content type:

query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"content": "machine learning"}}
            ],
            "filter": [
                {"term": {"content_type": "paper"}}  # or "video", "podcast"
            ]
        }
    }
}

Or use the UI filters (if implemented).

Q: How do I improve search speed?

A: Optimization strategies:

Reduce result size: Return fewer documents
Use filters: Pre-filter before searching
Optimize OpenSearch: Increase memory, reduce shards
Cache results: Cache common queries
Use pagination: Don't load all results at once

See Performance Guide for details.

Q: Can I search by author name?

A: Yes:

query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"content": "neural networks"}}
            ],
            "filter": [
                {"term": {"authors": "Geoffrey Hinton"}}
            ]
        }
    }
}

Performance

Q: Why is the first query slow?

A: First query may be slow due to:

Model loading: Embedding model loaded into memory
Index warming: OpenSearch caches are cold
LLM initialization: Gemini client initialization

Subsequent queries are faster (cached models, warm indices).

Q: How much RAM does the system use?

A: Typical usage:

Python application: 1-2GB
OpenSearch: 2-4GB (configurable)
Embedding model: 500MB-1GB
Total: 4-8GB recommended

For large-scale usage, 16GB+ recommended.

Q: Can I run this on a Raspberry Pi?

A: Possible but not recommended:

RAM limitation: Need at least 4GB
CPU: Will be very slow
Storage: Need sufficient space

Better options: Use cloud instance or local machine.

Q: How do I process documents faster?

A: Speed improvements:

Use GPU: For embedding generation
Batch processing: Process multiple docs at once
Reduce Gemini calls: Cache vision analysis results
Parallel processing: Use multiprocessing
Skip large PDFs: Set size limits

Example:

# Process in parallel
from multiprocessing import Pool

with Pool(processes=4) as pool:
    results = pool.map(process_pdf, pdf_files)

Q: Why is indexing slow?

A: Common bottlenecks:

Embedding generation: CPU-bound operation
Gemini API calls: Rate-limited to 60/min
OpenSearch indexing: Network and disk I/O
PDF processing: Complex PDFs with many images

Solutions: See Performance Optimization.

Features & Capabilities

Q: Does the system support multiple languages?

A: Limited support:

Papers: Primarily English (ArXiv, PMC)
Embeddings: Model supports multiple languages
Gemini: Supports many languages
Search: Works with non-English content

For full multilingual support, you'd need:

Multilingual embedding model
Language-specific data sources
Translation capabilities

Q: Can I export search results?

A: Yes, export options:

Citations: Export via Citation Manager tab
Results: Save as JSON, CSV, or BibTeX
Summaries: Copy from UI or save to file

Implementation:

import json

# Export results
with open('results.json', 'w') as f:
    json.dump(search_results, f, indent=2)

Q: Does the system support collaborative research?

A: Not built-in, but you could:

Shared index: Multiple users connect to same OpenSearch
Cloud deployment: Deploy on server, share URL
Export/import: Share indexed content between instances

Future enhancement could add user accounts and sharing features.

Q: Can I integrate this with Zotero or Mendeley?

A: Not directly, but possible:

Export papers from Zotero
Import PDFs into this system
Use citation export to import back

Or build a custom integration using their APIs.

Q: Does it support citation management?

A: Yes, basic features:

Track citations: From LLM responses
View bibliography: In Citation Manager
Export citations: BibTeX, JSON formats

For advanced features, use dedicated tools like Zotero.

Q: Can I add annotations or notes to papers?

A: Not currently implemented, but could be added:

Store notes in OpenSearch alongside documents
Add UI for note-taking
Include notes in search results

Q: Does it support PDF highlighting or markup?

A: No, this is a retrieval and question-answering system, not a PDF reader. Use dedicated PDF tools for markup.

Troubleshooting

Q: OpenSearch won't start. What should I do?

A: Troubleshooting steps:

Check Docker is running: docker ps
Check port availability: lsof -i :9200
Check Docker logs: docker logs <container_id>
Try different port: -p 9201:9200
Increase memory: Add --memory=4g

See OpenSearch Troubleshooting for details.

Q: I get "Gemini API key invalid". How do I fix this?

A: Steps to fix:

Verify key in .env file (no quotes, no spaces)
Test key at https://makersuite.google.com/
Generate new key if expired
Check for typos or extra characters
Ensure .env is in project root

Q: Search returns no results. Why?

A: Check these:

Index exists: curl http://localhost:9200/_cat/indices
Documents indexed: curl http://localhost:9200/research_assistant/_count
Query format: Test with simple query
Content mismatch: Verify indexed content matches query

Q: The UI won't load. What's wrong?

A: Common issues:

Port in use: Kill process using port 7860
Python error: Check terminal for stack trace
Missing dependencies: Reinstall requirements
OpenSearch down: Start OpenSearch first

Q: How do I enable debug logging?

A: Add to your code:

import logging

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

Or set in main.py before launching.

Q: Where can I find error logs?

A: Logs are in:

Console output: Check terminal
Log files: logs/ directory (if enabled)
OpenSearch logs: docker logs opensearch-node
Gradio logs: In console output

Advanced Topics

Q: Can I use a different embedding model?

A: Yes, modify opensearch_manager.py:

from sentence_transformers import SentenceTransformer

# Change model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Update dimension in index mapping (768 for above model)

See Embedding Models Guide.

Q: How do I add a custom data source?

A: Create a new collector:

Create custom_collector.py in data_collectors/
Implement collection logic
Register in main.py
Add UI option in gradio_app.py

See Custom Collectors Guide.

Q: Can I use this for non-academic content?

A: Yes, adapt for any content:

Modify collectors for your sources
Adjust data schema in OpenSearch
Update UI for your use case

The architecture is general-purpose RAG.

Q: How do I fine-tune the search ranking?

A: Adjust field weights:

query = {
    "query": {
        "multi_match": {
            "query": query_text,
            "fields": [
                "title^3",      # 3x weight
                "abstract^2",   # 2x weight
                "content^1",    # 1x weight
                "key_concepts^2"
            ]
        }
    }
}

Test different weights for your use case.

Q: Can I deploy this to production?

A: Yes, but consider:

Authentication: Add user auth
Scaling: Use managed OpenSearch (AWS, etc.)
Rate limiting: Implement proper limits
Monitoring: Add logging and metrics
Security: Sanitize inputs, use HTTPS
Costs: Monitor API usage

Q: How do I backup my data?

A: Backup strategies:

OpenSearch snapshots:

curl -X PUT "localhost:9200/_snapshot/backup/snapshot_1?wait_for_completion=true"

File backup:

tar -czf backup.tar.gz data/

Export to JSON:

# Export all documents
from opensearchpy import helpers

docs = helpers.scan(client, index='research_assistant')
with open('backup.json', 'w') as f:
    json.dump(list(docs), f)

Q: Can I contribute to the project?

A: Yes! Contributions welcome:

Submit bug reports
Suggest features
Submit pull requests
Improve documentation

See CONTRIBUTING.md for guidelines (if available).

Q: What's the difference between this and ChatGPT with plugins?

A: Key differences:

Local control: You control the data and index
Free: No subscription costs (except API usage)
Customizable: Modify for your needs
Private: Data stays local
Academic focus: Specialized for research
Multi-modal: Integrates papers, videos, podcasts

Q: Is there a roadmap for future features?

A: Potential enhancements:

More data sources (Google Scholar, JSTOR)
Better visualization (knowledge graphs)
Collaborative features (sharing, comments)
Mobile app
Better citation management
Integration with reference managers
Support for more file formats

Still Have Questions?

If your question isn't answered here:

Check the Troubleshooting Guide
Review the CLAUDE.md project overview
Search GitHub issues
Create a new issue with your question

FilesExpand file tree

faq.md

Latest commit

History