An end-to-end Retrieval-Augmented Generation (RAG) system that allows you to chat with PDF documents using Google's Gemini AI. This system intelligently stores embeddings to reduce API costs and supports multiple PDFs in storage.
- Overview
- Features
- How It Works
- Installation
- Setup
- Usage
- Architecture
- File Structure
- Cost Optimization
- Technical Details
This RAG (Retrieval-Augmented Generation) system enables natural language conversations with PDF documents. It uses vector embeddings to find relevant content from your PDFs and generates accurate answers using Google's Gemini AI models.
Key Capabilities:
- Extract and process text from PDF files
- Create vector embeddings for semantic search
- Store embeddings persistently to avoid re-computation
- Support multiple PDFs simultaneously
- Intelligent caching to reduce API costs
- Interactive chat interface
- PDF Text Extraction: Extracts text from all pages of PDF documents
- Text Chunking: Splits large documents into manageable chunks (800 words by default)
- Vector Embeddings: Creates embeddings using Google's
text-embedding-004model - Semantic Search: Uses cosine similarity to find most relevant document chunks
- AI-Powered Answers: Generates answers using Gemini 2.5 Flash model
- Multiple PDF Support: Switch between different PDFs seamlessly
- Persistent Embedding Storage: Saves embeddings to disk in JSON format
- Smart Caching: Only re-embeds PDFs when they've been modified
- Metadata Tracking: Tracks creation date, update date, and modification time
- Automatic Detection: Detects PDF changes and re-embeds only when necessary
- Interactive Selection: Choose PDFs by number or filename
- Embedding Status: View all saved embeddings with metadata
- Continuous Chat: Ask multiple questions in a session
- Clear Feedback: Shows when embeddings are loaded vs. created
-
Document Processing
- PDF text extraction using
pypdf - Text chunking into 800-word segments
- Creation of vector embeddings for each chunk
- PDF text extraction using
-
Embedding Storage
- Embeddings saved to
embeddings/directory as JSON files - Metadata includes: PDF name, creation date, update date, modification time
- Automatic detection of PDF changes
- Embeddings saved to
-
Query Processing
- User question converted to vector embedding
- Cosine similarity search finds top 3 most relevant chunks
- Context assembled from relevant chunks
-
Answer Generation
- Context and question sent to Gemini 2.5 Flash
- LLM generates answer based on retrieved context
- Answer returned to user
PDF File
↓
Text Extraction
↓
Text Chunking (800 words/chunk)
↓
Check if embeddings exist
├─ Yes → Check modification time
│ ├─ Modified → Re-embed
│ └─ Not Modified → Load saved embeddings
└─ No → Create embeddings
↓
Save embeddings with metadata
↓
User Question
↓
Question Embedding
↓
Similarity Search (Cosine)
↓
Top 3 Chunks Retrieved
↓
Context + Question → LLM
↓
Answer Generated
- Python 3.7 or higher
- Google Gemini API key
pip install pypdf google-genai python-dotenv scikit-learn numpyOr create a requirements.txt file:
pypdf
google-genai
python-dotenv
scikit-learn
numpyThen install:
pip install -r requirements.txt-
Get Google Gemini API Key
- Visit Google AI Studio
- Create a new API key
- Copy the API key
-
Create Environment File
- Create a
.envfile in the project root - Add your API key:
GEMINI_API_KEY=your_api_key_here
- Create a
-
Add PDF Files
- Place your PDF files in the project root directory
- The system will automatically detect all
.pdffiles
-
Run the Application
python app.py
-
Select a PDF
- The system will show all available PDFs
- Enter the PDF number (e.g.,
1) or filename - Press Enter
-
First Time Processing
- If embeddings don't exist, the system will:
- Extract text from PDF
- Create chunks
- Generate embeddings (may take a moment)
- Save embeddings to
embeddings/directory
- If embeddings don't exist, the system will:
-
Subsequent Runs
- If embeddings exist and PDF hasn't changed:
- Loads saved embeddings instantly
- No API calls for embedding generation
- Ready to chat immediately
- If embeddings exist and PDF hasn't changed:
-
Chat with PDF
- Type your question and press Enter
- Get AI-generated answers based on PDF content
- Type
quit,exit, orqto stop
============================================================
PDF Chat with RAG - Multiple PDF Support
============================================================
Available PDFs:
1. document1.pdf
2. document2.pdf
Saved Embeddings:
1. document1.pdf (Created: 2024-01-15T10:30:00, Updated: 2024-01-15T10:30:00, Chunks: 45)
------------------------------------------------------------
Enter PDF number (1-2) or PDF filename: 1
Selected PDF: document1.pdf
Loading saved embeddings...
Loaded 45 chunks from saved embeddings
Embeddings created: 2024-01-15T10:30:00
Last updated: 2024-01-15T10:30:00
============================================================
Chat with your PDF! (Type 'quit' or 'exit' to stop)
============================================================
Ask a question: What is the main topic of this document?
------------------------------------------------------------
Answer:
The main topic of this document is...
------------------------------------------------------------
-
Text Processing Module
chunk_text(): Splits text into chunksprocess_pdf(): Handles PDF extraction and processing
-
Embedding Management
save_embeddings(): Saves embeddings with metadataload_embeddings(): Loads saved embeddingsneeds_reembedding(): Checks if re-embedding is needed
-
PDF Management
get_pdf_metadata(): Extracts PDF file metadatalist_available_pdfs(): Lists all PDFs in directorylist_saved_embeddings(): Lists all saved embeddings
-
RAG Pipeline
- Question embedding generation
- Cosine similarity search
- Context retrieval
- LLM answer generation
PDF → Text → Chunks → Embeddings → Storage (JSON)
↓
User Question → Embedding → Similarity Search → Top Chunks
↓
Context + Question → LLM → Answer
Chat with PDF - RAG/
│
├── app.py # Main application file
├── .env # Environment variables (API key)
├── README.md # This file
├── requirements.txt # Python dependencies (optional)
│
├── embeddings/ # Generated embeddings directory
│ ├── document1_pdf.json
│ ├── document2_pdf.json
│ └── ...
│
└── *.pdf # Your PDF files
Each embedding file (embeddings/*.json) contains:
{
"pdf_name": "document.pdf",
"pdf_path": "/absolute/path/to/document.pdf",
"created_at": "2024-01-15T10:30:00.123456",
"updated_at": "2024-01-15T10:30:00.123456",
"modified_time": 1705312200.123,
"num_chunks": 45,
"documents": [
{
"text": "chunk text content...",
"embedding": [0.123, 0.456, ...]
},
...
]
}-
One-Time Embedding
- Embeddings created once per PDF
- Stored locally in JSON format
- No re-embedding unless PDF changes
-
Smart Caching
- Modification time tracking
- Automatic detection of PDF updates
- Re-embedding only when necessary
-
Efficient Storage
- JSON format for easy access
- No database overhead
- Fast loading times
Without Caching:
- Every run: Embedding API calls for all chunks
- Example: 100 chunks × $0.0001 = $0.01 per run
- 10 runs = $0.10
With Caching:
- First run: $0.01 (create embeddings)
- Subsequent runs: $0.00 (load from disk)
- 10 runs = $0.01 (90% savings!)
- Embedding Model:
text-embedding-004(Google Gemini) - LLM Model:
gemini-2.5-flash(Google Gemini)
- Chunk Size: 800 words (configurable in
chunk_text()) - Top Chunks: 3 most similar chunks retrieved
- Similarity Metric: Cosine similarity
pypdf: PDF text extractiongoogle-genai: Google Gemini API clientpython-dotenv: Environment variable managementscikit-learn: Cosine similarity calculationnumpy: Numerical operations
- Embedding Creation: ~1-5 seconds per 100 chunks (depends on API)
- Embedding Loading: <1 second (from local JSON)
- Similarity Search: <1 second for typical documents
- Answer Generation: ~2-5 seconds (depends on API)
Edit the chunk_text() function:
def chunk_text(text, chunk_size=800): # Change 800 to desired size
# ...Edit the similarity search section:
top_chunks = [chunk for _, chunk in best_chunks[:3]] # Change 3 to desired numberEdit the model name:
response = client.models.generate_content(
model="gemini-2.5-flash", # Change to desired model
contents=prompt
)- Embeddings are stored in the
embeddings/directory - Each PDF gets its own JSON file based on filename
- The system automatically creates the
embeddings/directory - PDF modification time is checked to determine if re-embedding is needed
- The
created_attimestamp is preserved when updating embeddings
- Solution: Ensure PDF files are in the same directory as
app.py
- Solution: Check that
.envfile exists and containsGEMINI_API_KEY=your_key
- Solution: Check that
embeddings/directory exists and contains JSON files - Solution: Verify JSON files are not corrupted
- Solution: This is normal for large PDFs. The system will cache results for future use.
This project is open source and available for personal and educational use.
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
Happy Chatting with Your PDFs! 📚💬