A Retrieval-Augmented Generation (RAG) based chatbot that answers questions using information from your PDF documents and displays the source of the response (PDF name and page number).
pip install -r requirements.txtCreate a .env file in the root directory:
# Cloud LLM API Key
GEMINI_API_KEY=your_gemini_api_key # Provide your cloud provider's API key
Create a folder named pdfs in the root directory (if it does not exist). Place all PDFs inside the pdfs/ folder, then run:
python 1_ingest.pyThis will:
- Load all PDFs
- Split text into chunks
- Generate embeddings
- Store them in the vector database
streamlit run 2_streamlit_app.pyThis will:
- Launch the Streamlit chatbot interface
- Internally use
rag.pyfor retrieval and response generation - Display the source of each answer (PDF name and page number)
- Make sure ingestion is completed before running the chatbot.
- Add new PDFs to the
pdfs/folder and rerun1_ingest.pyto update embeddings. - Embeddings are currently stored locally in the root directory. The storage layer can be replaced with any vector database (FAISS, Chroma, Pinecone, etc.).
- The project currently uses Gemini via
GEMINI_API_KEY. The cloud model provider can be changed as needed.

