RAG Document Q&A System

A production-grade Retrieval-Augmented Generation (RAG) system built with LangGraph, featuring:

DeepSeek OCR via vLLM for document text extraction
LangGraph for agentic workflow orchestration with validation loops
ChromaDB for persistent vector storage
Qwen3-Embedding-0.6B for embeddings and generation
Contextual Chunking based on Anthropic's research (49% retrieval improvement) (support included not tried so might have errors)
Semantic Chunking 768 tokens and 50 as sliding window
Streamlit web interface for document upload and chat

Demo

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Streamlit UI                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │  Upload Page    │  │   Chat Page     │  │  Manage Page    │  │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘  │
└───────────┼─────────────────────┼─────────────────────┼──────────┘
            │                     │                     │
            ▼                     ▼                     ▼
┌───────────────────────┐  ┌──────────────────────────────────────────────┐
│   Ingestion Pipeline  │  │             LangGraph Workflow                 │
│  ┌─────────────────┐  │  │  ┌────────┐ ┌─────────┐ ┌─────────┐ ┌──────┐  │
│  │  PDF Processor  │  │  │  │Query   │→│Retriever│→│Generator│→│Valid.│  │
│  └────────┬────────┘  │  │  │Rewriter│ └─────────┘ └────┬────┘ └──┬───┘  │
│           ▼           │  │  └────────┘                  │         │      │
│  ┌─────────────────┐  │  │                              ◄─────────┘      │
│  │  DeepSeek OCR   │  │  │                           (retry if invalid)  │
│  └────────┬────────┘  │  │                              │                │
│           ▼           │  │                              ▼                │
│  ┌─────────────────┐  │  │                        ┌─────────┐            │
│  │ Text Cleaner    │  │  │                        │Response │            │
│  └────────┬────────┘  │  │                        └─────────┘            │
│           ▼           │  └──────────────────────────────────────────────┘
│  ┌─────────────────┐  │                     │
│  │Contextual Chunk │  │                     │
│  └────────┬────────┘  │                     │
└───────────┼───────────┘                     │
            │                                 │
            ▼                                 ▼
      ┌─────────────────────────────────────────────┐
      │              ChromaDB Vector Store           │
      └─────────────────────────────────────────────┘

Project Structure

assignment-langraph/
├── pyproject.toml              # Single config with all dependencies
├── .env.example                # Environment template
├── data/                       # Sample PDFs
│
├── common/                     # Shared utilities
│   ├── config.py               # Pydantic settings
│   └── logging.py              # structlog configuration
│
├── ingestion/                  # Document ingestion pipeline
│   ├── ocr/                    # DeepSeek OCR client
│   ├── processor/              # PDF, cleaner, chunker
│   ├── vectorstore/            # ChromaDB
│   ├── pipeline.py             # Main orchestrator
│   └── cli.py                  # CLI interface
│
├── agents/                     # LangGraph RAG workflow
│   ├── nodes/                  # Query Rewriter, Retriever, Generator, Validator, Response
│   ├── state.py                # Shared state schema (includes chat_history)
│   ├── graph.py                # LangGraph workflow
│   └── chat.py                 # Chat session with conversation history
│
├── app/                        # Streamlit UI
│   ├── main.py                 # Entry point
│   └── pages/                  # Upload, Chat pages
│
└── tests/

Prerequisites

Python 3.11+
UV package manager: https://docs.astral.sh/uv/
DeepSeek OCR vLLM server (for OCR):

The project includes a Docker-based infrastructure setup for running DeepSeek OCR locally.

Quick start:
```
cd infrastructure
# Download model weights (~35GB)
mkdir -p models
huggingface-cli download deepseek-ai/DeepSeek-OCR --local-dir models/deepseek-ai/DeepSeek-OCR

# Build and run
docker-compose up -d
```
See infrastructure/README.md for full setup instructions, prerequisites, and troubleshooting.

Setup

Clone and navigate:

git clone <repository-url>
cd assignment-langraph

Create environment file:

cp .env.example .env
# Edit .env with your API keys

Install dependencies:
```
uv sync
```

Usage

Command Line Interface

Ingest a document:

uv run ingestion ingest data/document.pdf

# With page range:
uv run ingestion ingest data/document.pdf --start-page 1 --end-page 10

List ingested documents:

uv run ingestion list

Search documents:

uv run ingestion search "What is the main topic?"

Delete a document:

uv run ingestion delete <document_id>

Streamlit Web Interface

uv run streamlit run app/main.py

Then open http://localhost:8501 in your browser.

Python API

from ingestion import IngestionPipeline
from agents import ChatSession

# Ingest a document
pipeline = IngestionPipeline()
result = pipeline.ingest("document.pdf", start_page=1, end_page=10)
print(f"Ingested {result['chunks_created']} chunks")

# Chat with documents
session = ChatSession()
response = session.chat("What is the main topic of the document?")
print(response.content)

LangGraph Workflow

The RAG workflow follows this pattern with query contextualization and a retry loop for validation:

START → Query Rewriter → Retriever → Generator → Validator
                                                      │
                                              [is_valid?]
                                             /           \
                                        Yes /             \ No (retry < 3)
                                           ↓               ↓
                                      Response ←── Generator (with feedback)
                                           ↓
                                          END

Agents:

Query Rewriter: Contextualizes queries using conversation history (e.g., "tell me more about it" → "tell me more about machine learning")
Retriever: Fetches top-k relevant chunks from ChromaDB using the contextualized query
Generator: Generates answer using GPT-4o-mini with retrieved context
Validator: Checks for hallucinations using structured output
Response: Formats final response with sources

Conversation Context:

The system maintains chat history across messages in a session
The Query Rewriter resolves references like "it", "that", "the previous topic" using conversation context
Original query is preserved separately from the contextualized query for logging/debugging

Observability

Langfuse (Recommended)

Self-host or use cloud for tracing:

# .env
LANGFUSE_PUBLIC_KEY=pk-...
LANGFUSE_SECRET_KEY=sk-...
LANGFUSE_HOST=https://cloud.langfuse.com

structlog

JSON logging for production:

# .env
LOG_FORMAT=json
LOG_LEVEL=INFO

Configuration

All settings via environment variables or .env:

Variable	Default	Description
`OPENAI_API_KEY`	(required)	OpenAI API key
`DEEPSEEK_OCR_URL`	`http://localhost:8000/`	vLLM server URL
`CHROMA_PERSIST_DIR`	`./chroma_db`	Vector store path
`CHUNK_SIZE`	`512`	Tokens per chunk
`CHUNK_OVERLAP`	`50`	Overlap tokens
`CHUNKING_STRATEGY`	`contextual`	`semantic` or `contextual`
`RETRIEVAL_TOP_K`	`5`	Chunks to retrieve
`MAX_RETRY_COUNT`	`3`	Max validation retries

Sample Data

The data/ directory contains sample PDFs for testing:

handwritten.pdf - Handwritten notes demonstrating OCR capabilities
normal.pdf - Historical text about ancient India

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Document Q&A System

Demo

Architecture

Project Structure

Prerequisites

Setup

Usage

Command Line Interface

Streamlit Web Interface

Python API

LangGraph Workflow

Observability

Langfuse (Recommended)

structlog

Configuration

Sample Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
agents		agents
app		app
common		common
data		data
demo		demo
infrastructure		infrastructure
ingestion		ingestion
tests		tests
.env.example		.env.example
.gitignore		.gitignore
INSTRUCTION.md		INSTRUCTION.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

RAG Document Q&A System

Demo

Architecture

Project Structure

Prerequisites

Setup

Usage

Command Line Interface

Streamlit Web Interface

Python API

LangGraph Workflow

Observability

Langfuse (Recommended)

structlog

Configuration

Sample Data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages