Skip to content

darknight054/Langraph-Agent

Repository files navigation

RAG Document Q&A System

A production-grade Retrieval-Augmented Generation (RAG) system built with LangGraph, featuring:

  • DeepSeek OCR via vLLM for document text extraction
  • LangGraph for agentic workflow orchestration with validation loops
  • ChromaDB for persistent vector storage
  • Qwen3-Embedding-0.6B for embeddings and generation
  • Contextual Chunking based on Anthropic's research (49% retrieval improvement) (support included not tried so might have errors)
  • Semantic Chunking 768 tokens and 50 as sliding window
  • Streamlit web interface for document upload and chat

Demo

Download the demo video

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Streamlit UI                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │  Upload Page    │  │   Chat Page     │  │  Manage Page    │  │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘  │
└───────────┼─────────────────────┼─────────────────────┼──────────┘
            │                     │                     │
            ▼                     ▼                     ▼
┌───────────────────────┐  ┌──────────────────────────────────────────────┐
│   Ingestion Pipeline  │  │             LangGraph Workflow                 │
│  ┌─────────────────┐  │  │  ┌────────┐ ┌─────────┐ ┌─────────┐ ┌──────┐  │
│  │  PDF Processor  │  │  │  │Query   │→│Retriever│→│Generator│→│Valid.│  │
│  └────────┬────────┘  │  │  │Rewriter│ └─────────┘ └────┬────┘ └──┬───┘  │
│           ▼           │  │  └────────┘                  │         │      │
│  ┌─────────────────┐  │  │                              ◄─────────┘      │
│  │  DeepSeek OCR   │  │  │                           (retry if invalid)  │
│  └────────┬────────┘  │  │                              │                │
│           ▼           │  │                              ▼                │
│  ┌─────────────────┐  │  │                        ┌─────────┐            │
│  │ Text Cleaner    │  │  │                        │Response │            │
│  └────────┬────────┘  │  │                        └─────────┘            │
│           ▼           │  └──────────────────────────────────────────────┘
│  ┌─────────────────┐  │                     │
│  │Contextual Chunk │  │                     │
│  └────────┬────────┘  │                     │
└───────────┼───────────┘                     │
            │                                 │
            ▼                                 ▼
      ┌─────────────────────────────────────────────┐
      │              ChromaDB Vector Store           │
      └─────────────────────────────────────────────┘

Project Structure

assignment-langraph/
├── pyproject.toml              # Single config with all dependencies
├── .env.example                # Environment template
├── data/                       # Sample PDFs
│
├── common/                     # Shared utilities
│   ├── config.py               # Pydantic settings
│   └── logging.py              # structlog configuration
│
├── ingestion/                  # Document ingestion pipeline
│   ├── ocr/                    # DeepSeek OCR client
│   ├── processor/              # PDF, cleaner, chunker
│   ├── vectorstore/            # ChromaDB
│   ├── pipeline.py             # Main orchestrator
│   └── cli.py                  # CLI interface
│
├── agents/                     # LangGraph RAG workflow
│   ├── nodes/                  # Query Rewriter, Retriever, Generator, Validator, Response
│   ├── state.py                # Shared state schema (includes chat_history)
│   ├── graph.py                # LangGraph workflow
│   └── chat.py                 # Chat session with conversation history
│
├── app/                        # Streamlit UI
│   ├── main.py                 # Entry point
│   └── pages/                  # Upload, Chat pages
│
└── tests/

Prerequisites

  1. Python 3.11+

  2. UV package manager: https://docs.astral.sh/uv/

  3. DeepSeek OCR vLLM server (for OCR):

    The project includes a Docker-based infrastructure setup for running DeepSeek OCR locally.

    Quick start:

    cd infrastructure
    # Download model weights (~35GB)
    mkdir -p models
    huggingface-cli download deepseek-ai/DeepSeek-OCR --local-dir models/deepseek-ai/DeepSeek-OCR
    
    # Build and run
    docker-compose up -d

    See infrastructure/README.md for full setup instructions, prerequisites, and troubleshooting.

Setup

  1. Clone and navigate:

    git clone <repository-url>
    cd assignment-langraph
  2. Create environment file:

    cp .env.example .env
    # Edit .env with your API keys
  3. Install dependencies:

    uv sync

Usage

Command Line Interface

Ingest a document:

uv run ingestion ingest data/document.pdf

# With page range:
uv run ingestion ingest data/document.pdf --start-page 1 --end-page 10

List ingested documents:

uv run ingestion list

Search documents:

uv run ingestion search "What is the main topic?"

Delete a document:

uv run ingestion delete <document_id>

Streamlit Web Interface

uv run streamlit run app/main.py

Then open http://localhost:8501 in your browser.

Python API

from ingestion import IngestionPipeline
from agents import ChatSession

# Ingest a document
pipeline = IngestionPipeline()
result = pipeline.ingest("document.pdf", start_page=1, end_page=10)
print(f"Ingested {result['chunks_created']} chunks")

# Chat with documents
session = ChatSession()
response = session.chat("What is the main topic of the document?")
print(response.content)

LangGraph Workflow

The RAG workflow follows this pattern with query contextualization and a retry loop for validation:

START → Query Rewriter → Retriever → Generator → Validator
                                                      │
                                              [is_valid?]
                                             /           \
                                        Yes /             \ No (retry < 3)
                                           ↓               ↓
                                      Response ←── Generator (with feedback)
                                           ↓
                                          END

Agents:

  1. Query Rewriter: Contextualizes queries using conversation history (e.g., "tell me more about it" → "tell me more about machine learning")
  2. Retriever: Fetches top-k relevant chunks from ChromaDB using the contextualized query
  3. Generator: Generates answer using GPT-4o-mini with retrieved context
  4. Validator: Checks for hallucinations using structured output
  5. Response: Formats final response with sources

Conversation Context:

  • The system maintains chat history across messages in a session
  • The Query Rewriter resolves references like "it", "that", "the previous topic" using conversation context
  • Original query is preserved separately from the contextualized query for logging/debugging

Observability

Langfuse (Recommended)

Self-host or use cloud for tracing:

# .env
LANGFUSE_PUBLIC_KEY=pk-...
LANGFUSE_SECRET_KEY=sk-...
LANGFUSE_HOST=https://cloud.langfuse.com

structlog

JSON logging for production:

# .env
LOG_FORMAT=json
LOG_LEVEL=INFO

Configuration

All settings via environment variables or .env:

Variable Default Description
OPENAI_API_KEY (required) OpenAI API key
DEEPSEEK_OCR_URL http://localhost:8000/ vLLM server URL
CHROMA_PERSIST_DIR ./chroma_db Vector store path
CHUNK_SIZE 512 Tokens per chunk
CHUNK_OVERLAP 50 Overlap tokens
CHUNKING_STRATEGY contextual semantic or contextual
RETRIEVAL_TOP_K 5 Chunks to retrieve
MAX_RETRY_COUNT 3 Max validation retries

Sample Data

The data/ directory contains sample PDFs for testing:

  1. handwritten.pdf - Handwritten notes demonstrating OCR capabilities
  2. normal.pdf - Historical text about ancient India

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors