Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Proxy-Pointer Banner

Proxy-Pointer DocComparator -- Versatile Cross-Document Comparison ⚖️

DocComparator — An extension of the Proxy-Pointer architecture that performs deep, section-by-section comparison across any two documents (from complex credit agreements to scientific research papers). Instead of simple keyword matching, it leverages Agentic RAG to untangle multi-layered trade-offs, identify missing clauses, and connect legal/academic text to real-world implications.


How It Works

This diagram illustrates the modular, three-tier architecture of the versatile Document Comparator. The core comparison pipeline remains entirely decoupled from domain-specific rules, which are dynamically handled by the Upstream Extraction and Downstream Reporting layers.

graph LR
    %% Upstream Extraction Block (Left)
    A["<b>1. UPSTREAM EXTRACTION</b><br>---<br>Raw Documents<br>⬇<br>extract_pdf_to_md.py<br>⬇<br>Clean Markdown<br>⬇<br>build_doc_index.py<br>⬇<br>Skeleton Tree"]

    %% Core Comparison Block (Center)
    B["<b>2. COMPARISON ENGINE</b><br>---<br>User Criteria Prompt<br>⬇<br>criteria_validator.py<br>⬇<br>section_selector.py<br>⬇<br>cross_retriever.py<br>⬇<br>section_comparator.py"]

    %% Downstream Presentation Block (Right)
    C["<b>3. DOWNSTREAM PRESENTATION</b><br>---<br>build_comparison_prompt<br>(Personas)<br>⬇<br>report_builder.py<br>⬇<br>Streamlit UI Report"]

    %% Transitions
    A ==> B ==> C

    %% Styling
    style A fill:#f9f9fa,stroke:#b0bec5,stroke-width:2px,color:#263238
    style B fill:#f0f7fc,stroke:#64b5f6,stroke-width:2px,color:#0d47a1
    style C fill:#fff8f6,stroke:#ffab91,stroke-width:2px,color:#bf360c
Loading

Architectural Tier Breakdown

1. Upstream Extraction Layer

  • Role: Converts any incoming raw document structure into a standardized, machine-readable hierarchy.
  • Programs Involved:
    • extract_pdf_to_md.py: Handles upstream ingestion, converting PDFs into clean, hierarchically formatted Markdown (bypassed if .md is directly uploaded).
    • build_doc_index.py: Parses Markdown headers, filters administrative noise, and builds the hierarchical JSON structure map (_structure.json).

2. Core Comparison Engine

  • Role: Coordinates semantic search over hierarchical document nodes.
  • Programs Involved:
    • criteria_validator.py: Performs an initial feasibility check on the user's comparison criteria and dynamically detects the document type.
    • section_selector.py: Implements Stage 1 PP Retrieval. It identifies and extracts the most relevant sections of Document 1 based on user criteria.
    • cross_retriever.py: Implements Stage 2 PP Retrieval. It performs a targeted semantic search within Document 2's vector space using the context of the selected Document 1 sections.
    • section_comparator.py: Coordinates pairwise evaluations of matching sections, passing them to the LLM to analyze alignments and discrepancies.

3. Downstream Presentation Layer

  • Role: Tailors the analytical output to the target audience and formats the final visualization.
  • Programs Involved:
    • build_comparison_prompt: Injects the appropriate analytical persona (e.g., Senior Academic Researcher with Shared-Foundation Dampening, or Senior Legal Counsel analyzing Risk-Direction).
    • report_builder.py: Renders the final comparison report using professional scales and readable layouts, ready for markdown export.

Architecture Deep Dive

For the full technical story behind the Proxy-Pointer architecture:

  1. Proxy-Pointer Framework for Structure-Aware Enterprise Document Intelligence — Hierarchical understanding and comparison of contracts, research papers, and more
  2. Proxy-Pointer RAG: Structure Meets Scale — 100% Accuracy with Smarter Retrieval — Scaling to multi-document, LLM re-ranking, and benchmark results
  3. Proxy-Pointer RAG: Achieving Vectorless Accuracy at Vector RAG Scale and Cost — Core architecture & the pointer-based retrieval idea

5-Minute Quickstart

1. Clone

git clone https://github.com/Proxy-Pointer/Proxy-Pointer-RAG.git
cd Proxy-Pointer-RAG
cd DocComparator

2. Create Virtual Environment & Install Dependencies

We strongly recommend creating a virtual environment first:

python -m venv venv
# Windows: venv\Scripts\activate | macOS/Linux: source venv/bin/activate

You can then install dependencies using standard pip or using uv (recommended for developers).

Option A: Standard pip

Install the package:

pip install "pprag[compare]"

Option B: For Developers (using uv)

If you want to tinker with the code, this project uses uv for lightning-fast dependency management.

pip install uv
uv sync --project DocComparator

# Remember to prefix commands with `uv run` if you use this method!

3. Configure API keys

cp .env.example .env
# Edit .env → add:
# 1. GOOGLE_API_KEY
# 2. LLAMA_CLOUD_API_KEY (For PDF extraction)
# Note: Also review other commented variables, especially the FAISS trust settings required for local index loading!

4. Start the UI

# Prefix with `uv run` if you installed via Option B
pprag compare serve

5. Test with pre-loaded documents

Simply upload the .md or .pdf files from the data/uploads/ directory directly into the UI. The system will automatically build the necessary indexes, trees, and markdown files on the first run.

  • Legal comparison: Compare the Emerson and TRoadhouse credit agreements on criteria like "dispositions" or "representations and warranties".
  • Academic comparison: Compare VectorFusion and VectorPainter on criteria like "pipeline architecture" or "canvas initialization strategies".

6. Test Results

If you want to view the output reports without running the system yourself, you can look into the results/ folder, which contains pre-generated artifact reports for the test cases mentioned above.

7. Bring Your Own Documents

You can upload and compare your own documents! However, please note:

  • Upstream Adjustments: The extraction script (extract_pdf_to_md.py) may need to be adjusted so that the generated markdown captures the proper section heading hierarchy of your specific documents. This is critical for accurate skeleton tree generation and downstream processing.
  • Downstream Adjustments: If your documents are not Legal Contracts or Academic Papers, the build_comparison_prompt function and report_builder.py may need adjustment to inject the proper persona, logic dampeners, and reporting format for your domain.

Project Structure

Proxy-Pointer/
├── DocComparator/                     # Workspace subproject
│   ├── data/                          # Unified Data Hub
│   │   └── uploads/                   # Raw PDFs and test documents
│   ├── results/                       # Artifact reports for the test cases tried
│   ├── README.md                      # This file
│   └── pyproject.toml                 # Workspace package configuration
├── src/
│   └── pprag_doc_comparator/          # Package source code (installed via pip)
│       ├── comparison/
│       │   ├── cross_retriever.py     # Stage 2 PP Retrieval (Doc 2)
│       │   ├── section_comparator.py  # Pairwise LLM evaluation engine
│       │   └── section_selector.py    # Stage 1 PP Retrieval (Doc 1)
│       ├── extraction/
│       │   └── extract_pdf_to_md.py   # LlamaParse PDF ingestion & formatting
│       ├── indexing/
│       │   └── build_doc_index.py     # Skeleton tree & FAISS vector builder
│       ├── report/
│       │   └── report_builder.py      # Markdown report generation logic
│       ├── validation/
│       │   └── criteria_validator.py  # Persona injection & criteria feasibility
│       ├── config.py                  # Core configurations and model definitions
│       └── app.py                     # Streamlit app code

Configuration

All configuration is centralized in src/pprag_doc_comparator/config.py. Override via environment variables:

Variable Default Description
GOOGLE_API_KEY (required) Gemini API key
LLAMA_CLOUD_API_KEY (required) LlamaParse API key for PDF extraction
DC_UPLOADS_DIR data/uploads/ Uploads and raw testing files
DC_DOCUMENTS_DIR data/documents/ Processed Markdown source directory
DC_TREES_DIR data/trees/ Structure tree directory
DC_INDEX_DIR data/index/ FAISS index directory
DC_EMBEDDING_BATCH_SIZE 20 Number of chunks embedded per Gemini request during indexing
DC_EMBEDDING_BATCH_DELAY 1 Seconds to wait between embedding batches during indexing
DC_COMPARE_CONCURRENCY 3 Maximum parallel LLM section comparisons per selected Doc 1 section

Indexing Throughput

DocComparator builds a shared FAISS index for the uploaded documents. Embedding requests are batched and retried on transient quota errors. Increase DC_EMBEDDING_BATCH_SIZE for faster indexing when quota allows it; lower the batch size or increase DC_EMBEDDING_BATCH_DELAY if Gemini returns 429 or resource-exhausted responses.

Comparison Latency

DocComparator compares each selected Doc 1 section against up to MAX_DOC2_MATCHES retrieved Doc 2 sections. These comparison calls now run with bounded concurrency controlled by DC_COMPARE_CONCURRENCY, while preserving input order in the final report. Lower the value if you hit LLM rate limits; increase it if your quota allows more parallel requests.


Design Decisions

Bypassing Surface-Level Keyword Matching

Instead of a simple keyword search, the Agentic pipeline performs deep, semantic contractual reasoning. It understands that the absence of a word ("safe harbor" or "projections") completely flips the legal risk profile for a borrower.

Enterprise-Value Preservation

The tool bridges text to real-world business strategy. When analyzing credit agreements, it reveals how mature industrial giants (like Emerson) tolerate entirely different disposition covenants compared to highly regulated mid-market growth companies (like Texas Roadhouse).

Auto-Detecting Modalities and Personas

The engine uses an LLM to evaluate the criteria and the document text. It dynamically detects whether the user is comparing legal contracts or academic research papers, and adjusts the prompt persona and logic dampeners (e.g. changing 🔴 Significant Discrepancy to 🟡 Moderate Difference when academic papers share similar underlying foundations).


Author

Partha Sarkar

Contact

  • GitHub Issues: For bug reports.
  • General Questions: For general questions, ideas, and enhancement requests, reach out to me on LinkedIn or Email.

License

© 2026 Partha Sarkar (Proxy-Pointer). Licensed under MIT.