Name	Name	Last commit message	Last commit date
parent directory ..
assets	assets
data	data
examples	examples
.env.example	.env.example
.gitignore	.gitignore
README.md	README.md
pyproject.toml	pyproject.toml
uv.lock	uv.lock

Proxy-Pointer

Structural RAG for Document Analysis — A hierarchical, pointer-based RAG pipeline that retrieves full document sections using structural tree navigation instead of blind vector similarity.

How It Works

graph TD
    A[PDF Documents] -->|LlamaParse| B[Markdown Files]
    B -->|Skeleton Tree Builder| C[Structure Trees]
    C -->|LLM Noise Filter| D[Content Nodes]
    B --> E[Text Chunks]
    D --> E
    E -->|Gemini Embed 1536d| F[FAISS Index]

    G[User Query] -->|Gemini Embed| H[Vector Search k=200]
    F --> H
    H -->|Dedup by node_id| I[Top 50 Unique Candidates]
    I -->|LLM Re-Ranker| J[Top 5 Nodes]
    J -->|Load Full Sections| K[Rich Context]
    K -->|LLM Synthesizer| L[Answer]

Instead of retrieving small, context-less chunks, Proxy-Pointer:

Builds a structural tree of each document (like a table of contents) — pure Python, no external deps
Filters noise (TOC, abbreviations, foreword, etc.) using an LLM
Indexes structural pointers — each chunk carries metadata about its position in the document hierarchy
Re-ranks by structure — an LLM re-ranker selects the most relevant sections by their hierarchical path, not just embedding similarity
Loads full sections — the synthesizer sees complete document sections, not truncated 2000-char chunks

Architecture Deep Dive

For the full technical story behind the architecture:

Proxy-Pointer RAG: Achieving Vectorless Accuracy at Vector RAG Scale and Cost — Core architecture & the pointer-based retrieval idea
Proxy-Pointer RAG: Structure Meets Scale — 100% Accuracy with Smarter Retrieval — Scaling to multi-document, LLM re-ranking, and benchmark results

Benchmark Results

Proxy-Pointer has been evaluated against the FinanceBench dataset using four FY2022 10-K filings (AMD, American Express, Boeing, PepsiCo). In addition, I created a list of 40 significantly more complex and advanced questions requiring multi-step reasoning and calculations, causal and atribution analysis, adversarial reasoning, which I named Comprehensive benchmark. The results are as follows:

Benchmark	Questions	k_final	Accuracy
FinanceBench (26 questions)	Qualitative + quantitative	k=5	100% (26/26)
FinanceBench (26 questions)	Qualitative + quantitative	k=3	96.2% (25/26)
Comprehensive (40 questions)	Complex financial calculations	k=5	100% (40/40)
Comprehensive (40 questions)	Complex financial calculations	k=3	92.5% (37/40)

Full scorecards and comparison logs are available in data/Benchmark/. Refer the Comprehensive k=5 and Comprehensive k=3 folders for the detailed logs and scorecards.

5-Minute Quickstart

A pre-extracted Markdown file (AMD.md) for AMD's FY2022 10-K is included so you can build the index and start querying immediately — no PDF extraction required.

Want to try more companies? Additional Markdown files for American Express, Boeing, and PepsiCo are available in data/documents/md_files/. Copy them into data/documents/ and rebuild the index.

1. Clone

git clone https://github.com/Proxy-Pointer/Proxy-Pointer-RAG.git
cd Proxy-Pointer-RAG
cd Text-Only

2. Create Virtual Environment & Install Dependencies

We strongly recommend creating a virtual environment first:

python -m venv venv
# Windows: venv\Scripts\activate | macOS/Linux: source venv/bin/activate

You can then install dependencies using standard pip or using uv (recommended for developers).

Option A: Standard pip

Install the package:

pip install "pprag[text]"

Option B: For Developers (using uv)

If you want to tinker with the code, this project uses uv for lightning-fast dependency management.

pip install uv
uv sync --project Text-Only
# Or by package name: uv sync --package pprag-text

# Remember to prefix commands with `uv run` if you use this method!

3. Configure API keys

cp .env.example .env
# Edit .env → add your GOOGLE_API_KEY
# Note: Also review other commented variables, especially the FAISS trust settings required for local index loading!

4. Build the index

To build the FAISS index from scratch for the first time:

# Prefix with `uv run` if you installed via Option B
pprag text index --fresh

(Note: If you add more documents later, simply run pprag text index without the --fresh flag to incrementally append only the new files!)

5. Start querying

# Prefix with `uv run` if you installed via Option B
pprag text ask

Try a query like:

User >> What is AMD's quick ratio for FY2022? Quick ratio is Quick Assets (Cash and cash equivalents + Short-term investments + Accounts receivable, net) by current liabilities
or a more advanced comparative one:
User >> For AMD, what percentage of total revenue growth from FY2021 to FY2022 is attributable to the Embedded segment (including Xilinx)?

Running Benchmarks

The benchmark script evaluates the pipeline against an Excel dataset containing Question and Answer (or Ground Truth) columns. It runs each question through the bot, uses an LLM-as-a-judge to grade the response, and generates a timestamped log and scorecard in data/results/.

We provide a robust Comprehensive benchmark that evaluates the bot across all four dataset companies (AMD, AMEX, Boeing, and PepsiCo). Because only AMD is indexed by default in the quickstart, process the following steps to run the benchmark successfully:

Copy the remaining documents: Move all .md files from data/documents/md_files/ into the main data/documents/ directory.
Rebuild the index: Run the indexer from scratch to ensure the bot has the data required to answer questions for the other companies:
```
# Prefix with `uv run` if you installed via Option B
pprag text index --fresh
```
Execute the benchmark: Run the evaluation script against the provided Comprehensive dataset (or replace the path with your own Excel file):
```
# Prefix with `uv run` if you installed via Option B
pprag text benchmark "data/Benchmark/Comprehensive k=5/Test_Questions.xlsx"
```

Adding Your Own Documents

Option A: You already have Markdown files

Place .md files in data/documents/
Run the indexer (uses incremental indexing to only process new files):
```
# Prefix with `uv run` if you installed via Option B
pprag text index
```
(Add --fresh if you want to completely erase the existing index and rebuild from scratch).

Option B: Start from PDFs

Add LLAMA_CLOUD_API_KEY to your .env file
Place PDFs in data/pdf/

Extract to markdown:

# Prefix with `uv run` if you installed via Option B
pprag text extract

Build the index (incrementally adds the newly extracted PDFs):

# Prefix with `uv run` if you installed via Option B
pprag text index

Project Structure

Proxy-Pointer/
├── Text-Only/                         # Workspace subproject
│   ├── data/
│   │   ├── pdf/                       # Source PDFs (4 FinanceBench 10-Ks included)
│   │   ├── documents/
│   │   │   ├── AMD.md                 # Pre-extracted — ready for quickstart
│   │   │   └── md_files/              # Additional Markdown files (AMEX, Boeing, PepsiCo)
│   │   ├── trees/                     # Structure tree JSONs (auto-generated)
│   │   ├── index/                     # Generated FAISS index (gitignored)
│   │   └── Benchmark/                 # Benchmark results and scorecards
│   │       ├── FinanceBench/          # FinanceBench scorecards (k=3 and k=5)
│   │       ├── Comprehensive k=5/     # 40-question evaluation logs and scorecards
│   │       └── Comprehensive k=3/     # 40-question evaluation logs and scorecards
│   ├── examples/
│   │   └── sample_queries.md          # Example queries to try
│   ├── README.md                      # This file
│   └── pyproject.toml                 # Workspace package configuration
├── src/
│   └── pprag_text_only/               # Package source code (installed via pip)
│       ├── config.py                  # Centralized configuration
│       ├── extraction/
│       │   └── extract_pdf_to_md.py   # PDF → Markdown (LlamaParse)
│       ├── indexing/
│       │   ├── build_skeleton_trees.py# Markdown → structural tree (pure Python)
│       │   └── build_pp_index.py      # Noise filter + chunking + FAISS indexing
│       └── agent/
│           ├── pp_rag_bot.py          # Interactive RAG bot
│           └── benchmark.py           # Automated benchmarking with LLM-as-a-judge

Configuration

All configuration is centralized in src/pprag_text_only/config.py. Override via environment variables:

Variable	Default	Description
`GOOGLE_API_KEY`	(required)	Gemini API key
`LLAMA_CLOUD_API_KEY`	(optional)	LlamaParse API key for PDF extraction
`PP_DATA_DIR`	`data/documents/`	Markdown source directory
`PP_TREES_DIR`	`data/trees/`	Structure tree directory
`PP_INDEX_DIR`	`data/index/`	FAISS index directory
`PP_RESULTS_DIR`	`data/results/`	Benchmark results directory
`PP_EMBEDDING_BATCH_SIZE`	`20`	Number of chunks embedded per Gemini request during indexing
`PP_EMBEDDING_BATCH_DELAY`	`1`	Seconds to wait between embedding batches during indexing

Indexing Throughput

The index builder embeds chunks in configurable Gemini batches. Increase PP_EMBEDDING_BATCH_SIZE to improve indexing throughput on higher quota projects. If you see 429 or quota errors, lower the batch size or increase PP_EMBEDDING_BATCH_DELAY; embedding calls automatically retry transient rate-limit failures with exponential backoff.

Components

1. Extraction Layer (`src/extraction/`)

PDF → Markdown via LlamaParse.

Preserves document hierarchy (headings, tables, lists)
Outputs one .md file per PDF
Idempotent: skips already-extracted files
Note: Accurate markdown with headings identified is crucial for the pipeline to work correctly. If the default cost_effective tier is not sufficient, run with a higher tier by selecting a different value for the LLAMA_PARSE_TIER environment variable in config.py.

2. Indexing Layer (`src/indexing/`)

Three-stage pipeline:

Stage 0: Skeleton Tree Building (`build_skeleton_trees.py`)

Self-contained, pure-Python module (~150 lines) that parses Markdown headings into a hierarchical tree
Tree nodes represent document sections with node_id, title, and line_num
Zero external dependencies — no LLM calls, no subprocess invocations

Stage 1: LLM Noise Filter

Sends the skeleton tree JSON to Gemini Flash Lite
Identifies noise nodes across 6 categories:
- Table of Contents
- Abbreviations / Glossary
- Acknowledgments
- Foreword / Preface
- Executive Summary
- References / Bibliography
Returns a set of node_ids to exclude
Temperature 0.0 for deterministic results

Stage 2: Chunk, Embed, Index

For each non-noise node:
- Extracts text between line_num boundaries
- Parent nodes: text stops at first child's line_num (no overlap with children)
- Splits into 2000-char chunks with 200-char overlap
- Enriches with hierarchical breadcrumb: [Parent > Child > Section]\nchunk_text
- Stores metadata: doc_id, node_id, title, breadcrumb, start_line, end_line
Embeds with gemini-embedding-001 at 1536 dimensions (half of default 3072)
Saves as FAISS index

3. Retrieval Layer (`src/agent/`)

Two-stage retrieval:

Stage 1: Broad Vector Recall

Embeds query with same 1536-dim model
FAISS similarity search returns top 200 chunks
Deduplicates by node_id and shortlists to the Top 50 unique candidate nodes

Stage 2: LLM Structural Re-Ranker

Sends candidate hierarchical paths (breadcrumbs) to Gemini
LLM ranks by structural relevance, not embedding similarity
Returns top 5 unique node IDs
Fallback: if re-ranker fails, uses top 5 by similarity

Synthesis

For each selected node: loads the full section text from the source .md file using start_line / end_line pointers
Injects breadcrumb as ### REFERENCE header for grounding
Gemini synthesizes a grounded answer citing sources

Design Decisions

Why 1536 dimensions?

Gemini's gemini-embedding-001 defaults to 3072 dimensions. We use output_dimensionality=1536:

50% smaller FAISS index files
Faster similarity search
Minimal accuracy loss — for structural retrieval (breadcrumb matching), the re-ranker does the heavy lifting; embeddings just need to get the right candidates into the top 200

Why LLM noise filter instead of regex?

Hardcoded title matching (NOISE_TITLES = {"contents", "foreword", ...}) breaks on:

Variations: "Note of Thanks" vs "Acknowledgments"
Formatting: "Table of Contents" vs "TABLE OF CONTENTS"
Language: concept-based matching catches semantic equivalents

Why structural re-ranker?

Standard vector RAG returns chunks by embedding similarity. A query about "AMD's cash flow" might surface a paragraph that mentions cash flow, but the actual Cash Flow Statement table is structurally elsewhere. The re-ranker sees AMD > Financial Statements > Cash Flows as a breadcrumb and knows it's the right section.

Why full-section loading?

The indexed chunk is max 2000 chars — often just a fragment of a table or section. The synthesizer needs the complete section (including headers, full tables, footnotes) for accurate answers. The chunk acts as a pointer; the full section is the payload.

Dependencies

You can replace each of the following with your preferred tools:

Gemini — Embeddings, noise filter, re-ranker, synthesis
LangChain + FAISS — Vector indexing
LlamaParse — PDF to Markdown extraction (optional)
Pandas — Benchmark data loading (optional)

Author

Partha Sarkar

Contact

For questions, feedback, or to report a bug, please use the following channels:

GitHub Issues: For bug reports.
General Questions: For general questions, ideas, and enhancement requests, reach out to me on LinkedIn or Email.

FilesExpand file tree

Text-Only

Directory actions

More options