Structural RAG for Document Analysis — A hierarchical, pointer-based RAG pipeline that retrieves full document sections using structural tree navigation instead of blind vector similarity.
graph TD
A[PDF Documents] -->|LlamaParse| B[Markdown Files]
B -->|Skeleton Tree Builder| C[Structure Trees]
C -->|LLM Noise Filter| D[Content Nodes]
B --> E[Text Chunks]
D --> E
E -->|Gemini Embed 1536d| F[FAISS Index]
G[User Query] -->|Gemini Embed| H[Vector Search k=200]
F --> H
H -->|Dedup by node_id| I[Top 50 Unique Candidates]
I -->|LLM Re-Ranker| J[Top 5 Nodes]
J -->|Load Full Sections| K[Rich Context]
K -->|LLM Synthesizer| L[Answer]
Instead of retrieving small, context-less chunks, Proxy-Pointer:
- Builds a structural tree of each document (like a table of contents) — pure Python, no external deps
- Filters noise (TOC, abbreviations, foreword, etc.) using an LLM
- Indexes structural pointers — each chunk carries metadata about its position in the document hierarchy
- Re-ranks by structure — an LLM re-ranker selects the most relevant sections by their hierarchical path, not just embedding similarity
- Loads full sections — the synthesizer sees complete document sections, not truncated 2000-char chunks
For the full technical story behind the architecture:
- Proxy-Pointer RAG: Achieving Vectorless Accuracy at Vector RAG Scale and Cost — Core architecture & the pointer-based retrieval idea
- Proxy-Pointer RAG: Structure Meets Scale — 100% Accuracy with Smarter Retrieval — Scaling to multi-document, LLM re-ranking, and benchmark results
Proxy-Pointer has been evaluated against the FinanceBench dataset using four FY2022 10-K filings (AMD, American Express, Boeing, PepsiCo). In addition, I created a list of 40 significantly more complex and advanced questions requiring multi-step reasoning and calculations, causal and atribution analysis, adversarial reasoning, which I named Comprehensive benchmark. The results are as follows:
| Benchmark | Questions | k_final | Accuracy |
|---|---|---|---|
| FinanceBench (26 questions) | Qualitative + quantitative | k=5 | 100% (26/26) |
| FinanceBench (26 questions) | Qualitative + quantitative | k=3 | 96.2% (25/26) |
| Comprehensive (40 questions) | Complex financial calculations | k=5 | 100% (40/40) |
| Comprehensive (40 questions) | Complex financial calculations | k=3 | 92.5% (37/40) |
Full scorecards and comparison logs are available in data/Benchmark/. Refer the Comprehensive k=5 and Comprehensive k=3 folders for the detailed logs and scorecards.
A pre-extracted Markdown file (AMD.md) for AMD's FY2022 10-K is included so you can build the index and start querying immediately — no PDF extraction required.
Want to try more companies? Additional Markdown files for American Express, Boeing, and PepsiCo are available in
data/documents/md_files/. Copy them intodata/documents/and rebuild the index.
git clone https://github.com/Proxy-Pointer/Proxy-Pointer-RAG.git
cd Proxy-Pointer-RAG
cd Text-OnlyWe strongly recommend creating a virtual environment first:
python -m venv venv
# Windows: venv\Scripts\activate | macOS/Linux: source venv/bin/activateYou can then install dependencies using standard pip or using uv (recommended for developers).
Install the package:
pip install "pprag[text]"If you want to tinker with the code, this project uses uv for lightning-fast dependency management.
pip install uv
uv sync --project Text-Only
# Or by package name: uv sync --package pprag-text
# Remember to prefix commands with `uv run` if you use this method!cp .env.example .env
# Edit .env → add your GOOGLE_API_KEY
# Note: Also review other commented variables, especially the FAISS trust settings required for local index loading!To build the FAISS index from scratch for the first time:
# Prefix with `uv run` if you installed via Option B
pprag text index --fresh(Note: If you add more documents later, simply run pprag text index without the --fresh flag to incrementally append only the new files!)
# Prefix with `uv run` if you installed via Option B
pprag text askTry a query like:
User >> What is AMD's quick ratio for FY2022? Quick ratio is Quick Assets (Cash and cash equivalents + Short-term investments + Accounts receivable, net) by current liabilities
or a more advanced comparative one:
User >> For AMD, what percentage of total revenue growth from FY2021 to FY2022 is attributable to the Embedded segment (including Xilinx)?
The benchmark script evaluates the pipeline against an Excel dataset containing Question and Answer (or Ground Truth) columns. It runs each question through the bot, uses an LLM-as-a-judge to grade the response, and generates a timestamped log and scorecard in data/results/.
We provide a robust Comprehensive benchmark that evaluates the bot across all four dataset companies (AMD, AMEX, Boeing, and PepsiCo). Because only AMD is indexed by default in the quickstart, process the following steps to run the benchmark successfully:
- Copy the remaining documents: Move all
.mdfiles fromdata/documents/md_files/into the maindata/documents/directory. - Rebuild the index: Run the indexer from scratch to ensure the bot has the data required to answer questions for the other companies:
# Prefix with `uv run` if you installed via Option B pprag text index --fresh - Execute the benchmark: Run the evaluation script against the provided Comprehensive dataset (or replace the path with your own Excel file):
# Prefix with `uv run` if you installed via Option B pprag text benchmark "data/Benchmark/Comprehensive k=5/Test_Questions.xlsx"
- Place
.mdfiles indata/documents/ - Run the indexer (uses incremental indexing to only process new files):
(Add
# Prefix with `uv run` if you installed via Option B pprag text index--freshif you want to completely erase the existing index and rebuild from scratch).
- Add
LLAMA_CLOUD_API_KEYto your.envfile - Place PDFs in
data/pdf/ - Extract to markdown:
# Prefix with `uv run` if you installed via Option B pprag text extract - Build the index (incrementally adds the newly extracted PDFs):
# Prefix with `uv run` if you installed via Option B pprag text index
Proxy-Pointer/
├── Text-Only/ # Workspace subproject
│ ├── data/
│ │ ├── pdf/ # Source PDFs (4 FinanceBench 10-Ks included)
│ │ ├── documents/
│ │ │ ├── AMD.md # Pre-extracted — ready for quickstart
│ │ │ └── md_files/ # Additional Markdown files (AMEX, Boeing, PepsiCo)
│ │ ├── trees/ # Structure tree JSONs (auto-generated)
│ │ ├── index/ # Generated FAISS index (gitignored)
│ │ └── Benchmark/ # Benchmark results and scorecards
│ │ ├── FinanceBench/ # FinanceBench scorecards (k=3 and k=5)
│ │ ├── Comprehensive k=5/ # 40-question evaluation logs and scorecards
│ │ └── Comprehensive k=3/ # 40-question evaluation logs and scorecards
│ ├── examples/
│ │ └── sample_queries.md # Example queries to try
│ ├── README.md # This file
│ └── pyproject.toml # Workspace package configuration
├── src/
│ └── pprag_text_only/ # Package source code (installed via pip)
│ ├── config.py # Centralized configuration
│ ├── extraction/
│ │ └── extract_pdf_to_md.py # PDF → Markdown (LlamaParse)
│ ├── indexing/
│ │ ├── build_skeleton_trees.py# Markdown → structural tree (pure Python)
│ │ └── build_pp_index.py # Noise filter + chunking + FAISS indexing
│ └── agent/
│ ├── pp_rag_bot.py # Interactive RAG bot
│ └── benchmark.py # Automated benchmarking with LLM-as-a-judge
All configuration is centralized in src/pprag_text_only/config.py. Override via environment variables:
| Variable | Default | Description |
|---|---|---|
GOOGLE_API_KEY |
(required) | Gemini API key |
LLAMA_CLOUD_API_KEY |
(optional) | LlamaParse API key for PDF extraction |
PP_DATA_DIR |
data/documents/ |
Markdown source directory |
PP_TREES_DIR |
data/trees/ |
Structure tree directory |
PP_INDEX_DIR |
data/index/ |
FAISS index directory |
PP_RESULTS_DIR |
data/results/ |
Benchmark results directory |
PP_EMBEDDING_BATCH_SIZE |
20 |
Number of chunks embedded per Gemini request during indexing |
PP_EMBEDDING_BATCH_DELAY |
1 |
Seconds to wait between embedding batches during indexing |
The index builder embeds chunks in configurable Gemini batches. Increase
PP_EMBEDDING_BATCH_SIZE to improve indexing throughput on higher quota
projects. If you see 429 or quota errors, lower the batch size or increase
PP_EMBEDDING_BATCH_DELAY; embedding calls automatically retry transient
rate-limit failures with exponential backoff.
PDF → Markdown via LlamaParse.
- Preserves document hierarchy (headings, tables, lists)
- Outputs one
.mdfile per PDF - Idempotent: skips already-extracted files
- Note: Accurate markdown with headings identified is crucial for the pipeline to work correctly. If the default
cost_effectivetier is not sufficient, run with a higher tier by selecting a different value for theLLAMA_PARSE_TIERenvironment variable in config.py.
Three-stage pipeline:
- Self-contained, pure-Python module (~150 lines) that parses Markdown headings into a hierarchical tree
- Tree nodes represent document sections with
node_id,title, andline_num - Zero external dependencies — no LLM calls, no subprocess invocations
- Sends the skeleton tree JSON to Gemini Flash Lite
- Identifies noise nodes across 6 categories:
- Table of Contents
- Abbreviations / Glossary
- Acknowledgments
- Foreword / Preface
- Executive Summary
- References / Bibliography
- Returns a set of
node_ids to exclude - Temperature 0.0 for deterministic results
- For each non-noise node:
- Extracts text between
line_numboundaries - Parent nodes: text stops at first child's
line_num(no overlap with children) - Splits into 2000-char chunks with 200-char overlap
- Enriches with hierarchical breadcrumb:
[Parent > Child > Section]\nchunk_text - Stores metadata:
doc_id,node_id,title,breadcrumb,start_line,end_line
- Extracts text between
- Embeds with
gemini-embedding-001at 1536 dimensions (half of default 3072) - Saves as FAISS index
Two-stage retrieval:
- Embeds query with same 1536-dim model
- FAISS similarity search returns top 200 chunks
- Deduplicates by
node_idand shortlists to the Top 50 unique candidate nodes
- Sends candidate hierarchical paths (breadcrumbs) to Gemini
- LLM ranks by structural relevance, not embedding similarity
- Returns top 5 unique node IDs
- Fallback: if re-ranker fails, uses top 5 by similarity
- For each selected node: loads the full section text from the source
.mdfile usingstart_line/end_linepointers - Injects breadcrumb as
### REFERENCEheader for grounding - Gemini synthesizes a grounded answer citing sources
Gemini's gemini-embedding-001 defaults to 3072 dimensions. We use output_dimensionality=1536:
- 50% smaller FAISS index files
- Faster similarity search
- Minimal accuracy loss — for structural retrieval (breadcrumb matching), the re-ranker does the heavy lifting; embeddings just need to get the right candidates into the top 200
Hardcoded title matching (NOISE_TITLES = {"contents", "foreword", ...}) breaks on:
- Variations: "Note of Thanks" vs "Acknowledgments"
- Formatting: "Table of Contents" vs "TABLE OF CONTENTS"
- Language: concept-based matching catches semantic equivalents
Standard vector RAG returns chunks by embedding similarity. A query about "AMD's cash flow" might surface a paragraph that mentions cash flow, but the actual Cash Flow Statement table is structurally elsewhere. The re-ranker sees AMD > Financial Statements > Cash Flows as a breadcrumb and knows it's the right section.
The indexed chunk is max 2000 chars — often just a fragment of a table or section. The synthesizer needs the complete section (including headers, full tables, footnotes) for accurate answers. The chunk acts as a pointer; the full section is the payload.
You can replace each of the following with your preferred tools:
- Gemini — Embeddings, noise filter, re-ranker, synthesis
- LangChain + FAISS — Vector indexing
- LlamaParse — PDF to Markdown extraction (optional)
- Pandas — Benchmark data loading (optional)
Partha Sarkar
For questions, feedback, or to report a bug, please use the following channels:
- GitHub Issues: For bug reports.
- General Questions: For general questions, ideas, and enhancement requests, reach out to me on LinkedIn or Email.
© 2026 Partha Sarkar (Proxy-Pointer). Licensed under MIT.
