Skip to content

Commit cd7f589

Browse files
committed
feat(sdk): add local and cloud demo examples
1 parent 106afc2 commit cd7f589

3 files changed

Lines changed: 168 additions & 58 deletions

File tree

README.md

Lines changed: 52 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,10 @@
1616
<p align="center"><b>Reasoning-based RAG&nbsp;&nbsp;No Vector DB&nbsp;&nbsp;No Chunking&nbsp;&nbsp;Human-like Retrieval</b></p>
1717

1818
<h4 align="center">
19-
<a href="https://vectify.ai">🌐 Homepage</a>&nbsp;&nbsp;
19+
<a href="https://vectify.ai">🏠 Homepage</a>&nbsp;&nbsp;
2020
<a href="https://chat.pageindex.ai">🖥️ Chat Platform</a>&nbsp;&nbsp;
21-
<a href="https://pageindex.ai/developer">🔌 MCP & API</a>&nbsp;&nbsp;
22-
<a href="https://docs.pageindex.ai">📖 Docs</a>&nbsp;&nbsp;
21+
<a href="https://pageindex.ai/mcp">🔌 MCP</a>&nbsp;&nbsp;
22+
<a href="https://docs.pageindex.ai">📚 Docs</a>&nbsp;&nbsp;
2323
<a href="https://discord.com/invite/VuXuf29EUj">💬 Discord</a>&nbsp;&nbsp;
2424
<a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">✉️ Contact</a>&nbsp;
2525
</h4>
@@ -28,16 +28,20 @@
2828

2929

3030
<details open>
31-
<summary><h2>📢 Updates</h2></summary>
32-
33-
- 🔥 [**Agentic Vectorless RAG**](https://github.com/VectifyAI/PageIndex/blob/main/examples/agentic_vectorless_rag_demo.py) — A simple *agentic, vectorless RAG* [example](https://github.com/VectifyAI/PageIndex/blob/main/examples/agentic_vectorless_rag_demo.py) with self-hosted PageIndex, using OpenAI Agents SDK.
34-
- [PageIndex Chat](https://chat.pageindex.ai) — Human-like document analysis agent [platform](https://chat.pageindex.ai) for professional long documents. Also available via [MCP](https://pageindex.ai/developer) or [API](https://pageindex.ai/developer).
35-
- [PageIndex Framework](https://pageindex.ai/blog/pageindex-intro) — Deep dive into PageIndex: an *agentic, in-context tree index* that enables LLMs to perform *reasoning-based, human-like retrieval* over long documents.
36-
37-
<!-- **🧪 Cookbooks:**
31+
<summary><h3>📢 Latest Updates</h3></summary>
32+
33+
**🔥 Releases:**
34+
- [**PageIndex Chat**](https://chat.pageindex.ai): The first human-like document-analysis agent [platform](https://chat.pageindex.ai) built for professional long documents. Can also be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart) (beta).
35+
<!-- - [**PageIndex Chat API**](https://docs.pageindex.ai/quickstart): An API that brings PageIndex's advanced long-document intelligence directly into your applications and workflows. -->
36+
<!-- - [PageIndex MCP](https://pageindex.ai/mcp): Bring PageIndex into Claude, Cursor, or any MCP-enabled agent. Chat with long PDFs in a reasoning-based, human-like way. -->
37+
38+
**📝 Articles:**
39+
- [**PageIndex Framework**](https://pageindex.ai/blog/pageindex-intro): Introduces the PageIndex framework — an *agentic, in-context* *tree index* that enables LLMs to perform *reasoning-based*, *human-like retrieval* over long documents, without vector DB or chunking.
40+
<!-- - [Do We Still Need OCR?](https://pageindex.ai/blog/do-we-need-ocr): Explores how vision-based, reasoning-native RAG challenges the traditional OCR pipeline, and why the future of document AI might be *vectorless* and *vision-based*. -->
41+
42+
**🧪 Cookbooks:**
3843
- [Vectorless RAG](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): A minimal, hands-on example of reasoning-based RAG using PageIndex. No vectors, no chunking, and human-like retrieval.
39-
- [Vision-based Vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): OCR-free, vision-only RAG with PageIndex's reasoning-native retrieval workflow that works directly over PDF page images. -->
40-
44+
- [Vision-based Vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): OCR-free, vision-only RAG with PageIndex's reasoning-native retrieval workflow that works directly over PDF page images.
4145
</details>
4246

4347
---
@@ -58,38 +62,33 @@ It simulates how *human experts* navigate and extract knowledge from complex doc
5862
</a>
5963
</div>
6064

61-
### 🎯 Core Features
65+
### 🎯 Core Features
6266

6367
Compared to traditional vector-based RAG, **PageIndex** features:
6468
- **No Vector DB**: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.
6569
- **No Chunking**: Documents are organized into natural sections, not artificial chunks.
6670
- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents.
6771
- **Better Explainability and Traceability**: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search (“vibe retrieval”).
6872

69-
PageIndex powers a reasoning-based RAG system that achieved **state-of-the-art** [98.7% accuracy](https://github.com/VectifyAI/Mafin2.5-FinanceBench) on FinanceBench, demonstrating superior performance over vector-based RAG solutions in professional document analysis. See our [blog post](https://vectify.ai/blog/Mafin2.5) for details.
73+
PageIndex powers a reasoning-based RAG system that achieved **state-of-the-art** [98.7% accuracy](https://github.com/VectifyAI/Mafin2.5-FinanceBench) on FinanceBench, demonstrating superior performance over vector-based RAG solutions in professional document analysis (see our [blog post](https://vectify.ai/blog/Mafin2.5) for details).
7074

7175
### 📍 Explore PageIndex
7276

73-
To learn more, please see a detailed introduction to the [PageIndex framework](https://pageindex.ai/blog/pageindex-intro). Check out this GitHub repo for open-source code, and the [cookbooks](https://docs.pageindex.ai/cookbook), [tutorials](https://docs.pageindex.ai/tutorials), and [blog](https://pageindex.ai/blog) for additional usage guides and examples.
77+
To learn more, please see a detailed introduction of the [PageIndex framework](https://pageindex.ai/blog/pageindex-intro). Check out this GitHub repo for open-source code, and the [cookbooks](https://docs.pageindex.ai/cookbook), [tutorials](https://docs.pageindex.ai/tutorials), and [blog](https://pageindex.ai/blog) for additional usage guides and examples.
7478

75-
The PageIndex service is available as a ChatGPT-style [chat platform](https://chat.pageindex.ai), or can be integrated via [MCP](https://pageindex.ai/developer) or [API](https://pageindex.ai/developer).
79+
The PageIndex service is available as a ChatGPT-style [chat platform](https://chat.pageindex.ai), or can be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).
7680

7781
### 🛠️ Deployment Options
7882
- Self-host — run locally with this open-source repo.
79-
- Cloud Service — try instantly with our [Chat Platform](https://chat.pageindex.ai/), or integrate via [MCP](https://pageindex.ai/developer) or [API](https://pageindex.ai/developer).
83+
- Cloud Service — try instantly with our [Chat Platform](https://chat.pageindex.ai/), or integrate with [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).
8084
- _Enterprise_ — private or on-prem deployment. [Contact us](https://ii2abc2jejf.typeform.com/to/tK3AXl8T) or [book a demo](https://calendly.com/pageindex/meet) for more details.
8185

8286
### 🧪 Quick Hands-on
8387

84-
- 🔥 [**Agentic Vectorless RAG**](examples/agentic_vectorless_rag_demo.py) (**latest**) — a simple but complete **agentic vectorless RAG** [example](https://github.com/VectifyAI/PageIndex/blob/main/examples/agentic_vectorless_rag_demo.py) with *self-hosted* PageIndex, using OpenAI Agents SDK.
85-
- Try the [Vectorless RAG](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb) notebook — a *minimal*, hands-on example of reasoning-based RAG using PageIndex.
86-
- Check out [Vision-based Vectorless RAG](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb) — no OCR; a minimal, vision-based & reasoning-native RAG pipeline that works directly over page images.
88+
- Try the [**Vectorless RAG**](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb) notebook — a *minimal*, hands-on example of reasoning-based RAG using PageIndex.
89+
- Experiment with [*Vision-based Vectorless RAG*](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb) — no OCR; a minimal, reasoning-native RAG pipeline that works directly over page images.
8790

8891
<div align="center">
89-
<a href="https://github.com/VectifyAI/PageIndex/blob/main/examples/agentic_vectorless_rag_demo.py" target="_blank" rel="noopener">
90-
<img src="https://img.shields.io/badge/View_on_GitHub-Agentic_Vectorless_RAG-blue?style=for-the-badge&logo=github" alt="View on GitHub: Agentic Vectorless RAG" />
91-
</a>
92-
<br/>
9392
<a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb" target="_blank" rel="noopener">
9493
<img src="https://img.shields.io/badge/Open_In_Colab-Vectorless_RAG-orange?style=for-the-badge&logo=googlecolab" alt="Open in Colab: Vectorless RAG" />
9594
</a>
@@ -102,10 +101,9 @@ The PageIndex service is available as a ChatGPT-style [chat platform](https://ch
102101
---
103102

104103
# 🌲 PageIndex Tree Structure
105-
106104
PageIndex can transform lengthy PDF documents into a semantic **tree structure**, similar to a _"table of contents"_ but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.
107105

108-
Below is an example PageIndex tree structure. Also see more example [documents](https://github.com/VectifyAI/PageIndex/tree/main/examples/documents) and generated [tree structures](https://github.com/VectifyAI/PageIndex/tree/main/examples/documents/results).
106+
Below is an example PageIndex tree structure. Also see more example [documents](https://github.com/VectifyAI/PageIndex/tree/main/tests/pdfs) and generated [tree structures](https://github.com/VectifyAI/PageIndex/tree/main/tests/results).
109107

110108
```jsonc
111109
...
@@ -135,7 +133,7 @@ Below is an example PageIndex tree structure. Also see more example [documents](
135133
...
136134
```
137135

138-
You can generate the PageIndex tree structure with this open-source repo, or use our [API](https://pageindex.ai/developer).
136+
You can generate the PageIndex tree structure with this open-source repo, or use our [API](https://docs.pageindex.ai/quickstart)
139137

140138
---
141139

@@ -151,10 +149,12 @@ pip3 install --upgrade -r requirements.txt
151149

152150
### 2. Set your LLM API key
153151

154-
Create a `.env` file in the root directory with your LLM API key, with multi-LLM support via [LiteLLM](https://docs.litellm.ai/docs/providers):
152+
Create a `.env` file in the root directory with your LLM API key::
155153

156154
```bash
157155
OPENAI_API_KEY=your_openai_key_here
156+
# or
157+
CHATGPT_API_KEY=your_openai_key_here # legacy, still supported
158158
```
159159

160160
### 3. Generate PageIndex structure for your PDF
@@ -164,12 +164,12 @@ python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
164164
```
165165

166166
<details>
167-
<summary>Optional parameters</summary>
167+
<summary><strong>Optional parameters</strong></summary>
168168
<br>
169169
You can customize the processing with additional optional arguments:
170170

171171
```
172-
--model LLM model to use (default: gpt-4o-2024-11-20)
172+
--model OpenAI model to use (default: gpt-4o-2024-11-20)
173173
--toc-check-pages Pages to check for table of contents (default: 20)
174174
--max-pages-per-node Max pages per node (default: 10)
175175
--max-tokens-per-node Max tokens per node (default: 20000)
@@ -180,29 +180,31 @@ You can customize the processing with additional optional arguments:
180180
</details>
181181

182182
<details>
183-
<summary>Markdown support</summary>
183+
<summary><strong>Markdown support</strong></summary>
184184
<br>
185-
We also provide markdown support for PageIndex. You can use the `--md_path` flag to generate a tree structure for a markdown file.
185+
We also provide markdown support for PageIndex. You can use the `-md_path` flag to generate a tree structure for a markdown file.
186186

187187
```bash
188188
python3 run_pageindex.py --md_path /path/to/your/document.md
189189
```
190190

191-
> Note: in this mode, we use "#" to determine node headings and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don't recommend using this mode, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our [PageIndex OCR](https://pageindex.ai/blog/ocr), which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this mode.
191+
> Note: in this function, we use "#" to determine node heading and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don't recommend using this function, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our [PageIndex OCR](https://pageindex.ai/blog/ocr), which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this function.
192192
</details>
193193
194-
## Agentic Vectorless RAG: An Example
194+
### A Complete Agentic RAG Example
195195

196-
For a simple, end-to-end _**agentic vectorless RAG**_ example using PageIndex with OpenAI Agents SDK, see [`examples/agentic_vectorless_rag_demo.py`](examples/agentic_vectorless_rag_demo.py).
196+
For a complete agent-based QA example using the [OpenAI Agents SDK](https://github.com/openai/openai-agents-python), see [`examples/openai_agents_demo.py`](examples/openai_agents_demo.py).
197197

198198
```bash
199199
# Install optional dependency
200200
pip3 install openai-agents
201201

202202
# Run the demo
203-
python3 examples/agentic_vectorless_rag_demo.py
203+
python3 examples/openai_agents_demo.py
204204
```
205205

206+
---
207+
206208
<!--
207209
# ☁️ Improved Tree Generation with PageIndex OCR
208210
@@ -238,32 +240,24 @@ Explore the full [benchmark results](https://github.com/VectifyAI/Mafin2.5-Finan
238240

239241
# 🧭 Resources
240242

243+
* 🧪 [Cookbooks](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): hands-on, runnable examples and advanced use cases.
244+
* 📖 [Tutorials](https://docs.pageindex.ai/doc-search): practical guides and strategies, including *Document Search* and *Tree Search*.
241245
* 📝 [Blog](https://pageindex.ai/blog): technical articles, research insights, and product updates.
242-
* 🔧 [Developer](https://pageindex.ai/developer): MCP setup, API docs, and integration guides.
243-
* 🧪 [Cookbooks](https://docs.pageindex.ai/cookbook): hands-on, runnable examples and advanced use cases.
244-
* 📖 [Tutorials](https://docs.pageindex.ai/tutorials): practical guides and strategies, including *Document Search* and *Tree Search*.
246+
* 🔌 [MCP setup](https://pageindex.ai/mcp#quick-setup) & [API docs](https://docs.pageindex.ai/quickstart): integration details and configuration options.
245247

246248
---
247249

248250
# ⭐ Support Us
249-
250-
Leave us a star 🌟 if you like our project. Thank you!
251-
252-
<p>
253-
<img src="https://github.com/user-attachments/assets/eae4ff38-48ae-4a7c-b19f-eab81201d794" width="80%">
254-
</p>
255-
256251
Please cite this work as:
257252
```
258253
Mingtian Zhang, Yu Tang and PageIndex Team,
259254
"PageIndex: Next-Generation Vectorless, Reasoning-based RAG",
260255
PageIndex Blog, Sep 2025.
261256
```
262257

263-
<details>
264-
<summary>Or use the BibTeX citation.</summary>
258+
Or use the BibTeX citation:
265259

266-
```bibtex
260+
```
267261
@article{zhang2025pageindex,
268262
author = {Mingtian Zhang and Yu Tang and PageIndex Team},
269263
title = {PageIndex: Next-Generation Vectorless, Reasoning-based RAG},
@@ -273,20 +267,20 @@ PageIndex Blog, Sep 2025.
273267
note = {https://pageindex.ai/blog/pageindex-intro},
274268
}
275269
```
276-
</details>
277270

271+
Leave us a star 🌟 if you like our project. Thank you!
278272

279-
### Connect with Us
273+
<p>
274+
<img src="https://github.com/user-attachments/assets/eae4ff38-48ae-4a7c-b19f-eab81201d794" width="80%">
275+
</p>
280276

281-
<div align="center">
277+
### Connect with Us
282278

283-
[![Twitter](https://img.shields.io/badge/Twitter-000000?style=for-the-badge&logo=x&logoColor=white)](https://x.com/PageIndexAI)&ensp;
284-
[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/company/vectify-ai/)&ensp;
285-
[![Discord](https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.com/invite/VuXuf29EUj)&ensp;
279+
[![Twitter](https://img.shields.io/badge/Twitter-000000?style=for-the-badge&logo=x&logoColor=white)](https://x.com/PageIndexAI)&nbsp;
280+
[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/company/vectify-ai/)&nbsp;
281+
[![Discord](https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.com/invite/VuXuf29EUj)&nbsp;
286282
[![Contact Us](https://img.shields.io/badge/Contact_Us-3B82F6?style=for-the-badge&logo=envelope&logoColor=white)](https://ii2abc2jejf.typeform.com/to/tK3AXl8T)
287283

288-
</div>
289-
290284
---
291285

292-
© 2026 [Vectify AI](https://vectify.ai)
286+
© 2025 [Vectify AI](https://vectify.ai)

examples/cloud_demo.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
"""
2+
PageIndex Cloud Demo
3+
4+
Usage:
5+
pip install pageindex
6+
export PAGEINDEX_API_KEY=your-api-key
7+
python examples/cloud_demo.py
8+
"""
9+
import asyncio
10+
import os
11+
from pathlib import Path
12+
import requests
13+
from pageindex import CloudClient
14+
15+
_DIR = Path(__file__).parent
16+
PDF_URL = "https://arxiv.org/pdf/1706.03762.pdf"
17+
PDF_PATH = _DIR / "documents" / "attention.pdf"
18+
19+
# Download PDF if needed
20+
if not PDF_PATH.exists():
21+
print(f"Downloading {PDF_URL} ...")
22+
PDF_PATH.parent.mkdir(parents=True, exist_ok=True)
23+
with requests.get(PDF_URL, stream=True, timeout=30) as r:
24+
r.raise_for_status()
25+
with open(PDF_PATH, "wb") as f:
26+
for chunk in r.iter_content(chunk_size=8192):
27+
if chunk:
28+
f.write(chunk)
29+
print("Download complete.\n")
30+
31+
client = CloudClient(api_key=os.environ["PAGEINDEX_API_KEY"])
32+
col = client.collection()
33+
34+
doc_id = col.add(str(PDF_PATH))
35+
print(f"Indexed: {doc_id}\n")
36+
37+
# Streaming query
38+
stream = col.query("What is the main contribution of this paper?", stream=True)
39+
40+
async def main():
41+
streamed_text = False
42+
async for event in stream:
43+
if event.type == "answer_delta":
44+
print(event.data, end="", flush=True)
45+
streamed_text = True
46+
elif event.type == "tool_call":
47+
if streamed_text:
48+
print()
49+
streamed_text = False
50+
args = event.data.get("args", "")
51+
print(f"[tool call] {event.data['name']}({args})")
52+
elif event.type == "answer_done":
53+
print()
54+
streamed_text = False
55+
56+
asyncio.run(main())

examples/local_demo.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
"""
2+
PageIndex Local Demo
3+
4+
Usage:
5+
pip install pageindex
6+
python examples/local_demo.py
7+
"""
8+
import asyncio
9+
from pathlib import Path
10+
import requests
11+
from pageindex import LocalClient
12+
13+
_DIR = Path(__file__).parent
14+
PDF_URL = "https://arxiv.org/pdf/1706.03762.pdf"
15+
PDF_PATH = _DIR / "documents" / "attention.pdf"
16+
WORKSPACE = _DIR / "workspace"
17+
18+
# Download PDF if needed
19+
if not PDF_PATH.exists():
20+
print(f"Downloading {PDF_URL} ...")
21+
PDF_PATH.parent.mkdir(parents=True, exist_ok=True)
22+
with requests.get(PDF_URL, stream=True, timeout=30) as r:
23+
r.raise_for_status()
24+
with open(PDF_PATH, "wb") as f:
25+
for chunk in r.iter_content(chunk_size=8192):
26+
if chunk:
27+
f.write(chunk)
28+
print("Download complete.\n")
29+
30+
client = LocalClient(storage_path=str(WORKSPACE))
31+
col = client.collection()
32+
33+
doc_id = col.add(str(PDF_PATH))
34+
print(f"Indexed: {doc_id}\n")
35+
36+
# Streaming query
37+
stream = col.query(
38+
"What is the main architecture proposed in this paper and how does self-attention work?",
39+
stream=True,
40+
)
41+
42+
async def main():
43+
streamed_text = False
44+
async for event in stream:
45+
if event.type == "answer_delta":
46+
print(event.data, end="", flush=True)
47+
streamed_text = True
48+
elif event.type == "tool_call":
49+
if streamed_text:
50+
print()
51+
streamed_text = False
52+
print(f"[tool call] {event.data['name']}")
53+
elif event.type == "tool_result":
54+
preview = str(event.data)[:200] + "..." if len(str(event.data)) > 200 else event.data
55+
print(f"[tool output] {preview}")
56+
elif event.type == "answer_done":
57+
print()
58+
streamed_text = False
59+
60+
asyncio.run(main())

0 commit comments

Comments
 (0)