VisoRAG

VisualDocQA Kit for local GPU document RAG.

Vision-first document RAG with ColQwen2, Qwen2.5-VL, Qdrant, and FastAPI for PDF, DOCX, and image question answering and field extraction.

VisoRAG is a notebook-originated visual document QA and extraction baseline. It renders documents as page images, embeds the pages with ColQwen2, retrieves relevant pages through an in-memory Qdrant MaxSim collection, and answers with local Qwen2.5-VL through a FastAPI API or CLI. It is meant for developers, students, and researchers who want a source-readable visual RAG reference rather than a hosted multi-user product.

Read This First

Area	Status	What it means
Dev verification	Ready	`.[dev]` runs import, API, notebook, lint, and mocked tests without GPU model downloads.
Real inference	Requires CUDA	`.[gpu]`, model downloads, and a CUDA GPU are required for ColQwen2 and Qwen2.5-VL.
Production deployment	Not hardened	Use a gateway, TLS, restricted CORS, strong token, private logs, and GPU capacity planning first.

No custom web frontend is included in this clean release. The visible product surface is the FastAPI API, CLI, notebook, docs, and tests.

Preview

This is a real screenshot from the local FastAPI /docs route. It shows the actual public API surface.

This is generated conceptual artwork for repository sharing. It is not a product screenshot and does not show a real query result.

What It Solves

Traditional document RAG often starts by extracting text, which can lose layout, tables, charts, figures, and scanned-page context. VisoRAG keeps the document as images first: pages are rendered visually, embedded as page images, retrieved by visual similarity, and passed to a local vision-language model for answering or field extraction.

Good fits:

Exploring visual document RAG architecture.
Building a local GPU baseline for PDF/image QA.
Testing field extraction response contracts.
Learning how to package a notebook workflow as a FastAPI/CLI project.

Not good fits yet:

CPU-only inference.
Hosted multi-user SaaS.
Production document ingestion without additional security controls.
Persistent vector search over a long-lived document corpus.

Source of Truth

The authoritative implementation is VisoRAG_v8_Final (1).ipynb. This clean repository was built from that notebook only. Older sibling files from the original workspace are intentionally excluded because they were stale derived artifacts.

The Python package in src/visorag/ is the contributor-friendly extraction of the notebook behavior. It must preserve the notebook contracts unless a documented migration updates both the notebook and tests.

Features

Vision-first document QA without OCR-first text extraction.
PDF, DOCX, PNG, JPG, and JPEG input support in the real GPU runtime.
ColQwen2 page embeddings and Qdrant in-memory multivector MaxSim retrieval.
Local Qwen2.5-VL answer generation with deterministic generation settings.
Flat JSON responses for extraction mode.
FastAPI app factory and CLI for local serving and testing.
CI-friendly tests that mock heavy GPU/model work.

Install Modes

Install	Use it for	Does not include
`python -m pip install -e ".[dev]"`	Static checks, notebook contract, API tests, mocked pipeline tests	ColQwen2/Qwen2.5-VL runtime validation
`python -m pip install -e ".[gpu]"`	Real visual retrieval and generation on CUDA	Production auth, hosting, monitoring
DOCX support	Requires LibreOffice `soffice` on `PATH`	Not needed for PDF/image input

Quickstart

Clone the repository and run the lightweight verification path:

git clone https://github.com/RossDmello2/visualdocqa-kit.git
cd visualdocqa-kit
python -m pip install -e ".[dev]"
python scripts/check_notebook_contract.py
python scripts/smoke_import.py
python -m pytest

What this proves: the notebook still matches the expected contract, the package imports without loading GPU models, and API/pipeline behavior passes mocked tests. It does not prove real model inference.

For real GPU inference, install the GPU extra in a CUDA environment:

python -m pip install -e ".[gpu]"

The notebook was designed around a CUDA GPU runtime such as Google Colab T4. The local Qwen2.5-VL generation path is not CPU-ready. No public sample document is bundled; use only synthetic or public documents when sharing screenshots or issues.

API Usage

Start the local API:

export VISORAG_API_TOKEN="change-me"
python -m visorag serve --host 127.0.0.1 --port 8000

PowerShell:

$env:VISORAG_API_TOKEN = "change-me"
python -m visorag serve --host 127.0.0.1 --port 8000

change-me and the built-in test token fallback are local-development placeholders, not deployment values. Always set a strong VISORAG_API_TOKEN before serving real documents.

Health check:

python -m visorag health --url http://127.0.0.1:8000

Submit a document:

python -m visorag query \
  --url http://127.0.0.1:8000 \
  --token "$VISORAG_API_TOKEN" \
  --file /path/to/document.pdf \
  --query "Extract total due and invoice number" \
  --query-type extraction \
  --top-k 5

FastAPI routes:

GET / returns service metadata.
GET /health returns model-load flags and device info.
POST /query accepts multipart file plus query, query_type, document_type, and top_k.

See docs/API_REFERENCE.md for status codes, request fields, and both API-layer and pipeline error shapes.

Response Contract

Extraction mode returns flat JSON:

{
  "total_due": 6610.95,
  "invoice_number": "BPXIN-00550"
}

Factual and summary modes return:

{
  "answer": "The total due is 6610.95."
}

Pipeline errors include request_id; API validation/auth errors may only include error and detail because the pipeline has not started yet.

Tech Stack

Layer	Technology
Language	Python 3.10+
API	FastAPI, Uvicorn
CLI	`argparse`, HTTPX
Visual retrieval	ColQwen2 via `colpali-engine`
Vector store	Qdrant in-memory multivector collection
Answer generation	Local `Qwen/Qwen2.5-VL-3B-Instruct`
Document rendering	PyMuPDF, Pillow, LibreOffice for DOCX conversion
Tests and checks	Pytest, Ruff, notebook JSON/AST contract checks

Architecture

flowchart LR
    A["Upload PDF, DOCX, or image"] --> B["Validate request"]
    B --> C["Render pages as images"]
    C --> D["Embed pages with ColQwen2"]
    D --> E["Index in per-request in-memory Qdrant"]
    F["User query"] --> G["Embed query with ColQwen2"]
    G --> H["Retrieve top visual pages"]
    E --> H
    H --> I["Generate answer with local Qwen2.5-VL"]
    I --> J["Flat JSON extraction or answer response"]

Runtime constraints:

Requests are serialized through a process-level query lock.
Uploads default to 25 MB.
Rendered documents default to 200 pages maximum.
Qdrant state is per request and in memory.
Uploaded files and rendered pages are temporary by default.

Longer architecture notes live in docs/ARCHITECTURE.md.

Documentation Map

Need	Start here
First setup	docs/GETTING_STARTED.md
API clients	docs/API_REFERENCE.md
Architecture	docs/ARCHITECTURE.md
Deployment	docs/DEPLOYMENT.md
Troubleshooting	docs/TROUBLESHOOTING.md
Contribution model	CONTRIBUTING.md and docs/feature-extension.md
Naming and SEO rationale	docs/NAMING_SEO_STRATEGY.md

Project Structure

.
|-- VisoRAG_v8_Final (1).ipynb   # authoritative notebook provenance
|-- src/visorag/                 # import-safe package extraction
|-- tests/                       # mocked unit/API/notebook contract tests
|-- scripts/                     # static notebook and import smoke checks
|-- docs/                        # architecture, API, deployment, and operations notes
|-- docs/assets/                 # README visuals and captured API screenshots
`-- .github/                     # CI, issue templates, PR template, Dependabot

Testing

python scripts/check_notebook_contract.py
python scripts/smoke_import.py
python -m ruff format --check .
python -m ruff check .
python -m pytest

Real model inference is intentionally outside the default CI path because it requires CUDA, model downloads, and enough GPU memory. CI proves package/import/API contracts, not production GPU throughput.

Deployment

VisoRAG is easiest to run as a long-running Python service on a GPU machine or notebook runtime. Do not expose the FastAPI server directly to the internet without a gateway, TLS, restricted CORS, a strong bearer token, request-log controls, and dependency/model-license review. See docs/DEPLOYMENT.md.

Security

Do not upload private documents to public issues, logs, or CI artifacts. The local API uses bearer-token auth but enables permissive CORS for notebook compatibility. The runtime may log filenames and query snippets, so disable or restrict logs before processing sensitive documents. See SECURITY.md and docs/security-model.md.

Support

Use SUPPORT.md for safe support requests and SECURITY.md for vulnerability reporting.

Contributing

Read CONTRIBUTING.md and docs/feature-extension.md before opening a pull request. Behavior changes must preserve the notebook contract or document the migration.

Citation

Use CITATION.cff for software citation metadata. VisoRAG builds on ColQwen2, Qwen2.5-VL, Qdrant, PyMuPDF, FastAPI, and the Python scientific/ML ecosystem; retain upstream license notices when redistributing.

See docs/THIRD_PARTY_NOTICES.md for dependency notice guidance.

Current Status

READY WITH GAPS: the source package, docs, CI checks, notebook contract, import smoke, mocked tests, local FastAPI health endpoint, API docs rendering, and approved GitHub rename to visualdocqa-kit have been verified. Real GPU inference with ColQwen2 and Qwen2.5-VL is documented but not verified in CI, and internet-facing deployment requires additional hardening.

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisoRAG

Read This First

Preview

What It Solves

Source of Truth

Features

Install Modes

Quickstart

API Usage

Response Contract

Tech Stack

Architecture

Documentation Map

Project Structure

Testing

Deployment

Security

Support

Contributing

Citation

Current Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
assets		assets
docs		docs
examples		examples
scripts		scripts
src/visorag		src/visorag
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
VisoRAG_v8_Final (1).ipynb		VisoRAG_v8_Final (1).ipynb
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

VisoRAG

Read This First

Preview

What It Solves

Source of Truth

Features

Install Modes

Quickstart

API Usage

Response Contract

Tech Stack

Architecture

Documentation Map

Project Structure

Testing

Deployment

Security

Support

Contributing

Citation

Current Status

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages