An advanced Retrieval-Augmented Generation (RAG) architecture that fuses semantic vector search with a deterministic Neo4j Knowledge Graph, specifically engineered to map and synthesize academic literature in the computer vision domain.
This live demo shows the project’s interactive graph visualization, where academic papers, tasks, methods, and datasets are connected through Neo4j-powered relationships and explored in real time.
CV-Graph-RAG is a hybrid retrieval pipeline and synthesis engine. Unlike standard RAG systems that treat documents as flat bags of text, this system parses computer vision research papers into a strict relational topology. It maps explicit connections between Foundational Models, Methodologies, Datasets, and Vision Tasks, allowing a local Large Language Model (Llama 3) to traverse the academic landscape deterministically.
Standard vector databases are exceptional at finding semantic similarities, but they fundamentally fail at relational logic.
When researching complex architectures, a query like "Identify all object detection methods evaluated on MS-COCO that utilize bounding-box regressions" requires multi-hop logical intersection. Standard vector RAG retrieves chunks that contain these keywords loosely scattered across paragraphs, leading to severe transitivity errors and structural hallucinations. It cannot enforce boundaries between independent facts.
We needed a system that understands the exact boundaries of a paper's contribution.
Navigating the dense, rapidly evolving landscape of computer vision research—from real-time tracking architectures to complex gesture recognition systems—requires tracking intricate dependencies. When compiling literature for academic writing or scoping out the feasibility of a new model, extracting these structural connections manually is a massive bottleneck.
We built CV-Graph-RAG to transform isolated PDF research papers into a strictly queryable network. By combining the high-recall capabilities of vector embeddings with the strict provenance mapping of a graph database, this engine acts as a rigorous research assistant. It eliminates cross-document hallucinations and provides explicit, citation-backed evidence paths for every claim, drastically accelerating the literature review and architecture planning process.
[User Query]
│
├─► Vector Space (Top-K Semantic Chunks)
│ └─► Cross-Encoder Re-Ranking
│
├─► Neo4j Knowledge Graph (Cypher Edge Traversal)
│ └─► (Paper) ─[SOLVES]─► (Task) ─[EVALUATED_ON]─► (Dataset)
│
▼
[Hybrid Context Payload]
│
└─► Local LLM Synthesis Layer (Strict Guardrails)
│
└─► [Deterministic, Verifiable Output]
flowchart TD
subgraph Acquisition Layer
A["PDF Research Papers"]
B["arXiv API Abstracts"]
end
subgraph Extraction & Normalization
C["Structured Entity Extraction"]
D["Entity Resolution & Normalization"]
E["Embedding Generation"]
end
subgraph Storage & Indexing
F["Neo4j Knowledge Graph"]
G["Vector Space Index"]
end
subgraph Retrieval & Inference
H["Flask Server"]
I["Hybrid Retrieval"]
J["Ollama (Llama 3)"]
K["Evaluation Server"]
end
A --> C
B --> C
C --> D
D --> F
D --> E
E --> G
F --> I
G --> I
I --> H
H --> J
H --> K
K --> J
This system was benchmarked against two retrieval baselines:
- Naked LLM (No Retrieval)
- Standard Vector RAG
Compared to both baselines, CV-Graph-RAG demonstrated:
- Lower hallucination rates through graph-grounded retrieval.
- Higher relationship recovery accuracy between papers, tasks, methods, and datasets.
- Stronger evidence traceability via explicit graph traversal paths.
- Better multi-hop reasoning across connected research entities.
- More structured and interpretable outputs for literature synthesis tasks.
The detailed comparison is shown below.
| Metric | Naked LLM | Standard Vector RAG | CV-Graph-RAG |
|---|---|---|---|
| Task Extraction Accuracy | Low–Medium | Medium–High | High |
| Dataset Extraction Accuracy | Low–Medium | Medium–High | High |
| Entity Coverage (How many relevant tasks, datasets, methods are found) | Low | Medium | High |
| Relationship Accuracy (Task ↔ Dataset ↔ Method links) | Very Low | Medium | High |
| Hallucination Rate | High | Medium | Low |
| Evidence Traceability | None | Medium | High |
| Multi-hop Reasoning | Poor | Medium | High |
| Consistency Across Runs | Medium–Low | High | Very High |
| Structured Output Quality | Low | Medium | High |
| Interpretability | Low | Medium | Very High |
| Novel Relation Discovery | None | Low | High |
| Source Grounding | None | High | High |
| Graph Traversal Queries | Impossible | Poor | Excellent |
| Scalability to Large Literature Collections | Poor | Medium | High |
| Support for Explainable Retrieval | None | Limited | Strong |
Identify all tasks that are solved by the methods utilized in the paper evaluating GPT-4o, and list the datasets those tasks are typically evaluated on.
Click to see Tier 1: Naked LLM Baseline Response
The paper evaluating GPT-4o! That's a fascinating topic.
GPT-4o is a language model designed to generate human-like text based on input prompts. The paper evaluates its performance on various natural language processing (NLP) tasks, which I'll outline below:
1. **Text Classification**: GPT-4o can classify text into predefined categories (e.g., spam vs. non-spam emails). Typical datasets used for evaluation include:
* 20 Newsgroups
* IMDB Movie Reviews
* Stanford Sentiment Treebank
2. **Sentiment Analysis**: The model can analyze the sentiment of text (positive, negative, neutral) and predict whether a piece of text is subjective or objective. Common datasets used for this task include:
* IMDB Movie Reviews
* Stanford Sentiment Treebank
* Amazon Product Reviews
3. **Question Answering**: GPT-4o can answer questions based on the input prompt. This task typically involves evaluating the model's ability to extract relevant information from a given text passage. Popular datasets used for this task include:
* SQuAD (Stanford Question Answering Dataset)
* TriviaQA
* Natural Questions
4. **Named Entity Recognition (NER)**: The model can identify and classify named entities in unstructured text into categories like person, organization, location, date, time, etc. Typical datasets used for evaluation include:
* CoNLL-2003 Shared Task
* OntoNotes
* WikiAnn
5. **Part-of-Speech (POS) Tagging**: GPT-4o can identify the part of speech (noun, verb, adjective, adverb, etc.) for each word in a sentence. Common datasets used for this task include:
* Penn Treebank Corpus
* OntoNotes
* WikiAnn
6. **Language Modeling**: The model can predict the next word in a sequence given the context of the previous words. This task is often evaluated on large-scale text datasets, such as:
* BookCorpus
* WikiText-2
* Common Crawl
7. **Summarization**: GPT-4o can generate a concise summary of a given text passage. Typical datasets used for evaluation include:
* CNN/Daily Mail News Articles
* XSum
* T5-XSUM
Please note that these tasks and datasets are not exhaustive, as the paper may have evaluated the model on additional tasks or datasets specific to their research focus.
Click to see Tier 2: Standard Vector RAG Response
Tasks
1. Semantic Segmentation
- Evidence:[1]
- Context Summary: One of the standard computer vision tasks benchmarked using established datasets (e.g., COCO, ImageNet, etc.) to evaluate multimodal foundation models' visual understanding.
2. Object Detection
- Evidence:[1]
- Context Summary: Another standard computer vision task benchmarked using established datasets (e.g., COCO, ImageNet, etc.) to evaluate multimodal foundation models' visual understanding.
3. Image Classification
- Evidence:[1]
- Context Summary: A standard computer vision task benchmarked using established datasets (e.g., ImageNet, etc.) to evaluate multimodal foundation models' visual understanding.
4. Depth and Surface Normal Prediction
- Evidence:[1]
- Context Summary: A geometric task benchmarked using established datasets (e.g., COCO, etc.) to evaluate multimodal foundation models' visual understanding.
Datasets
1. COCO
- Evidence:[1]
- Context Summary: One of the established datasets used to evaluate multimodal foundation models' performance on various computer vision tasks.
2. ImageNet
- Evidence:[1]
- Context Summary: Another established dataset used to evaluate multimodal foundation models' performance on image classification and other computer vision tasks.
Note: The query does not specify a particular method, so I've listed the tasks that are solved by the methods utilized in the paper evaluating GPT-4o.
Click to see Tier 3: CV-Graph-RAG Response
Tasks
1. Object Detection
- Evidence: (How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks -> Solves Tasks -> Object Detection)
- Context Summary: Evaluated as one of the standard computer vision tasks to benchmark the multimodal foundation models' performance.[1]
2. Semantic Segmentation
- Evidence: (How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks -> Solves Tasks -> Semantic Segmentation)
- Context Summary: Evaluated as one of the standard computer vision tasks to benchmark the multimodal foundation models' performance.[1]
Datasets
1. ImageNet
- Evidence: (How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks -> Evaluated On Datasets -> ImageNet)
- Context Summary: One of the datasets used to evaluate the multimodal foundation models' performance in standard computer vision tasks.[1]
2. MS-COCO
- Evidence: (How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks -> Evaluated On Datasets -> MS-COCO)
- Context Summary: One of the datasets used to evaluate the multimodal foundation models' performance in standard computer vision tasks.[1]
cv-graph-rag/
├── data/ # Local storage directory for data assets and indices
│ ├── api_cache/ # Cached API network requests and responses
│ ├── extracted_entities/ # Raw extraction JSON arrays prior to database injection
│ ├── raw_abstracts/ # Unstructured source text packets and metadata from arXiv
│ └── vector_db/ # Persistent vector storage weights and embedding matrices
├── neo4j_data/ # Volume mounting space for the persistent Neo4j engine
├── neo4j_logs/ # System execution logs generated by the graph instance
├── notebooks/ # Prototyping and data exploration environments
│ └── 01_entity_eda.ipynb # Analysis notebook tracking extracted entity distributions
├── src/ # Core engineering application source code
│ ├── v1/ # Legacy monolithic prototype code
│ │ ├── build_graph.py # Initial baseline script compiling early graph nodes
│ │ ├── extract_entities.py # Prototype entity recognition script for source text
│ │ ├── hybrid_search.py # Early keyword-plus-vector hybrid search experiments
│ │ ├── ingest_arxiv.py # Legacy ingestion script communicating with the arXiv API
│ │ └── store_vectors.py # Initial vector compilation script for document tokens
│ └── v2/ # Production-grade modular GraphRAG framework
│ ├── generation/ # Synthesis prompt synthesis and local LLM chat controllers
│ ├── retrieval/ # Orchestrators balancing vector search, re-ranking, and graph traversals
│ ├── templates/ # HTML frontend assets serving the baseline app interface
│ ├── app.py # Primary Flask server managing live application execution
│ ├── batch_ingest.py # Bulk parallel ingestion processing pipeline
│ ├── generate_embeddings.py # Execution script computing high-density vectors for chunks
│ ├── ingest_to_neo4j.py # Graph client populating normalized entities into strict Cypher schemas
│ ├── normalize_graph_entities.py # Entity resolution module merging near-duplicate nodes
│ ├── run_rag.py # Production CLI pipeline interface to run the full architecture manually
│ └── structured_extraction.py # Pipeline forcing unstructured abstracts into JSON schematized tuples
├── tests/evaluation/ # Multi-tier benchmarking framework and verification modules
│ ├── templates/ # UI template files displaying side-by-side tier outputs
│ ├── app.py # Flask web engine powering the live evaluation dashboard
│ └── eval_backend.py # Execution controller routing queries through the three RAG tiers
├── .gitignore # System specifications excluding caches, data directories, and venv
├── docker-compose.yml # Automated orchestration engine provisioning the isolated Neo4j database
├── README.md # Architectural overview, benchmarking logs, and documentation
└── requirements.txt # Explicit Python package dependency configuration manifest
The architecture is split into decoupled layers that manage ingestion, geometric graph processing, semantic vector search, and local generation.
- Core Framework & API Layer: Python 3.10+, Flask (engine routing and UI server)
- Graph Database Environment: Neo4j Community Server v5.x, Cypher Query Language
- Vector & Embedding Pipeline: Sentence-Transformers (high-density embeddings), Cross-Encoder re-rankers (Hugging Face)
- Localized Inference Engine: Ollama API, Llama 3 (8B Instruct / localized parametric weight mapping)
- Infrastructure Containerization: Docker, Docker Compose (isolated multi-volume state persistence)
Follow these steps to provision the infrastructure, build the vector indices, inject the graph topology, and boot the live evaluation suite locally.
Ensure your local environment meets the following specifications:
- Docker and Docker Compose installed and running
- Python 3.10 or higher configured
- Minimum 16 GB RAM recommended for local model inference
Clone the repository and initialize an isolated virtual environment:
# Clone the repository
git clone https://github.com/your-username/cv-graph-rag.git
cd cv-graph-rag
# Initialize and activate the virtual environment
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
# Install deterministic production dependencies
pip install -r requirements.txtUse Docker Compose to provision the network-isolated graph database. This handles port mapping and local volume mounting automatically.
# Spin up the containerized Neo4j database in detached mode
docker-compose up -dVerify that the database is active by visiting the Neo4j Browser at http://localhost:7474. Use the default credentials configured in your docker-compose.yml file.
Ensure Ollama is running on your machine, then pull the target model to handle the localized synthesis layer:
# Pull the production-grade Llama 3 model weights locally
ollama pull llama3Before running queries, populate the vector database and map the Neo4j topology from raw text documents.
# 1. Fetch, parse, and structure academic papers into JSON format
python src/v2/batch_ingest.py
# 2. Compute semantic vector chunks and build the localized index
python src/v2/generate_embeddings.py
# 3. Inject structured entities into Neo4j graph schemas
python src/v2/ingest_to_neo4j.py
# 4. Resolve entity naming conflicts and merge duplicate nodes
python src/v2/normalize_graph_entities.pyOnce data injection is finalized, initialize the globally cached models and launch the Flask interface to start testing.
# Start the live interface server
python tests/evaluation/app.py