A production-oriented academic assignment that implements an end-to-end Multi-Modal Retrieval Augmented Generation system with knowledge graph capabilities tailored for E-Commerce. The platform ingests text, pdf, and image files (like product descriptions and photos), stores chunk embeddings in Pinecone, models relationships in NetworkX, and answers user questions through grounded retrieval plus local Ollama-backed generation.
This system lets a user:
- upload documents across multiple modalities
- convert each modality into retrieval-ready text
- store semantic chunks in a vector database
- connect related chunks and entities through a knowledge graph
- ask grounded questions over the uploaded knowledge base
- inspect retrieved evidence, graph insights, and generated answers in a polished frontend
The current implementation uses:
qwen2:0.5bon local Ollama for final answer generationsentence-transformers/clip-ViT-B-32for unified cross-modal text and image embeddingsPineconeserverless vector database
- Multi-modal ingestion pipeline for text documents, PDFs, and images, with optional audio hooks.
- Vector retrieval using
Pineconefor serverless cloud storage. - Knowledge graph construction using
NetworkXfor document, chunk, entity, and cross-modal relationships. - FastAPI backend exposing ingestion, querying, graph summary, inventory listing, and file deletion APIs.
- React + Vite frontend with a presentation-ready dashboard, upload workspace, query console, and evidence panels.
- File inventory management with per-file removal from the UI.
- Dockerized deployment with a single
docker-compose.yml. - Local Ollama integration using your host Ollama server.
- Graceful fallback behavior if cloud APIs are unavailable or disabled.
This implementation is intentionally aligned with the assignment rubric:
System Design & Architecture- clear backend/frontend separation
- explicit ingestion, retrieval, graph, and generation stages
Multi-Modal Implementation- working ingestion for text, PDF, and image modalities
- image understanding handled through a local vision model
Functionality & Demo- interactive upload, query, retrieval, graph insights, and file removal flow
Dockerization & Deployment- single
docker-compose.ymlentry point
- single
Code Quality & GitHub Usage- modular backend services and documented branching strategy
Literature Survey- included in this README and tied back to design choices
Presentation Quality- architecture diagram, demo script, and polished UI included
flowchart LR
A[User Uploads Files] --> B[FastAPI Ingestion API]
B --> C{Modality Router}
C -->|Text| D[Text Extractor]
C -->|PDF| E[PDF Parser]
C -->|Image| F[Ollama Vision Descriptor]
C -->|Audio Optional| G[Audio Transcriber]
D --> H[Chunking + Metadata]
E --> H
F --> H
G --> H
H --> I[Local Embedding Service]
I --> J[ChromaDB Vector Store]
H --> K[Entity Extraction]
K --> L[NetworkX Knowledge Graph]
M[User Query] --> N[Query Planner]
N --> O[Local Query Embedding]
O --> J
J --> P[Top-K Retrieved Chunks]
P --> Q[Graph Expansion]
L --> Q
Q --> R[Ollama qwen2:0.5b Answer Generator]
R --> S[Grounded Answer + Citations + Graph Insights]
The editable Mermaid source is available in docs/architecture.mmd.
- Frontend: React, Vite, TypeScript, custom CSS
- Backend: FastAPI, Python 3.11
- Embeddings:
sentence-transformers/clip-ViT-B-32 - Vector Database: Pinecone
- Knowledge Graph: NetworkX
- Answer Model: Ollama
qwen2:0.5b - Vision Model: Ollama
moondream - Parsing:
PyPDF2,Pillow - Deployment: Docker, Docker Compose, Nginx
.
- backend
- app
- api
- services
- utils
- main.py
- data
- Dockerfile
- requirements.txt
- frontend
- src
- Dockerfile
- nginx.conf
- package.json
- docs
- architecture.mmd
- demo-script.md
- docker-compose.yml
- .env.example
- README.md
- Extensions:
.txt,.md,.csv,.json,.html - Flow: read -> normalize -> chunk -> embed -> index
- Extension:
.pdf - Flow: extract page text with
PyPDF2-> chunk -> embed -> index
- Extensions:
.png,.jpg,.jpeg,.bmp,.gif,.webp - Flow:
- Extract true visual embedding directly from the image using
sentence-transformers/clip-ViT-B-32 - Allows semantic matching between text queries and image visuals (e.g. searching "red dress" returns an image of a red dress).
- Extract true visual embedding directly from the image using
- Extensions:
.mp3,.wav,.m4a,.ogg,.flac - Flow:
- with
LLM_PROVIDER=openaiand valid API access: Whisper transcription - otherwise: filename-level placeholder metadata
- with
- Query processing rewrites the question, extracts keywords, and infers modality filters.
- Query embeddings are generated locally using the CLIP model.
- Pinecone returns top-k relevant chunks (both text and images).
- NetworkX expands each hit with graph-neighbor insights such as related entities and cross-modal links.
- Ollama
qwen2:0.5bgenerates the final answer from grounded retrieved context. - If an external model path fails, the backend falls back to a deterministic extractive answer mode instead of crashing.
The graph contains:
documentnodes for uploaded assetschunknodes for retrieval unitsentitynodes extracted from content
Relationships include:
HAS_CHUNKMENTIONSCONTAINS_ENTITYCROSS_MODAL_LINK
This gives the project a genuine graph-RAG behavior rather than plain vector search.
GET /api/health- service healthGET /api/documents- list indexed filesDELETE /api/documents/{document_id}- remove one indexed file from inventory, vectors, graph, and uploadsGET /api/graph- graph summary for the frontendPOST /api/ingest- upload and index one or more filesPOST /api/query- run a grounded multi-modal RAG query
The frontend is designed for demo clarity and grading:
- branded dashboard header with live status cues
- upload panel with selected-file previews
- inventory cards with delete actions
- graph analytics section with modality distribution
- guided query workspace with quick prompt suggestions
- structured answer panel with citations, graph insights, and retrieved evidence
When the system is working correctly, the evaluator should observe:
- after file upload:
- inventory cards appear immediately
- modality labels are correct
- graph node and edge counts increase
- after a query:
- an answer is generated in the response panel
- citations appear for retrieved sources
- graph insights appear when relationships exist
- evidence cards show top retrieved chunks
- after deleting a file:
- the inventory card disappears
- graph counts update
- subsequent answers no longer use the removed file
- Copy
.env.exampleto.env. - Start Ollama on your machine.
- Make sure these models exist locally:
ollama listYou should see:
qwen2:0.5bmoondream
If moondream is missing:
ollama pull moondream- Ensure
.envcontains:
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://host.docker.internal:11434
OLLAMA_MODEL=qwen2:0.5b
OLLAMA_VISION_MODEL=moondream
OLLAMA_TIMEOUT_SECONDS=120
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_INDEX_NAME=multimodal-rag- Run:
docker compose up --build- Open:
- Frontend:
http://localhost:3000 - Backend docs:
http://localhost:8000/docs - Host Ollama API:
http://localhost:11434
Run these checks before presenting:
ollama list
docker compose ps
docker compose logs --tail=50You should confirm:
qwen2:0.5bis available in Ollamamoondreamis available in Ollama- backend and frontend containers are both up
- there are no repeated traceback errors in backend logs
Backend:
cd backend
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --reloadFrontend:
cd frontend
npm install
npm run dev- Upload one
.txtfile. - Upload one
.pdf. - Upload one
.jpgor.png. - Confirm:
- inventory cards appear
- graph metrics update
- modality rows appear in the graph section
- Ask:
Find me running shoes that look like this image, and summarize their reviews.
-
Verify:
- answer is generated
- citations appear
- graph insights appear
- evidence cards appear
-
Remove one file from inventory and confirm:
- it disappears from the UI
- graph counts update
- it no longer contributes to future answers
Use questions like these during testing or presentation:
Find me running shoes that look like this image.What are the features of the black leather jacket?Compare the specs of these two products.
A ready demo outline is available in docs/demo-script.md. Suggested flow:
- Show the architecture diagram.
- Explain the dual memory design: vector store + graph store.
- Upload one file from each supported modality.
- Highlight graph counts, modality breakdown, and inventory cards.
- Run a cross-modal query.
- Show retrieved evidence, graph insights, and grounded answer.
- Demonstrate file deletion to show inventory management.
Text and PDFs already contain text, but images do not. The solution was to normalize every modality into a text-centric intermediate representation before chunking and retrieval.
Neo4j would increase setup complexity. NetworkX was selected as a lightweight graph layer that still satisfies graph reasoning and explainability requirements.
The project was shifted to local Ollama-based generation and local embeddings by default, so the demo remains stable even without cloud model access.
The frontend was refined to include better hierarchy, evidence presentation, and inventory controls so the app is easier to present within a 10-minute academic demo.
- Image understanding quality depends on the capability of the local vision model and may be weaker than larger cloud multimodal models.
- Audio support remains optional and is stronger with a cloud transcription provider.
- Local deterministic embeddings are stable and demo-friendly, but they are less semantically rich than strong cloud embedding models.
- This project prioritizes reliability and academic presentation value over enterprise-scale optimization.
This project references the paper "A Survey on Large Language Model based Autonomous Agents" by Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen, published in Frontiers of Computer Science in 2024.
Source:
- Springer article: https://link.springer.com/article/10.1007/s11704-024-40231-1
Summary:
- The paper organizes LLM-based autonomous agents into modules such as profiling, memory, planning, and action.
- It argues that strong agent behavior comes from orchestration around the model, not only the model itself.
- That directly influenced this project: ChromaDB acts as retrieval memory, the knowledge graph adds structured relational memory, and the query planner orchestrates retrieval before answer generation.
- The survey also highlights robustness, grounding, and evaluation as central concerns, which is why this project surfaces citations, graph insights, and evidence-backed responses.
Recommended branching model when publishing:
mainfor stable demo-ready codedevfor integrationfeature/ingestion-pipelinefeature/ollama-integrationfeature/frontend-polishfeature/inventory-management
Suggested workflow:
- Create feature branches from
dev. - Make focused commits with descriptive messages.
- Open Pull Requests into
dev. - Merge
devintomainonce the demo build is stable.
Example commit messages:
feat(backend): add multimodal ingestion and chroma indexingfeat(llm): switch answer generation to local ollama modelsfeat(frontend): redesign dashboard and evidence panelsfeat(inventory): add document deletion from ui and backenddocs(readme): align setup and demo steps with ollama pipeline
- Full-stack app with React frontend and FastAPI backend
- Multi-modal ingestion for at least three modalities
- Vector database retrieval
- Knowledge graph construction
- Local LLM-based answer generation
- Dockerfiles and
docker-compose.yml -
.env.example - README with architecture, setup, challenges, and literature survey
- The backend runtime targets Python 3.11, which matches the Dockerfile.
- Local Ollama must be running on the host machine for the Dockerized backend to reach
host.docker.internal:11434. - The repository is ready to be initialized or pushed to GitHub, but remote publishing must be done from an environment with git access configured.
- If a real OpenAI key was used during testing, rotate it before publishing the repository.