|
| 1 | +<!-- |
| 2 | +Licensed to the Apache Software Foundation (ASF) under one |
| 3 | +or more contributor license agreements. See the NOTICE file |
| 4 | +distributed with this work for additional information |
| 5 | +regarding copyright ownership. The ASF licenses this file |
| 6 | +to you under the Apache License, Version 2.0 (the |
| 7 | +"License"); you may not use this file except in compliance |
| 8 | +with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | +Unless required by applicable law or agreed to in writing, |
| 13 | +software distributed under the License is distributed on an |
| 14 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | +KIND, either express or implied. See the License for the |
| 16 | +specific language governing permissions and limitations |
| 17 | +under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# Neo4j GraphRAG — TMDB Movies |
| 21 | + |
| 22 | +A full GraphRAG pipeline over a movie knowledge graph stored in Neo4j, |
| 23 | +built entirely with Apache Hamilton. Ingestion, embedding, and retrieval |
| 24 | +are each expressed as first-class Hamilton DAGs — dependencies declared |
| 25 | +through function signatures, execution graph built automatically. |
| 26 | + |
| 27 | +## Hamilton DAG visualisations |
| 28 | + |
| 29 | +Run `--visualise` on any mode to regenerate these from source without |
| 30 | +executing the pipeline. |
| 31 | + |
| 32 | +### Ingestion DAG |
| 33 | + |
| 34 | +```bash |
| 35 | +python run.py --mode ingest --visualise |
| 36 | +``` |
| 37 | + |
| 38 | + |
| 39 | + |
| 40 | +Raw TMDB JSON flows through parsing nodes into batched Neo4j writes. |
| 41 | +Hamilton automatically parallelises the four independent branches |
| 42 | +(movies, genres, companies, person edges) from the shared `raw_movies` |
| 43 | +and `raw_credits` inputs. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +### Embedding DAG |
| 48 | + |
| 49 | +```bash |
| 50 | +python run.py --mode embed --visualise |
| 51 | +``` |
| 52 | + |
| 53 | + |
| 54 | + |
| 55 | +Movie texts are fetched from Neo4j, batched through the OpenAI embeddings |
| 56 | +API, written back to Movie nodes, and a cosine vector index is created. |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +### Retrieval + Generation DAG |
| 61 | + |
| 62 | +```bash |
| 63 | +python run.py --mode query --visualise |
| 64 | +``` |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | +The full 13-node RAG pipeline. Hamilton wires all dependencies from |
| 69 | +function signatures — no manual orchestration: |
| 70 | + |
| 71 | +``` |
| 72 | +user_query + openai_api_key + neo4j_driver |
| 73 | + -> query_intent classify into VECTOR / CYPHER / AGGREGATE / HYBRID |
| 74 | + -> entity_extraction extract persons, movies, genres, companies, filters |
| 75 | + -> entity_resolution fuzzy-match each entity against the live graph |
| 76 | + -> query_embedding embed query (VECTOR / HYBRID only) |
| 77 | + -> vector_results cosine similarity search (VECTOR / HYBRID only) |
| 78 | + -> cypher_query LLM generates Cypher from resolved entities |
| 79 | + -> cypher_results execute Cypher against Neo4j |
| 80 | + -> merged_results combine both retrieval paths |
| 81 | + -> retrieved_context format as numbered plain-text records |
| 82 | + -> system_prompt inject context into LLM system prompt |
| 83 | + -> prompt_messages assemble message list |
| 84 | + -> answer gpt-4o generates final answer |
| 85 | +``` |
| 86 | + |
| 87 | +## What it demonstrates |
| 88 | + |
| 89 | +**Ingestion DAG** (`ingest_module.py`) |
| 90 | +Loads TMDB JSON, parses entities and relationships, writes to Neo4j via |
| 91 | +batched Cypher `MERGE`. |
| 92 | + |
| 93 | +**Embedding DAG** (`embed_module.py`) |
| 94 | +Computes OpenAI `text-embedding-3-small` embeddings over title + overview, |
| 95 | +writes vectors to Movie nodes, creates a Neo4j cosine vector index. |
| 96 | + |
| 97 | +**Retrieval DAG** (`retrieval_module.py`) |
| 98 | +Classifies each query into one of four strategies, resolves named entities |
| 99 | +against the graph to get canonical names, then executes retrieval: |
| 100 | + |
| 101 | +| Strategy | When used | How it retrieves | |
| 102 | +|-------------|----------------------------------|-----------------------------------------------| |
| 103 | +| `VECTOR` | Thematic / semantic queries | Cosine vector search over Movie embeddings | |
| 104 | +| `CYPHER` | Relational / factual queries | LLM-generated Cypher with resolved entities | |
| 105 | +| `AGGREGATE` | Counting / ranking queries | Aggregation Cypher with popularity guard | |
| 106 | +| `HYBRID` | Filtered + semantic queries | CYPHER + VECTOR, results merged | |
| 107 | + |
| 108 | +The semantic entity resolution layer looks up every extracted entity in |
| 109 | +Neo4j before generating Cypher, so "Warner Bros movies" always resolves |
| 110 | +to the canonical `"Warner Bros."` name in the graph. |
| 111 | + |
| 112 | +**Generation DAG** (`generation_module.py`) |
| 113 | +Formats retrieved records into a grounded system prompt and calls gpt-4o. |
| 114 | + |
| 115 | +## Knowledge graph schema |
| 116 | + |
| 117 | +``` |
| 118 | +(:Movie {id, title, release_date, overview, popularity, vote_average}) |
| 119 | +(:Person {id, name}) |
| 120 | +(:Genre {name}) |
| 121 | +(:ProductionCompany {id, name}) |
| 122 | +
|
| 123 | +(:Person)-[:ACTED_IN {order, character}]->(:Movie) |
| 124 | +(:Person)-[:DIRECTED]->(:Movie) |
| 125 | +(:Movie)-[:IN_GENRE]->(:Genre) |
| 126 | +(:Movie)-[:PRODUCED_BY]->(:ProductionCompany) |
| 127 | +``` |
| 128 | + |
| 129 | +Dataset: 4,803 movies · 56,603 persons · 106,257 ACTED_IN · 5,166 DIRECTED · 20 genres · 5,047 companies |
| 130 | + |
| 131 | +## Prerequisites |
| 132 | + |
| 133 | +- Docker |
| 134 | +- Python 3.10+ |
| 135 | +- OpenAI API key (`gpt-4o` access) |
| 136 | +- TMDB dataset (see `data/README.md`) |
| 137 | + |
| 138 | +## Setup |
| 139 | + |
| 140 | +### 1. Start Neo4j |
| 141 | + |
| 142 | +```bash |
| 143 | +docker compose up -d |
| 144 | +``` |
| 145 | + |
| 146 | +Neo4j browser: http://localhost:7474 (user: `neo4j`, password: `password`) |
| 147 | + |
| 148 | +### 2. Install dependencies |
| 149 | + |
| 150 | +```bash |
| 151 | +python -m venv venv |
| 152 | +source venv/bin/activate |
| 153 | +pip install -r requirements.txt |
| 154 | +``` |
| 155 | + |
| 156 | +### 3. Configure environment |
| 157 | + |
| 158 | +```bash |
| 159 | +cp .env.example .env |
| 160 | +# edit .env — add your OPENAI_API_KEY |
| 161 | +``` |
| 162 | + |
| 163 | +### 4. Download the dataset |
| 164 | + |
| 165 | +Follow `data/README.md` to download and convert the TMDB dataset. |
| 166 | + |
| 167 | +## Running |
| 168 | + |
| 169 | +```bash |
| 170 | +# Step 1 — load graph (takes ~5 seconds) |
| 171 | +python run.py --mode ingest |
| 172 | + |
| 173 | +# Step 2 — compute and store embeddings (takes ~2 minutes) |
| 174 | +python run.py --mode embed |
| 175 | + |
| 176 | +# Step 3 — query |
| 177 | +python run.py --mode query --question "Who directed Inception?" |
| 178 | +python run.py --mode query --question "Which movies did Tom Hanks and Robin Wright appear in together?" |
| 179 | +python run.py --mode query --question "Which production company made the most action movies?" |
| 180 | +python run.py --mode query --question "Recommend movies similar to Inception" |
| 181 | +python run.py --mode query --question "Find me war films rated above 7.5" |
| 182 | +python run.py --mode query --question "Which actors appeared in both a Christopher Nolan and a Steven Spielberg film?" |
| 183 | +``` |
| 184 | + |
| 185 | +## Project structure |
| 186 | + |
| 187 | +``` |
| 188 | +neo4j_graph_rag/ |
| 189 | +├── docker-compose.yml Neo4j 5 + APOC |
| 190 | +├── requirements.txt |
| 191 | +├── .env.example |
| 192 | +├── graph_schema.py Node/relationship definitions and Cypher constraints |
| 193 | +├── ingest_module.py Hamilton DAG: JSON -> Neo4j |
| 194 | +├── embed_module.py Hamilton DAG: Movie nodes -> embeddings -> vector index |
| 195 | +├── retrieval_module.py Hamilton DAG: query -> entity resolution -> retrieval |
| 196 | +├── generation_module.py Hamilton DAG: context + query -> gpt-4o -> answer |
| 197 | +├── run.py Entry point wiring all three pipelines |
| 198 | +├── docs/ |
| 199 | +│ └── images/ |
| 200 | +│ ├── ingest_dag.png |
| 201 | +│ ├── embed_dag.png |
| 202 | +│ └── rag_dag.png |
| 203 | +└── data/ |
| 204 | + └── README.md Dataset download and conversion instructions |
| 205 | +``` |
0 commit comments