Skip to content

Commit c8efc20

Browse files
kartikeyamandharjmarshrossneyskrawczpjfanningdependabot[bot]
authored
Add Neo4j GraphRAG example with TMDB movies dataset (#1532)
* Add Neo4j GraphRAG example with TMDB movies dataset Full pipeline expressed as Hamilton DAGs: - Ingestion: TMDB JSON -> Neo4j via batched Cypher MERGE (4,803 movies, 111k+ person edges, 20 genres, 5,047 companies) - Embedding: OpenAI text-embedding-3-small on Movie nodes with Neo4j cosine vector index - Retrieval: semantic entity resolution + 4-strategy routing (VECTOR / CYPHER / AGGREGATE / HYBRID) with direct Neo4j Cypher - Generation: gpt-4o with graph-grounded context - Passes 40 test queries across all retrieval categories - DAG visualisations in docs/images/ * Fix README image URLs with correct GitHub username * Fix .gitignore patterns and add DAG images * Fix `@resolve` decorator not calling `validate()` on returned decorators (#1524) * call decorator.validate() in resolve.resolve so that delayed evaluation of extract_fields works; add unit test * fix new failures in delayed resolve tests by defining dummy functions with valid annotations within the test body * correct probably typo in test name * add note above the 'defensive' hasattr check to explain its presence * add additional tests for resolve + extract_fields; test resolve + arbitrary decorator * add one more test that uses parameterize_sources instead of extract_fields * Apply suggestion from @skrawcz --------- Co-authored-by: Stefan Krawczyk <stefan@dagworks.io> * Various build & release fixes (#1529) * Fix release helper: use --no-use-vcs flag and restore twine check - Switch from FLIT_USE_VCS=0 env var to --no-use-vcs CLI flag (flit 3.12.0 does not respect the env var) - Restore twine to prerequisites check - Restore verify_wheel_with_twine call before signing Add release tooling: verification scripts, build fixes, and docs - Add verify_apache_artifacts.py for GPG signature, checksum, and license verification - Add scripts/README.md with build, release, and voter verification instructions (uv-based) - Add .rat-excludes for Apache RAT license header checks - Add verify_ui_build.sh with Apache license header - Fix release helper: add verify_wheel_with_twine function, --dry-run flag, remove -incubating suffix from wheel (invalid per PEP 427), clean up original flit artifacts after creating incubating copies - Add tests, plugin_tests, and representative examples to flit sdist includes * Adds uv run to examples adds uv run to examples * Address PR review feedback - Fix typo: singed -> signed in release helper - Use list unpacking instead of concatenation for files_to_upload - Validate exactly 1 tarball/wheel from glob (not just non-empty) - Add release dependency group (flit, twine) to pyproject.toml - Update README to use uv sync --group release and uv sync --group test - Fix grammar and add --clean flag to uv venv in README - Add uv run prefix to verify_apache_artifacts.py epilog examples - Run ruff format on verify_apache_artifacts.py * Exclude databackend from license check in pre-commit * Remove license comments from databackend.py (#1535) * Remove license comments from databackend.py Removed Apache License comments from the file. * Fix pre-commit-config --------- Co-authored-by: Stefan Krawczyk <stefank@cs.stanford.edu> * Bump lodash from 4.17.23 to 4.18.1 in /contrib/docs (#1538) Bumps [lodash](https://github.com/lodash/lodash) from 4.17.23 to 4.18.1. - [Release notes](https://github.com/lodash/lodash/releases) - [Commits](lodash/lodash@4.17.23...4.18.1) --- updated-dependencies: - dependency-name: lodash dependency-version: 4.18.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump brace-expansion from 1.1.12 to 1.1.13 in /contrib/docs (#1537) Bumps [brace-expansion](https://github.com/juliangruber/brace-expansion) from 1.1.12 to 1.1.13. - [Release notes](https://github.com/juliangruber/brace-expansion/releases) - [Commits](juliangruber/brace-expansion@v1.1.12...v1.1.13) --- updated-dependencies: - dependency-name: brace-expansion dependency-version: 1.1.13 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump aiohttp from 3.13.3 to 3.13.4 in /ui/backend (#1536) --- updated-dependencies: - dependency-name: aiohttp dependency-version: 3.13.4 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pygments from 2.19.2 to 2.20.0 in /ui/backend (#1534) Bumps [pygments](https://github.com/pygments/pygments) from 2.19.2 to 2.20.0. - [Release notes](https://github.com/pygments/pygments/releases) - [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES) - [Commits](pygments/pygments@2.19.2...2.20.0) --- updated-dependencies: - dependency-name: pygments dependency-version: 2.20.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump path-to-regexp from 0.1.12 to 0.1.13 in /contrib/docs (#1533) Bumps [path-to-regexp](https://github.com/pillarjs/path-to-regexp) from 0.1.12 to 0.1.13. - [Release notes](https://github.com/pillarjs/path-to-regexp/releases) - [Changelog](https://github.com/pillarjs/path-to-regexp/blob/v.0.1.13/History.md) - [Commits](pillarjs/path-to-regexp@v0.1.12...v.0.1.13) --- updated-dependencies: - dependency-name: path-to-regexp dependency-version: 0.1.13 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump brace-expansion in /dev_tools/vscode_extension (#1530) Bumps [brace-expansion](https://github.com/juliangruber/brace-expansion) from 1.1.12 to 1.1.13. - [Release notes](https://github.com/juliangruber/brace-expansion/releases) - [Commits](juliangruber/brace-expansion@v1.1.12...v1.1.13) --- updated-dependencies: - dependency-name: brace-expansion dependency-version: 1.1.13 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump requests from 2.32.5 to 2.33.0 in /ui/backend (#1528) Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.32.5...v2.33.0) --- updated-dependencies: - dependency-name: requests dependency-version: 2.33.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump picomatch in /ui/frontend (#1526) Bumps and [picomatch](https://github.com/micromatch/picomatch). These dependencies needed to be updated together. Updates `picomatch` from 2.3.1 to 2.3.2 - [Release notes](https://github.com/micromatch/picomatch/releases) - [Changelog](https://github.com/micromatch/picomatch/blob/master/CHANGELOG.md) - [Commits](micromatch/picomatch@2.3.1...2.3.2) Updates `picomatch` from 4.0.3 to 4.0.4 - [Release notes](https://github.com/micromatch/picomatch/releases) - [Changelog](https://github.com/micromatch/picomatch/blob/master/CHANGELOG.md) - [Commits](micromatch/picomatch@2.3.1...2.3.2) --- updated-dependencies: - dependency-name: picomatch dependency-version: 2.3.2 dependency-type: indirect - dependency-name: picomatch dependency-version: 4.0.4 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump yaml in /ui/frontend (#1525) Bumps and [yaml](https://github.com/eemeli/yaml). These dependencies needed to be updated together. Updates `yaml` from 2.8.2 to 2.8.3 - [Release notes](https://github.com/eemeli/yaml/releases) - [Commits](eemeli/yaml@v2.8.2...v2.8.3) Updates `yaml` from 1.10.2 to 1.10.3 - [Release notes](https://github.com/eemeli/yaml/releases) - [Commits](eemeli/yaml@v2.8.2...v2.8.3) --- updated-dependencies: - dependency-name: yaml dependency-version: 2.8.3 dependency-type: indirect - dependency-name: yaml dependency-version: 1.10.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump braces from 3.0.2 to 3.0.3 in /dev_tools/vscode_extension (#1522) Bumps [braces](https://github.com/micromatch/braces) from 3.0.2 to 3.0.3. - [Changelog](https://github.com/micromatch/braces/blob/master/CHANGELOG.md) - [Commits](micromatch/braces@3.0.2...3.0.3) --- updated-dependencies: - dependency-name: braces dependency-version: 3.0.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Address PR review comments: add Apache 2 headers, remove unused deps, fix image URLs, add .env.example * Various build & release fixes (#1529) * Fix release helper: use --no-use-vcs flag and restore twine check - Switch from FLIT_USE_VCS=0 env var to --no-use-vcs CLI flag (flit 3.12.0 does not respect the env var) - Restore twine to prerequisites check - Restore verify_wheel_with_twine call before signing Add release tooling: verification scripts, build fixes, and docs - Add verify_apache_artifacts.py for GPG signature, checksum, and license verification - Add scripts/README.md with build, release, and voter verification instructions (uv-based) - Add .rat-excludes for Apache RAT license header checks - Add verify_ui_build.sh with Apache license header - Fix release helper: add verify_wheel_with_twine function, --dry-run flag, remove -incubating suffix from wheel (invalid per PEP 427), clean up original flit artifacts after creating incubating copies - Add tests, plugin_tests, and representative examples to flit sdist includes * Adds uv run to examples adds uv run to examples * Address PR review feedback - Fix typo: singed -> signed in release helper - Use list unpacking instead of concatenation for files_to_upload - Validate exactly 1 tarball/wheel from glob (not just non-empty) - Add release dependency group (flit, twine) to pyproject.toml - Update README to use uv sync --group release and uv sync --group test - Fix grammar and add --clean flag to uv venv in README - Add uv run prefix to verify_apache_artifacts.py epilog examples - Run ruff format on verify_apache_artifacts.py * Exclude databackend from license check in pre-commit * Remove license comments from databackend.py (#1535) * Remove license comments from databackend.py Removed Apache License comments from the file. * Fix pre-commit-config --------- Co-authored-by: Stefan Krawczyk <stefank@cs.stanford.edu> * Add Neo4j GraphRAG example to ecosystem page --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Joe Marsh Rossney <17361029+jmarshrossney@users.noreply.github.com> Co-authored-by: Stefan Krawczyk <stefan@dagworks.io> Co-authored-by: Stefan Krawczyk <stefank@cs.stanford.edu> Co-authored-by: PJ Fanning <pjfanning@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
1 parent e84ed90 commit c8efc20

17 files changed

Lines changed: 2151 additions & 0 deletions

docs/ecosystem/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,7 @@ Persist and cache your data:
148148
| <img src="../_static/logos/slack.svg" width="20" height="20" style="vertical-align: middle;"> **Slack** | Notifications and integrations | [Examples](https://github.com/apache/hamilton/tree/main/examples/slack) \| [Lifecycle Hook](../reference/lifecycle-hooks/SlackNotifierHook.rst) |
149149
| <img src="../_static/logos/geopandas.png" width="20" height="20" style="vertical-align: middle;"> **GeoPandas** | Geospatial data analysis | [Type extension](https://github.com/apache/hamilton/blob/main/hamilton/plugins/geopandas_extensions.py) for GeoDataFrame support |
150150
| <img src="../_static/logos/yaml.svg" width="20" height="20" style="vertical-align: middle;"> **YAML** | Configuration management | [IO Adapters](../reference/io/available-data-adapters.rst) |
151+
| **Neo4j** | Knowledge graph RAG | [Examples](https://github.com/apache/hamilton/tree/main/examples/LLM_Workflows/neo4j_graph_rag) |
151152

152153
---
153154

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
# OpenAI
19+
OPENAI_API_KEY=your-openai-api-key-here
20+
21+
# Neo4j
22+
NEO4J_URI=bolt://localhost:7687
23+
NEO4J_USERNAME=neo4j
24+
NEO4J_PASSWORD=password
25+
NEO4J_DATABASE=neo4j
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
# Environment
19+
.env
20+
21+
# Python
22+
__pycache__/
23+
*.pyc
24+
*.pyo
25+
*.pyd
26+
.Python
27+
venv/
28+
.venv/
29+
30+
# Data files (download separately per data/README.md)
31+
data/*.json
32+
data/*.csv
33+
34+
# DAG visualisations are committed — ignore regenerated copies at root
35+
/ingest_dag.png
36+
/embed_dag.png
37+
/rag_dag.png
38+
39+
# Neo4j
40+
*.dump
41+
42+
# OS
43+
.DS_Store
44+
Thumbs.db
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Neo4j GraphRAG — TMDB Movies
21+
22+
A full GraphRAG pipeline over a movie knowledge graph stored in Neo4j,
23+
built entirely with Apache Hamilton. Ingestion, embedding, and retrieval
24+
are each expressed as first-class Hamilton DAGs — dependencies declared
25+
through function signatures, execution graph built automatically.
26+
27+
## Hamilton DAG visualisations
28+
29+
Run `--visualise` on any mode to regenerate these from source without
30+
executing the pipeline.
31+
32+
### Ingestion DAG
33+
34+
```bash
35+
python run.py --mode ingest --visualise
36+
```
37+
38+
![Ingestion DAG](https://raw.githubusercontent.com/apache/hamilton/examples/neo4j-graph-rag/examples/LLM_Workflows/neo4j_graph_rag/docs/images/ingest_dag.png)
39+
40+
Raw TMDB JSON flows through parsing nodes into batched Neo4j writes.
41+
Hamilton automatically parallelises the four independent branches
42+
(movies, genres, companies, person edges) from the shared `raw_movies`
43+
and `raw_credits` inputs.
44+
45+
---
46+
47+
### Embedding DAG
48+
49+
```bash
50+
python run.py --mode embed --visualise
51+
```
52+
53+
![Embedding DAG](https://raw.githubusercontent.com/apache/hamilton/examples/neo4j-graph-rag/examples/LLM_Workflows/neo4j_graph_rag/docs/images/embed_dag.png)
54+
55+
Movie texts are fetched from Neo4j, batched through the OpenAI embeddings
56+
API, written back to Movie nodes, and a cosine vector index is created.
57+
58+
---
59+
60+
### Retrieval + Generation DAG
61+
62+
```bash
63+
python run.py --mode query --visualise
64+
```
65+
66+
![RAG DAG](https://raw.githubusercontent.com/apache/hamilton/examples/neo4j-graph-rag/examples/LLM_Workflows/neo4j_graph_rag/docs/images/rag_dag.png)
67+
68+
The full 13-node RAG pipeline. Hamilton wires all dependencies from
69+
function signatures — no manual orchestration:
70+
71+
```
72+
user_query + openai_api_key + neo4j_driver
73+
-> query_intent classify into VECTOR / CYPHER / AGGREGATE / HYBRID
74+
-> entity_extraction extract persons, movies, genres, companies, filters
75+
-> entity_resolution fuzzy-match each entity against the live graph
76+
-> query_embedding embed query (VECTOR / HYBRID only)
77+
-> vector_results cosine similarity search (VECTOR / HYBRID only)
78+
-> cypher_query LLM generates Cypher from resolved entities
79+
-> cypher_results execute Cypher against Neo4j
80+
-> merged_results combine both retrieval paths
81+
-> retrieved_context format as numbered plain-text records
82+
-> system_prompt inject context into LLM system prompt
83+
-> prompt_messages assemble message list
84+
-> answer gpt-4o generates final answer
85+
```
86+
87+
## What it demonstrates
88+
89+
**Ingestion DAG** (`ingest_module.py`)
90+
Loads TMDB JSON, parses entities and relationships, writes to Neo4j via
91+
batched Cypher `MERGE`.
92+
93+
**Embedding DAG** (`embed_module.py`)
94+
Computes OpenAI `text-embedding-3-small` embeddings over title + overview,
95+
writes vectors to Movie nodes, creates a Neo4j cosine vector index.
96+
97+
**Retrieval DAG** (`retrieval_module.py`)
98+
Classifies each query into one of four strategies, resolves named entities
99+
against the graph to get canonical names, then executes retrieval:
100+
101+
| Strategy | When used | How it retrieves |
102+
|-------------|----------------------------------|-----------------------------------------------|
103+
| `VECTOR` | Thematic / semantic queries | Cosine vector search over Movie embeddings |
104+
| `CYPHER` | Relational / factual queries | LLM-generated Cypher with resolved entities |
105+
| `AGGREGATE` | Counting / ranking queries | Aggregation Cypher with popularity guard |
106+
| `HYBRID` | Filtered + semantic queries | CYPHER + VECTOR, results merged |
107+
108+
The semantic entity resolution layer looks up every extracted entity in
109+
Neo4j before generating Cypher, so "Warner Bros movies" always resolves
110+
to the canonical `"Warner Bros."` name in the graph.
111+
112+
**Generation DAG** (`generation_module.py`)
113+
Formats retrieved records into a grounded system prompt and calls gpt-4o.
114+
115+
## Knowledge graph schema
116+
117+
```
118+
(:Movie {id, title, release_date, overview, popularity, vote_average})
119+
(:Person {id, name})
120+
(:Genre {name})
121+
(:ProductionCompany {id, name})
122+
123+
(:Person)-[:ACTED_IN {order, character}]->(:Movie)
124+
(:Person)-[:DIRECTED]->(:Movie)
125+
(:Movie)-[:IN_GENRE]->(:Genre)
126+
(:Movie)-[:PRODUCED_BY]->(:ProductionCompany)
127+
```
128+
129+
Dataset: 4,803 movies · 56,603 persons · 106,257 ACTED_IN · 5,166 DIRECTED · 20 genres · 5,047 companies
130+
131+
## Prerequisites
132+
133+
- Docker
134+
- Python 3.10+
135+
- OpenAI API key (`gpt-4o` access)
136+
- TMDB dataset (see `data/README.md`)
137+
138+
## Setup
139+
140+
### 1. Start Neo4j
141+
142+
```bash
143+
docker compose up -d
144+
```
145+
146+
Neo4j browser: http://localhost:7474 (user: `neo4j`, password: `password`)
147+
148+
### 2. Install dependencies
149+
150+
```bash
151+
python -m venv venv
152+
source venv/bin/activate
153+
pip install -r requirements.txt
154+
```
155+
156+
### 3. Configure environment
157+
158+
```bash
159+
cp .env.example .env
160+
# edit .env — add your OPENAI_API_KEY
161+
```
162+
163+
### 4. Download the dataset
164+
165+
Follow `data/README.md` to download and convert the TMDB dataset.
166+
167+
## Running
168+
169+
```bash
170+
# Step 1 — load graph (takes ~5 seconds)
171+
python run.py --mode ingest
172+
173+
# Step 2 — compute and store embeddings (takes ~2 minutes)
174+
python run.py --mode embed
175+
176+
# Step 3 — query
177+
python run.py --mode query --question "Who directed Inception?"
178+
python run.py --mode query --question "Which movies did Tom Hanks and Robin Wright appear in together?"
179+
python run.py --mode query --question "Which production company made the most action movies?"
180+
python run.py --mode query --question "Recommend movies similar to Inception"
181+
python run.py --mode query --question "Find me war films rated above 7.5"
182+
python run.py --mode query --question "Which actors appeared in both a Christopher Nolan and a Steven Spielberg film?"
183+
```
184+
185+
## Project structure
186+
187+
```
188+
neo4j_graph_rag/
189+
├── docker-compose.yml Neo4j 5 + APOC
190+
├── requirements.txt
191+
├── .env.example
192+
├── graph_schema.py Node/relationship definitions and Cypher constraints
193+
├── ingest_module.py Hamilton DAG: JSON -> Neo4j
194+
├── embed_module.py Hamilton DAG: Movie nodes -> embeddings -> vector index
195+
├── retrieval_module.py Hamilton DAG: query -> entity resolution -> retrieval
196+
├── generation_module.py Hamilton DAG: context + query -> gpt-4o -> answer
197+
├── run.py Entry point wiring all three pipelines
198+
├── docs/
199+
│ └── images/
200+
│ ├── ingest_dag.png
201+
│ ├── embed_dag.png
202+
│ └── rag_dag.png
203+
└── data/
204+
└── README.md Dataset download and conversion instructions
205+
```
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Data
21+
22+
This example uses the [TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata) from Kaggle.
23+
24+
## Download
25+
26+
1. Create a free Kaggle account at https://www.kaggle.com
27+
2. Go to https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
28+
3. Click **Download** and unzip the archive
29+
4. Place the following two files in this `data/` folder:
30+
31+
```
32+
data/
33+
├── tmdb_5000_movies.json
34+
└── tmdb_5000_credits.json
35+
```
36+
37+
## Note on file format
38+
39+
The Kaggle archive ships the files as CSV (`tmdb_5000_movies.csv`, `tmdb_5000_credits.csv`).
40+
Several columns contain JSON strings (genres, cast, crew, production_companies).
41+
42+
Convert them to JSON before running ingestion:
43+
44+
```python
45+
import pandas as pd, json
46+
47+
movies = pd.read_csv("tmdb_5000_movies.csv")
48+
credits = pd.read_csv("tmdb_5000_credits.csv")
49+
50+
with open("tmdb_5000_movies.json", "w") as f:
51+
json.dump(movies.to_dict(orient="records"), f)
52+
53+
with open("tmdb_5000_credits.json", "w") as f:
54+
json.dump(credits.to_dict(orient="records"), f)
55+
```
56+
57+
Run this script once from inside the `data/` folder, then proceed with `python run.py --mode ingest`.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
19+
import pandas as pd, json
20+
21+
movies = pd.read_csv("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_movies.csv")
22+
credits = pd.read_csv("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_credits.csv")
23+
24+
with open("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_movies.json", "w") as f:
25+
json.dump(movies.to_dict(orient="records"), f)
26+
27+
with open("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_credits.json", "w") as f:
28+
json.dump(credits.to_dict(orient="records"), f)

0 commit comments

Comments
 (0)