Skip to content

Commit 01d25b5

Browse files
committed
Enhance Python bindings and examples for ArcadeDB
- Updated EXAMPLES_PLAN.md to reflect the current status and detailed architecture for the Stack Exchange dataset import example. - Revised README.md to include installation instructions for optional dependencies and clarified example execution steps. - Modified benchmark scripts to use the correct dataset directory for MovieLens instead of ml. - Adjusted memory monitoring script examples to reference the correct database paths for MovieLens. - Updated __init__.py to import XML importer instead of JSON and Neo4j. - Enhanced importer.py to support XML import with detailed limitations and quirks documented. - Added new tests in test_importer.py for complex data types, NULL handling, Unicode, and performance with larger datasets.
1 parent f9992c1 commit 01d25b5

7 files changed

Lines changed: 722 additions & 115 deletions

File tree

bindings/python/examples/EXAMPLES_PLAN.md

Lines changed: 248 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -99,23 +99,255 @@
9999
**Target**: Backend developers, full-stack engineers, data scientists
100100
**Concepts**: Documents + Graph + Vectors in one system, complex queries, multi-model integration
101101
**Why**: Comprehensive showcase of ArcadeDB's multi-model capabilities with rich dataset
102-
**Status**: 🚧 Planned
103-
**Dataset**: Stack Exchange data dump (Python/JavaScript tags, converted from XML to CSV)
104-
**Scope**: 300-400 lines (comprehensive multi-model example)
102+
**Status**: 🚧 In Progress
103+
**Dataset**: Stack Exchange XML data dump (8 XML files: Posts, Users, Tags, Comments, Votes, Badges, PostLinks, PostHistory)
104+
**Dataset Sizes**:
105+
- Small (cs.stackexchange.com): ~1.4M records, ~650MB, 668 tags
106+
- Medium (stats.stackexchange.com): ~5M records, ~2.5GB, 1,612 tags
107+
- Large (stackoverflow.com): ~350M records, ~325GB, 65K tags
108+
**Scope**: Single comprehensive Python file (~800-1000 lines) with class-based architecture
109+
110+
## Architecture: Three Phases in ONE File
111+
112+
### **Phase 1: Document Import (SchemaBuilder + DocumentImporter classes)**
113+
Import 8 XML files → 8 Document types with schema-first approach
114+
```python
115+
class SchemaBuilder:
116+
"""Creates ArcadeDB schema from analyzed JSON schema"""
117+
def load_schema(json_path) # Load stackoverflow_schema.json
118+
def create_document_types() # CREATE DOCUMENT TYPE Post, User, etc.
119+
def create_properties() # CREATE PROPERTY with explicit types
120+
def handle_type_conflicts() # Use STRING for INTEGER+DATETIME conflicts
121+
122+
class DocumentImporter:
123+
"""Imports XML → Documents with streaming parser"""
124+
def import_entity(xml_path, entity) # Stream XML, batch insert
125+
def convert_types() # STRING → INTEGER/DATETIME conversion
126+
def handle_nulls() # Missing attributes = NULL
127+
def batch_commit() # Commit every 5,000-10,000 records
128+
def create_indexes_after_import() # Primary keys + foreign keys
129+
```
130+
131+
**Documents Created**:
132+
- Post (22 attrs): Id, PostTypeId, Title, Body, Tags, OwnerUserId, CreationDate, Score, etc.
133+
- User (12 attrs): Id, DisplayName, Reputation, AboutMe, Location, CreationDate, etc.
134+
- Tag (5 attrs): Id, TagName, Count, ExcerptPostId, WikiPostId
135+
- Comment (7 attrs): Id, PostId, UserId, Text, Score, CreationDate, UserDisplayName
136+
- Vote (6 attrs): Id, PostId, UserId, VoteTypeId, BountyAmount, CreationDate
137+
- Badge (6 attrs): Id, UserId, Name, Date, Class, TagBased
138+
- PostLink (5 attrs): Id, PostId, RelatedPostId, LinkTypeId, CreationDate
139+
- PostHistory (10 attrs): Id, PostId, UserId, PostHistoryTypeId, Text, Comment, CreationDate, etc.
140+
141+
**Indexes Created** (after import):
142+
```sql
143+
-- Primary Keys (UNIQUE)
144+
CREATE INDEX ON Post (Id) UNIQUE
145+
CREATE INDEX ON User (Id) UNIQUE
146+
CREATE INDEX ON Tag (TagName) UNIQUE
147+
148+
-- Foreign Keys (NOTUNIQUE) - for graph traversal
149+
CREATE INDEX ON Post (OwnerUserId, ParentId, AcceptedAnswerId, PostTypeId) NOTUNIQUE
150+
CREATE INDEX ON Comment (PostId, UserId) NOTUNIQUE
151+
CREATE INDEX ON Vote (PostId, UserId, VoteTypeId) NOTUNIQUE
152+
CREATE INDEX ON Badge (UserId, Name) NOTUNIQUE
153+
CREATE INDEX ON PostLink (PostId, RelatedPostId, LinkTypeId) NOTUNIQUE
154+
CREATE INDEX ON PostHistory (PostId, UserId) NOTUNIQUE
155+
156+
-- Temporal/Scoring (NOTUNIQUE)
157+
CREATE INDEX ON Post (CreationDate, Score) NOTUNIQUE
158+
CREATE INDEX ON User (Reputation) NOTUNIQUE
159+
```
160+
161+
**Type Conflict Handling**:
162+
```python
163+
# Fields with DATETIME conflicts in large dataset → use STRING
164+
CONFLICT_FIELDS = {
165+
'Posts': ['AcceptedAnswerId'], # 38 DATETIME in 1M samples
166+
'Users': ['AccountId'],
167+
'Comments': ['Id'], # 7,255 DATETIME!
168+
'Votes': ['Id'], # 13,188 DATETIME!
169+
'Badges': ['Id', 'UserId'],
170+
'PostLinks': ['Id', 'PostId', 'RelatedPostId'],
171+
'Tags': ['ExcerptPostId', 'WikiPostId'],
172+
}
173+
# Strategy: Store as STRING, convert to INTEGER at query time (safe)
174+
```
175+
176+
### **Phase 2: Graph Creation (GraphBuilder class)**
177+
Convert Documents → Vertices + Edges
105178
```python
106-
# Documents: Questions/Answers with full text, searchable content
107-
# Graph: User → ASKED/ANSWERED → Question, Question → TAGGED_WITH → Tag
108-
# Vectors: Question embeddings for duplicate detection
109-
# Multi-model queries:
110-
# - "Find Python experts" (graph traversal + aggregation)
111-
# - "Similar unanswered questions" (vector + document filtering)
112-
# - "Trending topics by tag relationships" (graph analytics)
113-
# XML → CSV conversion (provide script or use existing tools)
114-
```
115-
**Dataset Options**:
116-
- Stack Exchange data dump (8-10 CSV files after conversion)
117-
- Alternative: E-commerce dataset (products, reviews, customers)
118-
- Decision: To be determined based on conversion ease and dataset availability
179+
class GraphBuilder:
180+
"""Creates graph from documents"""
181+
def convert_to_vertices() # Post/User/Tag documents → vertices
182+
def create_user_vertices() # User documents → User vertices
183+
def create_post_vertices() # Post documents → Post vertices (Q/A)
184+
def create_tag_vertices() # Tag documents → Tag vertices
185+
def create_edges() # Create all relationship edges
186+
187+
# Edge creation methods (use indexes for fast lookups)
188+
def create_asked_answered_edges() # Post.OwnerUserId → User
189+
def create_answer_to_edges() # Post.ParentId → Post (answers)
190+
def create_has_tag_edges() # Post.Tags → Tag (parse pipe-delimited)
191+
def create_commented_edges() # Comment.UserId → User, PostId → Post
192+
def create_voted_edges() # Vote.UserId → User (if not NULL)
193+
def create_earned_badge_edges() # Badge.UserId → User
194+
def create_linked_edges() # PostLink relationships
195+
```
196+
197+
**Vertices Created**:
198+
- User vertex (from User document)
199+
- Post vertex (from Post document, discriminate Question vs Answer via PostTypeId)
200+
- Tag vertex (from Tag document)
201+
202+
**Edges Created**:
203+
```python
204+
# User → Post relationships
205+
User -[ASKED]-> Post # Post.PostTypeId = 1 (question)
206+
User -[ANSWERED]-> Post # Post.PostTypeId = 2 (answer)
207+
User -[COMMENTED]-> Post # From Comment.UserId → PostId
208+
User -[VOTED]-> Post # From Vote.UserId → PostId (if not NULL)
209+
210+
# Post → Post relationships
211+
Post -[ANSWER_TO]-> Post # Post.ParentId (answer to question)
212+
Post -[LINKED_TO]-> Post # From PostLink.PostId → RelatedPostId
213+
214+
# Post → Tag relationships
215+
Post -[HAS_TAG]-> Tag # Parse Post.Tags (pipe-delimited: "|python|sql|")
216+
217+
# User → Badge relationships
218+
User -[EARNED_BADGE]-> Badge # From Badge.UserId
219+
```
220+
221+
**Graph Queries** (after Phase 2):
222+
```sql
223+
-- Find user's questions and answers
224+
MATCH {User, as: u} -[ASKED|ANSWERED]-> {Post, as: p}
225+
WHERE u.Id = 5
226+
RETURN u.DisplayName, p.Title, p.Score
227+
228+
-- Find answers to a question
229+
MATCH {Post, as: q} <-[ANSWER_TO]- {Post, as: a}
230+
WHERE q.Id = 1000
231+
RETURN a.Score, a.Body
232+
ORDER BY a.Score DESC
233+
234+
-- Find posts by tag
235+
MATCH {Post, as: p} -[HAS_TAG]-> {Tag, as: t}
236+
WHERE t.TagName = 'python'
237+
RETURN p.Title, p.Score
238+
239+
-- Find top contributors (users with most answers)
240+
MATCH {User, as: u} -[ANSWERED]-> {Post, as: p}
241+
RETURN u.DisplayName, count(p) as answer_count, u.Reputation
242+
ORDER BY answer_count DESC
243+
LIMIT 10
244+
```
245+
246+
### **Phase 3: Vector Search (VectorBuilder class)**
247+
Add embeddings for semantic search
248+
```python
249+
class VectorBuilder:
250+
"""Generates embeddings and creates vector indexes"""
251+
def generate_post_embeddings() # Post.Title + Body → embedding
252+
def generate_tag_embeddings() # Tag.TagName + Excerpt → embedding
253+
def generate_user_embeddings() # User.DisplayName + AboutMe → embedding
254+
def create_vector_indexes() # HNSW indexes for each
255+
def semantic_search() # Find similar posts/tags/users
256+
```
257+
258+
**Embeddings**:
259+
```python
260+
# Post embeddings (duplicate question detection)
261+
Post.embedding = embed(Post.Title + "\n" + Post.Body) # 384 dims
262+
CREATE VECTOR INDEX ON Post (embedding) HNSW
263+
264+
# Tag embeddings (related tags)
265+
Tag.embedding = embed(Tag.TagName + "\n" + Tag.ExcerptPostId.Body)
266+
CREATE VECTOR INDEX ON Tag (embedding) HNSW
267+
268+
# User embeddings (similar expertise)
269+
User.embedding = embed(User.DisplayName + "\n" + User.AboutMe)
270+
CREATE VECTOR INDEX ON User (embedding) HNSW
271+
```
272+
273+
**Vector Queries** (after Phase 3):
274+
```python
275+
# Find duplicate/similar questions
276+
query_post = db.query("SELECT FROM Post WHERE Id = 1000")
277+
similar_posts = index.find_nearest(query_post.embedding, k=10)
278+
279+
# Find related tags
280+
query_tag = db.query("SELECT FROM Tag WHERE TagName = 'python'")
281+
related_tags = index.find_nearest(query_tag.embedding, k=10)
282+
283+
# Find users with similar expertise
284+
query_user = db.query("SELECT FROM User WHERE Id = 5")
285+
similar_users = index.find_nearest(query_user.embedding, k=10)
286+
```
287+
288+
**Multi-Model Queries** (combine all three):
289+
```sql
290+
-- Find Python experts (graph + aggregation)
291+
MATCH {User, as: u} -[ANSWERED]-> {Post, as: p} -[HAS_TAG]-> {Tag, as: t}
292+
WHERE t.TagName = 'python' AND p.Score >= 5
293+
RETURN u.DisplayName, count(p) as answers, sum(p.Score) as total_score
294+
ORDER BY total_score DESC LIMIT 10
295+
296+
-- Similar unanswered questions (vector + document filtering)
297+
-- 1. Find similar questions via vector search
298+
-- 2. Filter by AcceptedAnswerId IS NULL
299+
-- 3. Rank by Score
300+
301+
-- Trending topics (graph analytics on tag co-occurrence)
302+
MATCH {Post, as: p} -[HAS_TAG]-> {Tag, as: t1},
303+
{Post, as: p} -[HAS_TAG]-> {Tag, as: t2}
304+
WHERE t1 != t2
305+
RETURN t1.TagName, t2.TagName, count(p) as co_occurrence
306+
ORDER BY co_occurrence DESC LIMIT 20
307+
```
308+
309+
## Class Architecture
310+
311+
```python
312+
class StackOverflowDatabase:
313+
"""Main orchestrator - manages all three phases"""
314+
def __init__(db_path, dataset_size)
315+
def run_full_pipeline() # Execute all phases
316+
def run_phase_1_documents() # Import documents only
317+
def run_phase_2_graph() # Add graph layer
318+
def run_phase_3_vectors() # Add vector search
319+
def validate_each_phase() # Check counts, run sample queries
320+
def export_database() # Export to JSONL for reproducibility
321+
322+
class SchemaBuilder:
323+
"""Phase 1: Schema creation"""
324+
325+
class DocumentImporter:
326+
"""Phase 1: XML import"""
327+
328+
class GraphBuilder:
329+
"""Phase 2: Graph layer"""
330+
331+
class VectorBuilder:
332+
"""Phase 3: Vector search"""
333+
```
334+
335+
## Dataset Options & Performance
336+
337+
**Small Dataset (cs.stackexchange.com)** - RECOMMENDED FOR DEVELOPMENT:
338+
- 105K posts, 138K users, 668 tags, 195K comments
339+
- ~650MB XML, imports in ~2-5 minutes
340+
- Perfect for development and testing
341+
342+
**Medium Dataset (stats.stackexchange.com)**:
343+
- 425K posts, 345K users, 1.6K tags, 819K comments
344+
- ~2.5GB XML, imports in ~10-20 minutes
345+
346+
**Large Dataset (stackoverflow.com)** - PRODUCTION SCALE:
347+
- 59M posts, 20M users, 65K tags, ~100M comments
348+
- ~325GB XML, imports in hours (97GB Posts.xml alone!)
349+
- Requires 8GB+ JVM heap, production server setup
350+
119351
**Note**: Largest example, demonstrates all three models working together
120352

121353
### 8. **Server Mode: HTTP API + Studio** (Priority: HIGH)

bindings/python/examples/run_benchmark_04_csv_import_documents.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,9 +74,9 @@ echo ""
7474
# Check if dataset exists
7575
DATA_BASE="./data"
7676
if [ "$SIZE" = "small" ]; then
77-
DATA_DIR="$DATA_BASE/ml-small"
77+
DATA_DIR="$DATA_BASE/movielens-small"
7878
else
79-
DATA_DIR="$DATA_BASE/ml-large"
79+
DATA_DIR="$DATA_BASE/movielens-large"
8080
fi
8181

8282
if [ ! -d "$DATA_DIR" ]; then

bindings/python/examples/run_benchmark_05_csv_import_graph.sh

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@
3030
# ./run_benchmark_05_csv_import_graph.sh small 5000 4 all_6 --export
3131
# ./run_benchmark_05_csv_import_graph.sh large 10000 8 java
3232
# ./run_benchmark_05_csv_import_graph.sh small 5000 4 all_java
33-
# ./run_benchmark_05_csv_import_graph.sh small 5000 4 java --import-jsonl ./exports/ml_small_db.jsonl.tgz
34-
# ./run_benchmark_05_csv_import_graph.sh small 5000 4 java --import-jsonl ./exports/ml_small_db.jsonl.tgz --export
33+
# ./run_benchmark_05_csv_import_graph.sh small 5000 4 java --import-jsonl ./exports/movielens_small_db.jsonl.tgz
34+
# ./run_benchmark_05_csv_import_graph.sh small 5000 4 java --import-jsonl ./exports/movielens_small_db.jsonl.tgz --export
3535
#
3636

3737
# Start timing
@@ -153,7 +153,7 @@ fi
153153
echo ""
154154

155155
# Check if source database exists (skip if using import mode)
156-
SOURCE_DB="./my_test_databases/ml_${SIZE}_db"
156+
SOURCE_DB="./my_test_databases/movielens_${SIZE}_db"
157157
if [ -z "$IMPORT_JSONL" ]; then
158158
if [ ! -d "$SOURCE_DB" ]; then
159159
echo "❌ Source database not found: $SOURCE_DB"
@@ -203,7 +203,7 @@ monitor_memory() {
203203
# Clean up any existing copies from previous runs
204204
echo "Cleaning up any existing database copies..."
205205
for i in {1..6}; do
206-
rm -rf "./my_test_databases/ml_${SIZE}_db_copy${i}"
206+
rm -rf "./my_test_databases/movielens_${SIZE}_db_copy${i}"
207207
done
208208

209209
# Create temporary copies for parallel runs (skip if using import mode - each run will import independently)
@@ -218,7 +218,7 @@ if [ $NUM_METHODS -gt 1 ] && [ -z "$IMPORT_JSONL" ]; then
218218
for method in "${!RUN_METHODS[@]}"; do
219219
COPY_NUM=$((COPY_NUM + 1))
220220
COPY_MAP[$method]=$COPY_NUM
221-
cp -r "$SOURCE_DB" "./my_test_databases/ml_${SIZE}_db_copy${COPY_NUM}" &
221+
cp -r "$SOURCE_DB" "./my_test_databases/movielens_${SIZE}_db_copy${COPY_NUM}" &
222222
eval "CP_PID${COPY_NUM}=$!"
223223
done
224224

@@ -243,7 +243,7 @@ get_source_db() {
243243
if [ ! -z "$IMPORT_JSONL" ]; then
244244
echo "" # No source DB when using import
245245
elif [ $NUM_METHODS -gt 1 ]; then
246-
echo "$(pwd)/my_test_databases/ml_${SIZE}_db_copy${COPY_MAP[$METHOD]}"
246+
echo "$(pwd)/my_test_databases/movielens_${SIZE}_db_copy${COPY_MAP[$METHOD]}"
247247
else
248248
echo "$(pwd)/$SOURCE_DB"
249249
fi
@@ -643,9 +643,9 @@ echo "Cleaning up temporary databases..."
643643
# Remove temporary database copies (used for parallel runs)
644644
if [ $NUM_METHODS -gt 1 ] && [ -z "$IMPORT_JSONL" ]; then
645645
for i in {1..6}; do
646-
if [ -d "./my_test_databases/ml_${SIZE}_db_copy${i}" ]; then
647-
rm -rf "./my_test_databases/ml_${SIZE}_db_copy${i}"
648-
echo " ✓ Removed ml_${SIZE}_db_copy${i}"
646+
if [ -d "./my_test_databases/movielens_${SIZE}_db_copy${i}" ]; then
647+
rm -rf "./my_test_databases/movielens_${SIZE}_db_copy${i}"
648+
echo " ✓ Removed movielens_${SIZE}_db_copy${i}"
649649
fi
650650
done
651651
fi

bindings/python/examples/run_with_memory_monitor.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,14 @@
66
# ./run_with_memory_monitor.sh <log_prefix> <python_command>
77
#
88
# Example:
9-
# ./run_with_memory_monitor.sh vector_large "ARCADEDB_JVM_MAX_HEAP='8g' ARCADEDB_JVM_ARGS='-Xms8g' python 06_vector_search_recommendations.py --source-db my_test_databases/ml_graph_large_db --db-path my_test_databases/ml_graph_large_db_vectors"
9+
# ./run_with_memory_monitor.sh vector_large "ARCADEDB_JVM_MAX_HEAP='8g' ARCADEDB_JVM_ARGS='-Xms8g' python 06_vector_search_recommendations.py --source-db my_test_databases/movielens_graph_large_db --db-path my_test_databases/movielens_graph_large_db_vectors"
1010
#
1111

1212
if [ $# -lt 2 ]; then
1313
echo "Usage: $0 <log_prefix> <python_command>"
1414
echo ""
1515
echo "Example:"
16-
echo " $0 vector_large \"python 06_vector_search_recommendations.py --source-db my_test_databases/ml_graph_large_db --db-path my_test_databases/ml_graph_large_db_vectors\""
16+
echo " $0 vector_large \"python 06_vector_search_recommendations.py --source-db my_test_databases/movielens_graph_large_db --db-path my_test_databases/movielens_graph_large_db_vectors\""
1717
exit 1
1818
fi
1919

bindings/python/src/arcadedb_embedded/__init__.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
from .exporter import export_database, export_to_csv
3131

3232
# Import importer classes
33-
from .importer import Importer, import_csv, import_json, import_neo4j
33+
from .importer import Importer, import_csv, import_xml
3434

3535
# Import result classes
3636
from .results import Result, ResultSet
@@ -88,7 +88,6 @@
8888
"export_to_csv",
8989
# Data import
9090
"Importer",
91-
"import_json",
9291
"import_csv",
93-
"import_neo4j",
92+
"import_xml",
9493
]

0 commit comments

Comments
 (0)