Enhance Python bindings and examples for ArcadeDB

tae898 · tae898 · commit 01d25b5550f7 · 2025-11-07T10:17:23.000+01:00
- Updated EXAMPLES_PLAN.md to reflect the current status and detailed architecture for the Stack Exchange dataset import example.
- Revised README.md to include installation instructions for optional dependencies and clarified example execution steps.
- Modified benchmark scripts to use the correct dataset directory for MovieLens instead of ml.
- Adjusted memory monitoring script examples to reference the correct database paths for MovieLens.
- Updated __init__.py to import XML importer instead of JSON and Neo4j.
- Enhanced importer.py to support XML import with detailed limitations and quirks documented.
- Added new tests in test_importer.py for complex data types, NULL handling, Unicode, and performance with larger datasets.
diff --git a/bindings/python/examples/EXAMPLES_PLAN.md b/bindings/python/examples/EXAMPLES_PLAN.md
@@ -99,23 +99,255 @@
 **Target**: Backend developers, full-stack engineers, data scientists
 **Concepts**: Documents + Graph + Vectors in one system, complex queries, multi-model integration
 **Why**: Comprehensive showcase of ArcadeDB's multi-model capabilities with rich dataset
-**Status**: 🚧 Planned
-**Dataset**: Stack Exchange data dump (Python/JavaScript tags, converted from XML to CSV)
-**Scope**: 300-400 lines (comprehensive multi-model example)
+**Status**: 🚧 In Progress
+**Dataset**: Stack Exchange XML data dump (8 XML files: Posts, Users, Tags, Comments, Votes, Badges, PostLinks, PostHistory)
+**Dataset Sizes**:
+- Small (cs.stackexchange.com): ~1.4M records, ~650MB, 668 tags
+- Medium (stats.stackexchange.com): ~5M records, ~2.5GB, 1,612 tags
+- Large (stackoverflow.com): ~350M records, ~325GB, 65K tags
+**Scope**: Single comprehensive Python file (~800-1000 lines) with class-based architecture
+
+## Architecture: Three Phases in ONE File
+
+### **Phase 1: Document Import (SchemaBuilder + DocumentImporter classes)**
+Import 8 XML files → 8 Document types with schema-first approach
+```python
+class SchemaBuilder:
+    """Creates ArcadeDB schema from analyzed JSON schema"""
+    def load_schema(json_path)          # Load stackoverflow_schema.json
+    def create_document_types()         # CREATE DOCUMENT TYPE Post, User, etc.
+    def create_properties()             # CREATE PROPERTY with explicit types
+    def handle_type_conflicts()         # Use STRING for INTEGER+DATETIME conflicts
+
+class DocumentImporter:
+    """Imports XML → Documents with streaming parser"""
+    def import_entity(xml_path, entity)  # Stream XML, batch insert
+    def convert_types()                  # STRING → INTEGER/DATETIME conversion
+    def handle_nulls()                   # Missing attributes = NULL
+    def batch_commit()                   # Commit every 5,000-10,000 records
+    def create_indexes_after_import()    # Primary keys + foreign keys
+```
+
+**Documents Created**:
+- Post (22 attrs): Id, PostTypeId, Title, Body, Tags, OwnerUserId, CreationDate, Score, etc.
+- User (12 attrs): Id, DisplayName, Reputation, AboutMe, Location, CreationDate, etc.
+- Tag (5 attrs): Id, TagName, Count, ExcerptPostId, WikiPostId
+- Comment (7 attrs): Id, PostId, UserId, Text, Score, CreationDate, UserDisplayName
+- Vote (6 attrs): Id, PostId, UserId, VoteTypeId, BountyAmount, CreationDate
+- Badge (6 attrs): Id, UserId, Name, Date, Class, TagBased
+- PostLink (5 attrs): Id, PostId, RelatedPostId, LinkTypeId, CreationDate
+- PostHistory (10 attrs): Id, PostId, UserId, PostHistoryTypeId, Text, Comment, CreationDate, etc.
+
+**Indexes Created** (after import):
+```sql
+-- Primary Keys (UNIQUE)
+CREATE INDEX ON Post (Id) UNIQUE
+CREATE INDEX ON User (Id) UNIQUE
+CREATE INDEX ON Tag (TagName) UNIQUE
+
+-- Foreign Keys (NOTUNIQUE) - for graph traversal
+CREATE INDEX ON Post (OwnerUserId, ParentId, AcceptedAnswerId, PostTypeId) NOTUNIQUE
+CREATE INDEX ON Comment (PostId, UserId) NOTUNIQUE
+CREATE INDEX ON Vote (PostId, UserId, VoteTypeId) NOTUNIQUE
+CREATE INDEX ON Badge (UserId, Name) NOTUNIQUE
+CREATE INDEX ON PostLink (PostId, RelatedPostId, LinkTypeId) NOTUNIQUE
+CREATE INDEX ON PostHistory (PostId, UserId) NOTUNIQUE
+
+-- Temporal/Scoring (NOTUNIQUE)
+CREATE INDEX ON Post (CreationDate, Score) NOTUNIQUE
+CREATE INDEX ON User (Reputation) NOTUNIQUE
+```
+
+**Type Conflict Handling**:
+```python
+# Fields with DATETIME conflicts in large dataset → use STRING
+CONFLICT_FIELDS = {
+    'Posts': ['AcceptedAnswerId'],      # 38 DATETIME in 1M samples
+    'Users': ['AccountId'],
+    'Comments': ['Id'],                 # 7,255 DATETIME!
+    'Votes': ['Id'],                    # 13,188 DATETIME!
+    'Badges': ['Id', 'UserId'],
+    'PostLinks': ['Id', 'PostId', 'RelatedPostId'],
+    'Tags': ['ExcerptPostId', 'WikiPostId'],
+}
+# Strategy: Store as STRING, convert to INTEGER at query time (safe)
+```
+
+### **Phase 2: Graph Creation (GraphBuilder class)**
+Convert Documents → Vertices + Edges
 ```python
-# Documents: Questions/Answers with full text, searchable content
-# Graph: User → ASKED/ANSWERED → Question, Question → TAGGED_WITH → Tag
-# Vectors: Question embeddings for duplicate detection
-# Multi-model queries:
-#   - "Find Python experts" (graph traversal + aggregation)
-#   - "Similar unanswered questions" (vector + document filtering)
-#   - "Trending topics by tag relationships" (graph analytics)
-# XML → CSV conversion (provide script or use existing tools)
-```
-**Dataset Options**:
-- Stack Exchange data dump (8-10 CSV files after conversion)
-- Alternative: E-commerce dataset (products, reviews, customers)
-- Decision: To be determined based on conversion ease and dataset availability
+class GraphBuilder:
+    """Creates graph from documents"""
+    def convert_to_vertices()           # Post/User/Tag documents → vertices
+    def create_user_vertices()          # User documents → User vertices
+    def create_post_vertices()          # Post documents → Post vertices (Q/A)
+    def create_tag_vertices()           # Tag documents → Tag vertices
+    def create_edges()                  # Create all relationship edges
+
+    # Edge creation methods (use indexes for fast lookups)
+    def create_asked_answered_edges()   # Post.OwnerUserId → User
+    def create_answer_to_edges()        # Post.ParentId → Post (answers)
+    def create_has_tag_edges()          # Post.Tags → Tag (parse pipe-delimited)
+    def create_commented_edges()        # Comment.UserId → User, PostId → Post
+    def create_voted_edges()            # Vote.UserId → User (if not NULL)
+    def create_earned_badge_edges()     # Badge.UserId → User
+    def create_linked_edges()           # PostLink relationships
+```
+
+**Vertices Created**:
+- User vertex (from User document)
+- Post vertex (from Post document, discriminate Question vs Answer via PostTypeId)
+- Tag vertex (from Tag document)
+
+**Edges Created**:
+```python
+# User → Post relationships
+User -[ASKED]-> Post        # Post.PostTypeId = 1 (question)
+User -[ANSWERED]-> Post     # Post.PostTypeId = 2 (answer)
+User -[COMMENTED]-> Post    # From Comment.UserId → PostId
+User -[VOTED]-> Post        # From Vote.UserId → PostId (if not NULL)
+
+# Post → Post relationships
+Post -[ANSWER_TO]-> Post    # Post.ParentId (answer to question)
+Post -[LINKED_TO]-> Post    # From PostLink.PostId → RelatedPostId
+
+# Post → Tag relationships
+Post -[HAS_TAG]-> Tag       # Parse Post.Tags (pipe-delimited: "|python|sql|")
+
+# User → Badge relationships
+User -[EARNED_BADGE]-> Badge  # From Badge.UserId
+```
+
+**Graph Queries** (after Phase 2):
+```sql
+-- Find user's questions and answers
+MATCH {User, as: u} -[ASKED|ANSWERED]-> {Post, as: p}
+WHERE u.Id = 5
+RETURN u.DisplayName, p.Title, p.Score
+
+-- Find answers to a question
+MATCH {Post, as: q} <-[ANSWER_TO]- {Post, as: a}
+WHERE q.Id = 1000
+RETURN a.Score, a.Body
+ORDER BY a.Score DESC
+
+-- Find posts by tag
+MATCH {Post, as: p} -[HAS_TAG]-> {Tag, as: t}
+WHERE t.TagName = 'python'
+RETURN p.Title, p.Score
+
+-- Find top contributors (users with most answers)
+MATCH {User, as: u} -[ANSWERED]-> {Post, as: p}
+RETURN u.DisplayName, count(p) as answer_count, u.Reputation
+ORDER BY answer_count DESC
+LIMIT 10
+```
+
+### **Phase 3: Vector Search (VectorBuilder class)**
+Add embeddings for semantic search
+```python
+class VectorBuilder:
+    """Generates embeddings and creates vector indexes"""
+    def generate_post_embeddings()      # Post.Title + Body → embedding
+    def generate_tag_embeddings()       # Tag.TagName + Excerpt → embedding
+    def generate_user_embeddings()      # User.DisplayName + AboutMe → embedding
+    def create_vector_indexes()         # HNSW indexes for each
+    def semantic_search()               # Find similar posts/tags/users
+```
+
+**Embeddings**:
+```python
+# Post embeddings (duplicate question detection)
+Post.embedding = embed(Post.Title + "\n" + Post.Body)  # 384 dims
+CREATE VECTOR INDEX ON Post (embedding) HNSW
+
+# Tag embeddings (related tags)
+Tag.embedding = embed(Tag.TagName + "\n" + Tag.ExcerptPostId.Body)
+CREATE VECTOR INDEX ON Tag (embedding) HNSW
+
+# User embeddings (similar expertise)
+User.embedding = embed(User.DisplayName + "\n" + User.AboutMe)
+CREATE VECTOR INDEX ON User (embedding) HNSW
+```
+
+**Vector Queries** (after Phase 3):
+```python
+# Find duplicate/similar questions
+query_post = db.query("SELECT FROM Post WHERE Id = 1000")
+similar_posts = index.find_nearest(query_post.embedding, k=10)
+
+# Find related tags
+query_tag = db.query("SELECT FROM Tag WHERE TagName = 'python'")
+related_tags = index.find_nearest(query_tag.embedding, k=10)
+
+# Find users with similar expertise
+query_user = db.query("SELECT FROM User WHERE Id = 5")
+similar_users = index.find_nearest(query_user.embedding, k=10)
+```
+
+**Multi-Model Queries** (combine all three):
+```sql
+-- Find Python experts (graph + aggregation)
+MATCH {User, as: u} -[ANSWERED]-> {Post, as: p} -[HAS_TAG]-> {Tag, as: t}
+WHERE t.TagName = 'python' AND p.Score >= 5
+RETURN u.DisplayName, count(p) as answers, sum(p.Score) as total_score
+ORDER BY total_score DESC LIMIT 10
+
+-- Similar unanswered questions (vector + document filtering)
+-- 1. Find similar questions via vector search
+-- 2. Filter by AcceptedAnswerId IS NULL
+-- 3. Rank by Score
+
+-- Trending topics (graph analytics on tag co-occurrence)
+MATCH {Post, as: p} -[HAS_TAG]-> {Tag, as: t1},
+      {Post, as: p} -[HAS_TAG]-> {Tag, as: t2}
+WHERE t1 != t2
+RETURN t1.TagName, t2.TagName, count(p) as co_occurrence
+ORDER BY co_occurrence DESC LIMIT 20
+```
+
+## Class Architecture
+
+```python
+class StackOverflowDatabase:
+    """Main orchestrator - manages all three phases"""
+    def __init__(db_path, dataset_size)
+    def run_full_pipeline()              # Execute all phases
+    def run_phase_1_documents()          # Import documents only
+    def run_phase_2_graph()              # Add graph layer
+    def run_phase_3_vectors()            # Add vector search
+    def validate_each_phase()            # Check counts, run sample queries
+    def export_database()                # Export to JSONL for reproducibility
+
+class SchemaBuilder:
+    """Phase 1: Schema creation"""
+
+class DocumentImporter:
+    """Phase 1: XML import"""
+
+class GraphBuilder:
+    """Phase 2: Graph layer"""
+
+class VectorBuilder:
+    """Phase 3: Vector search"""
+```
+
+## Dataset Options & Performance
+
+**Small Dataset (cs.stackexchange.com)** - RECOMMENDED FOR DEVELOPMENT:
+- 105K posts, 138K users, 668 tags, 195K comments
+- ~650MB XML, imports in ~2-5 minutes
+- Perfect for development and testing
+
+**Medium Dataset (stats.stackexchange.com)**:
+- 425K posts, 345K users, 1.6K tags, 819K comments
+- ~2.5GB XML, imports in ~10-20 minutes
+
+**Large Dataset (stackoverflow.com)** - PRODUCTION SCALE:
+- 59M posts, 20M users, 65K tags, ~100M comments
+- ~325GB XML, imports in hours (97GB Posts.xml alone!)
+- Requires 8GB+ JVM heap, production server setup
+
 **Note**: Largest example, demonstrates all three models working together
 
 ### 8. **Server Mode: HTTP API + Studio** (Priority: HIGH)
diff --git a/bindings/python/examples/run_benchmark_04_csv_import_documents.sh b/bindings/python/examples/run_benchmark_04_csv_import_documents.sh
@@ -74,9 +74,9 @@ echo ""
 # Check if dataset exists
 DATA_BASE="./data"
 if [ "$SIZE" = "small" ]; then
-    DATA_DIR="$DATA_BASE/ml-small"
+    DATA_DIR="$DATA_BASE/movielens-small"
 else
-    DATA_DIR="$DATA_BASE/ml-large"
+    DATA_DIR="$DATA_BASE/movielens-large"
 fi
 
 if [ ! -d "$DATA_DIR" ]; then
diff --git a/bindings/python/examples/run_benchmark_05_csv_import_graph.sh b/bindings/python/examples/run_benchmark_05_csv_import_graph.sh
@@ -30,8 +30,8 @@
 #   ./run_benchmark_05_csv_import_graph.sh small 5000 4 all_6 --export
 #   ./run_benchmark_05_csv_import_graph.sh large 10000 8 java
 #   ./run_benchmark_05_csv_import_graph.sh small 5000 4 all_java
-#   ./run_benchmark_05_csv_import_graph.sh small 5000 4 java --import-jsonl ./exports/ml_small_db.jsonl.tgz
-#   ./run_benchmark_05_csv_import_graph.sh small 5000 4 java --import-jsonl ./exports/ml_small_db.jsonl.tgz --export
+#   ./run_benchmark_05_csv_import_graph.sh small 5000 4 java --import-jsonl ./exports/movielens_small_db.jsonl.tgz
+#   ./run_benchmark_05_csv_import_graph.sh small 5000 4 java --import-jsonl ./exports/movielens_small_db.jsonl.tgz --export
 #
 
 # Start timing
@@ -153,7 +153,7 @@ fi
 echo ""
 
 # Check if source database exists (skip if using import mode)
-SOURCE_DB="./my_test_databases/ml_${SIZE}_db"
+SOURCE_DB="./my_test_databases/movielens_${SIZE}_db"
 if [ -z "$IMPORT_JSONL" ]; then
     if [ ! -d "$SOURCE_DB" ]; then
         echo "❌ Source database not found: $SOURCE_DB"
@@ -203,7 +203,7 @@ monitor_memory() {
 # Clean up any existing copies from previous runs
 echo "Cleaning up any existing database copies..."
 for i in {1..6}; do
-    rm -rf "./my_test_databases/ml_${SIZE}_db_copy${i}"
+    rm -rf "./my_test_databases/movielens_${SIZE}_db_copy${i}"
 done
 
 # Create temporary copies for parallel runs (skip if using import mode - each run will import independently)
@@ -218,7 +218,7 @@ if [ $NUM_METHODS -gt 1 ] && [ -z "$IMPORT_JSONL" ]; then
     for method in "${!RUN_METHODS[@]}"; do
         COPY_NUM=$((COPY_NUM + 1))
         COPY_MAP[$method]=$COPY_NUM
-        cp -r "$SOURCE_DB" "./my_test_databases/ml_${SIZE}_db_copy${COPY_NUM}" &
+        cp -r "$SOURCE_DB" "./my_test_databases/movielens_${SIZE}_db_copy${COPY_NUM}" &
         eval "CP_PID${COPY_NUM}=$!"
     done
 
@@ -243,7 +243,7 @@ get_source_db() {
     if [ ! -z "$IMPORT_JSONL" ]; then
         echo "" # No source DB when using import
     elif [ $NUM_METHODS -gt 1 ]; then
-        echo "$(pwd)/my_test_databases/ml_${SIZE}_db_copy${COPY_MAP[$METHOD]}"
+        echo "$(pwd)/my_test_databases/movielens_${SIZE}_db_copy${COPY_MAP[$METHOD]}"
     else
         echo "$(pwd)/$SOURCE_DB"
     fi
@@ -643,9 +643,9 @@ echo "Cleaning up temporary databases..."
 # Remove temporary database copies (used for parallel runs)
 if [ $NUM_METHODS -gt 1 ] && [ -z "$IMPORT_JSONL" ]; then
     for i in {1..6}; do
-        if [ -d "./my_test_databases/ml_${SIZE}_db_copy${i}" ]; then
-            rm -rf "./my_test_databases/ml_${SIZE}_db_copy${i}"
-            echo "  ✓ Removed ml_${SIZE}_db_copy${i}"
+        if [ -d "./my_test_databases/movielens_${SIZE}_db_copy${i}" ]; then
+            rm -rf "./my_test_databases/movielens_${SIZE}_db_copy${i}"
+            echo "  ✓ Removed movielens_${SIZE}_db_copy${i}"
         fi
     done
 fi
diff --git a/bindings/python/examples/run_with_memory_monitor.sh b/bindings/python/examples/run_with_memory_monitor.sh
@@ -6,14 +6,14 @@
 #   ./run_with_memory_monitor.sh <log_prefix> <python_command>
 #
 # Example:
-#   ./run_with_memory_monitor.sh vector_large "ARCADEDB_JVM_MAX_HEAP='8g' ARCADEDB_JVM_ARGS='-Xms8g' python 06_vector_search_recommendations.py --source-db my_test_databases/ml_graph_large_db --db-path my_test_databases/ml_graph_large_db_vectors"
+#   ./run_with_memory_monitor.sh vector_large "ARCADEDB_JVM_MAX_HEAP='8g' ARCADEDB_JVM_ARGS='-Xms8g' python 06_vector_search_recommendations.py --source-db my_test_databases/movielens_graph_large_db --db-path my_test_databases/movielens_graph_large_db_vectors"
 #
 
 if [ $# -lt 2 ]; then
     echo "Usage: $0 <log_prefix> <python_command>"
     echo ""
     echo "Example:"
-    echo "  $0 vector_large \"python 06_vector_search_recommendations.py --source-db my_test_databases/ml_graph_large_db --db-path my_test_databases/ml_graph_large_db_vectors\""
+    echo "  $0 vector_large \"python 06_vector_search_recommendations.py --source-db my_test_databases/movielens_graph_large_db --db-path my_test_databases/movielens_graph_large_db_vectors\""
     exit 1
 fi
 
diff --git a/bindings/python/src/arcadedb_embedded/__init__.py b/bindings/python/src/arcadedb_embedded/__init__.py
@@ -30,7 +30,7 @@
 from .exporter import export_database, export_to_csv
 
 # Import importer classes
-from .importer import Importer, import_csv, import_json, import_neo4j
+from .importer import Importer, import_csv, import_xml
 
 # Import result classes
 from .results import Result, ResultSet
@@ -88,7 +88,6 @@
     "export_to_csv",
     # Data import
     "Importer",
-    "import_json",
     "import_csv",
-    "import_neo4j",
+    "import_xml",
 ]
diff --git a/bindings/python/src/arcadedb_embedded/importer.py b/bindings/python/src/arcadedb_embedded/importer.py
diff --git a/bindings/python/tests/test_importer.py b/bindings/python/tests/test_importer.py