|
99 | 99 | **Target**: Backend developers, full-stack engineers, data scientists |
100 | 100 | **Concepts**: Documents + Graph + Vectors in one system, complex queries, multi-model integration |
101 | 101 | **Why**: Comprehensive showcase of ArcadeDB's multi-model capabilities with rich dataset |
102 | | -**Status**: 🚧 Planned |
103 | | -**Dataset**: Stack Exchange data dump (Python/JavaScript tags, converted from XML to CSV) |
104 | | -**Scope**: 300-400 lines (comprehensive multi-model example) |
| 102 | +**Status**: 🚧 In Progress |
| 103 | +**Dataset**: Stack Exchange XML data dump (8 XML files: Posts, Users, Tags, Comments, Votes, Badges, PostLinks, PostHistory) |
| 104 | +**Dataset Sizes**: |
| 105 | +- Small (cs.stackexchange.com): ~1.4M records, ~650MB, 668 tags |
| 106 | +- Medium (stats.stackexchange.com): ~5M records, ~2.5GB, 1,612 tags |
| 107 | +- Large (stackoverflow.com): ~350M records, ~325GB, 65K tags |
| 108 | +**Scope**: Single comprehensive Python file (~800-1000 lines) with class-based architecture |
| 109 | + |
| 110 | +## Architecture: Three Phases in ONE File |
| 111 | + |
| 112 | +### **Phase 1: Document Import (SchemaBuilder + DocumentImporter classes)** |
| 113 | +Import 8 XML files → 8 Document types with schema-first approach |
| 114 | +```python |
| 115 | +class SchemaBuilder: |
| 116 | + """Creates ArcadeDB schema from analyzed JSON schema""" |
| 117 | + def load_schema(json_path) # Load stackoverflow_schema.json |
| 118 | + def create_document_types() # CREATE DOCUMENT TYPE Post, User, etc. |
| 119 | + def create_properties() # CREATE PROPERTY with explicit types |
| 120 | + def handle_type_conflicts() # Use STRING for INTEGER+DATETIME conflicts |
| 121 | + |
| 122 | +class DocumentImporter: |
| 123 | + """Imports XML → Documents with streaming parser""" |
| 124 | + def import_entity(xml_path, entity) # Stream XML, batch insert |
| 125 | + def convert_types() # STRING → INTEGER/DATETIME conversion |
| 126 | + def handle_nulls() # Missing attributes = NULL |
| 127 | + def batch_commit() # Commit every 5,000-10,000 records |
| 128 | + def create_indexes_after_import() # Primary keys + foreign keys |
| 129 | +``` |
| 130 | + |
| 131 | +**Documents Created**: |
| 132 | +- Post (22 attrs): Id, PostTypeId, Title, Body, Tags, OwnerUserId, CreationDate, Score, etc. |
| 133 | +- User (12 attrs): Id, DisplayName, Reputation, AboutMe, Location, CreationDate, etc. |
| 134 | +- Tag (5 attrs): Id, TagName, Count, ExcerptPostId, WikiPostId |
| 135 | +- Comment (7 attrs): Id, PostId, UserId, Text, Score, CreationDate, UserDisplayName |
| 136 | +- Vote (6 attrs): Id, PostId, UserId, VoteTypeId, BountyAmount, CreationDate |
| 137 | +- Badge (6 attrs): Id, UserId, Name, Date, Class, TagBased |
| 138 | +- PostLink (5 attrs): Id, PostId, RelatedPostId, LinkTypeId, CreationDate |
| 139 | +- PostHistory (10 attrs): Id, PostId, UserId, PostHistoryTypeId, Text, Comment, CreationDate, etc. |
| 140 | + |
| 141 | +**Indexes Created** (after import): |
| 142 | +```sql |
| 143 | +-- Primary Keys (UNIQUE) |
| 144 | +CREATE INDEX ON Post (Id) UNIQUE |
| 145 | +CREATE INDEX ON User (Id) UNIQUE |
| 146 | +CREATE INDEX ON Tag (TagName) UNIQUE |
| 147 | + |
| 148 | +-- Foreign Keys (NOTUNIQUE) - for graph traversal |
| 149 | +CREATE INDEX ON Post (OwnerUserId, ParentId, AcceptedAnswerId, PostTypeId) NOTUNIQUE |
| 150 | +CREATE INDEX ON Comment (PostId, UserId) NOTUNIQUE |
| 151 | +CREATE INDEX ON Vote (PostId, UserId, VoteTypeId) NOTUNIQUE |
| 152 | +CREATE INDEX ON Badge (UserId, Name) NOTUNIQUE |
| 153 | +CREATE INDEX ON PostLink (PostId, RelatedPostId, LinkTypeId) NOTUNIQUE |
| 154 | +CREATE INDEX ON PostHistory (PostId, UserId) NOTUNIQUE |
| 155 | + |
| 156 | +-- Temporal/Scoring (NOTUNIQUE) |
| 157 | +CREATE INDEX ON Post (CreationDate, Score) NOTUNIQUE |
| 158 | +CREATE INDEX ON User (Reputation) NOTUNIQUE |
| 159 | +``` |
| 160 | + |
| 161 | +**Type Conflict Handling**: |
| 162 | +```python |
| 163 | +# Fields with DATETIME conflicts in large dataset → use STRING |
| 164 | +CONFLICT_FIELDS = { |
| 165 | + 'Posts': ['AcceptedAnswerId'], # 38 DATETIME in 1M samples |
| 166 | + 'Users': ['AccountId'], |
| 167 | + 'Comments': ['Id'], # 7,255 DATETIME! |
| 168 | + 'Votes': ['Id'], # 13,188 DATETIME! |
| 169 | + 'Badges': ['Id', 'UserId'], |
| 170 | + 'PostLinks': ['Id', 'PostId', 'RelatedPostId'], |
| 171 | + 'Tags': ['ExcerptPostId', 'WikiPostId'], |
| 172 | +} |
| 173 | +# Strategy: Store as STRING, convert to INTEGER at query time (safe) |
| 174 | +``` |
| 175 | + |
| 176 | +### **Phase 2: Graph Creation (GraphBuilder class)** |
| 177 | +Convert Documents → Vertices + Edges |
105 | 178 | ```python |
106 | | -# Documents: Questions/Answers with full text, searchable content |
107 | | -# Graph: User → ASKED/ANSWERED → Question, Question → TAGGED_WITH → Tag |
108 | | -# Vectors: Question embeddings for duplicate detection |
109 | | -# Multi-model queries: |
110 | | -# - "Find Python experts" (graph traversal + aggregation) |
111 | | -# - "Similar unanswered questions" (vector + document filtering) |
112 | | -# - "Trending topics by tag relationships" (graph analytics) |
113 | | -# XML → CSV conversion (provide script or use existing tools) |
114 | | -``` |
115 | | -**Dataset Options**: |
116 | | -- Stack Exchange data dump (8-10 CSV files after conversion) |
117 | | -- Alternative: E-commerce dataset (products, reviews, customers) |
118 | | -- Decision: To be determined based on conversion ease and dataset availability |
| 179 | +class GraphBuilder: |
| 180 | + """Creates graph from documents""" |
| 181 | + def convert_to_vertices() # Post/User/Tag documents → vertices |
| 182 | + def create_user_vertices() # User documents → User vertices |
| 183 | + def create_post_vertices() # Post documents → Post vertices (Q/A) |
| 184 | + def create_tag_vertices() # Tag documents → Tag vertices |
| 185 | + def create_edges() # Create all relationship edges |
| 186 | + |
| 187 | + # Edge creation methods (use indexes for fast lookups) |
| 188 | + def create_asked_answered_edges() # Post.OwnerUserId → User |
| 189 | + def create_answer_to_edges() # Post.ParentId → Post (answers) |
| 190 | + def create_has_tag_edges() # Post.Tags → Tag (parse pipe-delimited) |
| 191 | + def create_commented_edges() # Comment.UserId → User, PostId → Post |
| 192 | + def create_voted_edges() # Vote.UserId → User (if not NULL) |
| 193 | + def create_earned_badge_edges() # Badge.UserId → User |
| 194 | + def create_linked_edges() # PostLink relationships |
| 195 | +``` |
| 196 | + |
| 197 | +**Vertices Created**: |
| 198 | +- User vertex (from User document) |
| 199 | +- Post vertex (from Post document, discriminate Question vs Answer via PostTypeId) |
| 200 | +- Tag vertex (from Tag document) |
| 201 | + |
| 202 | +**Edges Created**: |
| 203 | +```python |
| 204 | +# User → Post relationships |
| 205 | +User -[ASKED]-> Post # Post.PostTypeId = 1 (question) |
| 206 | +User -[ANSWERED]-> Post # Post.PostTypeId = 2 (answer) |
| 207 | +User -[COMMENTED]-> Post # From Comment.UserId → PostId |
| 208 | +User -[VOTED]-> Post # From Vote.UserId → PostId (if not NULL) |
| 209 | + |
| 210 | +# Post → Post relationships |
| 211 | +Post -[ANSWER_TO]-> Post # Post.ParentId (answer to question) |
| 212 | +Post -[LINKED_TO]-> Post # From PostLink.PostId → RelatedPostId |
| 213 | + |
| 214 | +# Post → Tag relationships |
| 215 | +Post -[HAS_TAG]-> Tag # Parse Post.Tags (pipe-delimited: "|python|sql|") |
| 216 | + |
| 217 | +# User → Badge relationships |
| 218 | +User -[EARNED_BADGE]-> Badge # From Badge.UserId |
| 219 | +``` |
| 220 | + |
| 221 | +**Graph Queries** (after Phase 2): |
| 222 | +```sql |
| 223 | +-- Find user's questions and answers |
| 224 | +MATCH {User, as: u} -[ASKED|ANSWERED]-> {Post, as: p} |
| 225 | +WHERE u.Id = 5 |
| 226 | +RETURN u.DisplayName, p.Title, p.Score |
| 227 | + |
| 228 | +-- Find answers to a question |
| 229 | +MATCH {Post, as: q} <-[ANSWER_TO]- {Post, as: a} |
| 230 | +WHERE q.Id = 1000 |
| 231 | +RETURN a.Score, a.Body |
| 232 | +ORDER BY a.Score DESC |
| 233 | + |
| 234 | +-- Find posts by tag |
| 235 | +MATCH {Post, as: p} -[HAS_TAG]-> {Tag, as: t} |
| 236 | +WHERE t.TagName = 'python' |
| 237 | +RETURN p.Title, p.Score |
| 238 | + |
| 239 | +-- Find top contributors (users with most answers) |
| 240 | +MATCH {User, as: u} -[ANSWERED]-> {Post, as: p} |
| 241 | +RETURN u.DisplayName, count(p) as answer_count, u.Reputation |
| 242 | +ORDER BY answer_count DESC |
| 243 | +LIMIT 10 |
| 244 | +``` |
| 245 | + |
| 246 | +### **Phase 3: Vector Search (VectorBuilder class)** |
| 247 | +Add embeddings for semantic search |
| 248 | +```python |
| 249 | +class VectorBuilder: |
| 250 | + """Generates embeddings and creates vector indexes""" |
| 251 | + def generate_post_embeddings() # Post.Title + Body → embedding |
| 252 | + def generate_tag_embeddings() # Tag.TagName + Excerpt → embedding |
| 253 | + def generate_user_embeddings() # User.DisplayName + AboutMe → embedding |
| 254 | + def create_vector_indexes() # HNSW indexes for each |
| 255 | + def semantic_search() # Find similar posts/tags/users |
| 256 | +``` |
| 257 | + |
| 258 | +**Embeddings**: |
| 259 | +```python |
| 260 | +# Post embeddings (duplicate question detection) |
| 261 | +Post.embedding = embed(Post.Title + "\n" + Post.Body) # 384 dims |
| 262 | +CREATE VECTOR INDEX ON Post (embedding) HNSW |
| 263 | + |
| 264 | +# Tag embeddings (related tags) |
| 265 | +Tag.embedding = embed(Tag.TagName + "\n" + Tag.ExcerptPostId.Body) |
| 266 | +CREATE VECTOR INDEX ON Tag (embedding) HNSW |
| 267 | + |
| 268 | +# User embeddings (similar expertise) |
| 269 | +User.embedding = embed(User.DisplayName + "\n" + User.AboutMe) |
| 270 | +CREATE VECTOR INDEX ON User (embedding) HNSW |
| 271 | +``` |
| 272 | + |
| 273 | +**Vector Queries** (after Phase 3): |
| 274 | +```python |
| 275 | +# Find duplicate/similar questions |
| 276 | +query_post = db.query("SELECT FROM Post WHERE Id = 1000") |
| 277 | +similar_posts = index.find_nearest(query_post.embedding, k=10) |
| 278 | + |
| 279 | +# Find related tags |
| 280 | +query_tag = db.query("SELECT FROM Tag WHERE TagName = 'python'") |
| 281 | +related_tags = index.find_nearest(query_tag.embedding, k=10) |
| 282 | + |
| 283 | +# Find users with similar expertise |
| 284 | +query_user = db.query("SELECT FROM User WHERE Id = 5") |
| 285 | +similar_users = index.find_nearest(query_user.embedding, k=10) |
| 286 | +``` |
| 287 | + |
| 288 | +**Multi-Model Queries** (combine all three): |
| 289 | +```sql |
| 290 | +-- Find Python experts (graph + aggregation) |
| 291 | +MATCH {User, as: u} -[ANSWERED]-> {Post, as: p} -[HAS_TAG]-> {Tag, as: t} |
| 292 | +WHERE t.TagName = 'python' AND p.Score >= 5 |
| 293 | +RETURN u.DisplayName, count(p) as answers, sum(p.Score) as total_score |
| 294 | +ORDER BY total_score DESC LIMIT 10 |
| 295 | + |
| 296 | +-- Similar unanswered questions (vector + document filtering) |
| 297 | +-- 1. Find similar questions via vector search |
| 298 | +-- 2. Filter by AcceptedAnswerId IS NULL |
| 299 | +-- 3. Rank by Score |
| 300 | + |
| 301 | +-- Trending topics (graph analytics on tag co-occurrence) |
| 302 | +MATCH {Post, as: p} -[HAS_TAG]-> {Tag, as: t1}, |
| 303 | + {Post, as: p} -[HAS_TAG]-> {Tag, as: t2} |
| 304 | +WHERE t1 != t2 |
| 305 | +RETURN t1.TagName, t2.TagName, count(p) as co_occurrence |
| 306 | +ORDER BY co_occurrence DESC LIMIT 20 |
| 307 | +``` |
| 308 | + |
| 309 | +## Class Architecture |
| 310 | + |
| 311 | +```python |
| 312 | +class StackOverflowDatabase: |
| 313 | + """Main orchestrator - manages all three phases""" |
| 314 | + def __init__(db_path, dataset_size) |
| 315 | + def run_full_pipeline() # Execute all phases |
| 316 | + def run_phase_1_documents() # Import documents only |
| 317 | + def run_phase_2_graph() # Add graph layer |
| 318 | + def run_phase_3_vectors() # Add vector search |
| 319 | + def validate_each_phase() # Check counts, run sample queries |
| 320 | + def export_database() # Export to JSONL for reproducibility |
| 321 | + |
| 322 | +class SchemaBuilder: |
| 323 | + """Phase 1: Schema creation""" |
| 324 | + |
| 325 | +class DocumentImporter: |
| 326 | + """Phase 1: XML import""" |
| 327 | + |
| 328 | +class GraphBuilder: |
| 329 | + """Phase 2: Graph layer""" |
| 330 | + |
| 331 | +class VectorBuilder: |
| 332 | + """Phase 3: Vector search""" |
| 333 | +``` |
| 334 | + |
| 335 | +## Dataset Options & Performance |
| 336 | + |
| 337 | +**Small Dataset (cs.stackexchange.com)** - RECOMMENDED FOR DEVELOPMENT: |
| 338 | +- 105K posts, 138K users, 668 tags, 195K comments |
| 339 | +- ~650MB XML, imports in ~2-5 minutes |
| 340 | +- Perfect for development and testing |
| 341 | + |
| 342 | +**Medium Dataset (stats.stackexchange.com)**: |
| 343 | +- 425K posts, 345K users, 1.6K tags, 819K comments |
| 344 | +- ~2.5GB XML, imports in ~10-20 minutes |
| 345 | + |
| 346 | +**Large Dataset (stackoverflow.com)** - PRODUCTION SCALE: |
| 347 | +- 59M posts, 20M users, 65K tags, ~100M comments |
| 348 | +- ~325GB XML, imports in hours (97GB Posts.xml alone!) |
| 349 | +- Requires 8GB+ JVM heap, production server setup |
| 350 | + |
119 | 351 | **Note**: Largest example, demonstrates all three models working together |
120 | 352 |
|
121 | 353 | ### 8. **Server Mode: HTTP API + Studio** (Priority: HIGH) |
|
0 commit comments