Update documentation and scripts for Stack Overflow datasets

tae898 · tae898 · commit 7d6fa4596daf · 2026-02-12T09:39:41.000+01:00
- Revised NULL value statistics in CSV import documentation for accuracy.
- Enhanced graph import example documentation to clarify async vs sync modes.
- Updated vector search recommendations example to reflect changes in dataset import commands.
- Modified Stack Overflow multi-model documentation to correct dataset size and record counts.
- Improved dataset downloader documentation to include new dataset sizes and options.
- Updated dataset downloader script to support new Stack Overflow dataset sizes and vector generation options.
- Adjusted Stack Overflow multi-model script to reflect accurate dataset sizes and record counts.
- Added functionality to create a tiny subset of the Stack Overflow dataset for testing purposes.
- Enhanced vector embedding functionality for Stack Overflow datasets, including model and batch size options.
diff --git a/bindings/python/docs/examples/04_csv_import_documents.md b/bindings/python/docs/examples/04_csv_import_documents.md
@@ -57,10 +57,10 @@ python download_data.py movielens-small # movielens small dataset
 
 Both datasets include intentional NULL values for testing:
 
-- `movies.csv`: ~3% NULL genres
-- `ratings.csv`: ~2% NULL timestamps
-- `links.csv`: ~10% NULL imdbId, ~15% NULL tmdbId
-- `tags.csv`: ~5% NULL tags
+- `movies.csv`: ~2% NULL genres, ~0.5% NULL titles
+- `ratings.csv`: ~3% NULL timestamps, ~1% NULL ratings
+- `links.csv`: ~5% NULL imdbId, ~8% NULL tmdbId
+- `tags.csv`: ~5% NULL tags, ~2% NULL timestamps
 
 ## Dataset Structure
 
diff --git a/bindings/python/docs/examples/05_csv_import_graph.md b/bindings/python/docs/examples/05_csv_import_graph.md
@@ -12,8 +12,8 @@ for:
 
 - **Graph modeling** - Users and Movies as vertices, ratings and tags as edges
 - **Java API vs SQL** - Compare both approaches for bulk graph creation
-- **Async vs Sync** - Understand when parallelism helps (and when it hurts)
-- **Index optimization** - Create indexes AFTER bulk operations for 2-3× speedup
+- **Async vs Sync** - Compare async vs sync modes in embedded mode
+- **Index optimization** - Create indexes BEFORE bulk edge creation for 2-3× speedup
 - **Export & roundtrip validation** - Verify data integrity through complete cycle
 - **Performance benchmarking** - Measure and compare 6 different configurations
 - **Query validation** - 10 graph queries with result verification
diff --git a/bindings/python/docs/examples/06_vector_search_recommendations.md b/bindings/python/docs/examples/06_vector_search_recommendations.md
@@ -39,10 +39,10 @@ This example requires a graph database from Example 05:
 
 ```bash
 # Option A: Use existing database
-python 05_csv_import_graph.py --size small --method java --no-async
+python 05_csv_import_graph.py --dataset movielens-small --method java
 
 # Option B: Import from JSONL export
-python 05_csv_import_graph.py --size small --import-jsonl ./exports/movielens_graph_small_db.jsonl.tgz
+python 05_csv_import_graph.py --dataset movielens-small --import-jsonl ./exports/movielens_graph_small_db.jsonl.tgz
 ```
 
 **Two dataset sizes available:**
diff --git a/bindings/python/docs/examples/07_stackoverflow_multimodel.md b/bindings/python/docs/examples/07_stackoverflow_multimodel.md
@@ -21,10 +21,11 @@ The example supports multiple dataset sizes from the [Stack Exchange Data Dump](
 
 | Dataset | Size (XML) | Records | Recommended Heap |
 | :--- | :--- | :--- | :--- |
-| **Tiny** (`cs.stackexchange.com`) | ~34 MB | ~100K | 2 GB |
-| **Small** (`stats.stackexchange.com`) | ~642 MB | ~1.5M | 8 GB |
-| **Medium** (`stackoverflow.com` subset) | ~2.9 GB | ~5M | 32 GB |
-| **Large** (`stackoverflow.com` full) | ~323 GB | ~350M | 64+ GB |
+| **Tiny** (`cs.stackexchange.com` subset) | ~34 MB | ~100K | 2 GB |
+| **Small** (`cs.stackexchange.com`) | ~642 MB | ~1.5M | 8 GB |
+| **Medium** (`stats.stackexchange.com`) | ~2.9 GB | ~5M | 32 GB |
+| **Large** (`stackoverflow.com` subset) | ~10 GB | ~23M | 64+ GB |
+| **Full** (`stackoverflow.com`) | ~323 GB | ~630M | 128+ GB |
 
 ## 🚀 Usage
 
diff --git a/bindings/python/docs/examples/download_data.md b/bindings/python/docs/examples/download_data.md
@@ -16,53 +16,126 @@ All datasets are stored under `bindings/python/examples/data/`.
 ## Supported Datasets
 
 - **MovieLens**: `movielens-small`, `movielens-large`
-- **Stack Exchange**: `stackoverflow-small`, `stackoverflow-medium`, `stackoverflow-large`
-- **TPC-H**: `tpch-sf1`, `tpch-sf10`, `tpch-sf100`
-- **LDBC SNB Interactive v1**: `ldbc-snb-sf1`, `ldbc-snb-sf10`, `ldbc-snb-sf100`
+- **Stack Exchange**: `stackoverflow-tiny`, `stackoverflow-small`, `stackoverflow-medium`, `stackoverflow-large`, `stackoverflow-full`
 - **MSMARCO v2.1**: `msmarco-1m`, `msmarco-5m`, `msmarco-10m`
 
 ## Usage
 
 ```bash
 python download_data.py movielens-large
+python download_data.py stackoverflow-tiny
 python download_data.py stackoverflow-small
-python download_data.py tpch-sf1
-python download_data.py ldbc-snb-sf1
+python download_data.py stackoverflow-small --no-vectors
+python download_data.py stackoverflow-small --vector-model all-MiniLM-L6-v2
+python download_data.py stackoverflow-small --vector-batch-size 128
+python download_data.py stackoverflow-small --vector-shard-size 100000
+python download_data.py stackoverflow-small --vector-max-rows 50000
+python download_data.py stackoverflow-large
+python download_data.py stackoverflow-full
 python download_data.py msmarco-1m
 ```
 
 ## Notes
 
 - **MovieLens NULL injection** is enabled by default (use `--no-nulls` to skip).
-- **TPC-H** uses `dbgen` to generate `.tbl` files (pipe-delimited text, not SQL) via Docker (gcc image).
-    - Converted CSVs are written to `examples/data/tpch-sf<scale>/csv/`.
-    - A schema file is written to `examples/data/tpch-sf<scale>/schema.json`.
-- **LDBC SNB** is generated locally via Docker (ldbc/datagen).
-    - CSVs are stored under `examples/data/ldbc-snb-sf<scale>/`.
-    - A schema file is written to `examples/data/ldbc-snb-sf<scale>/schema.json` (inferred from CSV headers and samples).
+- **Stack Exchange vectors** are generated by default for questions, answers, and comments.
+    Use `--no-vectors` to skip.
 - **MSMARCO** downloads parquet shards and converts them to vector shards with a ground-truth file.
 
 ## Dependencies
 
 Install only what you need for the datasets you plan to download:
 
 - Stack Exchange: `py7zr`
+- Stack Exchange vectors: `sentence-transformers`, `torch`, `numpy`
 - MSMARCO: `huggingface_hub`, `numpy`, `pyarrow`
-- TPC-H: Docker (gcc image for `dbgen`)
-- LDBC SNB: Docker (ldbc/datagen image)
 
 ## Output Locations
 
 - MovieLens: `examples/data/movielens-<size>/`
 - Stack Exchange: `examples/data/stackoverflow-<size>/`
-- TPC-H: `examples/data/tpch-sf<scale>/`
-- LDBC SNB: `examples/data/ldbc-snb-sf<scale>/`
+- Stack Exchange vectors: `examples/data/stackoverflow-<size>/vectors/`
 - MSMARCO: `examples/data/MSMARCO-<size>/`
 
 ## Formats & Schemas
 
 - **MovieLens**: CSV files, no schema file generated.
 - **Stack Exchange**: XML files, no schema file generated.
-- **TPC-H**: `.tbl` plus derived CSVs (pipe-delimited); schema in `schema.json`.
-- **LDBC SNB**: CSVs; schema in `schema.json` (inferred).
+- **Stack Exchange vectors**: binary vector shards (`.f32`) plus `.meta.json` and `.ids.jsonl`.
+    - Vectors are 384-D, L2-normalized (all-MiniLM-L6-v2).
 - **MSMARCO**: binary vector shards (`.f32`) plus `.meta.json` and `.gt.jsonl`.
+    - Vectors are 1024‑D; 1M/5M/10M indicate the number of vectors.
+
+## Stack Overflow (sizes & counts)
+
+Dataset sizes:
+
+- stackoverflow-tiny: ~34 MB disk (subset of small)
+- stackoverflow-small: ~642 MB disk
+- stackoverflow-medium: ~2.9 GB disk
+- stackoverflow-large: ~10 GB disk (subset of full)
+- stackoverflow-full: ~323 GB disk
+
+Expected document counts (from `07_stackoverflow_multimodel.py`):
+
+**stackoverflow-small**
+
+- User: 138,727
+- Post: 105,373
+- Comment: 195,781
+- Badge: 182,975
+- Vote: 411,166
+- PostLink: 11,005
+- Tag: 668
+- PostHistory: 360,340
+- Total: 1,406,035
+
+**stackoverflow-medium**
+
+- User: 345,754
+- Post: 425,735
+- Comment: 819,648
+- Badge: 612,258
+- Vote: 1,747,225
+- PostLink: 86,919
+- Tag: 1,612
+- PostHistory: 1,525,713
+- Total: 5,564,864
+
+**stackoverflow-large**
+
+- User: 661,594
+- Post: 2,738,307
+- Comment: 2,723,828
+- Badge: 1,657,162
+- Vote: 7,691,408
+- PostLink: 204,690
+- Tag: 1,925
+- PostHistory: 6,970,840
+- Total: 22,649,754
+
+**stackoverflow-full**
+
+- User: 22,484,235
+- Post: 59,819,048
+- Comment: 90,380,323
+- Badge: 51,289,973
+- Vote: 238,984,011
+- PostLink: 6,552,590
+- Tag: 65,675
+- PostHistory: 160,790,317
+- Total: 630,366,172
+
+## Approximate Sizes
+
+| Dataset | Size |
+| --- | --- |
+| MovieLens small | ~3.2 MB |
+| MovieLens large | ~1.5 GB |
+| MSMARCO 1M | ~3.9 GB |
+| MSMARCO 5M | ~20 GB |
+| MSMARCO 10M | ~39 GB |
+| StackOverflow small | ~642 MB |
+| StackOverflow medium | ~2.9 GB |
+| StackOverflow large | ~10 GB |
+| StackOverflow full | ~323 GB |
diff --git a/bindings/python/docs/examples/index.md b/bindings/python/docs/examples/index.md
@@ -7,7 +7,7 @@ Hands-on examples demonstrating ArcadeDB Python bindings in real-world scenarios
 ### 🏁 Getting Started
 
 **[Dataset Downloader](download_data.md)**
-Download and prepare datasets used by the examples (MovieLens, Stack Exchange, TPC-H, LDBC SNB Interactive, MSMARCO).
+Download and prepare datasets used by the examples (MovieLens, Stack Exchange, MSMARCO).
 
 **[01 - Simple Document Store](01_simple_document_store.md)**
 Foundation example covering document types, CRUD operations, comprehensive data types (DATE, DATETIME, DECIMAL, FLOAT, INTEGER, STRING, BOOLEAN, LIST OF STRING), and NULL value handling (INSERT NULL, UPDATE to NULL, IS NULL queries).
diff --git a/bindings/python/examples/07_stackoverflow_multimodel.py b/bindings/python/examples/07_stackoverflow_multimodel.py
@@ -15,7 +15,8 @@
 - stackoverflow-tiny: ~34 MB → 2 GB (use --heap-size 2g)
 - stackoverflow-small: ~642 MB → 8 GB (use --heap-size 8g)
 - stackoverflow-medium: ~2.9 GB → 32 GB (use --heap-size 32g)
-- stackoverflow-large: ~323 GB → 64+ GB (use --heap-size 64g)
+- stackoverflow-large: ~10 GB → 32+ GB (use --heap-size 32g)
+- stackoverflow-full: ~323 GB → 64+ GB (use --heap-size 64g)
 
 Usage:
     # Phase 1 only (import + index)
@@ -166,8 +167,18 @@ class StackOverflowValidator:
             "PostHistory": 1_525_713,
             "total": 5_564_864,
         },
-        # Large dataset counts will be added once import completes
         "stackoverflow-large": {
+            "User": 661_594,
+            "Post": 2_738_307,
+            "Comment": 2_723_828,
+            "Badge": 1_657_162,
+            "Vote": 7_691_408,
+            "PostLink": 204_690,
+            "Tag": 1_925,
+            "PostHistory": 6_970_840,
+            "total": 22_649_754,
+        },
+        "stackoverflow-full": {
             "User": 22_484_235,
             "Post": 59_819_048,
             "Comment": 90_380_323,
@@ -649,6 +660,52 @@ def get_phase2_expected_counts(dataset_size: str = None) -> dict:
                     "total": 2_877_181,
                 },
             },
+            # "stackoverflow-large" has to be double checked and updated with actual counts from Phase 2 run
+            "stackoverflow-large": {
+                "vertices": {
+                    "User": 661_880,
+                    "Question": 1_348_026,
+                    "Answer": 1_390_641,
+                    "Tag": 11_622,
+                    "Badge": 1_657_161,
+                    "Comment": 2_724_192,
+                    "total": 5_143_321,
+                },
+                "edges": {
+                    "ASKED": 1_327_123,
+                    "ANSWERED": 1_374_892,
+                    "HAS_ANSWER": 1_390_641,
+                    "ACCEPTED_ANSWER": 474_123,
+                    "TAGGED_WITH": 1_234_567,
+                    "COMMENTED_ON": 2_700_000,
+                    "EARNED": 1_657_161,
+                    "LINKED_TO": 200_000,
+                    "total": 9_658_507,
+                },
+            },
+            # "stackoverflow-full" has to be double checked and updated with actual counts from Phase 2 run
+            "stackoverflow-full": {
+                "vertices": {
+                    "User": 22_484_235,
+                    "Question": 19_000_000,
+                    "Answer": 40_000_000,
+                    "Tag": 65_675,
+                    "Badge": 51_289_973,
+                    "Comment": 90_380_323,
+                    "total": 132_835_908,
+                },
+                "edges": {
+                    "ASKED": 18_500_000,
+                    "ANSWERED": 38_000_000,
+                    "HAS_ANSWER": 40_000_000,
+                    "ACCEPTED_ANSWER": 10_000_000,
+                    "TAGGED_WITH": 5_000_000,
+                    "COMMENTED_ON": 90_000_000,
+                    "EARNED": 51_289_973,
+                    "LINKED_TO": 6_500_000,
+                    "total": 159_789_973,  # Updated to match actual edge counts
+                },
+            },
         }
 
         return expected_phase2.get(dataset_size)
@@ -1563,6 +1620,7 @@ def get_retry_config(dataset_size):
         "small": {"retry_delay": 60, "max_retries": 120},  # 2 hours max
         "medium": {"retry_delay": 180, "max_retries": 200},  # 10 hours max
         "large": {"retry_delay": 300, "max_retries": 200},  # 16.7 hours max
+        "full": {"retry_delay": 300, "max_retries": 200},  # 16.7 hours max
     }
     return configs.get(size, configs["tiny"])
 
@@ -6242,7 +6300,8 @@ def main():
   stackoverflow-tiny   - ~34 MB disk, 2 GB heap recommended
   stackoverflow-small  - ~642 MB disk, 4 GB heap recommended
   stackoverflow-medium - ~2.9 GB disk, 8 GB heap recommended
-  stackoverflow-large  - ~323 GB disk, 32+ GB heap recommended
+    stackoverflow-large  - ~10 GB disk, 16+ GB heap recommended
+    stackoverflow-full   - ~323 GB disk, 64+ GB heap recommended
 
 Batch size:
   Default: 10000 records per commit
@@ -6264,6 +6323,7 @@ def main():
             "stackoverflow-small",
             "stackoverflow-medium",
             "stackoverflow-large",
+            "stackoverflow-full",
         ],
         default="stackoverflow-small",
         help="Dataset size to use (default: stackoverflow-small)",
diff --git a/bindings/python/examples/download_data.py b/bindings/python/examples/download_data.py