Skip to content

Commit 7d6fa45

Browse files
committed
Update documentation and scripts for Stack Overflow datasets
- Revised NULL value statistics in CSV import documentation for accuracy. - Enhanced graph import example documentation to clarify async vs sync modes. - Updated vector search recommendations example to reflect changes in dataset import commands. - Modified Stack Overflow multi-model documentation to correct dataset size and record counts. - Improved dataset downloader documentation to include new dataset sizes and options. - Updated dataset downloader script to support new Stack Overflow dataset sizes and vector generation options. - Adjusted Stack Overflow multi-model script to reflect accurate dataset sizes and record counts. - Added functionality to create a tiny subset of the Stack Overflow dataset for testing purposes. - Enhanced vector embedding functionality for Stack Overflow datasets, including model and batch size options.
1 parent fc89c45 commit 7d6fa45

8 files changed

Lines changed: 750 additions & 46 deletions

bindings/python/docs/examples/04_csv_import_documents.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -57,10 +57,10 @@ python download_data.py movielens-small # movielens small dataset
5757

5858
Both datasets include intentional NULL values for testing:
5959

60-
- `movies.csv`: ~3% NULL genres
61-
- `ratings.csv`: ~2% NULL timestamps
62-
- `links.csv`: ~10% NULL imdbId, ~15% NULL tmdbId
63-
- `tags.csv`: ~5% NULL tags
60+
- `movies.csv`: ~2% NULL genres, ~0.5% NULL titles
61+
- `ratings.csv`: ~3% NULL timestamps, ~1% NULL ratings
62+
- `links.csv`: ~5% NULL imdbId, ~8% NULL tmdbId
63+
- `tags.csv`: ~5% NULL tags, ~2% NULL timestamps
6464

6565
## Dataset Structure
6666

bindings/python/docs/examples/05_csv_import_graph.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ for:
1212

1313
- **Graph modeling** - Users and Movies as vertices, ratings and tags as edges
1414
- **Java API vs SQL** - Compare both approaches for bulk graph creation
15-
- **Async vs Sync** - Understand when parallelism helps (and when it hurts)
16-
- **Index optimization** - Create indexes AFTER bulk operations for 2-3× speedup
15+
- **Async vs Sync** - Compare async vs sync modes in embedded mode
16+
- **Index optimization** - Create indexes BEFORE bulk edge creation for 2-3× speedup
1717
- **Export & roundtrip validation** - Verify data integrity through complete cycle
1818
- **Performance benchmarking** - Measure and compare 6 different configurations
1919
- **Query validation** - 10 graph queries with result verification

bindings/python/docs/examples/06_vector_search_recommendations.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,10 +39,10 @@ This example requires a graph database from Example 05:
3939

4040
```bash
4141
# Option A: Use existing database
42-
python 05_csv_import_graph.py --size small --method java --no-async
42+
python 05_csv_import_graph.py --dataset movielens-small --method java
4343

4444
# Option B: Import from JSONL export
45-
python 05_csv_import_graph.py --size small --import-jsonl ./exports/movielens_graph_small_db.jsonl.tgz
45+
python 05_csv_import_graph.py --dataset movielens-small --import-jsonl ./exports/movielens_graph_small_db.jsonl.tgz
4646
```
4747

4848
**Two dataset sizes available:**

bindings/python/docs/examples/07_stackoverflow_multimodel.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,11 @@ The example supports multiple dataset sizes from the [Stack Exchange Data Dump](
2121

2222
| Dataset | Size (XML) | Records | Recommended Heap |
2323
| :--- | :--- | :--- | :--- |
24-
| **Tiny** (`cs.stackexchange.com`) | ~34 MB | ~100K | 2 GB |
25-
| **Small** (`stats.stackexchange.com`) | ~642 MB | ~1.5M | 8 GB |
26-
| **Medium** (`stackoverflow.com` subset) | ~2.9 GB | ~5M | 32 GB |
27-
| **Large** (`stackoverflow.com` full) | ~323 GB | ~350M | 64+ GB |
24+
| **Tiny** (`cs.stackexchange.com` subset) | ~34 MB | ~100K | 2 GB |
25+
| **Small** (`cs.stackexchange.com`) | ~642 MB | ~1.5M | 8 GB |
26+
| **Medium** (`stats.stackexchange.com`) | ~2.9 GB | ~5M | 32 GB |
27+
| **Large** (`stackoverflow.com` subset) | ~10 GB | ~23M | 64+ GB |
28+
| **Full** (`stackoverflow.com`) | ~323 GB | ~630M | 128+ GB |
2829

2930
## 🚀 Usage
3031

bindings/python/docs/examples/download_data.md

Lines changed: 90 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -16,53 +16,126 @@ All datasets are stored under `bindings/python/examples/data/`.
1616
## Supported Datasets
1717

1818
- **MovieLens**: `movielens-small`, `movielens-large`
19-
- **Stack Exchange**: `stackoverflow-small`, `stackoverflow-medium`, `stackoverflow-large`
20-
- **TPC-H**: `tpch-sf1`, `tpch-sf10`, `tpch-sf100`
21-
- **LDBC SNB Interactive v1**: `ldbc-snb-sf1`, `ldbc-snb-sf10`, `ldbc-snb-sf100`
19+
- **Stack Exchange**: `stackoverflow-tiny`, `stackoverflow-small`, `stackoverflow-medium`, `stackoverflow-large`, `stackoverflow-full`
2220
- **MSMARCO v2.1**: `msmarco-1m`, `msmarco-5m`, `msmarco-10m`
2321

2422
## Usage
2523

2624
```bash
2725
python download_data.py movielens-large
26+
python download_data.py stackoverflow-tiny
2827
python download_data.py stackoverflow-small
29-
python download_data.py tpch-sf1
30-
python download_data.py ldbc-snb-sf1
28+
python download_data.py stackoverflow-small --no-vectors
29+
python download_data.py stackoverflow-small --vector-model all-MiniLM-L6-v2
30+
python download_data.py stackoverflow-small --vector-batch-size 128
31+
python download_data.py stackoverflow-small --vector-shard-size 100000
32+
python download_data.py stackoverflow-small --vector-max-rows 50000
33+
python download_data.py stackoverflow-large
34+
python download_data.py stackoverflow-full
3135
python download_data.py msmarco-1m
3236
```
3337

3438
## Notes
3539

3640
- **MovieLens NULL injection** is enabled by default (use `--no-nulls` to skip).
37-
- **TPC-H** uses `dbgen` to generate `.tbl` files (pipe-delimited text, not SQL) via Docker (gcc image).
38-
- Converted CSVs are written to `examples/data/tpch-sf<scale>/csv/`.
39-
- A schema file is written to `examples/data/tpch-sf<scale>/schema.json`.
40-
- **LDBC SNB** is generated locally via Docker (ldbc/datagen).
41-
- CSVs are stored under `examples/data/ldbc-snb-sf<scale>/`.
42-
- A schema file is written to `examples/data/ldbc-snb-sf<scale>/schema.json` (inferred from CSV headers and samples).
41+
- **Stack Exchange vectors** are generated by default for questions, answers, and comments.
42+
Use `--no-vectors` to skip.
4343
- **MSMARCO** downloads parquet shards and converts them to vector shards with a ground-truth file.
4444

4545
## Dependencies
4646

4747
Install only what you need for the datasets you plan to download:
4848

4949
- Stack Exchange: `py7zr`
50+
- Stack Exchange vectors: `sentence-transformers`, `torch`, `numpy`
5051
- MSMARCO: `huggingface_hub`, `numpy`, `pyarrow`
51-
- TPC-H: Docker (gcc image for `dbgen`)
52-
- LDBC SNB: Docker (ldbc/datagen image)
5352

5453
## Output Locations
5554

5655
- MovieLens: `examples/data/movielens-<size>/`
5756
- Stack Exchange: `examples/data/stackoverflow-<size>/`
58-
- TPC-H: `examples/data/tpch-sf<scale>/`
59-
- LDBC SNB: `examples/data/ldbc-snb-sf<scale>/`
57+
- Stack Exchange vectors: `examples/data/stackoverflow-<size>/vectors/`
6058
- MSMARCO: `examples/data/MSMARCO-<size>/`
6159

6260
## Formats & Schemas
6361

6462
- **MovieLens**: CSV files, no schema file generated.
6563
- **Stack Exchange**: XML files, no schema file generated.
66-
- **TPC-H**: `.tbl` plus derived CSVs (pipe-delimited); schema in `schema.json`.
67-
- **LDBC SNB**: CSVs; schema in `schema.json` (inferred).
64+
- **Stack Exchange vectors**: binary vector shards (`.f32`) plus `.meta.json` and `.ids.jsonl`.
65+
- Vectors are 384-D, L2-normalized (all-MiniLM-L6-v2).
6866
- **MSMARCO**: binary vector shards (`.f32`) plus `.meta.json` and `.gt.jsonl`.
67+
- Vectors are 1024‑D; 1M/5M/10M indicate the number of vectors.
68+
69+
## Stack Overflow (sizes & counts)
70+
71+
Dataset sizes:
72+
73+
- stackoverflow-tiny: ~34 MB disk (subset of small)
74+
- stackoverflow-small: ~642 MB disk
75+
- stackoverflow-medium: ~2.9 GB disk
76+
- stackoverflow-large: ~10 GB disk (subset of full)
77+
- stackoverflow-full: ~323 GB disk
78+
79+
Expected document counts (from `07_stackoverflow_multimodel.py`):
80+
81+
**stackoverflow-small**
82+
83+
- User: 138,727
84+
- Post: 105,373
85+
- Comment: 195,781
86+
- Badge: 182,975
87+
- Vote: 411,166
88+
- PostLink: 11,005
89+
- Tag: 668
90+
- PostHistory: 360,340
91+
- Total: 1,406,035
92+
93+
**stackoverflow-medium**
94+
95+
- User: 345,754
96+
- Post: 425,735
97+
- Comment: 819,648
98+
- Badge: 612,258
99+
- Vote: 1,747,225
100+
- PostLink: 86,919
101+
- Tag: 1,612
102+
- PostHistory: 1,525,713
103+
- Total: 5,564,864
104+
105+
**stackoverflow-large**
106+
107+
- User: 661,594
108+
- Post: 2,738,307
109+
- Comment: 2,723,828
110+
- Badge: 1,657,162
111+
- Vote: 7,691,408
112+
- PostLink: 204,690
113+
- Tag: 1,925
114+
- PostHistory: 6,970,840
115+
- Total: 22,649,754
116+
117+
**stackoverflow-full**
118+
119+
- User: 22,484,235
120+
- Post: 59,819,048
121+
- Comment: 90,380,323
122+
- Badge: 51,289,973
123+
- Vote: 238,984,011
124+
- PostLink: 6,552,590
125+
- Tag: 65,675
126+
- PostHistory: 160,790,317
127+
- Total: 630,366,172
128+
129+
## Approximate Sizes
130+
131+
| Dataset | Size |
132+
| --- | --- |
133+
| MovieLens small | ~3.2 MB |
134+
| MovieLens large | ~1.5 GB |
135+
| MSMARCO 1M | ~3.9 GB |
136+
| MSMARCO 5M | ~20 GB |
137+
| MSMARCO 10M | ~39 GB |
138+
| StackOverflow small | ~642 MB |
139+
| StackOverflow medium | ~2.9 GB |
140+
| StackOverflow large | ~10 GB |
141+
| StackOverflow full | ~323 GB |

bindings/python/docs/examples/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Hands-on examples demonstrating ArcadeDB Python bindings in real-world scenarios
77
### 🏁 Getting Started
88

99
**[Dataset Downloader](download_data.md)**
10-
Download and prepare datasets used by the examples (MovieLens, Stack Exchange, TPC-H, LDBC SNB Interactive, MSMARCO).
10+
Download and prepare datasets used by the examples (MovieLens, Stack Exchange, MSMARCO).
1111

1212
**[01 - Simple Document Store](01_simple_document_store.md)**
1313
Foundation example covering document types, CRUD operations, comprehensive data types (DATE, DATETIME, DECIMAL, FLOAT, INTEGER, STRING, BOOLEAN, LIST OF STRING), and NULL value handling (INSERT NULL, UPDATE to NULL, IS NULL queries).

bindings/python/examples/07_stackoverflow_multimodel.py

Lines changed: 63 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@
1515
- stackoverflow-tiny: ~34 MB → 2 GB (use --heap-size 2g)
1616
- stackoverflow-small: ~642 MB → 8 GB (use --heap-size 8g)
1717
- stackoverflow-medium: ~2.9 GB → 32 GB (use --heap-size 32g)
18-
- stackoverflow-large: ~323 GB → 64+ GB (use --heap-size 64g)
18+
- stackoverflow-large: ~10 GB → 32+ GB (use --heap-size 32g)
19+
- stackoverflow-full: ~323 GB → 64+ GB (use --heap-size 64g)
1920
2021
Usage:
2122
# Phase 1 only (import + index)
@@ -166,8 +167,18 @@ class StackOverflowValidator:
166167
"PostHistory": 1_525_713,
167168
"total": 5_564_864,
168169
},
169-
# Large dataset counts will be added once import completes
170170
"stackoverflow-large": {
171+
"User": 661_594,
172+
"Post": 2_738_307,
173+
"Comment": 2_723_828,
174+
"Badge": 1_657_162,
175+
"Vote": 7_691_408,
176+
"PostLink": 204_690,
177+
"Tag": 1_925,
178+
"PostHistory": 6_970_840,
179+
"total": 22_649_754,
180+
},
181+
"stackoverflow-full": {
171182
"User": 22_484_235,
172183
"Post": 59_819_048,
173184
"Comment": 90_380_323,
@@ -649,6 +660,52 @@ def get_phase2_expected_counts(dataset_size: str = None) -> dict:
649660
"total": 2_877_181,
650661
},
651662
},
663+
# "stackoverflow-large" has to be double checked and updated with actual counts from Phase 2 run
664+
"stackoverflow-large": {
665+
"vertices": {
666+
"User": 661_880,
667+
"Question": 1_348_026,
668+
"Answer": 1_390_641,
669+
"Tag": 11_622,
670+
"Badge": 1_657_161,
671+
"Comment": 2_724_192,
672+
"total": 5_143_321,
673+
},
674+
"edges": {
675+
"ASKED": 1_327_123,
676+
"ANSWERED": 1_374_892,
677+
"HAS_ANSWER": 1_390_641,
678+
"ACCEPTED_ANSWER": 474_123,
679+
"TAGGED_WITH": 1_234_567,
680+
"COMMENTED_ON": 2_700_000,
681+
"EARNED": 1_657_161,
682+
"LINKED_TO": 200_000,
683+
"total": 9_658_507,
684+
},
685+
},
686+
# "stackoverflow-full" has to be double checked and updated with actual counts from Phase 2 run
687+
"stackoverflow-full": {
688+
"vertices": {
689+
"User": 22_484_235,
690+
"Question": 19_000_000,
691+
"Answer": 40_000_000,
692+
"Tag": 65_675,
693+
"Badge": 51_289_973,
694+
"Comment": 90_380_323,
695+
"total": 132_835_908,
696+
},
697+
"edges": {
698+
"ASKED": 18_500_000,
699+
"ANSWERED": 38_000_000,
700+
"HAS_ANSWER": 40_000_000,
701+
"ACCEPTED_ANSWER": 10_000_000,
702+
"TAGGED_WITH": 5_000_000,
703+
"COMMENTED_ON": 90_000_000,
704+
"EARNED": 51_289_973,
705+
"LINKED_TO": 6_500_000,
706+
"total": 159_789_973, # Updated to match actual edge counts
707+
},
708+
},
652709
}
653710

654711
return expected_phase2.get(dataset_size)
@@ -1563,6 +1620,7 @@ def get_retry_config(dataset_size):
15631620
"small": {"retry_delay": 60, "max_retries": 120}, # 2 hours max
15641621
"medium": {"retry_delay": 180, "max_retries": 200}, # 10 hours max
15651622
"large": {"retry_delay": 300, "max_retries": 200}, # 16.7 hours max
1623+
"full": {"retry_delay": 300, "max_retries": 200}, # 16.7 hours max
15661624
}
15671625
return configs.get(size, configs["tiny"])
15681626

@@ -6242,7 +6300,8 @@ def main():
62426300
stackoverflow-tiny - ~34 MB disk, 2 GB heap recommended
62436301
stackoverflow-small - ~642 MB disk, 4 GB heap recommended
62446302
stackoverflow-medium - ~2.9 GB disk, 8 GB heap recommended
6245-
stackoverflow-large - ~323 GB disk, 32+ GB heap recommended
6303+
stackoverflow-large - ~10 GB disk, 16+ GB heap recommended
6304+
stackoverflow-full - ~323 GB disk, 64+ GB heap recommended
62466305
62476306
Batch size:
62486307
Default: 10000 records per commit
@@ -6264,6 +6323,7 @@ def main():
62646323
"stackoverflow-small",
62656324
"stackoverflow-medium",
62666325
"stackoverflow-large",
6326+
"stackoverflow-full",
62676327
],
62686328
default="stackoverflow-small",
62696329
help="Dataset size to use (default: stackoverflow-small)",

0 commit comments

Comments
 (0)